Replace Pioneer Home   All Examples   Free Download

 New request --free  RSS: Replace Pioneer Examples

396.Text file parser -- How to extract text from many webpage files and form a dabase file?

User: editor -- 2010-01-16          << 395  397 >>
Hits: 489
Type: Text file parser   
Search all Text file parser examples
Description:
How to extract text from many webpage files and form a dabase file?
There are many html files, each one contain information of Title|Author|Author Affiliation|Source|Abstract|Descriptors|Keywords|Geographic Descriptors|Geographic Region|Accession Number, how can I extract all part of contents and form a text with format "yyy1|yyy2|yyy3|yyy4....|yyy10"?
Input Sample:
each html contain:
....
<dt>Title:</dt><dd xxx>yyy1</dd>
<dt>Author:</dt><dd xxx>yyy2</dd>
<dt>Author Affiliation:</dt><dd xxx>yyy3</dd>
<dt>Source:</dt><dd xxx>yyy4</dd>
<dt>Abstract:</dt><dd xxx>yyy5</dd>
<dt>Descriptors:</dt><dd xxx>yyy6</dd>
<dt>Keywords:</dt><dd xxx>yyy7</dd>
<dt>Geographic Descriptors:</dt><dd xxx>yyy8</dd>
<dt>Geographic Region:</dt><dd xxx>yyy9</dd>
<dt>Accession Number:</dt><dd xxx>yyy10</dd>
Output Sample:
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file1)
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file2)
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file3)
Answer:
Hint: You need to Download and install "Replace Pioneer" on windows platform to finish following steps.
1. ctrl-h open 'Replace' dialog
* in 'Search for Pattern' enter:

* in 'Replace with Pattern' enter:

* uncheck "print unmatched unit" option
* between "Output Page" and "Output File" entry at right bottom, change the symbol ">" to ">> Append"
2. Click "Batch..." button to open "Batch Runner" window
3. Drag all html files from windows file explorer to "Batch Runner" window.
4. Check "Set output file name" option, and change "${FILENAME}" to "result.txt" at the following entry.
5. Click "Batch Replace" button, all the desired content of html will be extract and put to result.txt.
Download Script:  scripts/396.rst.zip

Screenshot 1:  Replace_Window


Similar Examples:
How to extract titles from many html files into a txt file? (68%)
How to extract distinct webpages from weblog file? (63%)
How to extract first line from multiple files and generate a new file? (63%)
How to extract all the links and images from many web page files? (62%)
How to extract multiple fields from data file and create a csv file? (61%)
How to extract titles of all html files and save them to one file? (61%)
How to extract email from webpage and remove duplicate? (60%)
How to extract and join text from multiple files with user defined format? (59%)

Check Demo of Text file parser
Keywords:
ddd  yyy4  access  explorer  abs  keyword  contents  symbol  sym  scrip  content extract  extract between  change format many files  format many files  html format  extract webpage  how to extract text from file windows  how to change the format of