Replace Pioneer Home   All Examples   Free Download

 New request --free  RSS: Replace Pioneer Examples

396.Text file parser -- How to extract text from many webpage files and form a dabase file?

User: editor -- 2010-01-16          << 395  397 >>
Hits: 4623
Type: Text file parser   
Search all Text file parser examples
Description:
How to extract text from many webpage files and form a dabase file? 
There are many html files, each one contain information of Title|Author|Author Affiliation|Source|Abstract|Descriptors|Keywords|Geographic Descriptors|Geographic Region|Accession Number, how can I extract all part of contents and form a text with format "yyy1|yyy2|yyy3|yyy4....|yyy10"? 
Input Sample:
each html contain: 
.... 
<dt>Title:</dt><dd xxx>yyy1</dd> 
<dt>Author:</dt><dd xxx>yyy2</dd> 
<dt>Author Affiliation:</dt><dd xxx>yyy3</dd> 
<dt>Source:</dt><dd xxx>yyy4</dd> 
<dt>Abstract:</dt><dd xxx>yyy5</dd> 
<dt>Descriptors:</dt><dd xxx>yyy6</dd> 
<dt>Keywords:</dt><dd xxx>yyy7</dd> 
<dt>Geographic Descriptors:</dt><dd xxx>yyy8</dd> 
<dt>Geographic Region:</dt><dd xxx>yyy9</dd> 
<dt>Accession Number:</dt><dd xxx>yyy10</dd>
Output Sample:
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file1) 
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file2) 
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file3) 
Answer:
Hint: You need to Download and install "Replace Pioneer" on windows platform to finish following steps.
1. ctrl-h open 'Replace' dialog 
* in 'Search for Pattern' enter: 
 
* in 'Replace with Pattern' enter: 
 
* uncheck "print unmatched unit" option 
* between "Output Page" and "Output File" entry at right bottom, change the symbol ">" to ">> Append" 
2. Click "Batch..." button to open "Batch Runner" window 
3. Drag all html files from windows file explorer to "Batch Runner" window. 
4. Check "Set output file name" option, and change "${FILENAME}" to "result.txt" at the following entry. 
5. Click "Batch Replace" button, all the desired content of html will be extract and put to result.txt.
Download Script:  scripts/396.rst.zip

Screenshot 1:  Replace_Window


Similar Examples:
How to extract tables from many html files into one csv file? (68%)
How to extract titles from many html files into a txt file? (68%)
How to extract distinct webpages from weblog file? (63%)
How to extract first line from multiple files and generate a new file? (63%)
How to replace content of many files with text from a template file? (62%)
How to extract all the links and images from many web page files? (62%)
How to extract multiple fields from data file and create a csv file? (61%)
How to extract titles of all html files and save them to one file? (61%)

Check Demo of Text file parser
Keywords:
ddd  yyy4  access  explorer  abs  formation  contents  keywords  keyword  desc  how to change format of many files  how to change many files format  change format many files  extract text between  extract text from webpage  change format for many files  how to change format many files  extract between