Replace Pioneer Home   All Examples   Free Download

 New request --free  RSS: Replace Pioneer Examples

397.Text file parser -- How to run a program on many html files to extract useful text?

User: editor -- 2010-01-17          << 396  398 >>
Hits: 3540
Type: Text file parser   
Search all Text file parser examples
Description:
How to run a program on many text files to extract useful text? 
There are many html files, each one contains a subset of following items: Title|Author|Author Affiliation|Source|Abstract|Descriptors|Keywords|Geographic Descriptors|Geographic Region|Accession Number, how can I extract all part of contents and form a text with format "yyy1|yyy2|yyy3|yyy4....|yyy10"(yyyN=blank if corresponding item does not exist) 
Input Sample:
each html file contain a subset of following: 
....  
<dt>Title:</dt><dd xxx>yyy1</dd>  
<dt>Author:</dt><dd xxx>yyy2</dd>  
<dt>Author Affiliation:</dt><dd xxx>yyy3</dd>  
<dt>Source:</dt><dd xxx>yyy4</dd>  
<dt>Abstract:</dt><dd xxx>yyy5</dd>  
<dt>Descriptors:</dt><dd xxx>yyy6</dd>  
<dt>Keywords:</dt><dd xxx>yyy7</dd>  
<dt>Geographic Descriptors:</dt><dd xxx>yyy8</dd>  
<dt>Geographic Region:</dt><dd xxx>yyy9</dd>  
<dt>Accession Number:</dt><dd xxx>yyy10</dd>
Output Sample:
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file1)  
yyy1|yyy2| |yyy4....|yyy10 (from file2)  
yyy1| |yyy3|yyy4....|yyy10 (from file3)  
Answer:
Hint: You need to Download and install "Replace Pioneer" on windows platform to finish following steps.
1. ctrl-h open 'Replace' dialog  
* in 'Replace witn Pattern' enter: 
 
* click "Advanced" page, in "Run Following for each matched Unit" fill: 
 
* between "Output Page" and "Output File" entry at right bottom, change the symbol ">" to ">> Append" 
2. Click "Batch..." button to open "Batch Runner" window 
3. Drag all html files from windows file explorer to "Batch Runner" window. 
4. Check "Set output file name" option, and change "${FILENAME}" to "result.txt" at the following entry. 
5. Click "Batch Replace" button, all the desired content of html will be extract and put to result.txt.
Download Script:  scripts/397.rst.zip

Screenshot 1:  Replace_Window


Screenshot 2:  Replace_Advanced_Window


Similar Examples:
How to rename many files to the format of file.number.ext? (59%)
How to extract titles from many html files into a txt file? (55%)
How to count the number of words for many files and append result to each file? (53%)
How to change the many urls in html to ids in sequence? (53%)
How to extract text enclosed by "body" tag from many html files and join together? (53%)
How to rename many files with the last modified date of the file? (53%)
How to rename many jpg files with sequence id and record it? (53%)
How to create many different html files base on a given template?  (52%)

Check Demo of Text file parser
Keywords:
ddd  extract useful text  yyy4  access  explorer  abs  contents  keywords  keyword  desc  how to change format of many files  how to change many files format  change format many files  extract text between  change format for many files  how to change format many files  extract between  extract content