User: editor -- 2010-01-17 << 396 398 >> |
Hits: 3540 |
Type: Text file parser |
Search all Text file parser examples |
Description: |
How to run a program on many text files to extract useful text? There are many html files, each one contains a subset of following items: Title|Author|Author Affiliation|Source|Abstract|Descriptors|Keywords|Geographic Descriptors|Geographic Region|Accession Number, how can I extract all part of contents and form a text with format "yyy1|yyy2|yyy3|yyy4....|yyy10"(yyyN=blank if corresponding item does not exist) |
Input Sample: |
each html file contain a subset of following: .... <dt>Title:</dt><dd xxx>yyy1</dd> <dt>Author:</dt><dd xxx>yyy2</dd> <dt>Author Affiliation:</dt><dd xxx>yyy3</dd> <dt>Source:</dt><dd xxx>yyy4</dd> <dt>Abstract:</dt><dd xxx>yyy5</dd> <dt>Descriptors:</dt><dd xxx>yyy6</dd> <dt>Keywords:</dt><dd xxx>yyy7</dd> <dt>Geographic Descriptors:</dt><dd xxx>yyy8</dd> <dt>Geographic Region:</dt><dd xxx>yyy9</dd> <dt>Accession Number:</dt><dd xxx>yyy10</dd> |
Output Sample: |
yyy1|yyy2|yyy3|yyy4....|yyy10 (from file1) yyy1|yyy2| |yyy4....|yyy10 (from file2) yyy1| |yyy3|yyy4....|yyy10 (from file3) |
Answer: |
Hint: You need to Download and install "Replace Pioneer" on windows platform to finish following steps. |
1. ctrl-h open 'Replace' dialog * in 'Replace witn Pattern' enter: * click "Advanced" page, in "Run Following for each matched Unit" fill: * between "Output Page" and "Output File" entry at right bottom, change the symbol ">" to ">> Append" 2. Click "Batch..." button to open "Batch Runner" window 3. Drag all html files from windows file explorer to "Batch Runner" window. 4. Check "Set output file name" option, and change "${FILENAME}" to "result.txt" at the following entry. 5. Click "Batch Replace" button, all the desired content of html will be extract and put to result.txt. |
Download Script: scripts/397.rst.zip |
Screenshot 1: Replace_Window |
Screenshot 2: Replace_Advanced_Window |