User: Aaron -- 2016-03-14 << 1351 1353 >> |
Hits: 2826 |
Type: Text file parser |
Search all Text file parser examples |
Description: |
Requirement to split words from concatenated text file where all words are joined without spaces. This would likely require the use of a dictionary file containing common English words sorted by frequency. Here's an example: https://github.com/first20hours/google-10000-english Script needs to process all punctuation marks, new line breaks, and treat Upper and lower case words as unique. Words not included in dictionary file should be treated as new words. |
Input Sample: |
Itwasthebestoftimes,itwastheworstoftimes,itwastheageofwisdom,itwastheageoffoolishness, itwastheepochofbelief,itwastheepochofincredulity,itwastheseasonoflight,itwastheseasonofDarkness,etc. |
Output Sample: |
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, etc. |
Answer: |
Hint: You need to Download and install "Replace Pioneer" on windows platform to finish following steps. |
To split by priority of frequency is not a best approach. Here we try to sort the words by the length, the longer words has higher priority. 1. ctrl-o open text file 2. ctrl-h open 'replace' dialogue: * set 'search for pattern' to: * set 'replace with pattern' to: * click 'advanced tab': * set 'run following at the beginning of replace' to: * set 'run following for each matched unit' to: 3. click 'replace', done. Note: (1) you need to put file in d:\test\google-10000-english.txt (2) some of the words like 'epoch', 'foolishness' does not exist in google-10000-english.txt, which will cause some problem. You can add them manually (3) even if all words exit in dictionary, still there will be some pro |
Screenshot 1: Replace_Window |
Screenshot 2: Replace_Advanced_Window |