This sed script creates a list of words with frequencies from a text input. It is quite fast even for a large input text. |
As an example, I have run this script on Leo Tolstoys novel "war and peace" (german text):
cat Krieg-und-Frieden.txt | sed 's/ /\n/g' | sed 's/\r/\n/g' | sed 's/^[[:punct:]]*//' | sed 's/[[:punct:]]*$//' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n Output: 1 aaa 1 aaah 1 aalte 1 aasgeier 1 aasknochen 1 abänderungen 1 abbeé 1 abbekommen ... 4241 ich 4250 dem 4394 ein 4453 war 5533 auf 6050 nicht 6216 mit 6655 das 7185 den 7630 sich 7754 in 8709 zu 9166 sie 10559 er 13453 der 15114 15230 die 21906 und |
Explanation:
The first two sed commands eliminate all spaces and \r linefeeds with regular (\n) linefeeds. After that all punctuation characters are deleted, and all characters are changed to lower-case. The resulting word list is then sorted. The uniq command counts how many identical words are found. Finally, the resulting list is sorted again numerically, to get a list of word frequencies from lowest to highest count.
Unbedingt das Label kb-how-to-article aus dem nachfolgenden Macro löschen.
|