This sed script creates a list of words with frequencies from a text input. It is quite fast even for a large input text.

As an example, I have run this script on Leo Tolstoys novel "war and peace" (german text):

cat Krieg-und-Frieden.txt |

    sed 's/ /\n/g' | 
    sed 's/\r/\n/g' | 
    sed 's/^[[:punct:]]*//' | 
    sed 's/[[:punct:]]*$//' | 
    tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n

Output:
      1 aaa
      1 aaah
      1 aalte
      1 aasgeier
      1 aasknochen
      1 abänderungen
      1 abbeé
      1 abbekommen
...
   4241 ich
   4250 dem
   4394 ein
   4453 war
   5533 auf
   6050 nicht
   6216 mit
   6655 das
   7185 den
   7630 sich
   7754 in
   8709 zu
   9166 sie
  10559 er
  13453 der
  15114 
  15230 die
  21906 und

Explanation:

The first two sed commands eliminate all spaces and \r linefeeds with regular (\n) linefeeds. After that all punctuation characters are deleted, and all characters are changed to lower-case. The resulting word list is then sorted. The uniq command counts how many identical words are found. Finally, the resulting list is sorted again numerically, to get a list of word frequencies from lowest to highest count.

Related articles