This is a text-only version of the following page on https://raymii.org:
---
Title       : 	Word occurrence counter and analyzer
Author      : 	Remy van Elst
Date        : 	07-03-2013
URL         : 	https://raymii.org/s/articles/Word_occurrence_counter_and_analyzer.html
Format      : 	Markdown/HTML
---



With these commands you can analyze a text file. It will count all the
occurrences of all words and put out the stats. It is usefull for song lyrics,
books, notes and everything. It helps me analyze my writing style, which words
do I use more often, where are my spelling errors and such. It is also nice to
win an argument against someone over a dragonforce song. This example will use
lyrics as example, but it is applicable to all text files.

<p class="ad"> <b>Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:</b><br><br> <a href="https://leafnode.nl">I'm developing an open source monitoring app called  Leaf Node Monitoring, for windows, linux & android. Go check it out!</a><br><br> <a href="https://github.com/sponsors/RaymiiOrg/">Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.</a><br><br> <a href="https://www.digitalocean.com/?refcode=7435ae6b8212">You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $100 credit for 60 days. </a><br><br> </p>


##### Get the Lyrics (text)

First get the lyrics, or the text you want to analyze into a text file. I've
heard nano, vi(m) and emacs are quite good with text. In this song I will use a
song by Dragonforce. It does not matter which one because they're all full of
the same words.

My lyrics file is named: `df1.txt`

##### Sanitize them

The tools we are going to use do not like all those comma's, colons, exclamation
marks and weird non-alphanumeric characters. So sanitize the file like this:

    
    
    cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt
    

What this does is pump the file through the tr command, that command (with these
arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we
want.

##### Analyze it Now we do the magic:

    
    
    sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
    
    
    
    remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
    72 the
    32 
    25 and
    22 of
    20 in
    17 we
    16 on
    14 our
    13 a
    8 were
    8 lost
    8 for
    7 will
    7 still
    7 light
    6 to
    6 so
    6 fire
    6 far
    5 through
    

### Other Example

#### on my class notes about blood and the immune system

    
    
    remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt      
    remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20                 
    195 
    108 de
    80 een
    72 van
    65 het
    51 in
    46 is
    40 en
    24 zijn
    24 op
    24 afweer
    22 die
    20 vraag
    20 deze
    19 worden
    18 kan
    17 bij
    16 dit
    15 er
    14 of
    

After stripping it of the non-usefull words:

    
    
    remy@vps8:~$ cat afwres.txt | head -n 10
    24 afweer
    14 cellen
    11 bacterin
    9 waar
    9 reactie
    9 antigeen
    8 specifieke
    7 milieu
    7 lymfocyten
    7 lichaam
    

#### Fabian Scherschels NanoWriMo 2011 Book: Nightwatch

[GIT tree of the book][2] & [NaNoWiMo page][3] Book is `Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License`

    
    
    1020 the
    454 he
    421 and
    418 of
    357 to
    347 had
    297 a
    267 was
    257 his
    241 that
    216 in
    132 it
    130 marc
    112 him
    108 as
    105 this
    105 they
    93 with
    90 but
    82 were
    82 from
    82 been
    82 at
    74 on
    70 would
    68 for
    68 could
    56 their
    56 be
    53 out
    51 into
    50 man
    49 all
    48 there
    48 so
    48 by
    47 looked
    46 not
    44 up
    44 them
    44 like
    

#### Analyzing IP and log files

Today I found another usefull use for this command. Analyzing IP adresses. First
I grepped my entire lighttpd log file:

cat access.log | egrep -o
'[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr
[:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr

(egrep -o spits out only the IP adress, not the whole line on which the IP
adress is on)

That gives out this nice list (this list is made up, not real IP adresses):

    
    
    2 83.64.150.248
    2 94.0.74.75
    2 94.142.55.252
    2 95.237.133.3
    2 98.225.130.26
    3 108.100.28.45
    3 213.93.70.87
    5 81.30.145.69
    348 66.228.43.247
    467 173.255.236.50
    

[Thanks to the wonderfull community at stackexchange][4]

   [1]: https://www.digitalocean.com/?refcode=7435ae6b8212
   [2]: https://gitorious.org/nano2011
   [3]: http://www.nanowrimo.org/en/participants/fabsh
   [4]: http://unix.stackexchange.com/questions/39039/get-text-file-word-occurrence-count-of-all-words-print-output-sorted

---

License:
All the text on this website is free as in freedom unless stated otherwise. 
This means you can use it in any way you want, you can copy it, change it 
the way you like and republish it, as long as you release the (modified) 
content under the same license to give others the same freedoms you've got 
and place my name and a link to this site with the article as source.

This site uses Google Analytics for statistics and Google Adwords for 
advertisements. You are tracked and Google knows everything about you. 
Use an adblocker like ublock-origin if you don't want it.

All the code on this website is licensed under the GNU GPL v3 license 
unless already licensed under a license which does not allows this form 
of licensing or if another license is stated on that page / in that software:

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Just to be clear, the information on this website is for meant for educational 
purposes and you use it at your own risk. I do not take responsibility if you 
screw something up. Use common sense, do not 'rm -rf /' as root for example. 
If you have any questions then do not hesitate to contact me.

See https://raymii.org/s/static/About.html for details.