10. Power laws, Zipf, Benford, …¶

Areas: pure math, curiosity, economics, complex systems

[status: partly written]

10.1. Motivation, prerequisites, plan¶

We can start by realizing that certain relationships between numbers are “curious”. Our curiosity might be enough to get us going if we see a couple of examples. Here are two examples which might make us curious:

Longer words in a language occur less often, and there is a clear mathematical relationship between frequency of words and the rank of the word in a frequency table (Zipf’s law).
In numbers that come up in the real world, the most significant digit is more often 1 and 2 than 8 and 9 (Benford’s law).

There are many areas of nature and of human endeavor in which similar relationships come up. These relationships are called power laws (where power refers to the exponent in a variable). Many of these give insight into the underlying mechanisms of those areas of study.

We will start by studying Zipf’s law and write some programs that illustrate it. This will let us study other data sets that have similar relationships, some from astronomy, some from geography, some from economics. We will look at those examples, discussing the insights that come from the power law relationships. Finally we will dedicate some time to Benford’s law and some curious and amusing historical applications.

10.2. Zipf’s law¶

We start with Zipf’s law. In the first half of the 20th century George Zipf noticed that if you have a large sample of words in a real-world English language text, there is an inverse proportionality between the frequency of a word and its position in the frequency table (see Section 5.5 where we introduced inverse proportionality).

This means that a plot of frequency versus rank will look like a \(1/x\) plot, or if you plot both x and y axis on a logarithmic scale, you will have a straight line with a slope of -1.

How do we explore this law? As usual, we write a program! Start by typing in the program in Listing 10.2.1.

Listing 10.2.1 word-freq-rank.py - Analyze frequency versus rank for words in a text file.¶

#! /usr/bin/env python3

import sys
import math
import pprint
import re

def main():
    if len(sys.argv) == 1:
        f = sys.stdin
    elif len(sys.argv) == 2:
        fname = sys.argv[1]
        f = open(fname, 'r')
    else:
        sys.stderr.write('error: use 0 or 1 arguments\n')
        sys.exit(1)

    sorted_words, rank_dict = read_words_from_file(f)
    f.close()
    print('##  rank  word              frequency')
    for i, word in enumerate(sorted_words):
        print('%8d  %-16s  %8d' % (i+1, word, rank_dict[word]))


def read_words_from_file(f):
    word_set = set()
    rank_dict = {}
    for line in f.readlines():
        word_list = re.split('--|\s', line)
        ## now that we have the words, let's strip away all the
        ## punctuation marks
        word_list = [word.strip(""",.;:_"'&%^$#@!?/\|+-()*""").lower()
                     for word in word_list]
        cleaned_up_words = []
        for word in word_list:
            word = word.strip(""",.;:_"'&%^$#@!?/\|+-()*""")
            if len(word) > 0:
                cleaned_up_words.append(word)
        ## now that we have found the words in this line, we also add
        ## them to the word rank dictionary before we lost the count
        ## information by adding them to the set
        for word in word_list:
            if word in rank_dict.keys():
                rank_dict[word] += 1
            else:
                rank_dict[word] = 1
        ## finally, add them to a set, which discards repeated
        ## occurrences
        word_set.update(tuple(cleaned_up_words))

    ## finally: use the rank dictionary to sort the list of words by
    ## how often they occur (their rank)
    sorted_word_list = sorted(list(word_set), key=lambda x: -rank_dict[x])
    return sorted_word_list, rank_dict

def analyze_words(word_list):
    ## for now we do it an easy way: find the longest word, so we can
    ## fill out the histogram with zeros ahead of time
    rank_dict = {}
    for word in word_list:
        if word in rank_dict.keys():
            rank_dict[word] += 1
        else:
            rank_dict[word] = 1
    sorted_words = sorted(word_list, key=lambda x: rank_dict[x])
    print(len(sorted_words))
    print(sorted_words)
    return sorted_words, rank_dict

main()

10.2.1. Example from a paragraph of your own.¶

Type a simple paragraph of English text into a file, call it simple-para.txt. Then run the program on that file with:

$ python3 word-freq-rank.py simple-para.txt

You’ll see output showing the histogram with ASCII output.

If you want to make an actual plot then you can comment out the lines

# for (word_len, word_freq) in enumerate(word_len_hist):
#     print(word_len, word_freq, 'X'*math.ceil((word_freq*65.0/highest_freq)))

and uncomment these lines that are right below:

for (word_len, word_freq) in enumerate(word_len_hist):
    print(word_len, '    ', word_freq)

Now you can run it and see the data for the histogram.

10.2.2. Example from Project Gutenberg¶

To download a full book from Project Gutenberg and plot its word frequency distribution with gnuplot you can use these instructions:

Listing 10.2.2.1 Instructions to plot the word frequency distribution of Swann’s Way, by Marcel Proust.¶

##REQUIRED_FILE: swanns-way.txt
##PRE_EXEC: wget --output-document swanns-way.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
##PRE_EXEC: ./word-length-freq.py swanns-way.txt > swanns-way-freq.out
set multiplot layout 2,1 title "Zipf's law"
set grid
plot 'swanns-way-freq.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (linear scale)"
set logscale xy
plot 'swanns-way-freq.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (log-log scale)"

Figure 10.2.2.1 Histogram of how many times words of a certain length appear in the text of Swann’s Way (volume 1 of Marcel Proust’s “In Search of Lost Time”, formerly translated in English as “Remembrance of Things Past”).

The top plot in Figure 10.2.2.1 is in the usual linear scale. This can be hard to read because you have some big numbers, and the small ones are hard to tell apart. That’s why the other two plots have a logarithmic scale for the y axis (log scales are discussed in Section 4.2).

Out of curiosity, let us also look at the first few and last few lines of the output file.

Listing 10.2.2.2 First few lines (most common words) in Swann’s Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs.¶

##  rank  word              frequency
       1  the                  10051
       2  of                    7169
       3  to                    6749
       4  and                   4631
       5  a                     4440
       6  in                    4160
       7  that                  3632
       8  had                   2712
       9  which                 2686

Listing 10.2.2.3 Last few lines (least common words) in Swann’s Way. First column is the rank (most frequent to least frequent), the second is the word, the third is the number of times it occurs.¶

recognition              1
blankets                 1
enjoyable                1
hoary                    1
episodes                 1
absolved                 1
retreat                  1
corporal                 1
basis                    1
my-poor                  1
barometers               1

10.2.3. Fitting the curve in Zipf’s law¶

TODO

10.2.4. Explanations of Zipf’s law¶

TODO

10.3. What are “power laws”?¶

If you have a relationship where the number of items with a given measure is proportional to that measure to a negative power, then we say that we have power law behavior.

In the case of Zipf’s law, if \(l\) is the length of the words and \(N(l)\) is the number of words with that length in our text, we have:

\[N(l) = {\rm const} \times l^{-1}\]

10.3.1. Calculating the power from the data¶

How do you calculate the power \(\alpha\) if you are given a collection of data points?

In Section 6 we learned some techniques to fit functions to data. How would we fit

Zipf’s law was just one rather notable example of power law, but there are many others. We will discuss a few more and write programs that download the relevant data and analyze it.

population-rank distribution for nations

wget https://raw.githubusercontent.com/datasets/population/master/data/population.csv

easier:

wget https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt

rank-size distribution for nations
city size
deadly conflicts
luminosity in astronomy

10.4. Benford’s law¶

10.4.1. Physical constants¶

$ wget https://physics.nist.gov/cuu/Constants/Table/allascii.txt
$ cat allascii.txt | cut -c 61 | grep '[0-9]' > first-digits.dat
gnuplot> plot 'first-digits.dat'

This has downloaded an ascii table of physical constants (we’ve talked about ascii before, but we need to frequently remind people that ascii is basically plain text… ascii is not a common buzzword, so we should mention ascii often, and explain often what it means).

Now we want to look at the distribution of those first digits. Which appears most?

In Section 3.3 we learned how to make and plot histograms. The most effective way to make and plot histograms might be the snippet of code that uses numpy and matplotlib in Section 3.5. Let’s use that to write a small program that analyzes the frequency of these first digits:

Listing 10.4.1.1 first-digit-hist.py - Plot the frequency with which digits appear in the first position in a list of numbers.¶

#! /usr/bin/env python3

## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)

import sys
import numpy as np
import matplotlib.pyplot as plt

fname = sys.argv[1]
out_fname = ''
if len(sys.argv) == 3:
    out_fname = sys.argv[2]
    out_format = sys.argv[2][-3:]

## [...] collect quantity to be histogrammed in an array x
xs = open(fname, 'r').read().split()
x = [int(digit) for digit in xs]
## prepare the plot
n, bins, patches = plt.hist(x, bins=np.arange(1,11)-0.5, 
                            normed=1, facecolor='g', alpha=0.75)
plt.xticks(range(1,10))
plt.xlim([0,10])
plt.xlabel('first digit')
plt.ylabel('probability')
plt.title('Histogram of first digit probability')
plt.grid(True)
## if there is no output filename then we plot interactively
if not out_fname:
    plt.show()
else:
    plt.savefig(out_fname, format=out_format)
    print('output written to file %s' % out_fname)

Go ahead and enter the program first-digit-hist.py and then run it with:

## to get an interactive plot:
$ ./first-digit-hist.py first-digits.dat
## to put the output into a file:
$ ./first-digit-hist.py first-digits.dat first-digit-freq.png
## or you can replace .png with .pdf or .svg

The result of these runs should be a plot a bit like the one in Figure 10.4.1.1: you will see that 1 occurs much more frequently, followed by 2 and then 3. Once you get beyond 4 you would have to collect a lot of data to get a clear result, but you would then find that 4 occurs more than 5, which occurs more than 6 and so on.

Figure 10.4.1.1 Histogram of how many times the various digits appear as the first digit in a number. The list of numbers we use are the physical constants in the National Institute of Standards and Technology list.

To show an example of how one can use shell pipelines to look at another data set. The GNU scientific library sources can be found on github at

$ wget https://raw.githubusercontent.com/ampl/gsl/master/const/gsl_const_mks.h
$ wget http://git.savannah.gnu.org/cgit/gsl.git/plain/const/gsl_const_cgs.h
$ cat gsl_const_mks.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-mks.dat
$ cat gsl_const_cgs.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-cgs.dat
$ ./first-digit-hist.py first-digits-gsl-cgs.dat &
$ ./first-digit-hist.py first-digits-gsl-mks.dat &

10.4.2. Stock quotes¶

Another source of numbers to which Benford’s law might apply is stock quotes. Let us FIXME

wget --output-document nasdaq-list.csv 'http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download'

Listing 10.4.2.1 first-digit-nasdaq.py - Plot the frequency with which digits appear in the first position of NASDAQ stock market listings.¶

#! /usr/bin/env python3

## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)

import sys
import numpy as np
import matplotlib.pyplot as plt
import csv

fname = sys.argv[1]
out_fbase = ''
if len(sys.argv) == 3:
    out_fbase = sys.argv[2][:-4]
    out_format = sys.argv[2][-3:]

print(out_fbase)

csvfile = open(fname, 'r')
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
firstdigit_sale = []
firstdigit_cap = []

for row in list(reader)[1:]:
    print(row[0], row[2], row[3])
    if row[2] != 'n/a':
        sale = float(row[2])
        if sale != 0:
            firstdigit = ('%e' % sale)[0]
            firstdigit_sale.append(int(firstdigit))
    if row[3] != 'n/a':
        cap = float(row[3])
        if cap != 0:
            firstdigit = ('%e' % cap)[0]
            firstdigit_cap.append(int(firstdigit))
csvfile.close()

## prepare the plot: collect quantity to be histogrammed in an array x
for (label, x) in [('sale', firstdigit_sale), ('cap', firstdigit_cap)]:
    n, bins, patches = plt.hist(x, bins=np.arange(1, 11)-0.5,
                                normed=1, facecolor='g', alpha=0.75)
    plt.xticks(range(1, 10))
    plt.xlim([0, 10])
    plt.xlabel('first digit')
    plt.ylabel('probability')
    plt.title('Histogram of first digit probability')
    plt.grid(True)
    ## if there is no output filename then we plot interactively
    if not out_fbase:
        plt.show()
    else:
        out_fname = out_fbase + '_' + label + '.' + out_format
        plt.savefig(out_fname, format=out_format)
        plt.gcf().clear()
        print('output written to file %s' % out_fname)

Figure 10.4.2.2 Histogram of how many times the various digits appear as the first digit in a number. The list of numbers we use here are the stock quotes and market capitalization in the NASDAQ stock exchange.

10.4.3. Further reading¶

https://physics.nist.gov/cuu/Constants/Table/allascii.txt
https://en.wikipedia.org/wiki/Benford%27s_law
https://en.wikipedia.org/wiki/Zipf%27s_law
stats on word length
/usr/share/dict/words
https://en.wikipedia.org/wiki/Brown_Corpus
http://www.nltk.org/nltk_data/
random book from gutenberg
stats on individual income or wealth

10.5. Pareto’s principle¶

10.5.1. Further reading¶

https://www.theguardian.com/commentisfree/2011/nov/11/occupy-movement-wealth-power-law-distribution