10. Power laws, Zipf, Benford, …¶
Areas: pure math, curiosity, economics, complex systems
[status: partly written]
10.1. Motivation, prerequisites, plan¶
We can start by realizing that certain relationships between numbers are “curious”. Our curiosity might be enough to get us going if we see a couple of examples. Here are two examples which might make us curious:
- Longer words in a language occur less often, and there is a clear mathematical relationship between frequency of words and the rank of the word in a frequency table (Zipf’s law).
- In numbers that come up in the real world, the most significant digit is more often 1 and 2 than 8 and 9 (Benford’s law).
There are many areas of nature and of human endeavor in which similar relationships come up. These relationships are called power laws (where power refers to the exponent in a variable). Many of these give insight into the underlying mechanisms of those areas of study.
We will start by studying Zipf’s law and write some programs that illustrate it. This will let us study other data sets that have similar relationships, some from astronomy, some from geography, some from economics. We will look at those examples, discussing the insights that come from the power law relationships. Finally we will dedicate some time to Benford’s law and some curious and amusing historical applications.
10.2. Zipf’s law¶
We start with Zipf’s law. In the first half of the 20th century George Zipf noticed that if you have a large sample of words in a real-world English language text, there is an inverse proportionality between the frequency of a word and its position in the frequency table (see Section 5.5 where we introduced inverse proportionality).
This means that a plot of frequency versus rank will look like a \(1/x\) plot, or if you plot both x and y axis on a logarithmic scale, you will have a straight line with a slope of -1.
How do we explore this law? As usual, we write a program! Start by typing in the program in Listing 10.2.1.
#! /usr/bin/env python3
import sys
import math
import pprint
import re
def main():
if len(sys.argv) == 1:
f = sys.stdin
elif len(sys.argv) == 2:
fname = sys.argv[1]
f = open(fname, 'r')
else:
sys.stderr.write('error: use 0 or 1 arguments\n')
sys.exit(1)
sorted_words, rank_dict = read_words_from_file(f)
f.close()
print('## rank word frequency')
for i, word in enumerate(sorted_words):
print('%8d %-16s %8d' % (i+1, word, rank_dict[word]))
def read_words_from_file(f):
word_set = set()
rank_dict = {}
for line in f.readlines():
word_list = re.split('--|\s', line)
## now that we have the words, let's strip away all the
## punctuation marks
word_list = [word.strip(""",.;:_"'&%^$#@!?/\|+-()*""").lower()
for word in word_list]
cleaned_up_words = []
for word in word_list:
word = word.strip(""",.;:_"'&%^$#@!?/\|+-()*""")
if len(word) > 0:
cleaned_up_words.append(word)
## now that we have found the words in this line, we also add
## them to the word rank dictionary before we lost the count
## information by adding them to the set
for word in word_list:
if word in rank_dict.keys():
rank_dict[word] += 1
else:
rank_dict[word] = 1
## finally, add them to a set, which discards repeated
## occurrences
word_set.update(tuple(cleaned_up_words))
## finally: use the rank dictionary to sort the list of words by
## how often they occur (their rank)
sorted_word_list = sorted(list(word_set), key=lambda x: -rank_dict[x])
return sorted_word_list, rank_dict
def analyze_words(word_list):
## for now we do it an easy way: find the longest word, so we can
## fill out the histogram with zeros ahead of time
rank_dict = {}
for word in word_list:
if word in rank_dict.keys():
rank_dict[word] += 1
else:
rank_dict[word] = 1
sorted_words = sorted(word_list, key=lambda x: rank_dict[x])
print(len(sorted_words))
print(sorted_words)
return sorted_words, rank_dict
main()
10.2.1. Example from a paragraph of your own.¶
Type a simple paragraph of English text into a file, call it simple-para.txt. Then run the program on that file with:
$ python3 word-freq-rank.py simple-para.txt
You’ll see output showing the histogram with ASCII output.
If you want to make an actual plot then you can comment out the lines
# for (word_len, word_freq) in enumerate(word_len_hist):
# print(word_len, word_freq, 'X'*math.ceil((word_freq*65.0/highest_freq)))
and uncomment these lines that are right below:
for (word_len, word_freq) in enumerate(word_len_hist):
print(word_len, ' ', word_freq)
Now you can run it and see the data for the histogram.
10.2.2. Example from Project Gutenberg¶
To download a full book from Project Gutenberg and plot its word frequency distribution with gnuplot you can use these instructions:
##REQUIRED_FILE: swanns-way.txt
##PRE_EXEC: wget --output-document swanns-way.txt http://www.gutenberg.org/cache/epub/1128/pg1128.txt
##PRE_EXEC: ./word-length-freq.py swanns-way.txt > swanns-way-freq.out
set multiplot layout 2,1 title "Zipf's law"
set grid
plot 'swanns-way-freq.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (linear scale)"
set logscale xy
plot 'swanns-way-freq.out' using 1:3 with linespoints pt 6 ps 0.2 title "word frequency (log-log scale)"
Figure 10.2.2.1 Histogram of how many times words of a certain length appear in the text of Swann’s Way (volume 1 of Marcel Proust’s “In Search of Lost Time”, formerly translated in English as “Remembrance of Things Past”).
The top plot in Figure 10.2.2.1 is in the usual linear scale. This can be hard to read because you have some big numbers, and the small ones are hard to tell apart. That’s why the other two plots have a logarithmic scale for the y axis (log scales are discussed in Section 4.2).
Out of curiosity, let us also look at the first few and last few lines of the output file.
## rank word frequency
1 the 10051
2 of 7169
3 to 6749
4 and 4631
5 a 4440
6 in 4160
7 that 3632
8 had 2712
9 which 2686
13449 recognition 1
13450 blankets 1
13451 enjoyable 1
13452 hoary 1
13453 episodes 1
13454 absolved 1
13455 retreat 1
13456 corporal 1
13457 basis 1
13458 my-poor 1
13459 barometers 1
10.2.3. Fitting the curve in Zipf’s law¶
TODO
10.2.4. Explanations of Zipf’s law¶
TODO
10.3. What are “power laws”?¶
If you have a relationship where the number of items with a given measure is proportional to that measure to a negative power, then we say that we have power law behavior.
In the case of Zipf’s law, if \(l\) is the length of the words and \(N(l)\) is the number of words with that length in our text, we have:
More in general we will have:
where \(\alpha\) is the “power” in “power law”. In Zipf’s law the power is \(\alpha = -1\).
10.3.1. Calculating the power from the data¶
How do you calculate the power \(\alpha\) if you are given a collection of data points?
In Section 6 we learned some techniques to fit functions to data. How would we fit
Zipf’s law was just one rather notable example of power law, but there are many others. We will discuss a few more and write programs that download the relevant data and analyze it.
- population-rank distribution for nations
wget https://raw.githubusercontent.com/datasets/population/master/data/population.csv
- easier:
wget https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt
- rank-size distribution for nations
- city size
- deadly conflicts
- luminosity in astronomy
10.4. Benford’s law¶
10.4.1. Physical constants¶
$ wget https://physics.nist.gov/cuu/Constants/Table/allascii.txt
$ cat allascii.txt | cut -c 61 | grep '[0-9]' > first-digits.dat
gnuplot> plot 'first-digits.dat'
This has downloaded an ascii table of physical constants (we’ve talked about ascii before, but we need to frequently remind people that ascii is basically plain text… ascii is not a common buzzword, so we should mention ascii often, and explain often what it means).
Now we want to look at the distribution of those first digits. Which appears most?
In Section 3.3 we learned how to make and plot histograms. The most effective way to make and plot histograms might be the snippet of code that uses numpy and matplotlib in Section 3.5. Let’s use that to write a small program that analyzes the frequency of these first digits:
#! /usr/bin/env python3
## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)
import sys
import numpy as np
import matplotlib.pyplot as plt
fname = sys.argv[1]
out_fname = ''
if len(sys.argv) == 3:
out_fname = sys.argv[2]
out_format = sys.argv[2][-3:]
## [...] collect quantity to be histogrammed in an array x
xs = open(fname, 'r').read().split()
x = [int(digit) for digit in xs]
## prepare the plot
n, bins, patches = plt.hist(x, bins=np.arange(1,11)-0.5,
normed=1, facecolor='g', alpha=0.75)
plt.xticks(range(1,10))
plt.xlim([0,10])
plt.xlabel('first digit')
plt.ylabel('probability')
plt.title('Histogram of first digit probability')
plt.grid(True)
## if there is no output filename then we plot interactively
if not out_fname:
plt.show()
else:
plt.savefig(out_fname, format=out_format)
print('output written to file %s' % out_fname)
Go ahead and enter the program first-digit-hist.py
and then run it
with:
## to get an interactive plot:
$ ./first-digit-hist.py first-digits.dat
## to put the output into a file:
$ ./first-digit-hist.py first-digits.dat first-digit-freq.png
## or you can replace .png with .pdf or .svg
The result of these runs should be a plot a bit like the one in Figure 10.4.1.1: you will see that 1 occurs much more frequently, followed by 2 and then 3. Once you get beyond 4 you would have to collect a lot of data to get a clear result, but you would then find that 4 occurs more than 5, which occurs more than 6 and so on.
Figure 10.4.1.1 Histogram of how many times the various digits appear as the first digit in a number. The list of numbers we use are the physical constants in the National Institute of Standards and Technology list.
To show an example of how one can use shell pipelines to look at another data set. The GNU scientific library sources can be found on github at
$ wget https://raw.githubusercontent.com/ampl/gsl/master/const/gsl_const_mks.h
$ wget http://git.savannah.gnu.org/cgit/gsl.git/plain/const/gsl_const_cgs.h
$ cat gsl_const_mks.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-mks.dat
$ cat gsl_const_cgs.h | grep 'define GSL_' | awk '{print $3}' | grep -v '(1e' | cut -c 2 > first-digits-gsl-cgs.dat
$ ./first-digit-hist.py first-digits-gsl-cgs.dat &
$ ./first-digit-hist.py first-digits-gsl-mks.dat &
10.4.2. Stock quotes¶
Another source of numbers to which Benford’s law might apply is stock quotes. Let us FIXME
wget --output-document nasdaq-list.csv 'http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download'
#! /usr/bin/env python3
## run this with
## ./first-digit-hist.py first-digits.dat
## to get an interactive plot, or with
## ./first-digit-hist.py first-digits.dat first-digit-freq.png
## (or you can put pdf or svg)
import sys
import numpy as np
import matplotlib.pyplot as plt
import csv
fname = sys.argv[1]
out_fbase = ''
if len(sys.argv) == 3:
out_fbase = sys.argv[2][:-4]
out_format = sys.argv[2][-3:]
print(out_fbase)
csvfile = open(fname, 'r')
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
firstdigit_sale = []
firstdigit_cap = []
for row in list(reader)[1:]:
print(row[0], row[2], row[3])
if row[2] != 'n/a':
sale = float(row[2])
if sale != 0:
firstdigit = ('%e' % sale)[0]
firstdigit_sale.append(int(firstdigit))
if row[3] != 'n/a':
cap = float(row[3])
if cap != 0:
firstdigit = ('%e' % cap)[0]
firstdigit_cap.append(int(firstdigit))
csvfile.close()
## prepare the plot: collect quantity to be histogrammed in an array x
for (label, x) in [('sale', firstdigit_sale), ('cap', firstdigit_cap)]:
n, bins, patches = plt.hist(x, bins=np.arange(1, 11)-0.5,
normed=1, facecolor='g', alpha=0.75)
plt.xticks(range(1, 10))
plt.xlim([0, 10])
plt.xlabel('first digit')
plt.ylabel('probability')
plt.title('Histogram of first digit probability')
plt.grid(True)
## if there is no output filename then we plot interactively
if not out_fbase:
plt.show()
else:
out_fname = out_fbase + '_' + label + '.' + out_format
plt.savefig(out_fname, format=out_format)
plt.gcf().clear()
print('output written to file %s' % out_fname)
10.4.3. Further reading¶
- https://physics.nist.gov/cuu/Constants/Table/allascii.txt
- https://en.wikipedia.org/wiki/Benford%27s_law
- https://en.wikipedia.org/wiki/Zipf%27s_law
- stats on word length
- /usr/share/dict/words
- https://en.wikipedia.org/wiki/Brown_Corpus
- http://www.nltk.org/nltk_data/
- random book from gutenberg
- stats on individual income or wealth