6. Fitting functions to data

[status: partly-written]

6.1. Motivation, Prerequisites, Plan

In Section 5 we looked at how to take a function and represent it as points and lines in a plot. Here we do the opposite: if we are given a collection of \((x, y)\) points we try to find what kind of function might have generated those points.

There are so many types of functions that there is some artistry involved in picking which kind of function to fit to a set of data.

In many cases we will want to fit a straight line to our data. Sometimes it will be a polynomial. Our intuition on what functions to try to fit can come from two sources: (a) looking at the data visually (“hey! that looks like a parabola!”), and (b) having some intuition about the process that underlies that data (“should be an exponential decay; let’s try fitting an exponential function!”)

Terminology: the techniques we use here are sometimes referred to as regression, linear regression, fitting, curve fitting, line fitting, least squares fitting, …

6.2. Examples to get started

Let us start by visualizing some data sets. We’ll start with the dataset we had for age, height and weight of the !Kung people of the Kalahari desert. We downloaded and analyzed this data set back in Section 3.2, but there we were looking at height and weight historgrams. Here we look at something simpler: height versus age.

In this example we will grab the data set,

$ wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
$ gnuplot
gnuplot> set datafile separator ";"
gnuplot> plot 'Howell1.csv' using 3:1 with points

You can carry it all out with the following instructions:

Listing 6.2.1 Instructions to plot the height vs. age for the !Kung.
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points

and you should get the figure in Figure 6.2.1.

../_images/plot-height.svg

Figure 6.2.1 Height vs. age for the DUDE !Kung people of the Kalahari desert. Note that we have two separate behaviors of the height data: one for before the age of 18 (rapid growth), the other for after the age of 20 (mostly norizontal).

We see this drastic difference between lower and higher age. One thing that comes to mind is that the mechanisms that cause this are obvious biological ones: we grow fast when we are young, then we stop growing taller.

The rising height part of the graph (up to age 18) seems to be reasonably close to a straight line, as is the top part (from 20 years on). The entire plot does not at all look like a straight line.

Let us look at those separate areas in separate plots:

Listing 6.2.2 Zooming in on the straight line between ages 2 and 18.
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [2:20] 'Howell1.csv' using 3:1 with points

and you should get the figure in Figure 6.2.2.

../_images/plot-height-2-18.svg

Figure 6.2.2 Zooming in on the age 2-18 region of the height plot.

6.3. Straight line fits

6.3.1. Our goal

Let us state a bold goal: can we find a straight line that seems to fit the points in the rising area of the height plot of the !Kung people (Figure 6.2.2)?

Remember from Section 5.2 that a straight line looks like \(y = m x + b\) is described by two parameters: the slope \(m\) and the y intercept \(b\), so another way to phrase this goal is:

Given a collection of points \((x_i, y_i)\), find the slope and intercept for the line \(y = m x + b\) which most closely fits the points.

You could do this visually: print it, take a ruler, place it so that it runs through the points in the plot, and draw a line. The result is not really optimal, so we look for better techniques.

6.3.2. Stepping back: just two points

Let us start with just two points, one for a 3-year-old and the other for a 17-year-old. From the data file we can pick the (age, height) pairs: (3, 96.52) and (17, 142.875).

In middle school math we learn how to find the line that goes through these points. We write out the equation \(y = m x + b\) using the specific \((x, y)\):

(6.3.2.1)\[\begin{split}96.52 & = m \times 3 + b \\ 142.875 & = m \times 17 + b\end{split}\]

By subtracting the equations we get:

\[\begin{split}142.875 & - 96.52 = m \times 17 - m \times 3 = m \times 14 \\ & \implies m = (142.875 - 96.52) / 14 = 3.311\end{split}\]

and substituting back into the first equation we get:

\[\begin{split}96.52 & = 3.311 \times 3 + b \\ & \implies b = 96.52 - 3 \times 3.3111 = 86.5867\end{split}\]

So the equation for our line is:

\[\begin{split}m & = 3.311 \\ b & = 86.5867 \\ y & = 3.311 \times x + 86.5867\end{split}\]

Note

This is just a first stab at it! We do not know if we chose those two points well – in fact it seems from this picture that the 17-year-old we picked was shorter than average, which would skew the results. The 3-year-old is also taller than average. This example was just to get going so that we can talk about line fits. In Section 6.4 we will do proper line fitting.

6.3.3. Let’s plot that line with our data

Now that we have a line fit we should feel really excited about plotting it together with our data, so that we can get some visual satisfaction. I always encourage people to do this right away.

Let us take the plotting instructions in Listing 6.2.1 and add to it the plotting of our fitted line:

Listing 6.3.3.1 Height vs. age and a line fit for ages 2-18.
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
     [2:18] 3.311 * x + 86.5867 lw 6 title "two point fit"

and you should get Figure 6.3.3.1.

../_images/plot-height-with-fit.svg

Figure 6.3.3.1 Height vs. age for the !Kung people of the Kalahari desert, with a line fit for ages 2-18.

6.3.4. Physical interpretation of the line fit

One realy cool thing we learn about fitting functions to data is that those functions have a physical interpretation! This is one of the things that gets scientists really excited, so I will mention it before we move on to a better procedure for fitting curves.

If you look closely at the units of measure in our scatter plot you see that we are plotting height (a measure of length) versus age (a measure of time). This means that the slope \(m\) is measured in units of length divided by time, in this case in centimeters per year. So the slope we found of 3.311 should really be reported as 3.311 centimeters/year, and it tells us how much children grow per year between the ages of 2 and 18.

Now let us look at the intercept \(b\) (in our case 86.5867). This number tels us what the value of y (height) is at time zero (birth). This would seem to imply that babies are born some 87cm tall.

Important

But wait! Babies are not born 87cm tall. They are born around 50cm tall. Then why did our straight line fit report that height at birth?? This is explained by remembering the original plot: in Figure 6.2.1 we see that the height forms a straight line between the ages of 2 and 18, but not for children less than 2. Those children grow much faster, so the straight line is not a valid approximation for infants. This means that, sadly, the intercept \(b\) does not give a gratifying physical interpretation in this case.

6.4. Proper line fitting

We’ve seen how to pick two points from our data set and make a line that goes through them, but this approach has some real problems: if I had chosen a very tall 3-year-old and a very short 17-year-old, then the slope would have been much smaller. I could have also skewed it the other way by picking a very short 3-year-old and a very tall 17-year-old, which would have given a much steeper slope. [FIXME: write exercise to do all of this]

So what is a good objective way of finding the “best fit”? This is much debated, and one get get quite subtle and make a real profession of the line fitting business, but for our purposes we will use the most comon approach. It is called “least squares fitting”.

Look at Figure 6.4.1:

../_images/Linear_least_squares_example2.svg

Figure 6.4.1 From wikipedia, https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green).

You see four points and an example of a line that tries to go through those four points. How well did it succeed?

Look at those green lines in the figure: they show how far the point is from the line, and they give a measure of the error in your fit. If you had chosen a radically different blue line, then the sum of all those green lines might be quite big, and you would have a poor fit.

The idea behind “least squares” is that you want to minimize the sum of the squares of those errors.

There is a lot going on here. You might immediately ask at least two questions: (a) why the sum of the squares of the errors? (b) how do you minimize?

For now I will give a very brief answer to the first question: squares are often considered “nicer” functions than just taking the distance itself. You can see this with this plot:

$ gnuplot
gnuplot> plot [-3:3] x*x
gnuplot> replot abs(x)

both the lines tend to grow bigger when your error is bigger, but the smooth one is more suited to various mathematical techniques.

As for the second question: we will minimize the sum of the squares of the errors using techniques from calculus. I don’t describe them here, but I will show a cool pair of equations. First wrap your head around this expression:

(6.4.1)\[E = \sum_{i=1}^N (m x_i + b - y_i)^2\]

I like formulae like this one because that capital greek sigma letter is visually appealing, but what does it all mean?

First note that in math we use the symbol \(\sum_{i=1}^N\) to mean “take a sum of all these terms, where the letter i will be replaced in turn by 1, 2, 3, … up to N.

Now let’s say you have N data points \((x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\). Then the E in equation (6.4.1) is the sum of a bunch of terms, each of which looks like \((m x_i + b - y_i)^2\). That is difference between “what the fitted line would have given you” and “the real data point that you have”. Squared. These differences are often called the “residuals”.

So E is a pretty good measure of how poor your line fit is, so if you find the values of m and b that make E as small as possible then they might give you a nice straight line fit through your data.

To show you some more cool math typesetting, and to entice you to study calculus, I will show you the equations that are written to solve this problem:

(6.4.2)\[\begin{split}E & = \sum_{i=1}^N (m x_i + b - y_i)^2 \\ \frac{\partial E}{\partial m} & = 0 \\ \frac{\partial E}{\partial b} & = 0\end{split}\]

This gives you two equations which you can solve to get m and b. We will not go through the details of it since it is calculus material (“finding minima using derivatives”), but we will now learn to use a Python library that does this calculation for us.

Note

There are many pleasing aspects to learning about least squares fitting. One of them is that this is one of the earliest places where you run in to … FIXME

6.5. Using Python’s scientific libraries to fit lines

Python comes with an extensive scientific library called scipy. Scipy has a statistics subpackage, which in turn has a function called linregress() which does all that work for us.

Enter the program in

Listing 6.5.1 fit-height.py – fit a line through the height data and print out the slope and intercept.
#! /usr/bin/env python3

import sys
from scipy.stats import linregress

def main():
    if len(sys.argv) != 4:
        print('error: wrong number of arguments')
        print('usage: %s filename min_age max_age' % sys.argv[0])
        print('example: %s Howell1.csv 2 18' % sys.argv[0])
        sys.exit(1)
    fname = sys.argv[1]
    lowest_age = int(sys.argv[2])
    highest_age = int(sys.argv[3])
    
    xdata, ydata = load_file(fname, lowest_age, highest_age)
    slope, intercept, r_value, p_value, std_error = linregress(xdata, ydata)
    print('the least squares fit returns slope, intercept:')
    print('m =', slope, ', b =', intercept)


## load columns 2 and 0 from the file, return two data vectors.  only
## pick out ages betwen min_age and max_age (inclusive)
def load_file(fname, min_age, max_age):
    xdata = []
    ydata = []
    f = open(sys.argv[1], 'r')
    for line in f.readlines()[1:]:
        words = line.split(';')
        x, y = float(words[2]), float(words[0])
        if x >= min_age and x <= max_age:
            xdata.append(x)
            ydata.append(y)
    f.close()
    return xdata, ydata

main()

Run the program with:

$ ./fit-height.py  Howell1.csv 2 18
the least squares fit returns slope, intercept:
m = 3.99026836995 , b = 79.6805438775

Now compare these values to those found in in Section 6.3.2 where we just picked two points to work with. It turns out that 3.99 is not too far from 4.12, and 79.68 is not far from 79.06, but they are different. Let us update Figure 6.3.3.1 to also show the line found by least squares linear regression:

Listing 6.5.2 Height vs. age, a naive line fit and a least squares fit for ages 2-18.
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot [0:26] 'Howell1.csv' using 3:1 with points, \
    [2:18] 3.311 * x + 86.5867 lw 4 title "two point fit", \
    [2:18] 3.99026836995 * x + 79.6805438775 lw 4 title "least squares fit"

and you should get Figure 6.5.1.

../_images/plot-height-with-two-fits.svg

Figure 6.5.1 Height vs. age for the !Kung people of the Kalahari desert, with a naive line fit for ages 2-18, and a least squares fit as well. Note the difference between the two lines - the steeper line from the least squares fit is a better fit for the age 2-18 range.

This line fit from the least squares method was a much better fit to the 2-18 year old data, so we can now say that the !Kung youth grow approximately 3.99 centimeters/year.

6.6. When to not try a linear fit

We should only try linear fits if we have reason to believe that the data is linear (either visually or from the scientific mechanism that underlies the data).

In the case of our height plot, what would happen if we were to try to fit the entire plot, from age 0 to the maximum age?

Let us try and see what it looks like. We run our program with ages 0 and 90:

$ ./fit-height.py  Howell1.csv 0 90
the least squares fit returns slope, intercept:
m = 0.909605221912 , b = 111.571782869

What would this line look like? Plotting like this:

Listing 6.6.1 Plot inappripriate line fit.
##REQUIRED_FILE: Howell1.csv
##PRE_EXEC: wget --continue https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Howell1.csv
set datafile separator ";"
set grid
set key right bottom
set title 'height versus age for the !Kung people'
set xlabel 'age (years)'
set ylabel 'height (cm)'
plot 'Howell1.csv' using 3:1 with points, \
     0.909605221912 * x + 111.571782869 lw 4 title "least squares line fit"

and you should get the Figure 6.6.1.

../_images/plot-height-with-inappropriate-fit.svg

Figure 6.6.1 Height vs. age for the !Kung people of the Kalahari desert. We also plot a fit based on the entire plot, and we see that it is a poor fit to the data, since the data is not linear.

We see that the mechanisms that determine height are quite different for infants, youths and older people, so there is no single line fit.