.. _chap-data-files-and-first-plots:

Starting out: data files and first plots
========================================

[status: mostly-complete-needs-polishing-and-proofreading]

.. _sec-very-first-data-plots:

Very first data plots with gnuplot
----------------------------------

Our first goal is to become comfortable with data files and with
plotting. We first get the students to renew their acquaintance with
creating files with an editor and make a file with some hand-crafted
data.

Use your favorite editor (possibly ``emacs`` for those who have taken
my previous course, but ``vi`` or ``gedit`` should also work) to open
a file called ``simpledata.dat``

Enter two columns of simple data into this file. For example:

.. include:: simpledata.dat
   :literal:

Then save it, and enter ``gnuplot`` to plot this data:

.. code:: bash

    $ sudo apt-get install gnuplot-x11
    $ gnuplot
    gnuplot> plot 'simpledata.dat'

.. _fig-sampleplot-simpledata:

.. figure:: simpledata.*

   First example of data plot.


then have the students plot the data with slightly different options in
gnuplot:

.. code:: bash

    $ gnuplot
    gnuplot> plot 'simpledata.dat' with lines
    gnuplot> set xlabel 'this is the "x" axis'
    gnuplot> set ylabel 'this is the "y" axis'
    gnuplot> plot 'simpledata.dat' with linespoints

..
   .. note:: We want to give early hints to how you can make this
      automatic and reproducible, so we will also give an example of
      automatically making a PDF file and including into a document. We
      should make the students input this file, but on the projector we
      can show that :numref:`fig-sampleplot-simpledata` is generated
      by the gnuplot script ``simpledata.gp``
      running on the file ``simpledata.dat``

      sampleplot-gp-simpledata-tex

Plotting *functions* with gnuplot
---------------------------------

More examples of using gnuplot. We don’t assume knowledge of
trigonometry from younger students, so we tell the story as "this is
the \sin function, which you will learn about some day; it plots these
waves."

On the other hand the polynomial :math:`y = x^3 - 3x` should be within
their reach: I might call on the class to tell me "what’s
:math:`-10^3` and :math:`-2^3`, :math:`2^3`, and :math:`10^3`" - this
establishes that the plot goes down on the left hand side, and up on
the right hand side.  The interesting play in the middle can be
narrated by showing that :math:`(1/2)^3` is smaller than :math:`3 *
(1/2)`, so that the negative term dominates briefly (see
:numref:`fig-simplefunction`)..

.. code:: bash

    gnuplot> plot sin(x)
    gnuplot> plot x*x*x - 3*x
    gnuplot> plot x**3 - 3*x
    ## note that this last one did show a dip in the middle,
    ## but zooming in on the range from -3 to 3 shows the
    ## interesting features in the middle of the plot.  This
    ## plot has two flat points (one local maximum and one
    ## local minimum), rather than one saddle point.
    gnuplot> plot [-3:3] x**3 - 3*x

[style=AdvancedFrame] If I am getting a good mathematical vibe from the
classroom I will briefly step into calculus territory and ask students
to predict the local max and min for :math:`y = x^3 - 3x`. I will then
quickly show them that

.. math:: \frac{d(x^3 - 3x)}{dx} = 3 x^2 - 3

and using the quadratic formula

.. math:: x = \frac{-b \pm \sqrt{b^2 - 4ac}} { 2a }

we get :math:`\pm 6 / 6 = \pm 1`, which is exactly where the local max
and min are located (see :numref:`fig-simplefunction`).

You can now set the grid in the plot and see if this calculation matches
the plot:

.. literalinclude:: simplefunction.gp
   :language: gnuplot
   :caption: Instructions to plot a simple function and its
             derivative, in this case the function :math:`f(x) = x^3 -
             3 x` and its derivative :math:`df(x)/dx = 3 x^2 - 3`.

The two plots, superimposed in :numref:`fig-simplefunction`, show that
where the derivative function (:math:`3x^2-3`) is zero, the original
function (:math:`x^3-3x`) has its flat point. This can be presented,
especially to older kids, in a rapid way that conveys "you don’t have
to understand this, but if you have heard about slopes then note that
the second curve shows the slope of the first one..." For the younger
kids we can emphasize that the figure shows two functions and looks
intriguing.

.. _fig-simplefunction:

.. figure:: simplefunction.*

   A simple plot of the function :math:`y = x^3 - 3x` and its
   derivative :math:`y = 3x^2 - 3`.


Visualizing functions like this is cool and it can be useful during the
exploratory phase of a research project, but it seldom comes up in the
bulk of the work and in the final scientific write-up. We will now move
on to tasks which come up quite frequently.


.. _sec-reading-and-writing-files:

Reading and writing files, in brief
-----------------------------------

First let us make sure we know a couple of shell commands to look at a
file. Here I usually will take a portion of the board and write a boxed
inset "cheat sheet" with some useful shell commands. [4]_

Since we already have a file called ``simpledata.dat`` which we
created earlier, let us look at three shell commands that give us a
quick glance at what’s in that file: ``head``, ``tail`` and ``less``.

.. code:: bash

    $ head simpledata.dat
    $ tail simpledata.dat
    $ less simpledata.dat
    ## (when using less be sure to quit with 'q')

These are simple ways to peek at a file, and will work with any text
file. You should always remember these commands.

Next we will look at how to read a file in a Python program. This is a
crucial pattern and we will use it a lot. Type in the program
``reader.py`` shown in :numref:`listing-reader-py` and run it to see
what happens.

.. _listing-reader-py:

.. literalinclude:: reader.py
   :language: python
   :caption: reader.py -- Reads a simple set of 2-column data

Remember that after writing and saving the program you do the following
to make it executable and then run it:

.. code:: bash

    $ chmod +x reader.py
    $ ./reader.py

Finally let us see how to write files to disk. We will extend to do an
easy manipulation of the file ``simpledata.dat`` and then write it
back out to a new file ``simpledata.dat.sums``.  This new program will
be called ``simple-writer.py``, so we need to copy it first:

.. code:: bash

    $ cp reader.py simple-writer.py

and edit ``simple-writer.py`` to look like
:numref:`listing-simple-writer-py`.

.. _listing-simple-writer-py:

.. literalinclude:: simple-writer.py
   :language: python
   :caption: simple-writer.py -- Reads a simple set of 2-column data
             and sums the second column and then writes out line with
             (x y sumy).


At this point we are comfortable with basic programming patterns for
reading and writing data. This is very important: this kind of file
manipulation is one of the big steps in becoming a confident programmer.
This is a good time to make the kids familiar with the terminology
"input/output" (I/O).

Generating our own data to plot
-------------------------------

Now we look at an example of writing a python program to generate some
data which we will then plot. Initially this just feels like silly
data: it calculates the :math:`sin` function for many opints and
prints out the values. There is a reason to start with this very
simple model: it will allow us (in the more advanced course on Fourier
analysis) to give clear cut examples of deeper data anlysis.  We will
not be lazy: in the more advanced courses we will look at real signals
instead of the toy ones.

Start with the ``python3`` command line and let us see the few lines
of code that generate :math:`sin` wave data:

.. literalinclude:: simplewave.py
   :language: python
   :caption: simplewave.py -- simple :math:`sin` wave generator.

FIXME: put gnuplot result here

..
   FIXME: must do this in markdown for making this plot.
   You can see the plot in ,

   sampleplot-gp-simplewave-tex

Comments on this:

-  This just prints values to the terminal, so we limited it to 30
   values of ``x``. When we write it to a file will do much more.

-  The classic ``for`` loop generates a sequence of integers, which does
   not do well at all for plotting: we want to space our ``x`` values
   much more tightly to get a nice plot, so we divide the integer ``i``
   by 10.0 to get finely spaced values of ``x``.

Let us now put this into a file called ``simplewave-write.py``

.. literalinclude:: simplewave-write.py
   :language: python
   :caption: simplewave-write.py -- simple :math:`sin` wave generator,
             writes output to a file.


We run the program and do a quick scan of its output with:

.. code:: bash

    $ python3 simplewave-write.py
    $ ls -l sin_wave.dat
    $ head sin_wave.dat
    $ tail sin_vave.dat

and when we have seen the first and last few lines of the output file we
realize we can plot it (after going back to our gnuplot window) with:

::

    $ gnuplot
    gnuplot> plot 'sin_wave.dat' using 1:2 with lines

Describing what this program does is rather straightforward, and it can
be compared to the gnuplot instruction ``plot sin(x)``.

At this point, at the blackboard, I will suggest to the students a
couple of edits they "should have already tried on their own":

-  See what happens if we don’t divide :math:`i/100.0`: try both
   ``x = i`` (for this use ``range(7)``) and some intermediate value
   ``x = i/4.0`` (for this use ``range(28)``). Note the loss of
   resolution.

-  Use python’s ``sys.argv`` to take the file name as a command line
   argument.

We then show a cleaner version of this program which adds some comments,
makes clear what the :math:`sin` period is, and uses some robust
proramming paradigms such as using command line arguments.

.. literalinclude:: simplewave-cleaner.py
   :language: python
   :caption: simplewave-cleaner.py -- More elaborate version of
             simplewave-write.py

Note that the comments also tell you how to run a program and
visualize the output. This use of comments to explain behavior is an
important detail, even for small programs.

In the more advanced course on Fourier analysis we will revisit this
simple program and make it generate more complex waveforms.

The broad landscape of plotting software
----------------------------------------

Now that we have seen some examples, let us talk broadly about plotting
software, since you will soon be bombarded with people telling you about
their favorites.

By now we know well that in the free software world there is often a
dizzying variety of tools to do any task, with fans advocating each
approach. Gnuplot is full-featured, stable, actively maintained and
ubiquitous so I have chosen it, there are several other valid choices.

There are at least three main categories of plotting tools:

a. The "just a plotting program" kind,

b. The "plotting program with some data analysis that grew into a full
   programming language",

c. The "plotting library for a well-established programming language".

Gnuplot is clearly one of the first, R and Octave the second, Python
with Matplotlib and Cern’s Root are examples of the third.

Often it comes down to where a particular scientist did her early
research work: a boss will tell you to "use this tool because it’s what
I use". I recommend forming a broad knowlege of scientific tools so that
you can use *the most appropriate tool* instead of *the tool that makes
your boss comfortable*. You will often find that astrophysicists often
use Python with matplotlib, particle physicists use Cern’s Root,
biologists use R or Python, social scientists who do much statistical
work use R. You shoud always know the tool your community uses (if it’s
a free software tool), as well as some others which might be more
appropriate.

There are many proprietary plotting packages. I advocate against the use
of proprietary software, and it is certainly unacceptable to do science
with proprietary tools, but I will mention a couple of packages so that
when you come across users you will be able to categorize and compare
them and offer an effective free software approach to the same problem.

CricketGraph, often used in the 1990s, was a light-weight plotting
program, later supplanted by KaleidaGraph. I would recommend gnuplot as
a good way of doing what those programs did.

Matlab and IDL are simple plotting and data analysis programs that grew
out of control and added an ad-hoc programming language to the package.
The languages were never meant for large software applications, but are
often used to write very large programs. It is interesting to note that
these programs always start with the stated intention of not requiring a
scientist to know how to program, but they end up channeling scientists
into using an ill-designed language for large programs.

Matlab and IDL programs can be written in a much cleaner way using
Python for the programming parts, and the matplotlib plotting library
for graphics.

There is a final outlier in the proprietary data analysis world, which
is the "using a spreadsheet to do data analysis" approach, often with
the proprietary Excel spreadsheet. There is no saving grace to this
approach: apart from technical concerns with the validity of the
numerical subroutines, there is also the complete lack of
reproducibility of a person moving a mouse around over cells. One of
the most embarrassing cases of incorrect analysis was in a much-cited
Economics paper about debt rations in European countries
:cite:`reinhart2010growth`. The analysis was done with an Excel
spreadsheet, and some readers concluded that the authors selected the
wrong range of data with a mouse movement. There is no reproducibility
when mouse movements are part of the data analysis. The economics
article was disgraced because of the faulty analysis as well as other
problems with their methodology.

Data formats
------------

In :numref:`sec-very-first-data-plots` and
:numref:`sec-reading-and-writing-files` we saw the simplest examples
of data files: columns of numbers separated by white space. These are
the simplest to work with, and if your files are smaller than about 10
megabytes you should always treat yourself to that simplicity. This
format is often called "whitespace separated ascii" or names similar
to that.

Often you will find that the columns of data are separated by commas.
This format is called "comma separated values" (csv) and the files
often end with the ``.csv`` extension. The format has been around
almost half a century. It has some advantages over the whitespace
separated columns and is used by almost all spreadsheet programs as an
import/export format.  In gnuplot you handle this with the instruction
``set datafile separator comma`` as we see in
:numref:`sec-population-data-from-the-web`.

Sometimes files are in a variety of *binary formats*, which you cannot
read directly. We will not deal with these at this time, since we are
not yet working with very big files, but later on we will show how to
convert mp3 files to an ascii format which is easily read by our
programs and by gnuplot.

.. _sec-population-data-from-the-web:

Population data from the web
----------------------------

Our goals here are to:

-  Automate fetching of data sets from the web.

-  Look at a plot in a few different ways to get a narrative out of it.

We will start by looking at the population history of the whole world.
When I discuss this with students I often ask "what do you think the
population of the world is today?" (then you can have them search the
web for "world population clock", which will take them to
http://www.worldometers.info/world-population/).

Then ask "what do you think the world population was in 1914? And 1923?
And 1776? And 1066? And in the early and late Roman empire? And in the
Age of Pericles?

Let us search for

``world population growth``

and we will come to this web site:
https://ourworldindata.org/world-population-growth/ and if we go down a
bit further we will see a link to download the annual world population
data.  The text on the link is FIXME: this section is incomplete.

We will *not* click on the link. Instead we will use the program
``wget`` to download it automatically [5]_:

.. code:: bash

    $ wget http://ourworldindata.org/roser/graphs/[...]/....csv -O world-pop.csv

Note that this is a very long URL, but students can get it as a result
of their search, so nobody has to type the full thing in.

Once they have the file downloaded they can look at the data with:

.. code:: bash

    $ less world-pop.csv

and will quickly see that it is slightly different from the data we have
seen so far. The columns of data are separated by *commas* instead of
spaces. This type of file format is called *comma-separated-value*
format and is quite common. Our plotting program, ``gnuplot``, works
with space-separated columns by default, so there are two tricks to plot
the file. Either use the cool program ``sed`` to change the commas into
spaces:

.. code:: bash

    $ sed 's/,/   /g' world-pop.csv > world-pop.dat
    $ gnuplot
    gnuplot> plot 'world-pop.dat' using 1:3 with linespoints

or tell gnuplot to use a comma as a column separator:

.. literalinclude:: plotworldpopulation.gp
   :language: gnuplot
   :caption: Instructions to plot the world population.

.. _fig-plotworldpopulation:

.. figure:: plotworldpopulation.*
   :width: 90%

   The world population from 10000 BCE until the present time.

And what a story we could tell from this plot if it weren’t so hard to
read! The main problem with this plot is that the world population in
ancient times was quite small, and then it grew dramatically with
various milestones in history which allowed for longer life expectancy
and for the occupation of more of the world.

There are a couple of ways of trying to get more out of this plot. One
is to *zoom in* to certain parts of it. For example, in we zoom in to
the milennium from the founding of Rome to the fall of the western Roman
empire, shown in :numref:`fig-plotworldpopulationroman`.

.. literalinclude:: plotworldpopulationroman.gp
   :language: gnuplot
   :caption: Plot the world population from the founding of Rome until
             the fall of the western Roman empire.


.. _fig-plotworldpopulationroman:

.. figure:: plotworldpopulationroman.*
   :width: 90%

   The world population from the founding of Rome (753 BCE) until the
   fall of the western Roman empire (476 CE).

This is a good time to stop and discuss the graph.  In discussing
:numref:`fig-plotworldpopulationroman` students might make interesting
connections referring to the Wikipedia `Roman demography article
<https://en.wikipedia.org/wiki/Demography_of_the_Roman_Empire#Population>`_
It is sometimes estimated that the Roman empire might have had about
70 million citizens at the height of the empire, in the 2nd centry CE.
The world population at that time was approximately 200 million
people, so the Roman empire would have accounted for some 35% of the
world's population.  This means that large scale population events in
the Roman empire, like the Antonine Plauge in 165-180 CE, or the
decline and fall of the empire in the 4th and 5th centuries might
account for dips in :numref:`fig-plotworldpopulationroman`.

We can also zoom in to the 20th century.  In
:numref:`fig-plotworldpopulation20th` we zoom in to the 20th century.

.. literalinclude:: plotworldpopulation20th.gp
   :language: gnuplot
   :caption: Plot the world population in the 20th century.


.. _fig-plotworldpopulation20th:

.. figure:: plotworldpopulation20th.*
   :width: 90%

   The world population in the 20th century.

Discussion of :numref:`fig-plotworldpopulation20th` can point out that
there is exponential growth from 1900 to 1962 (the year in which the
world's rate of population growth peaked), but that the exponential
growth has interruptions due to World War I, the Spanish flu, and
World War II.

.. literalinclude:: plotworldpopulation0-1800.gp
   :language: gnuplot
   :caption: Plot the world population from 0 to 1800 CE.


.. _fig-plotworldpopulation0-1800:

.. figure:: plotworldpopulation0-1800.*
   :width: 90%

   The world population from 0 to 1800 CE.

And in :numref:`fig-plotworldpopulation0-1800` we zoom in to the
period from year 0 to 1800 CE.  It can be interesting to look at
pandemics and wars in this period and see if you can find features in
the plot that correspond to those periods in history.

These attempts at zooming in tell us a some interesting things:

-  It is frustrating that there is so little data before 1950.

-  The 0 to 1800 plot allows us to see things clearly before the
   population jumps up so much.

-  In the 0-1800 plot we see that the world population starts growing as
   we approach the year 1000, after which it flattens off around the
   year 1300 (the period of the great plague), after which it starts
   pick up and never stops growing.

The other way to look at data when the :math:`y` axis has too much
range is to use what is called a *log
scale*. :numref:`fig-plotworldpopulationlog` shows how this can be
done in ``gnuplot``, and you can see that the :math:`y` axis has been
adjusted so that we can see some of the features in the data.  This
plot is more useful than that in :numref:`fig-plotworldpopulation`.


.. literalinclude:: plotworldpopulationlog.gp
   :language: gnuplot
   :caption: Instructions to plot the world population with log scale.


.. _fig-plotworldpopulationlog:

.. figure:: plotworldpopulationlog.*
   :width: 90%

   The world population from 10000 BCE until the present time, with a
   log scale for population.  You can see some features because the
   log scale compresses the 20th century population explosion.


.. _sec-simple-surface-plots:

Simple surface plots
--------------------

So far we have looked at *line plots*. Let us now look at another type
of plot: the *surface plot*.  A surface plot comes from a function of
two variables: :math:`z = f(x, y)`.  In these plots the value of the
function is plotted as the *height* over the x, y plane.  Here is an
example:

.. literalinclude:: simplesurfacefunction.gp
   :language: gnuplot
   :caption: Instructions to plot a simple surface, in this case the
             function :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}`

.. _fig-simplesurfacefunction:

.. figure:: simplesurfacefunction.*
   :width: 60%

   The surface plot of :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}`

The plot in :numref:`fig-simplesurfacefunction` is visually
gratifying, and even more so because when you generate it
interactively with gnuplot you can grab it with the left mouse button
and rotate it to get more insight into what the surface looks like
from different angles.

Another way of showing the same information is a *heat
map*. :numref:`fig-heatmap` shows the same function, but this
time the value of the function is represented with *color* instead of
height.

.. literalinclude:: heatmap.gp
   :language: gnuplot
   :caption: Instructions to plot a heat map, in this case the
             function :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}`

.. _fig-heatmap:

.. figure:: heatmap.*
   :width: 60%

   The heat map for of :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}`


Topics we have covered
----------------------

-  data files

-  plots

-  gnuplot

-  reading features from simple plots

-  simple surface plots


.. [4]
   In the introductory course I have insets on the board with shell
   commands, ``emacs`` keybindings, and some Python commands. The
   ``emacs`` keybindings are especially important since the students
   have not necessarily done the full tutorial.


.. [5]
   The full URL is
   http://ourworldindata.org/roser/graphs/WorldPopulationAnnual12000years_interpolated_HYDEandUN/WorldPopulationAnnual12000years_interpolated_HYDEandUN.csv
   but we don’t need to type it all, so in the text I show an
   abbreviation of it.