.. _chap-data-files-and-first-plots: Starting out: data files and first plots ======================================== [status: mostly-complete-needs-polishing-and-proofreading] .. _sec-very-first-data-plots: Very first data plots with gnuplot ---------------------------------- Our first goal is to become comfortable with data files and with plotting. We first get the students to renew their acquaintance with creating files with an editor and make a file with some hand-crafted data. Use your favorite editor (possibly ``emacs`` for those who have taken my previous course, but ``vi`` or ``gedit`` should also work) to open a file called ``simpledata.dat`` Enter two columns of simple data into this file. For example: .. include:: simpledata.dat :literal: Then save it, and enter ``gnuplot`` to plot this data: .. code:: bash $ sudo apt-get install gnuplot-x11 $ gnuplot gnuplot> plot 'simpledata.dat' .. _fig-sampleplot-simpledata: .. figure:: simpledata.* First example of data plot. then have the students plot the data with slightly different options in gnuplot: .. code:: bash $ gnuplot gnuplot> plot 'simpledata.dat' with lines gnuplot> set xlabel 'this is the "x" axis' gnuplot> set ylabel 'this is the "y" axis' gnuplot> plot 'simpledata.dat' with linespoints .. .. note:: We want to give early hints to how you can make this automatic and reproducible, so we will also give an example of automatically making a PDF file and including into a document. We should make the students input this file, but on the projector we can show that :numref:`fig-sampleplot-simpledata` is generated by the gnuplot script ``simpledata.gp`` running on the file ``simpledata.dat`` sampleplot-gp-simpledata-tex Plotting *functions* with gnuplot --------------------------------- More examples of using gnuplot. We don’t assume knowledge of trigonometry from younger students, so we tell the story as "this is the \sin function, which you will learn about some day; it plots these waves." On the other hand the polynomial :math:`y = x^3 - 3x` should be within their reach: I might call on the class to tell me "what’s :math:`-10^3` and :math:`-2^3`, :math:`2^3`, and :math:`10^3`" - this establishes that the plot goes down on the left hand side, and up on the right hand side. The interesting play in the middle can be narrated by showing that :math:`(1/2)^3` is smaller than :math:`3 * (1/2)`, so that the negative term dominates briefly (see :numref:`fig-simplefunction`).. .. code:: bash gnuplot> plot sin(x) gnuplot> plot x*x*x - 3*x gnuplot> plot x**3 - 3*x ## note that this last one did show a dip in the middle, ## but zooming in on the range from -3 to 3 shows the ## interesting features in the middle of the plot. This ## plot has two flat points (one local maximum and one ## local minimum), rather than one saddle point. gnuplot> plot [-3:3] x**3 - 3*x [style=AdvancedFrame] If I am getting a good mathematical vibe from the classroom I will briefly step into calculus territory and ask students to predict the local max and min for :math:`y = x^3 - 3x`. I will then quickly show them that .. math:: \frac{d(x^3 - 3x)}{dx} = 3 x^2 - 3 and using the quadratic formula .. math:: x = \frac{-b \pm \sqrt{b^2 - 4ac}} { 2a } we get :math:`\pm 6 / 6 = \pm 1`, which is exactly where the local max and min are located (see :numref:`fig-simplefunction`). You can now set the grid in the plot and see if this calculation matches the plot: .. literalinclude:: simplefunction.gp :language: gnuplot :caption: Instructions to plot a simple function and its derivative, in this case the function :math:`f(x) = x^3 - 3 x` and its derivative :math:`df(x)/dx = 3 x^2 - 3`. The two plots, superimposed in :numref:`fig-simplefunction`, show that where the derivative function (:math:`3x^2-3`) is zero, the original function (:math:`x^3-3x`) has its flat point. This can be presented, especially to older kids, in a rapid way that conveys "you don’t have to understand this, but if you have heard about slopes then note that the second curve shows the slope of the first one..." For the younger kids we can emphasize that the figure shows two functions and looks intriguing. .. _fig-simplefunction: .. figure:: simplefunction.* A simple plot of the function :math:`y = x^3 - 3x` and its derivative :math:`y = 3x^2 - 3`. Visualizing functions like this is cool and it can be useful during the exploratory phase of a research project, but it seldom comes up in the bulk of the work and in the final scientific write-up. We will now move on to tasks which come up quite frequently. .. _sec-reading-and-writing-files: Reading and writing files, in brief ----------------------------------- First let us make sure we know a couple of shell commands to look at a file. Here I usually will take a portion of the board and write a boxed inset "cheat sheet" with some useful shell commands. [4]_ Since we already have a file called ``simpledata.dat`` which we created earlier, let us look at three shell commands that give us a quick glance at what’s in that file: ``head``, ``tail`` and ``less``. .. code:: bash $ head simpledata.dat $ tail simpledata.dat $ less simpledata.dat ## (when using less be sure to quit with 'q') These are simple ways to peek at a file, and will work with any text file. You should always remember these commands. Next we will look at how to read a file in a Python program. This is a crucial pattern and we will use it a lot. Type in the program ``reader.py`` shown in :numref:`listing-reader-py` and run it to see what happens. .. _listing-reader-py: .. literalinclude:: reader.py :language: python :caption: reader.py -- Reads a simple set of 2-column data Remember that after writing and saving the program you do the following to make it executable and then run it: .. code:: bash $ chmod +x reader.py $ ./reader.py Finally let us see how to write files to disk. We will extend to do an easy manipulation of the file ``simpledata.dat`` and then write it back out to a new file ``simpledata.dat.sums``. This new program will be called ``simple-writer.py``, so we need to copy it first: .. code:: bash $ cp reader.py simple-writer.py and edit ``simple-writer.py`` to look like :numref:`listing-simple-writer-py`. .. _listing-simple-writer-py: .. literalinclude:: simple-writer.py :language: python :caption: simple-writer.py -- Reads a simple set of 2-column data and sums the second column and then writes out line with (x y sumy). At this point we are comfortable with basic programming patterns for reading and writing data. This is very important: this kind of file manipulation is one of the big steps in becoming a confident programmer. This is a good time to make the kids familiar with the terminology "input/output" (I/O). Generating our own data to plot ------------------------------- Now we look at an example of writing a python program to generate some data which we will then plot. Initially this just feels like silly data: it calculates the :math:`sin` function for many opints and prints out the values. There is a reason to start with this very simple model: it will allow us (in the more advanced course on Fourier analysis) to give clear cut examples of deeper data anlysis. We will not be lazy: in the more advanced courses we will look at real signals instead of the toy ones. Start with the ``python3`` command line and let us see the few lines of code that generate :math:`sin` wave data: .. literalinclude:: simplewave.py :language: python :caption: simplewave.py -- simple :math:`sin` wave generator. FIXME: put gnuplot result here .. FIXME: must do this in markdown for making this plot. You can see the plot in , sampleplot-gp-simplewave-tex Comments on this: - This just prints values to the terminal, so we limited it to 30 values of ``x``. When we write it to a file will do much more. - The classic ``for`` loop generates a sequence of integers, which does not do well at all for plotting: we want to space our ``x`` values much more tightly to get a nice plot, so we divide the integer ``i`` by 10.0 to get finely spaced values of ``x``. Let us now put this into a file called ``simplewave-write.py`` .. literalinclude:: simplewave-write.py :language: python :caption: simplewave-write.py -- simple :math:`sin` wave generator, writes output to a file. We run the program and do a quick scan of its output with: .. code:: bash $ python3 simplewave-write.py $ ls -l sin_wave.dat $ head sin_wave.dat $ tail sin_vave.dat and when we have seen the first and last few lines of the output file we realize we can plot it (after going back to our gnuplot window) with: :: $ gnuplot gnuplot> plot 'sin_wave.dat' using 1:2 with lines Describing what this program does is rather straightforward, and it can be compared to the gnuplot instruction ``plot sin(x)``. At this point, at the blackboard, I will suggest to the students a couple of edits they "should have already tried on their own": - See what happens if we don’t divide :math:`i/100.0`: try both ``x = i`` (for this use ``range(7)``) and some intermediate value ``x = i/4.0`` (for this use ``range(28)``). Note the loss of resolution. - Use python’s ``sys.argv`` to take the file name as a command line argument. We then show a cleaner version of this program which adds some comments, makes clear what the :math:`sin` period is, and uses some robust proramming paradigms such as using command line arguments. .. literalinclude:: simplewave-cleaner.py :language: python :caption: simplewave-cleaner.py -- More elaborate version of simplewave-write.py Note that the comments also tell you how to run a program and visualize the output. This use of comments to explain behavior is an important detail, even for small programs. In the more advanced course on Fourier analysis we will revisit this simple program and make it generate more complex waveforms. The broad landscape of plotting software ---------------------------------------- Now that we have seen some examples, let us talk broadly about plotting software, since you will soon be bombarded with people telling you about their favorites. By now we know well that in the free software world there is often a dizzying variety of tools to do any task, with fans advocating each approach. Gnuplot is full-featured, stable, actively maintained and ubiquitous so I have chosen it, there are several other valid choices. There are at least three main categories of plotting tools: a. The "just a plotting program" kind, b. The "plotting program with some data analysis that grew into a full programming language", c. The "plotting library for a well-established programming language". Gnuplot is clearly one of the first, R and Octave the second, Python with Matplotlib and Cern’s Root are examples of the third. Often it comes down to where a particular scientist did her early research work: a boss will tell you to "use this tool because it’s what I use". I recommend forming a broad knowlege of scientific tools so that you can use *the most appropriate tool* instead of *the tool that makes your boss comfortable*. You will often find that astrophysicists often use Python with matplotlib, particle physicists use Cern’s Root, biologists use R or Python, social scientists who do much statistical work use R. You shoud always know the tool your community uses (if it’s a free software tool), as well as some others which might be more appropriate. There are many proprietary plotting packages. I advocate against the use of proprietary software, and it is certainly unacceptable to do science with proprietary tools, but I will mention a couple of packages so that when you come across users you will be able to categorize and compare them and offer an effective free software approach to the same problem. CricketGraph, often used in the 1990s, was a light-weight plotting program, later supplanted by KaleidaGraph. I would recommend gnuplot as a good way of doing what those programs did. Matlab and IDL are simple plotting and data analysis programs that grew out of control and added an ad-hoc programming language to the package. The languages were never meant for large software applications, but are often used to write very large programs. It is interesting to note that these programs always start with the stated intention of not requiring a scientist to know how to program, but they end up channeling scientists into using an ill-designed language for large programs. Matlab and IDL programs can be written in a much cleaner way using Python for the programming parts, and the matplotlib plotting library for graphics. There is a final outlier in the proprietary data analysis world, which is the "using a spreadsheet to do data analysis" approach, often with the proprietary Excel spreadsheet. There is no saving grace to this approach: apart from technical concerns with the validity of the numerical subroutines, there is also the complete lack of reproducibility of a person moving a mouse around over cells. One of the most embarrassing cases of incorrect analysis was in a much-cited Economics paper about debt rations in European countries :cite:`reinhart2010growth`. The analysis was done with an Excel spreadsheet, and some readers concluded that the authors selected the wrong range of data with a mouse movement. There is no reproducibility when mouse movements are part of the data analysis. The economics article was disgraced because of the faulty analysis as well as other problems with their methodology. Data formats ------------ In :numref:`sec-very-first-data-plots` and :numref:`sec-reading-and-writing-files` we saw the simplest examples of data files: columns of numbers separated by white space. These are the simplest to work with, and if your files are smaller than about 10 megabytes you should always treat yourself to that simplicity. This format is often called "whitespace separated ascii" or names similar to that. Often you will find that the columns of data are separated by commas. This format is called "comma separated values" (csv) and the files often end with the ``.csv`` extension. The format has been around almost half a century. It has some advantages over the whitespace separated columns and is used by almost all spreadsheet programs as an import/export format. In gnuplot you handle this with the instruction ``set datafile separator comma`` as we see in :numref:`sec-population-data-from-the-web`. Sometimes files are in a variety of *binary formats*, which you cannot read directly. We will not deal with these at this time, since we are not yet working with very big files, but later on we will show how to convert mp3 files to an ascii format which is easily read by our programs and by gnuplot. .. _sec-population-data-from-the-web: Population data from the web ---------------------------- Our goals here are to: - Automate fetching of data sets from the web. - Look at a plot in a few different ways to get a narrative out of it. We will start by looking at the population history of the whole world. When I discuss this with students I often ask "what do you think the population of the world is today?" (then you can have them search the web for "world population clock", which will take them to http://www.worldometers.info/world-population/). Then ask "what do you think the world population was in 1914? And 1923? And 1776? And 1066? And in the early and late Roman empire? And in the Age of Pericles? Let us search for ``world population growth`` and we will come to this web site: https://ourworldindata.org/world-population-growth/ and if we go down a bit further we will see a link to download the annual world population data. The text on the link is FIXME: this section is incomplete. We will *not* click on the link. Instead we will use the program ``wget`` to download it automatically [5]_: .. code:: bash $ wget http://ourworldindata.org/roser/graphs/[...]/....csv -O world-pop.csv Note that this is a very long URL, but students can get it as a result of their search, so nobody has to type the full thing in. Once they have the file downloaded they can look at the data with: .. code:: bash $ less world-pop.csv and will quickly see that it is slightly different from the data we have seen so far. The columns of data are separated by *commas* instead of spaces. This type of file format is called *comma-separated-value* format and is quite common. Our plotting program, ``gnuplot``, works with space-separated columns by default, so there are two tricks to plot the file. Either use the cool program ``sed`` to change the commas into spaces: .. code:: bash $ sed 's/,/ /g' world-pop.csv > world-pop.dat $ gnuplot gnuplot> plot 'world-pop.dat' using 1:3 with linespoints or tell gnuplot to use a comma as a column separator: .. literalinclude:: plotworldpopulation.gp :language: gnuplot :caption: Instructions to plot the world population. .. _fig-plotworldpopulation: .. figure:: plotworldpopulation.* :width: 90% The world population from 10000 BCE until the present time. And what a story we could tell from this plot if it weren’t so hard to read! The main problem with this plot is that the world population in ancient times was quite small, and then it grew dramatically with various milestones in history which allowed for longer life expectancy and for the occupation of more of the world. There are a couple of ways of trying to get more out of this plot. One is to *zoom in* to certain parts of it. For example, in we zoom in to the milennium from the founding of Rome to the fall of the western Roman empire, shown in :numref:`fig-plotworldpopulationroman`. .. literalinclude:: plotworldpopulationroman.gp :language: gnuplot :caption: Plot the world population from the founding of Rome until the fall of the western Roman empire. .. _fig-plotworldpopulationroman: .. figure:: plotworldpopulationroman.* :width: 90% The world population from the founding of Rome (753 BCE) until the fall of the western Roman empire (476 CE). This is a good time to stop and discuss the graph. In discussing :numref:`fig-plotworldpopulationroman` students might make interesting connections referring to the Wikipedia `Roman demography article `_ It is sometimes estimated that the Roman empire might have had about 70 million citizens at the height of the empire, in the 2nd centry CE. The world population at that time was approximately 200 million people, so the Roman empire would have accounted for some 35% of the world's population. This means that large scale population events in the Roman empire, like the Antonine Plauge in 165-180 CE, or the decline and fall of the empire in the 4th and 5th centuries might account for dips in :numref:`fig-plotworldpopulationroman`. We can also zoom in to the 20th century. In :numref:`fig-plotworldpopulation20th` we zoom in to the 20th century. .. literalinclude:: plotworldpopulation20th.gp :language: gnuplot :caption: Plot the world population in the 20th century. .. _fig-plotworldpopulation20th: .. figure:: plotworldpopulation20th.* :width: 90% The world population in the 20th century. Discussion of :numref:`fig-plotworldpopulation20th` can point out that there is exponential growth from 1900 to 1962 (the year in which the world's rate of population growth peaked), but that the exponential growth has interruptions due to World War I, the Spanish flu, and World War II. .. literalinclude:: plotworldpopulation0-1800.gp :language: gnuplot :caption: Plot the world population from 0 to 1800 CE. .. _fig-plotworldpopulation0-1800: .. figure:: plotworldpopulation0-1800.* :width: 90% The world population from 0 to 1800 CE. And in :numref:`fig-plotworldpopulation0-1800` we zoom in to the period from year 0 to 1800 CE. It can be interesting to look at pandemics and wars in this period and see if you can find features in the plot that correspond to those periods in history. These attempts at zooming in tell us a some interesting things: - It is frustrating that there is so little data before 1950. - The 0 to 1800 plot allows us to see things clearly before the population jumps up so much. - In the 0-1800 plot we see that the world population starts growing as we approach the year 1000, after which it flattens off around the year 1300 (the period of the great plague), after which it starts pick up and never stops growing. The other way to look at data when the :math:`y` axis has too much range is to use what is called a *log scale*. :numref:`fig-plotworldpopulationlog` shows how this can be done in ``gnuplot``, and you can see that the :math:`y` axis has been adjusted so that we can see some of the features in the data. This plot is more useful than that in :numref:`fig-plotworldpopulation`. .. literalinclude:: plotworldpopulationlog.gp :language: gnuplot :caption: Instructions to plot the world population with log scale. .. _fig-plotworldpopulationlog: .. figure:: plotworldpopulationlog.* :width: 90% The world population from 10000 BCE until the present time, with a log scale for population. You can see some features because the log scale compresses the 20th century population explosion. .. _sec-simple-surface-plots: Simple surface plots -------------------- So far we have looked at *line plots*. Let us now look at another type of plot: the *surface plot*. A surface plot comes from a function of two variables: :math:`z = f(x, y)`. In these plots the value of the function is plotted as the *height* over the x, y plane. Here is an example: .. literalinclude:: simplesurfacefunction.gp :language: gnuplot :caption: Instructions to plot a simple surface, in this case the function :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}` .. _fig-simplesurfacefunction: .. figure:: simplesurfacefunction.* :width: 60% The surface plot of :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}` The plot in :numref:`fig-simplesurfacefunction` is visually gratifying, and even more so because when you generate it interactively with gnuplot you can grab it with the left mouse button and rotate it to get more insight into what the surface looks like from different angles. Another way of showing the same information is a *heat map*. :numref:`fig-heatmap` shows the same function, but this time the value of the function is represented with *color* instead of height. .. literalinclude:: heatmap.gp :language: gnuplot :caption: Instructions to plot a heat map, in this case the function :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}` .. _fig-heatmap: .. figure:: heatmap.* :width: 60% The heat map for of :math:`f(x, y) = e^{\frac{-x^2 - 1.7 y^2}{10}}` Topics we have covered ---------------------- - data files - plots - gnuplot - reading features from simple plots - simple surface plots .. [4] In the introductory course I have insets on the board with shell commands, ``emacs`` keybindings, and some Python commands. The ``emacs`` keybindings are especially important since the students have not necessarily done the full tutorial. .. [5] The full URL is http://ourworldindata.org/roser/graphs/WorldPopulationAnnual12000years_interpolated_HYDEandUN/WorldPopulationAnnual12000years_interpolated_HYDEandUN.csv but we don’t need to type it all, so in the text I show an abbreviation of it.