.. _chap-getting-to-philosophy: ======================= Getting to philosophy ======================= [status: content-mostly-written] Motivation, prerequisites, plan =============================== .. rubric:: Motivation Go to any Wikipedia page and follow the first link in the body of its text, and then you follow the first link of that page, and so forth. For almost all Wikipedia pages this procedure will eventually lead you to the Wikipedia page on Philosophy. This observation has its own wikipedia page: https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy .. note:: When we say "first link" on a wikipedia page, we mean the first link of the article content, after all the "for other uses", "(from Greek ...)", and other frontmatter -- these are not part of the article itself. This is not a rigorous or deep observation, but it allows us to write some software to analyze and visualize this assertion, and that journey will teach us some very cool programming techniques. * Explore the "Getting to Philosophy" observation. * Learn how to do a bit of *web scraping* and text manipulation. * Use recursive programming for a real world application. * Learn about the remarkable ``graphviz`` software. .. rubric:: Prerequisites * The 10-hour "serious programming" course. * The "Data files and first plots" mini-course in :numref:`chap-data-files-and-first-plots`. * Recursion from :numref:`chap-recursion`. * Web scraping from :numref:`chap-web-scraping`. .. rubric:: Plan So how do we write programs that study and visualize this idea? We will: #. Review what web pages look like. #. Write programs that retrieve and pick apart web pages looking for links. #. Learn about graphviz. #. Use graphviz to analyze the flow of links in our simple web pages. #. Make those programs more subtle to search through the more complex HTML structure in Wikipedia articles. #. Output the "first link" chain in various Wikipedia pages to a file so that graphviz can show us an interesting visualization of the that chain. Parsing simple web pages ======================== You should quickly review the brief section on what web pages look like in :numref:`sec-what-does-a-web-page-look-like-underneath` before continuing in this section. Let us start with the simple web page we had in :numref:`listing-simple-web-page-with-anchor` back in :numref:`sec-what-does-a-web-page-look-like-underneath` Now write a program which finds the first hyperlink in a web page. There are many ways of doing this using sophisticated Python libraries, but we will start with a simple approach that simply uses Python's string methods. An example is in :numref:`listing-find-first-anchor`. .. _listing-find-first-anchor: .. literalinclude:: find_first_link.py :language: python :caption: Look through the text of a page for the first hypertext link. Running this program will give the position and text of the first hyperlink in that HTML file:: $ ./find_first_link.py myinfo.html pos, link URL: 330 myinfo.html last_part: myinfo Making vertex and edge graphs ============================= *Graph* can mean many things. In computer science it is a picture that shows connections between things. The "things" are shown as shapes and the connections are shown as lines or arrows. There is a very cool program called ``graphviz`` which lets you make a simple text file and get a graph drawn from it. In :numref:`listing-gtp-py`: there is a simple example that shows a bit of president Kennedy's family tree: .. _listing-kennedys-py: .. literalinclude:: kennedys.dot :caption: The Kennedy family tree You can then generate the picture with: .. parsed-literal:: dot -Tsvg -O kennedys.dot dot -Tpng -O kennedys.dot dot -Tpdf -O kennedys.dot .. _fig-kennedys: .. figure:: kennedys.dot.* The immediate family tree of president Kennedy, rendered with graphviz. You can see more elaborate and sometimes quite visually striking examples at the graphviz web site: http://www.graphviz.org/Gallery.php You can see that it would be illustrative to make such a graph of the paths through Wikipedia pages. But first let's take some baby steps: to get more comfortable with how graphviz works, students should create their own ``.dot`` file with their own family tree. This requires some fast typing, but then they can process it with ``dot`` and view the picture generated by graphviz. A program to get to philosophy ============================== The program I show you here is quite elaborate because it has to deal with some possible scenarios that confuse the issue of which is the "first link" in a wikipedia page. We have provisions that: * exclude links that come in parentheses * exclude links before the start of the first paragraph * exclude links to wikipedia "meta pages", those that start with ``File:``, ``Help:``, ``Wikipedia:`` and that end with ``.svg`` In :numref:`listing-gtp-py` we get to see a couple of the types of algorithms we invent as we do this kind of text processing: the code counts the number of open parentheses that have not yet been closed. Now enter the program in :numref:`listing-gtp-py`: .. _listing-gtp-py: .. literalinclude:: gtp.py :language: python :caption: Examine the "Getting To Philosophy" principle on wikipedia. The results can be seen in :numref:`fig-gtp_graph`. .. _fig-gtp_graph: .. figure:: gtp_graph.dot.* A graph that shows what happens when you keep clicking the first link in a Wikipedia page. This often ends up in the Wikipedia entry on `Philosophy `_. Note, though, that sometimes the articles get slightly modified so that they end up in areas related to philosophy, such as "existence" or "reality". Then run the program with ``python3 gtp.py`` -- you can run it without arguments, or you can give it the URL of a wikipedia page: .. code-block:: console $ chmod 755 gtp.py $ ./gtp.py https://en.wikipedia.org/wiki/Roman_Empire $ evince Roman_Empire.gv.pdf &