An exploding field into which many mathematics graduates (and others) are going is known as "Data Science". This is a profession in which people look at data (sometimes "Big Data", sometimes not such big data), and try to figure things out from it. It uses statistics, computer programming, visualisation and knowledge of the context of the data.
This article is a case study showing the type of work a data scientist would do in order to pull together information from different sources to tell a story about it. It is very technical in places, but even skipping over the technical details will give you insight into how exciting it can be to chase after an elusive goal in this field. And if you are interested in finding out more,
there are some suggested references included as well.
This case study discussed what I did to produce the data table in the activity Where Are You Flying?
It turned out to require much more effort than one might have expected, and have many points at which choices had to be made. This gives a little flavour of what someone doing "data science" might have to do on a regular basis, and also some
of the (potentially important) issues in presented data. For this reason, it is written more like a story or journal entry than a nice "cleaned-up" explanation of how the data was finally extracted. At the end, there are some questions to ponder and further investigation.
Getting the raw data
The two things I needed were a list of which countries are in which continent, and where planes flew from London. The continents turned out to be a little easier, at least to begin with...
- I obtained information on continents from Wikipedia. This page has lots more information than I needed (or so I thought). But it links to a page with nice
computer-readable source (click on the "Edit page" icon or the "Edit" tab to access the source text). I copied the list of countries and the continent abbreviations into two separate files which you can download: continents.txt and continent-abbreviations.csv. (CSV stands for "comma-separated values", a plain-text format for spreadsheet-like data; this file can be opened in any spreadsheet software.)
- Getting historical flight data proved to be more challenging. The American Government publishes freely-available historic flight data on internal flights within the USA via the Bureau of Transportation Statistics website. But there is no British equivalent, as far as I can tell.
- There are commercial providers which will sell data on international flights and internal commercial flights within other countries. This data clearly has commercial value. But I was looking for an "open data" solution to my question as I really didn't want to have to pay to create this activity.
- The UK's Civil Aviation Authority publishes some summary statistics, rather than the flight-by-flight data that the USA publishes. So I decided to make do with that. I downloaded the 2017 annual punctuality statistics from their website
and saved the file as caa-flight-delays.csv.
Wrangling the data
Real-life data is messy. Different data sources are in incompatible formats or use different conventions. Sometimes data is missing. This is a problem that librarians and information professionals have been dealing with for well over a century: how do we organise information in such a way that connected items can be easily matched up? They have come up with a variety of
good solutions, but as there are multiple solutions, different organisations use different ones (and some don't follow any of the standard solutions), which can also cause difficulties, as we will see...
So here's what I did to clean up and draw the data together, using the Open Source software R
and the tidyverse suite. (You can read much more about working with this software in Hadley Wickham's book R for Data Science
, which he has made freely available.) The
final code I wrote is in the file get-london-flights.R
, but there was also a fair amount of playing with the data in RStudio to get a feel for it and to try things out. For example, I regularly looked at the data tables to ensure that things appeared to be reasonable before going further, or asked for a list of airports in the file to
ensure that I spelled everything correctly.
The continents and countries
- The continent abbreviation data was almost fine: I just read in the data table, except that the abbreviation for North America, "NA", was interpreted by R as meaning "the data is Not Available", so I had to take account of that.
- The format of the country names in the Wikipedia data file is something like "Afghanistan, Islamic Republic of" (just taking the first country on the list). Some countries don't have a comma, for example "American Samoa". That's not too bad; I can handle that. I started by checking that each country appears only once in the table, as problems will arise later if that's not
the case. This is relatively easy to do in R, using something like:
continents_raw %>% group_by(Name) %>% count() %>% filter(n > 1)
Oh dear. It turned out that some countries appear twice in the data file, mostly those spanning Europe and Asia. There are also interesting and significant cases (at least for my purposes) such as Madeira, which is part of Portugal but lies on the continental shelf of Africa: there are many flights to Madeira listed in the CAA flight data file (as I discovered later). A decision
had to be made for each of these. I decided that most of the Europe/Asia countries should be classified as Asia for this purpose, for no particularly good reason. However, for the Russian Federation, I decided to regard it as lying entirely in Europe: almost all of the flights are to Moscow or St Petersburg, which are in Europe. This was the first set of decisions I had to make;
different decisions might have made a difference to the results.
- On a more technical note, the format of the Wikipedia data is a little weird. I regarded it as a CSV file with funny field separators (the "|" symbol), but then fields were separated by a pair of these "|" symbols. So I ended up with lots of unnecessary columns, which I then deleted. (Oh, and the header row was separated with "!" symbols instead, so I manually edited the
file to make it consistent.)
The flight data
- The first step was to look at the data table to get a feel for it. My initial observations included noting:
- The country names are in UPPER CASE, whereas in the Wikipedia file they are just Capitalised.
- The country names are short; they don't appear to have the parts after the commas in the Wikipedia file.
- There is a column called "arrival_departure" which seems to indicate whether this row of the table is referring to flights arriving into the airport ("A") or departing from the airport ("D").
- Some rows have 0 flights! (The "number_flights_matched" column.)
- To handle these, I did the following:
- I split the Wikipedia country names at the comma, and made two columns in the continents table, one with just the main name ("name.main") and one with the full name ("name.full"), and I also capitalised them; for example, "Afghanistan, Islamic Republic of" became "AFGHANISTAN" (for "name.main") and "ISLAMIC REPUBLIC OF AFGHANISTAN" (for "name.full"). By looking at a few rows of the
flight data, I suspect that the "name.main" column will be much more useful.
- For the flight data, I only kept rows which had an "arrival_departure" value of "D", and for which there was at least one flight recorded.
- At some point, I thought I really ought to check that only one country has each main name, but it turns out that this is not the case: there are two pairs of countries which have the same main name, CONGO and KOREA. For each of these, the two countries involved are in the same continent (Africa and Asia, respectively), so I need keep only one of each pair for my purposes. (The
particular choice of which to keep does not make any political statement, though other choices I made earlier might do so. I did, however, choose to remain consistent with the countries actually appearing in the CAA data.)
- I filtered the CAA flight data to only keep flights out of the London airports, as those are what I am interested in on this occasion. There is a question here: which airports count as London airports? Is "London Gatwick" a London airport? What about "London Southend"? I followed the CAA classification that these are both considered as "London area" airports (though in
my first iteration, I excluded Southend - I later found the CAA classification).
Tying the flights and continents data together
I now have a table of countries and the continents they lie in, and a table giving the number of flights out of the London airports and their destinations. To count the number of flights to different continents, I somehow need to tie these two together.
- I started by listing all the countries appearing in the CAA table, and doing a quick visual scan. (Something like: unique(caa_data$origin_destination_country) will do this.) It is hard to scan over the entire file, so listing each country appearing makes this a manageable task.
- A couple of things jumped out at me: three entries have parentheses, namely "SPAIN(CANARY ISLANDS)", "PORTUGAL(MADEIRA)" and "PORTUGAL(EXCLUDING MADEIRA)". I'll get rid of the parentheses before I go further. (This is where I needed to decide whether Madeira is in Europe or Africa; I went with Europe.) There's also a country called "Unknown" (not in upper case) which I
removed as there are only two flights involved. This left me with a new column called "destination country" with the cleaned-up name.
- Now, all I need to do is to match the countries in the CAA table with the short names in the continents table. I started by checking that every country in the CAA table appears in the Wikipedia table. It turned out that there were 19 missing. (I used the anti_join function to do this, as in the R source code.) 13 of these just had slightly different names in the two
tables, 5 had the full names in the CAA table, and 1 (KOSOVO) did not appear at all. Kosovo is missing from the Wikipedia table, presumably because it is not currently recognised as a country by the UN, so I added it to the table of continents. For the other 18, I manually created a new table listing these, giving the corresponding CAA and Wikipedia names, and then merged this
information into the flights table.
Why is it that the country names in these two data sources don't match? It turns out that there is no generally agreed naming system for countries! Wikipedia bases its list on the United Nations list of preferred English names; see here
for information on their working group and their report. Within the report, it
is clear that countries have multiple names.
On the other hand, the Getty Research Institute has an internationally-recognised Thesaurus of Geographic Names (TGN)
, which is relied upon by institutions around the world; their preferred English names of countries sometimes differs from the UN names.
The CAA data does not seem to match either of these two; it is not clear where their country names come from.
This sort of data processing issue shows how valuable it is to make use of a widely recognised naming convention for something like country names: it makes further data analysis feasible.
Analysing the data
- At long last, I was ready to analyse the data! It made sense to handle domestic flights separately, which I did by adding a fictitious continent called "UK" to the table of continent codes, and changing the continent code of "UNITED KINGDOM" to "UK". That was far simpler than trying to separate out the domestic flights from the caa_data table and dealing with it as a special
- After that, I simply matched up the continents table to the caa_data table to associate every route with the continent it's in, then counted up the total number of flights for each continent.
- Finally, I tied up the continent names with each continent code, so that the table would be more readable.
- I then output the table as a CSV file, so that I could read it into any spreadsheet software. This table, continents-count.csv, is reproduced at the start of the activity Where Are You Flying?
Representing the data
Now that I had the data, I had to decide how to represent it. For the sake of this activity, I decided to use Excel to produce pictorial representations (even though R would do a perfectly good job), as that way I can play with some "interesting" representations which R might be less happy about generating. I also used graphics software to produce some of the graphics.
Some further questions
This resource is part of the collection Statistics - Maths of Real Life
- Would the results be significantly different if we included all UK airports rather than just the London-area ones?
- Did I use the correct data? What other sources of data can you find on the CAA website which might give better data, or at least more useful data?
- What other interesting questions could you ask about the data available? How might you go about answering them?
- Are there other questions related to flights you'd like to know the answers to, but for which the data is not available? How might you go about collecting the data you'd need to answer your questions?
The icon for this article incorporates the R logo, obtained from the R Project website. Its use is subject to the Creative Commons Attribution-ShareAlike 4.0 International license (CC-BY-SA 4.0). While this article talks about using R, it has not been
endorsed by the R Project and should not be taken as representing the views of the R Project in any way.