Using SQL to Query and Analyze a CIA Factbook Database

I use SQL to explore a database of facts about countries in the world. In addition, I use Pandas and Altair to perform data transformations and create interactive charts on population-related issues.
sql
sqlite
python
pandas
altair
Author

Migs Germar

Published

July 24, 2021

Unsplash | NASA

Overview

The CIA World Factbook website is a collection of basic information on hundreds of countries and geopolitical entities. Each country is discussed with regards to a wide range of aspects, such as politics, security, socio-economics, and environment.

For this project, we have downloaded the factbook.db database file, which contains CIA World Factbook data compiled by Dataquest. The goal of the project is to use sqlite3 in Jupyter Notebook to query this database and perform basic exploratory data analysis. Specifically, SQL is used to produce summary statistics and data transformations, Pandas is used to perform further transformations, and Altair is used to create interactive charts.

Note

I wrote this notebook by following a guided project on the Dataquest platform, specifically, the Analyzing CIA Factbook Data Using SQL Guided Project. The general project flow and research questions came from Dataquest. However, the text and code in this notebook are written by me unless stated otherwise.

Preparations

Install necessary packages.

import pandas as pd
import numpy as np
import altair as alt

Connect Jupyter Notebook to the database file using the SQLite flavor of SQL.

%load_ext sql
%sql sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
'Connected: @private/2021-07-24-SQL-Fundamentals-Files/factbook.db'

Exploratory Data Analysis

Database Tables

First, we inspect the tables in the database. SQL is used in a code cell via the %%sql magic.

%%sql
SELECT *
FROM sqlite_master
WHERE type = "table";
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
type name tbl_name rootpage sql
table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float)

There are two tables in the database: sqlite_sequence and facts. The latter contains the data that we will analyze. The sql column above lists the columns in the facts table and their data types. The column names include “area”, “population”, “population_growth”, etc. which are the kind of information that we would expect to know about a country.

The facts Table

Let us inspect the first 5 rows of the facts table.

%%sql
SELECT *
FROM facts
LIMIT 5 --#Limit the result to the first 5 rows.
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.3 12.92 6.58 3.3
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.0
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Most of the column names are self-explanatory. The following are some additional information about the data according to Dataquest.

  • The areas are given in square kilometers (\(\text{km}^2\))
  • population_growth is given as the percentage of increase of the total population per year.
  • birth_rate and death_rate are given as the number of people who are born/pass away per year per 1000 people.

Summary Statistics on Population Size and Growth

Below, we calculate some basic summary statistics about the countries’ populations and population growths. We want to see which countries have the highest and lowest value of each variable.

%%sql
SELECT
    MIN(population),
    MAX(population),
    MIN(population_growth),
    MAX(population_growth)
FROM facts
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 7256490011 0.0 4.02

Interestingly, the minimum population size among all of the countries in the table is 0, meaning no people at all. This is surprising since one would expect a country to have many people. This may indicate an error in the data, so let’s check which countries have 0 people.

%%sql
SELECT *
FROM facts
WHERE population == 0
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
250 ay Antarctica None 280000 None 0 None None None None

It turns out that it is not a country, but rather the continent of Antarctica, which is at the South Pole. It is too cold there for people to live comfortably, so of course its population size would be 0.

The maximum population size, on the other hand, is over 7 billion, which seems too large. Let’s check which country this is.

%%sql
SELECT *
FROM facts
WHERE population == (
    SELECT MAX(population) --# Use a subquery to get the exact value of the maximum population size.
    FROM facts
)
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
261 xx World None None None 7256490011 1.08 18.6 7.8 None

It turns out that 7 billion refers to the total population in the entire world, not one country, so there is no error in the data. The data seems to have been collected in the mid-2010s, since the world population reached 7.2 billion in 2013.

Revised Summary Statistics

We’ve come to the realization that this database doesn’t just include entries on individual countries but also on entire continents and the entire world. Therefore, if we want to find useful information, we should exclude the entries for Antarctica and the World. We can do this using SQL’s NOT IN operator.

Let’s run our initial query again with this in mind.

%%sql
SELECT
    MIN(population),
    MAX(population),
    MIN(population_growth),
    MAX(population_growth)
FROM facts
WHERE name NOT IN ("World", "Antarctica")
--# Exclude World and Antarctica from the entries being used in computation.
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
48 1367485388 0.0 4.02

The population growth statistics are the same as before, but now, the population size statistics are different.

Interestingly, the minimum population size is still very small—only 48 people. Which country is this?

%%sql
SELECT *
FROM facts
WHERE population == (
    SELECT MIN(population)
    FROM facts
    WHERE name NOT IN ("World", "Antarctica")
)
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
238 pc Pitcairn Islands 47 47 0 48 0.0 None None None

The country with a population of 48 is the Pitcairn Islands, which, according to the World Factbook, is a colony of the UK. The islands are small (the query result above shows only 47 square kilometers of land), so there are few people.

On the other hand, the maximum population size was over 1.3 billion. Let’s check which country this is.

%%sql
SELECT *
FROM facts
WHERE population == (
    SELECT MAX(population)
    FROM facts
    WHERE name NOT IN ("World", "Antarctica")
)
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
37 ch China 9596960 9326410 270550 1367485388 0.45 12.49 7.53 0.44

Of course, the country with the largest population would be China. I expected this before I ran the query; China is well-known for the one-child policy that it enacted from 1980 to 2015 in order to curb the growth of its huge population.

Now, let’s investigate the minimum population growth, which was 0%. This means that the population wasn’t increasing at all at that time.

%%sql
SELECT *
FROM facts
WHERE
    population_growth == 0
    AND name NOT IN ("Antarctica", "World")
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
190 vt Holy See (Vatican City) 0 0 0 842 0.0 None None None
200 ck Cocos (Keeling) Islands 14 14 0 596 0.0 None None None
207 gl Greenland 2166086 2166086 None 57733 0.0 14.48 8.49 5.98
238 pc Pitcairn Islands 47 47 0 48 0.0 None None None

Four countries have shown up in the results. One of these is the Pitcairn Islands which were mentioned earlier. This makes sense as they have few people to begin with, and the Factbook states that it is rare for outsiders to migrate there.

Interestingly, Vatican City is among the countries without population growth, but this makes sense too. It is a small independent state within Italy that serves as the authority of Roman Catholicism.

Finally, let’s check the country with the highest population growth rate.

%%sql
SELECT *
FROM facts
WHERE population_growth == (
    SELECT MAX(population_growth)
    FROM facts
    WHERE name NOT IN ("World", "Antarctica")
)
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
162 od South Sudan 644329 None None 12042910 4.02 36.91 8.18 11.47

The country with the highest population growth rate is South Sudan. Its population was increasing by 4.02% annually at the time that the data was collected. According to Hicks (2017), there was widespread famine in South Sudan, and its high population growth rendered the country’s level of food production insufficient.

Population Raw Increase

Out of curiosity, we can investigate which countries may have the highest population raw increase. This means the increase in terms of the number of people as opposed to the percentage of the population.

We can calculate this in an SQL query by multiplying population size by population growth rate. However, the query result needs to be stored as a DataFrame so that it can be used in a chart.

First, we can store the result of an SQL query as a ResultSet object by using the << operator, according to the ipython-sql README.

%%sql result <<
SELECT
    name,
    population,
    population_growth,
    (population * population_growth / 100.0) AS "population_raw_increase" 
FROM facts
WHERE name NOT IN ("World", "Antarctica")
ORDER BY population_raw_increase DESC
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
Returning data to local variable result

Then, the ResultSet can be converted into a DataFrame via its .DataFrame() method.

pop_increase = result.DataFrame()

print(pop_increase.shape)
pop_increase.head(10)
(259, 4)
name population population_growth population_raw_increase
0 India 1.251696e+09 1.22 1.527069e+07
1 China 1.367485e+09 0.45 6.153684e+06
2 Nigeria 1.815621e+08 2.45 4.448270e+06
3 Pakistan 1.990858e+08 1.46 2.906653e+06
4 Ethiopia 9.946582e+07 2.89 2.874562e+06
5 Bangladesh 1.689577e+08 1.60 2.703324e+06
6 United States 3.213689e+08 0.78 2.506677e+06
7 Indonesia 2.559937e+08 0.92 2.355142e+06
8 Congo, Democratic Republic of the 7.937514e+07 2.45 1.944691e+06
9 Philippines 1.009984e+08 1.61 1.626074e+06

The DataFrame above shows the top 10 countries in terms of population raw increase, which is written in scientific notation. Notably, while most of the values have an exponent of 6, India’s has an exponent of 7. This means that its population raw increase is a whole order of magnitude higher than that of the other countries.

We can communicate this result better by graphing population growth against population size, with population raw increase as a continuous grouping variable.

Code
(
    alt.Chart(pop_increase)
    .mark_point()
    .encode(
        x = alt.X("population:Q", title = "Population Size"),
        y = alt.Y("population_growth:Q", title = "Population Growth Rate (% of Population)"),
        color = alt.Color("population_raw_increase:Q", title = "Population Raw Increase"),
        tooltip = [
            "name",
            "population",
            "population_growth",
            "population_raw_increase",
        ]
    )
    .properties(title = "Population Growth Rate against Population Size")
    .interactive()
)

One can click, drag, and scroll on the chart above in order to explore its data points. Hover over a point in order to view extra information like the country’s name.

In the case of the chart above, the population raw increase variable ranges from 0 people to 15 million people; this range is expressed with a color gradient. Looking at the data points, we can see that most of them are pale blue, indicating a low increase. India’s point has the darkest blue color.

Hovering over India’s point, we can see its tooltip, which states that the population growth rate of India is 1.22%. This is not the highest global population growth rate; we saw earlier that this is around 4%. However, India’s extremely large population size (over 1 billion) leads it to have the highest raw population increase.

Population Density

Next, in this section, we will find the average population size and average land area among all countries.

%%sql
SELECT
    AVG(population),
    AVG(area_land)
FROM facts
WHERE name NOT IN ("World", "Antarctica")
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
AVG(population) AVG(area_land)
32377011.0125 523693.2

The average population size is over 32 million people, and the average land area is over 500 thousand square kilometers.

We can now investigate which countries have above-average population size and below-average land area.

%%sql
SELECT
    name,
    population,
    population_growth,
    area_land
FROM facts
WHERE
    population > (
        SELECT AVG(population)
        FROM facts
        WHERE name NOT IN ("World", "Antarctica")
    )
    AND
    area_land < (
        SELECT AVG(area_land)
        FROM facts
        WHERE name NOT IN ("World", "Antarctica")
    )
    AND
    name NOT IN ("World", "Antarctica")
ORDER BY population_growth DESC
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
name population population_growth area_land
Uganda 37101745 3.24 197100
Iraq 37056169 2.93 437367
Philippines 100998376 1.61 298170
Bangladesh 168957745 1.6 130170
Morocco 33322699 1.0 446300
Vietnam 94348835 0.97 310070
Spain 48146134 0.89 498980
United Kingdom 64088222 0.54 241930
Thailand 67976405 0.34 510890
Italy 61855120 0.27 294140
Germany 80854408 0.17 348672
Japan 126919659 0.16 364485
Korea, South 49115196 0.14 96920
Poland 38562189 0.09 304255

Since the countries in this table have above-average population size and below-average land area, these may be at risk of facing high population density in the future. This may be especially true for those with the highest population growth rates, such as Uganda, Iraq, and the Philippines.

If we want to know which countries already have a high population density, though, we can run the query below. It calculates the population density as population size over land area, and then it displays the countries with the top 10 highest population densities.

%%sql result <<
SELECT
    name,
    population,
    population_growth,
    area_land,
    population / area_land AS "population_density"
FROM facts
WHERE name NOT IN ("World", "Antarctica")
ORDER BY population_density DESC
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
Returning data to local variable result

The resulting DataFrame is shown below. We have added 2 additional columns which will help us in making a chart:

  • place: Whether the country is in the top 10 countries in terms of population density.
  • text: The text to display for each country.
pop_density = result.DataFrame()

# Place variable
pop_density["place"] = ["Top 10"] * 10 + ["Below Top 10"] * (len(pop_density) - 10)

# Text variable
pop_density["text"] = (
    pop_density["name"].copy()
    # Append text that indicates the numeric place of the country.
    .str.cat([" (Top {})".format(i) for i in range(1, len(pop_density) + 1)])
)

# Make the text blank for the countries below the top 10.
pop_density.loc[
    pop_density["place"] == "Below Top 10",
    "text"
] = ""

pop_density.head(11)
name population population_growth area_land population_density place text
0 Macau 592731.0 0.80 28.0 21168.0 Top 10 Macau (Top 1)
1 Monaco 30535.0 0.12 2.0 15267.0 Top 10 Monaco (Top 2)
2 Singapore 5674472.0 1.89 687.0 8259.0 Top 10 Singapore (Top 3)
3 Hong Kong 7141106.0 0.38 1073.0 6655.0 Top 10 Hong Kong (Top 4)
4 Gaza Strip 1869055.0 2.81 360.0 5191.0 Top 10 Gaza Strip (Top 5)
5 Gibraltar 29258.0 0.24 6.0 4876.0 Top 10 Gibraltar (Top 6)
6 Bahrain 1346613.0 2.41 760.0 1771.0 Top 10 Bahrain (Top 7)
7 Maldives 393253.0 0.08 298.0 1319.0 Top 10 Maldives (Top 8)
8 Malta 413965.0 0.31 316.0 1310.0 Top 10 Malta (Top 9)
9 Bermuda 70196.0 0.50 54.0 1299.0 Top 10 Bermuda (Top 10)
10 Bangladesh 168957745.0 1.60 130170.0 1297.0 Below Top 10

The results above show the top 10 countries in terms of population density in people per square kilometer of land. Notably, Monaco and Gibraltar are cases of countries with under 10 square kilometers of land but also tens of thousands of people. On the other hand, Singapore and Hong Kong are cases of countries with a somewhat larger land area but huge population size due to urbanization. Hong Kong has the highest number of over 150 m tall buildings in the world, and Singapore has the 12th highest.

Jackson (2017) states that high population density in a place can lead to noise pollution, territorial behavior, and lack of personal space for the people there. Thus, these may be significant challenges for the countries listed above.

Continuing on, we can take our analysis a step further by investigating countries with both high population density and high population growth rate. Such countries may be at risk of further aggravation of population density issues.

Code
# Base layer with x and y axes
base = (
    alt.Chart(pop_density)
    .mark_point()
    .encode(
        x = alt.X(
            "area_land:Q",
            title = "Land Area (Square Kilometers)",
            scale = alt.Scale(domain = (0, 1100)),
            # Limit the x and y axis domains.
            # We will zoom in on the part of the chart that shows the important points.
        ),
        y = alt.Y(
            "population:Q",
            title = "Population Size",
            scale = alt.Scale(domain = (0, 6000000)),
        ),
    )
)

# Scatter plot layer
points = (
    base
    .encode(
        # Express groupings using color and size.
        color = alt.Color("place:N", title = "Category of Population Density"),
        size = alt.Size("population_growth:Q", title = "Population Growth Rate (% of Population)"),
        # Provide a tooltip with extra information.
        tooltip = [
            "name",
            "area_land",
            "population",
            "population_density",
            "population_growth",
        ],
    )
)

# Text layer
text = (
    base
    .mark_text(
        align = "center",
        baseline = "middle",
        dx = 60,
        # Nudge text to the right.
    )
    .encode(
        # Display the text variable.
        text = "text",
    )
)

# Combine layers.
chart = (
    (points + text)
    .properties(title = "Top 10 Countries in Population Density")
    .interactive()
)

# Display chart.
chart

In the chart above, the orange points represent the top 10 countries in terms of population density, while the blue points represent the other countries. Larger points have a higher population growth rate.

Macau has the highest population density. However, Gaza Strip, Bahrain, and Singapore have higher population growth rates than Macau does (as indicated by the larger circles). This indicates that these countries’ population sizes may increase quickly, thus leading to increased population density and aggravating the issues that come with it.

Death Rate and Birth Rate

Our last query in SQL shall investigate which countries have a higher death rate than birth rate. Such countries may face a decrease in their population size.

Note that we will calculate a new variable, death_offset_birth, by subtracting birth rate from death rate. This variable is indicative of the net decrease in population (if only births and deaths are taken into account).

%%sql result <<
SELECT
    name,
    population,
    population_growth,
    birth_rate,
    death_rate,
    death_rate - birth_rate AS "death_offset_birth"
FROM facts
WHERE name NOT IN ("World", "Antarctica")
ORDER BY death_offset_birth DESC
;
 * sqlite:///private/2021-07-24-SQL-Fundamentals-Files/factbook.db
Done.
Returning data to local variable result

The result of the query is shown below. Additionally, we have created a new column, higher_category. This indicates whether the death rate or the birth rate is higher for a particular country.

db_rates = result.DataFrame()

db_rates["higher_category"] = (
    db_rates["death_offset_birth"]
    .apply(lambda x: "Death Rate Higher" if x > 0 else "Birth Rate Higher")
)

db_rates.head(10)
name population population_growth birth_rate death_rate death_offset_birth higher_category
0 Bulgaria 7186893.0 0.58 8.92 14.44 5.52 Death Rate Higher
1 Serbia 7176794.0 0.46 9.08 13.66 4.58 Death Rate Higher
2 Latvia 1986705.0 1.06 10.00 14.31 4.31 Death Rate Higher
3 Lithuania 2884433.0 1.04 10.10 14.27 4.17 Death Rate Higher
4 Ukraine 44429471.0 0.60 10.72 14.46 3.74 Death Rate Higher
5 Hungary 9897541.0 0.22 9.16 12.73 3.57 Death Rate Higher
6 Germany 80854408.0 0.17 8.47 11.42 2.95 Death Rate Higher
7 Slovenia 1983412.0 0.26 8.42 11.37 2.95 Death Rate Higher
8 Romania 21666350.0 0.30 9.14 11.90 2.76 Death Rate Higher
9 Croatia 4464844.0 0.13 9.45 12.18 2.73 Death Rate Higher

The results above show the top 10 countries in terms of the difference between death rate and birth rate. At the top is Bulgaria; for every 1000 people each year, 14.44 die and 8.92 are born, meaning that deaths outnumber births by 5.52.

We can make an area chart in order to get an idea of the death rate situation for all countries.

Before that, though, how many countries have a death rate that is higher than the birth rate?

(db_rates["higher_category"] == "Death Rate Higher").sum()
24

There are 24 such countries. Thus, we can mention this fact in the chart title.

The chart is shown below.

Code
(
    alt.Chart(db_rates)
    .mark_area(point = True) # Show a point for each observation.
    .encode(
        x = alt.X("birth_rate:Q", title = "Birth Rate (Births per 1000 People per Year)"),
        y = alt.Y("death_offset_birth:Q", title = "Death Rate offset by Birth Rate"),
        color = alt.Color("higher_category:N", title = "Higher Rate"),
        tooltip = [
            "name",
            "birth_rate",
            "death_rate",
            "death_offset_birth",
        ],
    )
    .properties(title = "Death Rate is Higher than Birth Rate for 24 Countries")
    .interactive()
)

In the chart above, each point represents a country. Click, drag, and scroll in order to explore the countries and read their tooltips.

The horizontal line marks the spot where the birth rate and death rate are equal—they cancel each other out. The blue part represents the majority of countries, which have a birth rate higher than the death rate. On the other hand, the orange part represents the 24 countries where deaths outnumber births.

Interestingly, there is a downward trend in the chart. Death rate offset by birth rate appears to have a strong negative correlation with birth rate.

More importantly, one would expect the 24 highlighted countries to have a negative population growth. However, if we look at the table, the population growth rates are still positive. This means that other factors such as migration may be involved in keeping the population growing rather than shrinking.

Summary

In this project, we used a local database file of CIA World Factbook data about 259 countries. We explored the topics of population size, growth rate, raw increase, and density, as well as death rate and birth rate. We queried the database with SQL in order to calculate summary statistics and generate tables showing the top countries with certain characteristics. After identifying countries of interest, we did a little research on each one in order to understand why it had such characteristics. Lastly, we created interesting interactive charts that highlighted the countries of interest. These charts allowed the viewer to read extra information about each country via tooltips.

Thanks for reading!