import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
Overview
The Chinook database by Luis Rocha and Brice Lambson is a sample database about a hypothetical digital media store called Chinook. This store sells individual music tracks online, similar to iTunes. The database contains tables covering various aspects of the company, such as the employees, customers, invoices, tracks, albums, and artists.
The schema below, which was designed by Dataquest, lists the columns under each table. Columns connected by lines contain matching information.
The matching columns allow us to perform joins on these tables. Thus, we are able to answer more complicated questions about the data.
In our hypothetical scenario, the Chinook company has requested us to answer the following business questions:
- What are the best-selling music genres with regards to USA customers? Based on this, which new albums should be purchased for the Chinook store?
- Which of Chinook’s sales support agents has the highest total sales from their assigned customers? Can the exemplary performance of these employees be explained by any information in the database?
- What are the statistics on the customers and sales for each country where Chinook offers its service?
- How many purchases are full albums, and how many are selected sets of tracks? Based on this, what strategy should Chinook adopt when buying new tracks from record companies?
SQL will be used to answer all of these questions. Matplotlib and Altair will also be used to produce helpful visualizations.
I wrote this notebook by following a guided project on the Dataquest platform, specifically the Guided Project: Answering Business Questions Using SQL. The general project flow and research questions came from Dataquest. However, the text and code here are written by me unless stated otherwise.
Preparations
Install the necessary packages.
Connect to the database using SQLite.
%load_ext sql
%sql sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
'Connected: @private/2021-07-31-Intermediate-SQL-Files/chinook.db'
Analysis
Tables and Views
First, we’ll inspect the tables and views available in the chinook.db
database.
- Tables contain columns of data. Each column has a different name and data type.
- Views do not contain data. Instead, these are pre-written SQL queries which show a transformation of existing data. Thus, it can be called a “virtual table.” (Sławińska 2020)
%%sql
SELECT
name,type
FROM sqlite_mastertype IN ("table", "view")
WHERE ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
usa_track_purchases | view |
usa_genre_sales | view |
agent_customer | view |
agent_stats | view |
country_invoices | view |
country_customers | view |
country_labels | view |
country_avg_order | view |
country_stats | view |
invoice_tracks_bought | view |
invoice_album | view |
invoice_full_album_tracks | view |
invoice_purchase_type | view |
purchase_type_proportion | view |
Originally, there were only 11 tables and 0 views in the database. The views listed above are ones which I made throughout this project. I will show how I created these views in later sections.
For now, let’s inspect the customer
table, as we will be using it to answer most of the company’s business questions.
%%sql
*
SELECT
FROM customer5
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
customer_id | first_name | last_name | company | address | city | state | country | postal_code | phone | fax | support_rep_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | luisg@embraer.com.br | 3 |
2 | Leonie | Köhler | None | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | +49 0711 2842222 | None | leonekohler@surfeu.de | 5 |
3 | François | Tremblay | None | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721-4711 | None | ftremblay@gmail.com | 3 |
4 | Bjørn | Hansen | None | Ullevålsveien 14 | Oslo | None | Norway | 0171 | +47 22 44 22 22 | None | bjorn.hansen@yahoo.no | 4 |
5 | František | Wichterlová | JetBrains s.r.o. | Klanova 9/506 | Prague | None | Czech Republic | 14700 | +420 2 4172 5555 | +420 2 4172 5555 | frantisekw@jetbrains.com | 4 |
Each row in this table contains data on a different customer of Chinook. Each customer has a unique customer ID and an assigned support representative from Chinook. The support rep’s employee ID is stored in the support_rep_id
column. The other columns contain information on the customer’s name, occupation, location, and contact details.
All names and personal details in the Chinook database are fictitious and randomly generated. Public use of this database is not a breach of data privacy.
Best-Selling Music Genres in the USA
In our first scenario, Chinook has signed a deal with a new record company, so its tracks can now be put up for sale on the Chinook store. The record company has 4 albums so far; below are the artist names and their genres.
- Regal (Hip-Hop)
- Red Tone (Punk)
- Meteor and the Girls (Pop)
- Slim Jim Bites (Blues)
However, Chinook would like to spread its releases over time, so it will only add 3 albums to the store. Thus, we have to determine the best-selling genres on the store. Furthermore, since the record company would like to target a USA audience, we can narrow our analysis to Chinook’s USA customers.
First, we create a view called usa_track_purchases
. This will show the genre, track name, unit price, and quantity bought for each of the invoice lines of USA customers.
%%sql
;
DROP VIEW IF EXISTS usa_track_purchases
CREATE VIEW usa_track_purchases AS
SELECT
il.invoice_line_id AS invoice_line_id,
g.name AS genre,
t.name AS track_name,
il.unit_price AS unit_price,
il.quantity AS quantity
FROM customer AS c
INNER JOIN
invoice AS iv= c.customer_id
ON iv.customer_id
INNER JOIN
invoice_line AS il= iv.invoice_id
ON il.invoice_id
INNER JOIN
track AS t= il.track_id
ON t.track_id
INNER JOIN
genre AS g= t.genre_id
ON g.genre_id = "USA"
WHERE c.country ;
*
SELECT
FROM usa_track_purchases7
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
invoice_line_id | genre | track_name | unit_price | quantity |
---|---|---|---|---|
1 | Rock | Right Next Door to Hell | 0.99 | 1 |
2 | Rock | Dust N' Bones | 0.99 | 1 |
3 | Rock | Live and Let Die | 0.99 | 1 |
4 | Rock | Don't Cry (Original) | 0.99 | 1 |
5 | Rock | Perfect Crime | 0.99 | 1 |
6 | Rock | You Ain't the First | 0.99 | 1 |
7 | Rock | Bad Obsession | 0.99 | 1 |
By querying the view above, we will create another view, usa_genre_sales
. This will contain the following specific information about each genre:
- Number of tracks sold
- Percentage of tracks sold
- Total sales in US dollars
%%sql
;
DROP VIEW IF EXISTS usa_genre_sales
CREATE VIEW usa_genre_sales AS
SELECT
genre,
SUM(quantity) AS number_sold,
--Get the quantity per genre and divide it by the total quantity of USA purchases.
ROUND(
CAST(SUM(quantity) AS Float)/ CAST(
*) FROM usa_track_purchases)
(SELECT COUNT(
AS Float
)* 100.0,
2
) AS percentage_sold,
ROUND(* CAST(quantity AS Float)),
SUM(unit_price 2
) AS total_sales
FROM usa_track_purchases
GROUP BY genre
ORDER BY number_sold DESC, total_sales DESC;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
[]
%%sql result <<
*
SELECT
FROM usa_genre_sales;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Returning data to local variable result
= result.DataFrame()
usa_genre_df usa_genre_df
genre | number_sold | percentage_sold | total_sales | |
---|---|---|---|---|
0 | Rock | 561 | 53.38 | 555.39 |
1 | Alternative & Punk | 130 | 12.37 | 128.70 |
2 | Metal | 124 | 11.80 | 122.76 |
3 | R&B/Soul | 53 | 5.04 | 52.47 |
4 | Blues | 36 | 3.43 | 35.64 |
5 | Alternative | 35 | 3.33 | 34.65 |
6 | Latin | 22 | 2.09 | 21.78 |
7 | Pop | 22 | 2.09 | 21.78 |
8 | Hip Hop/Rap | 20 | 1.90 | 19.80 |
9 | Jazz | 14 | 1.33 | 13.86 |
10 | Easy Listening | 13 | 1.24 | 12.87 |
11 | Reggae | 6 | 0.57 | 5.94 |
12 | Electronica/Dance | 5 | 0.48 | 4.95 |
13 | Classical | 4 | 0.38 | 3.96 |
14 | Heavy Metal | 3 | 0.29 | 2.97 |
15 | Soundtrack | 2 | 0.19 | 1.98 |
16 | TV Shows | 1 | 0.10 | 0.99 |
We can make a bar graph from this result in order to communicate findings better.
Code
(
alt.Chart(usa_genre_df)
.mark_bar()
.encode(= alt.X("genre:N", title = "Music Genre", sort = "-y"),
x = alt.Y("percentage_sold:Q", title = "Percentage of All Purchases in the USA"),
y = alt.Color("total_sales:Q", title = "Total Sales (USD)"),
color = usa_genre_df.columns.tolist(),
tooltip
)
.properties(= "Popularity of Music Genres with Chinook's USA Customers",
title = 300,
height = 600,
width
)
.configure_axis(= 30,
labelAngle
)
.interactive() )
One can hover over each bar in the chart above for a tooltip with more specific information.
Results show that Rock is the best-selling music genre as it makes up 53% of total purchases in the USA. Rock is followed by the Alternative & Punk and Metal genres, which each make up over 10% of purchases.
On a side note, the total_sales
column’s values are very close to that of the number_sold
column. This can be explained by the fact that track prices range from USD 0.99 to USD 1.99.
%%sql
SELECT
MIN(unit_price) AS min_price,
MAX(unit_price) AS max_price
FROM track;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
min_price | max_price |
---|---|
0.99 | 1.99 |
Since there is little variation among track prices, the genre with the most units sold is usually also the genre with the highest sales.
Going back to the scenario at hand, we need to compare the statistics on the Hip-Hop, Punk, Pop, and Blues genres. We will run a query for this below.
%%sql
*
SELECT
FROM usa_genre_sales"Hip Hop/Rap", "Alternative & Punk", "Pop", "Blues")
WHERE genre IN (;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
genre | number_sold | percentage_sold | total_sales |
---|---|---|---|
Alternative & Punk | 130 | 12.37 | 128.7 |
Blues | 36 | 3.43 | 35.64 |
Pop | 22 | 2.09 | 21.78 |
Hip Hop/Rap | 20 | 1.9 | 19.8 |
The result above shows that Alternative & Punk, Blues, and Pop are the three best-selling genres out of the four. Notably, Alternative & Punk makes up 12.37% of all purchases in the USA.
Therefore, we would recommend the Chinook administration to add the albums of Red Tone (Punk), Slim Jim Bites (Blues), and Meteor and the Girls (Pop) to the digital store. The album of Regal (Hip-Hop) has lower priority and can be added at a later date.
Sales Support Agent Performance
Next, Chinook is requesting us to evaluate the performance of its sales support agents. Each customer is assigned to an agent after their first purchase. Since there are only 3 agents, each agent provides support to many customers. Details about the agents are shown in the query below.
%%sql
*
SELECT
FROM employee= "Sales Support Agent"
WHERE title ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
employee_id | last_name | first_name | title | reports_to | birthdate | hire_date | address | city | state | country | postal_code | phone | fax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Peacock | Jane | Sales Support Agent | 2 | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1111 6 Ave SW | Calgary | AB | Canada | T2P 5M5 | +1 (403) 262-3443 | +1 (403) 262-6712 | jane@chinookcorp.com |
4 | Park | Margaret | Sales Support Agent | 2 | 1947-09-19 00:00:00 | 2017-05-03 00:00:00 | 683 10 Street SW | Calgary | AB | Canada | T2P 5G3 | +1 (403) 263-4423 | +1 (403) 263-4289 | margaret@chinookcorp.com |
5 | Johnson | Steve | Sales Support Agent | 2 | 1965-03-03 00:00:00 | 2017-10-17 00:00:00 | 7727B 41 Ave | Calgary | AB | Canada | T3B 1Y7 | 1 (780) 836-9987 | 1 (780) 836-9543 | steve@chinookcorp.com |
Since we have data on each customer’s purchases, we can calculate the total purchases associated with each support agent. We can then use this to compare the performance of the agents.
First, we create a view called agent_customer
by joining the employee
table with the customer
table based on the support representative ID. We will include some extra details about each agent, such as their birthdate and hire date. We won’t include their location since we know that they are all in Calgary, AB, Canada. As for the customers, we will include their customer ID, name, and total purchases as the sum of their invoices.
%%sql
;
DROP VIEW IF EXISTS agent_customer
CREATE VIEW agent_customer AS
SELECT
e.employee_id AS agent_id,|| " " || e.last_name AS agent_name,
e.first_name
e.birthdate AS agent_bd,
e.hire_date AS agent_hire_date,
c.customer_id AS customer_id,|| " " || c.last_name AS customer_name,
c.first_name
SUM(iv.total) AS customer_total_purchases
FROM employee AS e
LEFT JOIN
customer AS c= e.employee_id
ON c.support_rep_id
LEFT JOIN
invoice AS iv= c.customer_id
ON iv.customer_id = "Sales Support Agent"
WHERE e.title
GROUP BY c.customer_id
ORDER BY agent_id, customer_id;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
[]
%%sql result <<
*
SELECT
FROM agent_customer;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Returning data to local variable result
= result.DataFrame()
agent_customer_df agent_customer_df.head()
agent_id | agent_name | agent_bd | agent_hire_date | customer_id | customer_name | customer_total_purchases | |
---|---|---|---|---|---|---|---|
0 | 3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 1 | Luís Gonçalves | 108.90 |
1 | 3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 3 | François Tremblay | 99.99 |
2 | 3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 12 | Roberto Almeida | 82.17 |
3 | 3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 15 | Jennifer Peterson | 66.33 |
4 | 3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 18 | Michelle Brooks | 79.20 |
Next, we will query this view in order to determine the following information about each agent:
- number of customers
- total sales from the agent’s customers
- percentage of all agents’ sales
- average sales per customer
%%sql
;
DROP VIEW IF EXISTS agent_stats
CREATE VIEW agent_stats AS
SELECT
agent_id,
agent_name,
agent_bd,
agent_hire_date,
COUNT(customer_id) AS number_customers,
ROUND(
SUM(customer_total_purchases),2
) AS sales_number,
ROUND(
SUM(customer_total_purchases) / (SELECT SUM(customer_total_purchases) FROM agent_customer)
* 100,
2
)AS sales_percentage,
ROUND(
AVG(customer_total_purchases),2
) AS average_per_customer
FROM agent_customer
GROUP BY agent_id;
*
SELECT
FROM agent_stats;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
agent_id | agent_name | agent_bd | agent_hire_date | number_customers | sales_number | sales_percentage | average_per_customer |
---|---|---|---|---|---|---|---|
3 | Jane Peacock | 1973-08-29 00:00:00 | 2017-04-01 00:00:00 | 21 | 1731.51 | 36.77 | 82.45 |
4 | Margaret Park | 1947-09-19 00:00:00 | 2017-05-03 00:00:00 | 20 | 1584.0 | 33.63 | 79.2 |
5 | Steve Johnson | 1965-03-03 00:00:00 | 2017-10-17 00:00:00 | 18 | 1393.92 | 29.6 | 77.44 |
At first glance, Jane Peacock seems like the best-performing agent since Chinook got the highest total sales from her customers (USD 1731.51). This idea is cast into doubt when it is mentioned that she had the highest number of customers (21).
Also, the differences among the average sales per customer is quite small. Each of Jane’s customers spends 3 more dollars on Chinook than each of Margaret’s. Each of Margaret’s customers spends 2 more dollars on Chinook than each of Steve Johnson’s.
The average sales per customer may also be influenced by outliers, as shown by the boxplot below.
Code
(
alt.Chart(agent_customer_df)
.mark_boxplot()
.encode(= alt.Y("customer_total_purchases:Q", title = "Customer Total Purchases"),
y = alt.X("agent_name:N", title = "Agent Name"),
x
)
.properties(= "Distribution of Customer Purchases by Sales Support Agent",
title = 300,
height = 500,
width
)
.interactive() )
If we hover over each of the boxes above, we can get more information such as minimum, 1st quartile, median, etc. The median is less influenced by outliers, so we can look at that. The median values are 79.20 (Jane), 77.72 (Margaret), and 75.74 (Steve). These values are still very close to each other.
Therefore, we cannot conclusively state that any agent performs better than the others.
Sales Data by Country
Next, Chinook is requesting us to analyze sales in each country where it offers its service. Specifically, they would like to know the:
- total number of customers
- total value of sales
- average sales per customer
- average order value
- Every order is a batch purchase of multiple tracks.
Furthermore, since there are some countries in the database with only one customer, we shall group these customers together under a category called “Other”, which must appear at the very bottom of our final result.
First, we will create a view, country_invoices
, which shows all invoices and the country of the customer.
%%sql
;
DROP VIEW IF EXISTS country_invoices
CREATE VIEW country_invoices AS
SELECT
c.country AS country,
c.customer_id AS customer_id,|| " " || c.last_name AS customer_name,
c.first_name
iv.invoice_id AS invoice_id,
iv.total AS order_value
FROM customer AS c
LEFT JOIN
invoice AS iv= c.customer_id
ON iv.customer_id ;
*
SELECT
FROM country_invoices5
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
country | customer_id | customer_name | invoice_id | order_value |
---|---|---|---|---|
Brazil | 1 | Luís Gonçalves | 16 | 8.91 |
Brazil | 1 | Luís Gonçalves | 77 | 5.9399999999999995 |
Brazil | 1 | Luís Gonçalves | 149 | 8.91 |
Brazil | 1 | Luís Gonçalves | 153 | 13.86 |
Brazil | 1 | Luís Gonçalves | 182 | 5.9399999999999995 |
Then, we will create a view, country_customers
, which shows the total purchases per customer.
%%sql
;
DROP VIEW IF EXISTS country_customers
CREATE VIEW country_customers AS
SELECT
country,
customer_id,
customer_name,
SUM(order_value) AS customer_total_purchase
FROM country_invoices
GROUP BY customer_id;
*
SELECT
FROM country_customers5
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
country | customer_id | customer_name | customer_total_purchase |
---|---|---|---|
Brazil | 1 | Luís Gonçalves | 108.89999999999998 |
Germany | 2 | Leonie Köhler | 82.17 |
Canada | 3 | François Tremblay | 99.99 |
Norway | 4 | Bjørn Hansen | 72.27000000000001 |
Czech Republic | 5 | František Wichterlová | 144.54000000000002 |
We will then create a view called country_labels
. It will show the number of customers per country. Countries with only 1 customer will be given a label of “Other”.
%%sql
;
DROP VIEW IF EXISTS country_labels
CREATE VIEW country_labels AS
SELECT
country,
COUNT(customer_id) AS number_customers,
CASE> 1 THEN country
WHEN COUNT(customer_id) "Other"
ELSE
END AS country_label,
CASE> 1 THEN 0
WHEN COUNT(customer_id) 1
ELSE
END AS is_other
FROM country_customers
GROUP BY country;
*
SELECT
FROM country_labels10
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
country | number_customers | country_label | is_other |
---|---|---|---|
Argentina | 1 | Other | 1 |
Australia | 1 | Other | 1 |
Austria | 1 | Other | 1 |
Belgium | 1 | Other | 1 |
Brazil | 5 | Brazil | 0 |
Canada | 8 | Canada | 0 |
Chile | 1 | Other | 1 |
Czech Republic | 2 | Czech Republic | 0 |
Denmark | 1 | Other | 1 |
Finland | 1 | Other | 1 |
We will also create a view called country_avg_order
which simply shows the average order value per country. This will be done by querying the country_invoices
view.
%%sql
;
DROP VIEW IF EXISTS country_avg_order
CREATE VIEW country_avg_order AS
SELECT
cl.country_label,
AVG(civ.order_value) AS avg_order_value
FROM country_invoices AS civ
INNER JOIN
country_labels AS cl= civ.country
ON cl.country
GROUP BY cl.country_label;
*
SELECT
FROM country_avg_order;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
country_label | avg_order_value |
---|---|
Brazil | 7.011147540983605 |
Canada | 7.047236842105264 |
Czech Republic | 9.107999999999999 |
France | 7.781400000000001 |
Germany | 8.161463414634145 |
India | 8.72142857142857 |
Other | 7.44857142857143 |
Portugal | 6.3837931034482756 |
USA | 7.942671755725194 |
United Kingdom | 8.768571428571429 |
Finally, we will create a view called country_stats
that shows all of the 4 statistics that were requested by the Chinook management. The “Other” entry will be forced to the bottom of the query result using the is_other
column we created in the country_labels
view.
%%sql
;
DROP VIEW IF EXISTS country_stats
CREATE VIEW country_stats AS
SELECT
l.country_label,
COUNT(c.customer_id) AS number_customers,2) AS total_sales,
ROUND(SUM(c.customer_total_purchase), 2) AS avg_sales_customer,
ROUND(AVG(c.customer_total_purchase), 2) AS avg_order_value
ROUND(a.avg_order_value,
FROM country_customers AS c
INNER JOIN
country_labels AS l= c.country
ON l.country
INNER JOIN
country_avg_order AS a= l.country_label
ON a.country_label
GROUP BY l.country_label
ORDER BY
l.is_other ASC,
total_sales DESC;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
[]
%%sql result <<
*
SELECT
FROM country_stats;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Returning data to local variable result
= result.DataFrame()
country_stats_df country_stats_df
country_label | number_customers | total_sales | avg_sales_customer | avg_order_value | |
---|---|---|---|---|---|
0 | USA | 13 | 1040.49 | 80.04 | 7.94 |
1 | Canada | 8 | 535.59 | 66.95 | 7.05 |
2 | Brazil | 5 | 427.68 | 85.54 | 7.01 |
3 | France | 5 | 389.07 | 77.81 | 7.78 |
4 | Germany | 4 | 334.62 | 83.66 | 8.16 |
5 | Czech Republic | 2 | 273.24 | 136.62 | 9.11 |
6 | United Kingdom | 3 | 245.52 | 81.84 | 8.77 |
7 | Portugal | 2 | 185.13 | 92.57 | 6.38 |
8 | India | 2 | 183.15 | 91.58 | 8.72 |
9 | Other | 15 | 1094.94 | 73.00 | 7.45 |
We can make a scatter plot with the data above in order to make results more clear.
Code
# Exclude the "Other" entry.
= country_stats_df.loc[country_stats_df["country_label"] != "Other"]
main_df
# Base layer.
= (
base
alt.Chart(main_df)= 300) # Set default size of points to 300 pixels.
.mark_point(size
.encode(= alt.X("number_customers:Q", title = "Number of Customers"),
x = alt.Y("total_sales:Q", title = "Total Sales (USD)"),
y
)
)
# Scatter plot layer.
= base.encode(
points = alt.Color("avg_sales_customer:Q", title = "Average Sales per Customer (USD)"),
color = country_stats_df.columns.tolist(),
tooltip
)
# Text layer.
= (
text
base# Move text to the top left of each point.
.mark_text( = "right",
align = -5,
dy = -20,
dx
)
.encode(= "country_label:N"
text
)
)
# Combine layers.
= (
chart + text)
(points
.properties(= "Chinook Sales by Country",
title = 300,
height = 700,
width
)
.interactive()
)
# Display chart.
chart
Looking at the top right of the chart, we can see that the USA has the highest number of Chinook customers (13), as well as the highest total sales (USD 1040.49). All of the other countries have only 2 to 8 customers.
On the other hand, the Czech Republic has the highest average sales per customer at USD 136.62; this is indicated by its dark color in the chart. This country also had the highest average order value at USD 9.11. This means that though the country has few customers, these people are avid buyers of music.
Chinook may benefit from marketing its service more aggressively in countries other than the USA where it has gained a good foothold, such as Canada, Brazil, and France. Chinook may also target the Czech Republic since the customers there seem to buy a lot of music on a per-person basis.
Comparing Purchase Types: Full Album vs Selected Sets
In our last scenario, we have been requested to compare the popularities of Chinook’s two purchase types:
- Full album
- The customer buys a full album.
- Albums are pre-defined in Chinook’s library. The customer may not add other tracks on top of an album.
- Selected set of tracks
- The customer manually selects any number of individual tracks.
In both cases, each track is bought at its unit price; there are no discounts.
Currently, Chinook’s purchasing strategy is to buy full albums from record companies. However, Chinook doesn’t know whether full album purchases are popular among its customers. If selected sets are more popular, then Chinook may switch to a new strategy in which it will only buy the most popular individual tracks from record companies.
Our analysis will help Chinook make the final decision. For each purchase type, we will show the number of invoices and percentage of all invoices with that type. The type with the higher number will be the more popular one.
First, we create a view, invoice_tracks
, that shows all tracks under each invoice. The album associated with each track is also shown.
%%sql
;
DROP VIEW IF EXISTS invoice_tracks_bought
CREATE VIEW invoice_tracks_bought AS
SELECT
iv.invoice_id AS invoice_id,
il.track_id AS track_id,
t.name AS track_name,
t.album_id AS album_id,
a.title AS album_name
FROM invoice AS iv
INNER JOIN
invoice_line AS il= iv.invoice_id
ON il.invoice_id
INNER JOIN
track AS t= il.track_id
ON t.track_id
INNER JOIN
album AS a= t.album_id
ON a.album_id
ORDER BY invoice_id, track_id, album_id;
*
SELECT
FROM invoice_tracks_bought7
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
invoice_id | track_id | track_name | album_id | album_name |
---|---|---|---|---|
1 | 1158 | Right Next Door to Hell | 91 | Use Your Illusion I |
1 | 1159 | Dust N' Bones | 91 | Use Your Illusion I |
1 | 1160 | Live and Let Die | 91 | Use Your Illusion I |
1 | 1161 | Don't Cry (Original) | 91 | Use Your Illusion I |
1 | 1162 | Perfect Crime | 91 | Use Your Illusion I |
1 | 1163 | You Ain't the First | 91 | Use Your Illusion I |
1 | 1164 | Bad Obsession | 91 | Use Your Illusion I |
Then, we create a view called invoice_album
, which shows only the album associated with the first track of each invoice. This way, there is only one row per invoice. We will also include the total purchase amount of each invoice so that we can use it later.
%%sql
;
DROP VIEW IF EXISTS invoice_album
CREATE VIEW invoice_album AS
SELECT
itb.invoice_id AS invoice_id,
MIN(itb.album_id) AS album_id,
itb.album_name AS album_name,
iv.total AS total_purchase
FROM invoice_tracks_bought AS itb
INNER JOIN
invoice AS iv= itb.invoice_id
ON iv.invoice_id
GROUP BY itb.invoice_id
ORDER BY itb.invoice_id;
*
SELECT
FROM invoice_album5
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
invoice_id | album_id | album_name | total_purchase |
---|---|---|---|
1 | 91 | Use Your Illusion I | 15.84 |
2 | 20 | The Best Of Buddy Guy - The Millenium Collection | 9.9 |
3 | 203 | A-Sides | 1.98 |
4 | 58 | Come Taste The Band | 7.92 |
5 | 163 | From The Muddy Banks Of The Wishkah [live] | 16.83 |
Then, we join invoice_album
with the track
table in order to list all of the tracks under the album associated with each invoice. The result will be a view called invoice_full_album_tracks
.
%%sql
;
DROP VIEW IF EXISTS invoice_full_album_tracks
CREATE VIEW invoice_full_album_tracks AS
SELECT*,
ia.
t.track_id AS track_id,
t.name AS track_name
FROM invoice_album AS ia
INNER JOIN
track AS t= ia.album_id
ON t.album_id
ORDER BY invoice_id, album_id, track_id;
*
SELECT
FROM invoice_full_album_tracks5
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
invoice_id | album_id | album_name | total_purchase | track_id | track_name |
---|---|---|---|---|---|
1 | 91 | Use Your Illusion I | 15.84 | 1158 | Right Next Door to Hell |
1 | 91 | Use Your Illusion I | 15.84 | 1159 | Dust N' Bones |
1 | 91 | Use Your Illusion I | 15.84 | 1160 | Live and Let Die |
1 | 91 | Use Your Illusion I | 15.84 | 1161 | Don't Cry (Original) |
1 | 91 | Use Your Illusion I | 15.84 | 1162 | Perfect Crime |
The invoice_full_album_tracks
view looks very similar to the invoice_tracks_bought
view at first glance. However, there is a difference:
invoice_tracks_bought
- It contains the tracks bought in each invoice.
- Each set of tracks may or may not be a full album.
invoice_full_album_tracks
- It contains all of the tracks under one album associated with an invoice.
- Some of these tracks may not have been bought by the customer.
Next, we will create a new view called invoice_purchase_type
which indicates whether each invoice is a “Full Album” or “Selected Set” purchase.
In order to determine this, we will have a CASE
statement which can be explained as follows:
WHEN
clause: If the set of tracks bought and the full album are exactly the same, mark the purchase type as “Full Album”.ELSE
clause: Otherwise, mark the purchase type as “Selected Set”.
Inside the WHEN
clause, we have a rather complicated-looking set of operations. Let’s look at one part:
(
SELECT itb.track_id
FROM invoice_tracks_bought AS itb
WHERE itb.invoice_id = ia.invoice_id
EXCEPT
SELECT ifa.track_id
FROM invoice_full_album_tracks AS ifa
WHERE ifa.invoice_id = ia.invoice_id
) IS NULL
In order to make the explanation more simple, we can call the subqueries above “set 1” and “set 2”.
(
{set 1}
EXCEPT
{set 2}
) IS NULL
- Set 1 represents all tracks bought in one invoice.
- Set 2 represents the full set of tracks in an album associated with the invoice.
- Via
EXCEPT
andIS NULL
, we check whether Set 1 is a subset of Set 2.- If it is, the result is True.
- Otherwise, False.
We then repeat the same process but in reverse, to check if Set 2 is a subset of Set 1. We thus end up with two boolean values, and we use the AND
operator on these.
(
{set 1}
EXCEPT
{set 2}
) IS NULL
AND
(
{set 2}
EXCEPT
{set 1}
) IS NULL
The purpose of AND
is to determine the following.
- If both conditions are True:
- The two sets of tracks match exactly.
- The invoice is a “Full Album” purchase.
- If any condition is False:
- The two sets of tracks do not match exactly.
- The invoice is a “Selected Set” purchase.
The full query is shown below.
%%sql
;
DROP VIEW IF EXISTS invoice_purchase_type
CREATE VIEW invoice_purchase_type AS
SELECT
ia.invoice_id,
CASE
WHEN (
(
SELECT itb.track_id
FROM invoice_tracks_bought AS itb= ia.invoice_id
WHERE itb.invoice_id
EXCEPT
SELECT ifa.track_id
FROM invoice_full_album_tracks AS ifa= ia.invoice_id
WHERE ifa.invoice_id
) IS NULL
AND
(
SELECT ifa.track_id
FROM invoice_full_album_tracks AS ifa= ia.invoice_id
WHERE ifa.invoice_id
EXCEPT
SELECT itb.track_id
FROM invoice_tracks_bought AS itb= ia.invoice_id
WHERE itb.invoice_id
) IS NULL
"Full Album"
) THEN "Selected Set"
ELSE
END AS purchase_type,
ia.total_purchase AS total_purchase
FROM invoice_album AS ia;
*
SELECT
FROM invoice_purchase_type10
LIMIT ;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
Done.
invoice_id | purchase_type | total_purchase |
---|---|---|
1 | Full Album | 15.84 |
2 | Selected Set | 9.9 |
3 | Selected Set | 1.98 |
4 | Selected Set | 7.92 |
5 | Full Album | 16.83 |
6 | Selected Set | 1.98 |
7 | Selected Set | 10.89 |
8 | Selected Set | 9.9 |
9 | Selected Set | 8.91 |
10 | Selected Set | 1.98 |
With the view above, we can finally answer the question being asked. A view called purchase_type_proportion
will be created which shows the following information per purchase type:
- number of invoices
- percentage of invoices
- average sales per invoice
- sales in USD
- percentage of total sales
%%sql
;
DROP VIEW IF EXISTS purchase_type_proportion
CREATE VIEW purchase_type_proportion AS
SELECT
purchase_type,
COUNT(purchase_type) AS type_count,
ROUND(
CAST(COUNT(purchase_type) AS Float)/ CAST(
(SELECT COUNT(purchase_type)
FROM invoice_purchase_type)
AS Float
)* 100,
2
) AS type_percentage,
ROUND(
AVG(total_purchase),2
) AS avg_sales_per_invoice,
ROUND(
SUM(total_purchase),2
) AS sales_number,
ROUND(
SUM(total_purchase)/ (SELECT SUM(total_purchase)
FROM invoice_purchase_type)* 100,
2
) AS sales_percentage
FROM invoice_purchase_type
GROUP BY purchase_type
ORDER BY type_count DESC;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Done.
[]
%%sql result <<
*
SELECT
FROM purchase_type_proportion;
* sqlite:///private/2021-07-31-Intermediate-SQL-Files/chinook.db
Done.
Returning data to local variable result
= result.DataFrame()
purchase_type_df purchase_type_df
purchase_type | type_count | type_percentage | avg_sales_per_invoice | sales_number | sales_percentage | |
---|---|---|---|---|---|---|
0 | Selected Set | 500 | 81.43 | 6.50 | 3248.19 | 68.97 |
1 | Full Album | 114 | 18.57 | 12.82 | 1461.24 | 31.03 |
The type_percentage
column shows that the majority (81%) of all Chinook invoices are Selected Sets. Full Album purchases only make up 19% of all invoices.
This is shown in the pie chart below.
Code
= sns.color_palette('pastel')[:2]
colors
plt.pie(= purchase_type_df["type_percentage"],
x = purchase_type_df["purchase_type"],
labels = colors,
colors = "%0.2f%%",
autopct
)
"Proportion of Chinook Store Invoices by Purchase Type")
plt.title(
plt.show()
However, the average sales per Full Album purchase is almost twice that of a Selected Set purchase. This is shown in the bar chart below.
Code
# Base layer with bar chart
= (
base
alt.Chart(purchase_type_df)
.mark_bar()
.encode(= alt.X("purchase_type:N", title = "Purchase Type"),
x = alt.Y("avg_sales_per_invoice:Q", title = "Average Sales per Invoice (USD)"),
y = purchase_type_df.columns.tolist(),
tooltip
)
)
# Text layer
= (
text
base
.mark_text(= "center",
align = 10,
dy = "white",
color
)
.encode(= "avg_sales_per_invoice:Q",
text
)
)
# Combine layers into one chart
= (
chart + text)
(base
.properties(= "Average Sales per Invoice by Purchase Type",
title = 300,
height = 200,
width
)= 30)
.configure_axis(labelAngle
.interactive()
)
# Display chart.
chart
Full Albums cost more than Selected Sets because the former tend to have more tracks. Thus, even if Full Albums only represent 19% of all invoices, these also represent 31% of all dollar sales. This is shown in the pie chart below.
Code
plt.pie(= purchase_type_df["sales_percentage"],
x = purchase_type_df["purchase_type"],
labels = colors,
colors = "%0.2f%%",
autopct
)
"Proportion of Chinook Store Sales by Purchase Type")
plt.title(
plt.show()
Overall, though, Selected Sets still represent the majority of Chinook’s invoices (81%) and sales (69%). Therefore, we recommend that Chinook shift to a new purchasing strategy in which it only buys the most popular tracks from record companies. Chinook should not buy full albums since it is less likely for customers to purchase these.
Conclusion
In this project, we used intermediate SQL techniques to answer business questions for a hypothetical digital media store called Chinook. We solved a total of 4 scenarios by creating views and gradually working towards a final query. We also ended each scenario with a chart that communicates our findings more engagingly.
Below is a summary of all of the questions and our findings.
What are the best-selling music genres with regards to USA customers? Based on this, which new albums should be purchased for the Chinook store?
Rock is the best-selling music genre as it makes up 53% of total purchases in the USA. Rock is followed by the Alternative & Punk genre and Metal genre, which each make up over 10% of purchases.
Among the 4 new albums which may be added to the digital store, we recommend the following 3 artists’ albums: Red Tone (Punk), Slim Jim Bites (Blues), and Meteor and the Girls (Pop).
Which of Chinook’s sales support agents has the highest total sales from their assigned customers? Can the exemplary performance of these employees be explained by any information in the database?
Sales support agent Jane Peacock has the highest total sales (37% of all sales) from her customers. She also has the highest average sales per customer (USD 82).
However, her statistics are only slightly higher than that of her colleagues. Therefore, we cannot conclusively say that she is the best-performing agent.
What are the statistics on the customers and sales for each country where Chinook offers its service?
The USA has the highest number of Chinook customers (13), as well as the highest total sales (USD 1040.49). All of the other countries have only 2 to 8 customers.
We recommend marketing Chinook more aggressively in Canada, Brazil, and France since these countries have the highest number of customers after the USA. We also recommend expanding into the Czech Republic since it has a high average value of sales per customer.
How many purchases are full albums, and how many are manually selected sets of tracks? Based on this, what strategy should Chinook adopt when buying new tracks from record companies?
Selected Sets account for 81% of all invoices and 69% of total sales. Therefore, these are more popular than Full Albums.
Thus, we recommend that Chinook buy only the most popular tracks from record companies. This would be more cost-effective than buying full albums.
That concludes this project. Thanks for reading!