import pandas as pd
import numpy as np
import altair as alt
import re
Overview
Welcome. In this project, we will be working on data about different types of apps and their corresponding number of users. The goal is to determine which apps can best attract the largest number of users. This will help a hypothetical app company make decisions regarding what apps to develop in the near future.
I wrote this notebook for the Dataquest course’s Guided Project: Profitable App Profiles for the App Store and Google Play Markets. The hypothetical app company and the general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.
App Company’s Context
First, we must know the context of the hypothetical app company so that we can align our analysis with it.
This company only makes free apps directed toward an English-speaking audience. They get revenue from in-app advertisements and purchases. Thus, they rely on having a large number of users so that they can generate more revenue.
Additionally, their apps should ideally be successful on both Google Play Store and the Apple App Store. The reason is that the company has the following 3-step validation strategy:
(from the Dataquest guided project)
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
We’ll take this information into consideration throughout our analysis.
Package Installs
App Data Overview
This project uses two datasets.
- The Google Play Store dataset lists over 10,000 Android apps.
- The Apple App Store dataset lists over 7,000 iOS apps.
= pd.read_csv("./private/2021-05-08-PAP-Files/AppleStore.csv", header = 0)
data_apple = pd.read_csv("./private/2021-05-08-PAP-Files/googleplaystore.csv", header = 0) data_google
Apple App Store dataset
print(data_apple.shape)
(7197, 16)
The dataset has 7197 rows (1 row per app), and 16 columns which describe these apps.
According to the Kaggle documentation (Mobile App Store ( 7200 apps)), the following are the columns and their meanings.
- “id” : App ID
- “track_name”: App Name
- “size_bytes”: Size (in Bytes)
- “currency”: Currency Type
- “price”: Price amount
- “ratingcounttot”: User Rating counts (for all version)
- “ratingcountver”: User Rating counts (for current version)
- “user_rating” : Average User Rating value (for all version)
- “userratingver”: Average User Rating value (for current version)
- “ver” : Latest version code
- “cont_rating”: Content Rating
- “prime_genre”: Primary Genre
- “sup_devices.num”: Number of supporting devices
- “ipadSc_urls.num”: Number of screenshots showed for display
- “lang.num”: Number of supported languages
- “vpp_lic”: Vpp Device Based Licensing Enabled
A sample of the first 5 rows of the dataset is shown below.
data_apple.head()
id | track_name | size_bytes | currency | price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284882215 | 389879808 | USD | 0.0 | 2974676 | 212 | 3.5 | 3.5 | 95.0 | 4+ | Social Networking | 37 | 1 | 29 | 1 | |
1 | 389801252 | 113954816 | USD | 0.0 | 2161558 | 1289 | 4.5 | 4.0 | 10.23 | 12+ | Photo & Video | 37 | 0 | 29 | 1 | |
2 | 529479190 | Clash of Clans | 116476928 | USD | 0.0 | 2130805 | 579 | 4.5 | 4.5 | 9.24.12 | 9+ | Games | 38 | 5 | 18 | 1 |
3 | 420009108 | Temple Run | 65921024 | USD | 0.0 | 1724546 | 3842 | 4.5 | 4.0 | 1.6.2 | 9+ | Games | 40 | 5 | 1 | 1 |
4 | 284035177 | Pandora - Music & Radio | 130242560 | USD | 0.0 | 1126879 | 3594 | 4.0 | 4.5 | 8.4.1 | 12+ | Music | 37 | 4 | 1 | 1 |
Google Play Store dataset
print(data_google.shape)
(10841, 13)
The dataset has 10841 rows and 13 columns.
The column names are self-explanatory, so the Kaggle documentation (Google Play Store Apps) does not describe them.
print(list(data_google.columns))
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Below is a sample of the dataset.
data_google.head()
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
Data Cleaning
Before analysis, the data must first be cleaned of unwanted datapoints.
Inaccurate Data
This Kaggle discussion about the Google Play dataset indicates that row 10472 (excluding the header) has an error.
Below, I have printed row 0 and row 10472 so that these can be compared.
0, 10472]] data_google.iloc[[
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
As we look at row 10472 in the context of the column headers and row 0, the following things become clear.
- The “Category” value is not present. Thus, all values to the right of it have been shifted leftward.
- The “Android Ver” column was left with a missing value.
Thus, this row will be removed.
if data_google.iloc[10472, 0] == 'Life Made WI-Fi Touchscreen Photo Frame':
# This if-statement prevents more rows from being deleted
# if the cell is run again.
10472, inplace = True)
data_google.drop(print("The inaccurate row was deleted.")
The inaccurate row was deleted.
Duplicate Data
There are also duplicate app entries in the Google Play dataset. We can consider a row as a duplicates if another row exists that has the same “App” value.
Here, I count the total number of duplicate rows. This turns out to be 1979 rows.
def count_duplicates(df, col_name):
"""Count the number of duplicate rows in a DataFrame.
`col_name` is the name of the column to be used as a basis
for duplicate values."""
= {}
all_apps
for index, row in df.iterrows():
= row[col_name]
name
all_apps.setdefault(name, []).append(index)
= [ind
duplicate_inds for lst in all_apps.values()
for ind in lst
if len(lst) > 1]
= "Duplicates: {}".format(len(duplicate_inds))
n_duplicates = df.iloc[duplicate_inds]
duplicate_rows
return n_duplicates, duplicate_rows
= count_duplicates(data_google, "App")
google_dupes print(google_dupes[0])
Duplicates: 1979
As an example, there are 4 rows for Instagram:
= data_google["App"] == "Instagram"
ig_filter = data_google.loc[ig_filter] ig_rows
ig_rows
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2545 | SOCIAL | 4.5 | 66577313 | Varies with device | 1,000,000,000+ | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
2604 | SOCIAL | 4.5 | 66577446 | Varies with device | 1,000,000,000+ | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
2611 | SOCIAL | 4.5 | 66577313 | Varies with device | 1,000,000,000+ | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device | |
3909 | SOCIAL | 4.5 | 66509917 | Varies with device | 1,000,000,000+ | Free | 0 | Teen | Social | July 31, 2018 | Varies with device | Varies with device |
Looking closely, we can see that duplicate rows are not exactly identical. The “Reviews” column, which shows the total number of reviews of the app, has different values.
It can be inferred that the row with the largest value is the newest entry for the app. Therefore, all duplicate rows will be dropped except for the ones with the largest “Reviews” values.
def remove_duplicates(df, name_col, reviews_col):
# Each key-value pair will follow the format:
# {"App Name": maximum number of reviews among all duplicates}
= {}
reviews_max
for index, row in df.iterrows():
= row[name_col]
name = int(row[reviews_col])
n_reviews
if n_reviews > reviews_max.get(name, -1):
= n_reviews
reviews_max[name]
# List of duplicate indices to drop,
# excluding the row with the highest number of reviews
# among that app's duplicate rows.
= []
indices_to_drop
# Rows with names that have already been added into this list
# will be dropped.
= []
already_added
for index, row in df.iterrows():
= row[name_col]
name = int(row[reviews_col])
n_reviews
if (name not in already_added) and (n_reviews == reviews_max[name]):
already_added.append(name)else:
indices_to_drop.append(index)
# Remove duplicates and return the clean dataset.
= df.drop(indices_to_drop)
clean return clean
= remove_duplicates(data_google, "App", "Reviews")
android_clean print(android_clean.shape)
(9659, 13)
After duplicates were removed, the Google Play dataset was left with 9659 rows.
As for the Apple App Store dataset, there are 4 duplicate rows.
= count_duplicates(data_apple, "track_name")
apple_dupes
print(apple_dupes[0])
1] apple_dupes[
Duplicates: 4
id | track_name | size_bytes | currency | price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2948 | 1173990889 | Mannequin Challenge | 109705216 | USD | 0.0 | 668 | 87 | 3.0 | 3.0 | 1.4 | 9+ | Games | 37 | 4 | 1 | 1 |
4463 | 1178454060 | Mannequin Challenge | 59572224 | USD | 0.0 | 105 | 58 | 4.0 | 4.5 | 1.0.1 | 4+ | Games | 38 | 5 | 1 | 1 |
4442 | 952877179 | VR Roller Coaster | 169523200 | USD | 0.0 | 107 | 102 | 3.5 | 3.5 | 2.0.0 | 4+ | Games | 37 | 5 | 1 | 1 |
4831 | 1089824278 | VR Roller Coaster | 240964608 | USD | 0.0 | 67 | 44 | 3.5 | 4.0 | 0.81 | 4+ | Games | 38 | 0 | 1 | 1 |
The “rating_count_tot” column in the Apple App Store dataset is like the “Reviews” column in the Google Play dataset. It tells the total number of reviews so far. Therefore, Apple App Store dataset duplicates can be removed by keeping the rows with the highest rating count totals.
= remove_duplicates(data_apple, "track_name", "rating_count_tot")
ios_clean print(ios_clean.shape)
(7195, 16)
From 7197 rows, there are now 7195 rows in the Apple App Store dataset.
Non-English Apps
The hypothetical app company who will use this analysis is a company that only makes apps in English. Thus, all apps with non-English titles shall be removed from the datasets.
The task now is to identify titles which are not in English. It is known that in the ASCII table, the characters most commonly used in English are within codes 0 to 127. Some English app titles may have special characters or emojis, though, so I will only remove titles which have more than 3 characters outside of the normal range.
def is_english(text):
unicode = [ord(char) for char in text]
= [(code >= 0 and code <= 127) for code in unicode]
normal = len(text) - sum(normal)
non_english
return non_english <= 3
def keep_english(df, name_col):
"""Return a new DataFrame containing only rows with English names."""
= []
remove_indices
for index, row in df.iterrows():
= row[name_col]
name if not is_english(name):
remove_indices.append(index)
return df.drop(remove_indices)
= keep_english(android_clean, "App")
android_clean = keep_english(ios_clean, "track_name")
ios_clean
print("Google Play Store Dataset:", android_clean.shape)
print("Apple App Store Dataset:", ios_clean.shape)
Google Play Store Dataset: (9614, 13)
Apple App Store Dataset: (6181, 16)
Now, there are only English apps in both datasets.
Paid Apps
As mentioned earlier, the app company only makes free apps. Therefore, data on paid apps is irrelevant to this analysis. Paid apps shall be identified and removed from both datasets.
def remove_paid(df, price_col):
"""Return a new DataFrame without paid apps."""
= []
remove_indices
for index, row in df.iterrows():
= str(row[price_col])
price
# Keep characters that are numeric or periods.
= float(re.sub("[^0-9.]", "", price))
price
if price != 0.0:
remove_indices.append(index)
return df.drop(remove_indices)
= remove_paid(android_clean, "Price")
android_clean = remove_paid(ios_clean, "price")
ios_clean
print("Google Play Store Dataset:", android_clean.shape)
print("Apple App Store Dataset:", ios_clean.shape)
Google Play Store Dataset: (8864, 13)
Apple App Store Dataset: (3220, 16)
The datasets were left with 8864 apps in Google Play and 3220 apps in the App Store.
Missing Data
Lastly, let us remove rows with missing data. Note that it would be wasteful to remove rows with missing data in columns that we will not inspect. Therefore, we will only remove rows with missing data in relevant columns. (Why these are relevant will be explained later.) These would be the following.
Google Play Store dataset
- App
- Category
- Installs
- Genres
Apple App Store dataset
- track_name
- prime_genre
- rating_count_tot
I will now remove all rows with missing values in these columns.
android_clean.dropna(= ["App", "Category", "Installs", "Genres"],
subset = True,
inplace
)
ios_clean.dropna(= ["track_name", "prime_genre", "rating_count_tot"],
subset = True,
inplace
)
print("Google Play Store Dataset:", android_clean.shape)
print("Apple App Store Dataset:", ios_clean.shape)
Google Play Store Dataset: (8864, 13)
Apple App Store Dataset: (3220, 16)
These are the same shapes as before. Therefore, there were no missing values in the relevant columns. No datapoints were removed at this step.
Data cleaning is done, so now we can move on to the analysis.
Common App Genres
Now that the data has been cleaned, let us find out which genres of apps are most common in both app markets. If an app genre is common, then there may be high demand for it among users.
Which columns in the datasets can give information about the app genres?
print("Google Play Store")
android_clean.head()
Google Play Store
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
5 | Paper flowers instructions | ART_AND_DESIGN | 4.4 | 167 | 5.6M | 50,000+ | Free | 0 | Everyone | Art & Design | March 26, 2017 | 1.0 | 2.3 and up |
print("\nApple App Store")
ios_clean.head()
Apple App Store
id | track_name | size_bytes | currency | price | rating_count_tot | rating_count_ver | user_rating | user_rating_ver | ver | cont_rating | prime_genre | sup_devices.num | ipadSc_urls.num | lang.num | vpp_lic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284882215 | 389879808 | USD | 0.0 | 2974676 | 212 | 3.5 | 3.5 | 95.0 | 4+ | Social Networking | 37 | 1 | 29 | 1 | |
1 | 389801252 | 113954816 | USD | 0.0 | 2161558 | 1289 | 4.5 | 4.0 | 10.23 | 12+ | Photo & Video | 37 | 0 | 29 | 1 | |
2 | 529479190 | Clash of Clans | 116476928 | USD | 0.0 | 2130805 | 579 | 4.5 | 4.5 | 9.24.12 | 9+ | Games | 38 | 5 | 18 | 1 |
3 | 420009108 | Temple Run | 65921024 | USD | 0.0 | 1724546 | 3842 | 4.5 | 4.0 | 1.6.2 | 9+ | Games | 40 | 5 | 1 | 1 |
4 | 284035177 | Pandora - Music & Radio | 130242560 | USD | 0.0 | 1126879 | 3594 | 4.0 | 4.5 | 8.4.1 | 12+ | Music | 37 | 4 | 1 | 1 |
For Google Play, some columns that seem relevant are “Category” and “Genres”. For the Apple App Store, the relevant column is “prime_genre”.
We can determine the most common genres by using frequency tables of the mentioned columns.
def freq_table(df, label):
"""Return a frequency table of the values in a column of a DataFrame."""
= df[label]
col = {}
freq
for value in col:
0)
freq.setdefault(value, += 1
freq[value]
for key in freq:
/= len(df) / 100
freq[key]
= pd.Series(freq).sort_values(ascending = False)
freq_series
return freq_series
def sr_to_df(sr, col_name = "number", n_head = None):
"""Return a DataFrame by resetting the index of a Series."""
= sr.rename(col_name).reset_index().rename(columns = {"index":"name"})
df if n_head is not None:
= df.head(n_head)
df
return df
= freq_table(android_clean, "Category")
google_categories = freq_table(android_clean, "Genres")
google_genres = freq_table(ios_clean, "prime_genre") apple_genres
The frequency tables will be inspected in the sections below. Only the top positions in each table will be shown, for brevity.
Apple App Store: Prime Genres
First, the frequency table of Apple App Store prime genres shall be analyzed. Below, I have ordered the table by frequency, descending. I have also made bar graphs showing the top 10 positions in each frequency table.
"percentage", n_head = 5) sr_to_df(apple_genres,
name | percentage | |
---|---|---|
0 | Games | 58.136646 |
1 | Entertainment | 7.888199 |
2 | Photo & Video | 4.968944 |
3 | Education | 3.664596 |
4 | Social Networking | 3.291925 |
def bar_n(series, chart_title, ylabel, n = 10, perc = False):
"""Takes a series and outputs a bar graph of the first n items."""
= "name"
series.index.name = series.rename("number").reset_index()
df "number"] = [round(i, 2) for i in df["number"]]
df[= df[:n]
df
= alt.Chart(df).mark_bar().encode(
bar = alt.X("name", title = "Name", sort = "-y"),
x = alt.Y("number", title = ylabel),
y
)
= bar.mark_text(
text = 'center',
align = 'middle',
baseline = -5, # Nudge text upward
dy
).encode(= 'number:Q'
text
)
= (bar + text).properties(
chart = chart_title,
title = 700,
width = 400,
height
)
return chart
bar_n(
apple_genres,"Top 10 Most Common Prime Genres of iOS Apps",
"Percentage of Apps",
= True,
perc )
The top 5 most common prime genres in the Apple App Store are Games, Entertainment, Photo & Video, Education, and Social Networking. Games are at the top, occupying over 58% of all apps. This is a much higher percentage than any other single genre occupies.
The general impression is that there are many more iOS apps that are entertainment-related apps compared to practical apps.
Google Play Store: Categories
Next, below is the frequency table for Google Play Store app categories.
"percentage", 5) sr_to_df(google_categories,
name | percentage | |
---|---|---|
0 | FAMILY | 18.907942 |
1 | GAME | 9.724729 |
2 | TOOLS | 8.461191 |
3 | BUSINESS | 4.591606 |
4 | LIFESTYLE | 3.903430 |
bar_n(
google_categories,"Top 10 Most Common Categories of Android Apps",
"Percentage of Apps",
= True,
perc )
The picture here seems to be different. The most common category is Family occupying almost 19% of all apps, followed by Game, Tools, Business, and Lifestyle.
The table suggests that practical app categories are more common in Google Play than in the Apple App Store.
Google Play Store: Genres
Lastly, below is the frequency table for Google Play Store app genres.
"percentage", 5) sr_to_df(google_genres,
name | percentage | |
---|---|---|
0 | Tools | 8.449910 |
1 | Entertainment | 6.069495 |
2 | Education | 5.347473 |
3 | Business | 4.591606 |
4 | Productivity | 3.892148 |
bar_n(
google_genres,"Top 10 Most Common Genres of Android Apps",
"Percentage of Apps",
)
There are 114 genres in this table, so it is not fully displayed. However, it would appear that the top 5 genres are Tools (8%), Entertainment, Education, Business, and Lifestyle. Like with the categories, practical apps are very common.
However, I noticed something special about this frequency table. Some genres are actually combinations of multiple genres, separated by semi-colons. If I can extract and count individual genres from these combined genres, then I can get a more accurate idea of app genres in the Google Play Store.
This frequency table will show numbers instead of percentages. Since the genres overlap, the percentages would add up to greater than 100%.
= {}
freq for value in android_clean["Genres"]:
= value.split(";")
genres for genre in genres:
0)
freq.setdefault(genre, += 1
freq[genre]
= pd.Series(freq).sort_values(ascending = False)
google_genres_split
= 5) sr_to_df(google_genres_split, n_head
name | number | |
---|---|---|
0 | Tools | 750 |
1 | Education | 606 |
2 | Entertainment | 569 |
3 | Business | 407 |
4 | Lifestyle | 347 |
bar_n(
google_genres_split,"Top 10 Most Common Genres of Android Apps (Split Up)",
"Number of Apps",
)
It can be seen that the frequency table has slightly different placements now. However, the top genres are still Tools, Education, Entertainment, Business, and Lifestyle. Practical app genres are very common in the Google Play Store. They are more common here than in the Apple App Store.
Based on the results, the Google Play Store has a selection of apps that is more balanced between entertainment and practicality.
Going back to the the frequency table of Categories, since it seems that each Category represents a group of Genres. For example, one would expect apps in the Simulation, Arcade, Puzzle, Strategy, etc. genres to be under the Game category. It was shown earlier that this category is the 2nd most common in the Google Play Store.
The Categories column is more general and gives a more accurate picture of the common types of apps. Thus, from here on, I will be analyzing only the “Category” column and not the “Genres” column.
I will now use “app type” to generally refer to the Apple App Store’s “prime_genre” values or the Google Play Store’s “Category” values.
App Types by Number of Users
We first looked at app types in terms of how common they are in the two app markets. Now, we shall see how many users there are for each app type.
Apple App Store: Rating Counts
In the Apple App Store dataset, there is no column that indicates the number of users.
print(list(ios_clean.columns))
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
However, the "rating_count_tot"
column exists. It indicates the total number of ratings given to each app. We can use it as a proxy for the number of users of each app.
The function below will return a Series showing the average number of users per app within each type. (Not the total number of users per type.)
def users_by_type(df, type_col, users_col, moct = "mean"):
"""Return a Series that maps each app type to
the average number of users per app for that type.
Specify 'mean' or 'median' for the measure of central tendency."""
= {}
dct
for index, row in df.iterrows():
= row[type_col]
app_type = row[users_col]
users
dct.setdefault(app_type, []).append(users)
= {}
dct2
for app_type in dct:
= dct[app_type]
counts if moct == "mean":
= np.mean(counts)
dct2[app_type] elif moct == "median":
= np.median(counts)
dct2[app_type]
= pd.Series(dct2).sort_values(ascending = False)
result return result
= users_by_type(ios_clean, "prime_genre", "rating_count_tot")
ios_users
= 5) sr_to_df(ios_users, n_head
name | number | |
---|---|---|
0 | Navigation | 86090.333333 |
1 | Reference | 74942.111111 |
2 | Social Networking | 71548.349057 |
3 | Music | 57326.530303 |
4 | Weather | 52279.892857 |
bar_n(
ios_users,"Top 10 Most Popular iOS App Types",
"Mean Number of Users per App",
)
The top 5 iOS app types with the highest mean average number of users per app are Navigation, Reference, Social Networking, Music, and Weather.
However, these mean averages may be skewed by a few particularly popular apps. For example, let us look at the number of users of the top 5 Navigation apps.
= ios_clean[[
ios_nav "track_name",
"rating_count_tot",
]].loc["prime_genre"] == "Navigation"
ios_clean[
].sort_values(= "rating_count_tot",
by = False,
ascending
).set_index("track_name",
)
# `ios_nav` is still a DataFrame at this point.
# It becomes a Series below.
= ios_nav["rating_count_tot"]
ios_nav
= 5) sr_to_df(ios_nav, n_head
track_name | number | |
---|---|---|
0 | Waze - GPS Navigation, Maps & Real-time Traffic | 345046 |
1 | Google Maps - Navigation & Transit | 154911 |
2 | Geocaching® | 12811 |
3 | CoPilot GPS – Car Navigation & Offline Maps | 3582 |
4 | ImmobilienScout24: Real Estate Search in Germany | 187 |
bar_n(
ios_nav,"iOS Navigation Apps by Popularity",
"Number of Users",
)
Clearly, the distribution is skewed because Waze has such a high number of users. Therefore, a better measure of central tendency to use would be the median, not the mean.
Let us repeat the analysis using the median this time:
= users_by_type(
ios_users
ios_clean,"prime_genre",
"rating_count_tot",
= "median",
moct
)
= 5) sr_to_df(ios_users, n_head
name | number | |
---|---|---|
0 | Productivity | 8737.5 |
1 | Navigation | 8196.5 |
2 | Reference | 6614.0 |
3 | Shopping | 5936.0 |
4 | Social Networking | 4199.0 |
bar_n(
ios_users,"Top 10 Most Popular iOS App Types",
"Median Number of Users per App",
)
The top 5 most popular iOS apps by median number of users per app are: - Productivity - Navigation - Reference - Shopping - Social Networking
These placements are quite different from the top 5 most common iOS apps (Games, Entertainment, Photo & Video, Education, and Social Networking).
We can say the following about the Apple App Store.
- Apps for entertainment and fun, notably Games, are the most common apps.
- Apps for practical purposes, notably Productivity, are the most popular apps.
Google Play Store: Installs
Let us see which columns in the Google Play Store dataset can tell us about the number of users per app.
android_clean.head()
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
5 | Paper flowers instructions | ART_AND_DESIGN | 4.4 | 167 | 5.6M | 50,000+ | Free | 0 | Everyone | Art & Design | March 26, 2017 | 1.0 | 2.3 and up |
The “Installs” column seems like the best indicator of the number of users.
"App", "Installs"]] android_clean[[
App | Installs | |
---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | 10,000+ |
2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | 5,000,000+ |
3 | Sketch - Draw & Paint | 50,000,000+ |
4 | Pixel Draw - Number Art Coloring Book | 100,000+ |
5 | Paper flowers instructions | 50,000+ |
... | ... | ... |
10836 | Sya9a Maroc - FR | 5,000+ |
10837 | Fr. Mike Schmitz Audio Teachings | 100+ |
10838 | Parkinson Exercices FR | 1,000+ |
10839 | The SCP Foundation DB fr nn5n | 1,000+ |
10840 | iHoroscope - 2018 Daily Horoscope & Astrology | 10,000,000+ |
8864 rows × 2 columns
The column contains strings which indicate the general range of how many users installed the apps. Since we cannot find the exact number of installs, we will simply remove the “+” signs and convert the numbers into integers.
"Installs"] = [int(re.sub("[,+]", "", text))
android_clean[for text in android_clean["Installs"]]
"Installs"]] android_clean[[
Installs | |
---|---|
0 | 10000 |
2 | 5000000 |
3 | 50000000 |
4 | 100000 |
5 | 50000 |
... | ... |
10836 | 5000 |
10837 | 100 |
10838 | 1000 |
10839 | 1000 |
10840 | 10000000 |
8864 rows × 1 columns
Let us now see which app categories are most popular. We will use the median average here, as we did for iOS apps.
= users_by_type(
android_users
android_clean,"Category",
"Installs",
= "median",
moct
)
= 10) sr_to_df(android_users, n_head
name | number | |
---|---|---|
0 | ENTERTAINMENT | 1000000.0 |
1 | EDUCATION | 1000000.0 |
2 | GAME | 1000000.0 |
3 | PHOTOGRAPHY | 1000000.0 |
4 | SHOPPING | 1000000.0 |
5 | WEATHER | 1000000.0 |
6 | VIDEO_PLAYERS | 1000000.0 |
7 | COMMUNICATION | 500000.0 |
8 | FOOD_AND_DRINK | 500000.0 |
9 | HEALTH_AND_FITNESS | 500000.0 |
bar_n(
android_users,"Top 10 Most Popular Android App Types",
"Median Number of Users per App",
= 10,
n )
Since the top 5 spots all had the same median number of users per app (1000000), the graph was expanded to include the top 10 spots.
It appears that the types of Android apps with the highest median number of users per app are:
- GAME
- VIDEO_PLAYERS
- WEATHER
- EDUCATION
- ENTERTAINMENT
- PHOTOGRAPHY
- SHOPPING
We can say the following about the Google Play Store.
- Both fun apps and practical apps are very common.
- The most popular apps are also a mix of fun apps and practical apps.
App Profile Ideas
Based on the results, we can now determine a profitable app profile for the hypothetical app company.
Here is a summary of the findings on the 2 app stores.
- The Google Play Store has a balanced mix of fun and practical apps, so we can pick either kind.
- On the other hand, the Apple App Store appears to be oversaturated with game apps, and practical apps are more popular.
Therefore, in order to get the most users, the app company can set themselves apart in the Apple App Store by developing a useful practical app.
The most popular types of practical apps for the Apple App Store would be:
- Productivity
- Navigation
- Reference
- Shopping
For the Google Play Store, these would be:
- Weather
- Education
- Photography
- Shopping
Shopping appears in both lists, so it may be the most profitable type of app. However, the app company would have to make a unique app that has an edge over existing popular shopping apps. The same would apply for making a navigation app.
Considering that Reference and Education apps are popular, perhaps these two types could be combined into one app. First, let us find out the titles of the most popular apps in these genres.
= ios_clean[[
reference_popularity "track_name",
"rating_count_tot"
]].loc["prime_genre"] == "Reference"
ios_clean[
].dropna(
).sort_values("rating_count_tot",
= False,
ascending
).set_index("track_name",
"rating_count_tot"]
)[
= 10) sr_to_df(reference_popularity, n_head
track_name | number | |
---|---|---|
0 | Bible | 985920 |
1 | Dictionary.com Dictionary & Thesaurus | 200047 |
2 | Dictionary.com Dictionary & Thesaurus for iPad | 54175 |
3 | Google Translate | 26786 |
4 | Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q... | 18418 |
5 | New Furniture Mods - Pocket Wiki & Game Tools ... | 17588 |
6 | Merriam-Webster Dictionary | 16849 |
7 | Night Sky | 12122 |
8 | City Maps for Minecraft PE - The Best Maps for... | 8535 |
9 | LUCKY BLOCK MOD ™ for Minecraft PC Edition - T... | 4693 |
bar_n(
reference_popularity,"Top 10 Most Popular iOS Reference Apps",
"Number of Users",
)
= android_clean[[
education_popularity "App",
"Installs"
]].loc["Category"] == "EDUCATION"
android_clean[
].dropna(
).sort_values("Installs",
= False,
ascending
).set_index("App",
"Installs"]
)[
= 5) sr_to_df(education_popularity, n_head
App | number | |
---|---|---|
0 | Quizlet: Learn Languages & Vocab with Flashcards | 10000000 |
1 | Learn languages, grammar & vocabulary with Mem... | 10000000 |
2 | Learn English with Wlingua | 10000000 |
3 | Remind: School Communication | 10000000 |
4 | Math Tricks | 10000000 |
bar_n(
education_popularity,"Top 10 Most Popular Android Education Apps",
"Number of Users",
)
The most popular Reference apps are the Bible and some dictionary and translation apps. The most popular Education apps teach languages (especially English), or Math.
Therefore, the following are some ideas of a profitable app:
- An app containing the Bible, another religious text, or another well-known text. The app can additionally include reflections, analyses, or quizzes about the text.
- An app that contains an English dictionary, a translator, and some quick guides on English vocabulary and grammar.
- An app like the above, but for a different language that is spoken by many people.
- An app that teaches English and Math lessons. Perhaps it could be marketed as a practice app for an entrance exam.
Conclusion
In this project, we analyzed app data from a Google Play Store dataset and an Apple App Store dataset. Apps were limited to free apps targeted towards English speakers, because the hypothetical app company makes these kinds of apps. The most common and popular app genres were determined.
In the end, several ideas of profitable apps were listed. The app company may now review the analysis and consider the suggestions. This may help them make an informed, data-driven decision regarding the next app that they will develop.