import pandas as pd
import numpy as np
import altair as alt
import datetime as dt
Overview
Hacker News is a popular website about technology. Specifically, it is a community choice aggregator of tech content. Users can:
- Submit tech articles that the user found online.
- Submit “Ask” posts to ask the community a question.
- Submit “Show” posts to show the community something that the user made.
- Vote and comment on other people’s posts.
In this project, we will analyze and compare Ask posts and Show posts in order to answer the following questions:
- Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?
- Can posting at a certain time of day result in getting more comments?
This analysis can be helpful for Hacker News users who would like for their posts to reach a larger audience on the platform.
I wrote this notebook for the Dataquest course’s Guided Project: Exploring Hacker News Posts. The research questions and general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.
Package Installs
Dataset
The dataset for this project is the Hacker News Posts dataset on Kaggle, uploaded by Hacker News.
The following is quoted from the dataset’s Description.
This data set is Hacker News posts from the last 12 months (up to September 26 2016).
It includes the following columns:
- title: title of the post (self explanatory)
- url: the url of the item being linked to
- num_points: the number of upvotes the post received
- num_comments: the number of comments the post received
- author: the name of the account that made the post
- created_at: the date and time the post was made (the time zone is Eastern Time in the US)
Let us view the first 5 rows of the dataset below.
= pd.read_csv("./private/2021-05-11-OHNP-Files/HN_posts_year_to_Sep_26_2016.csv")
hn hn.head()
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
0 | 12579008 | You have two days to comment if you want stem ... | http://www.regulations.gov/document?D=FDA-2015... | 1 | 0 | altstar | 9/26/2016 3:26 |
1 | 12579005 | SQLAR the SQLite Archiver | https://www.sqlite.org/sqlar/doc/trunk/README.md | 1 | 0 | blacksqr | 9/26/2016 3:24 |
2 | 12578997 | What if we just printed a flatscreen televisio... | https://medium.com/vanmoof/our-secrets-out-f21... | 1 | 0 | pavel_lishin | 9/26/2016 3:19 |
3 | 12578989 | algorithmic music | http://cacm.acm.org/magazines/2011/7/109891-al... | 1 | 0 | poindontcare | 9/26/2016 3:16 |
4 | 12578979 | How the Data Vault Enables the Next-Gen Data W... | https://www.talend.com/blog/2016/05/12/talend-... | 1 | 0 | markgainor1 | 9/26/2016 3:14 |
Below is the shape of the dataset.
print(hn.shape)
(293119, 7)
There are 293,119 rows and 7 columns in the dataset.
Before the data can be analyzed, it must first be cleaned.
Data Cleaning
Duplicate Rows
Below, I use pandas to delete duplicate rows except for the first instance of each duplicate.
Rows will be considered as duplicates if they are exactly alike in all features. I decided on this because it is possible for two posts to have the same title and/or url but be posted at different times or by different users. Thus, we cannot identify duplicates based on one or two features alone.
= hn.drop_duplicates(keep = "first")
hn print(hn.shape)
(293119, 7)
No duplicates were found. All rows were kept.
Posts without Comments
Our research questions involve the number of comments on each post. However, there are many posts with 0 comments.
To illustrate this, below a frequency table of the number of comments on each post.
def freq_comments(df = hn):
"""Function to make a frequency table of the number of comments per post
specifically for the Hacker News dataset."""
= df["num_comments"].value_counts().reset_index()
freq_df
= ["num_comments", "frequency"]
freq_df.columns
= freq_df.sort_values(
freq_df = "num_comments",
by
).reset_index(= True,
drop
)
return freq_df
= freq_comments()
freq_df1
freq_df1
num_comments | frequency | |
---|---|---|
0 | 0 | 212718 |
1 | 1 | 28055 |
2 | 2 | 9731 |
3 | 3 | 5016 |
4 | 4 | 3272 |
... | ... | ... |
543 | 1007 | 1 |
544 | 1120 | 1 |
545 | 1448 | 1 |
546 | 1733 | 1 |
547 | 2531 | 1 |
548 rows × 2 columns
The table above shows that posts with 0 comments are most frequent.
Let us plot the table on a histogram.
def hist_comments(df, title):
"""Function to make a histogram of the number of comments per post
specifically for the Hacker News dataset."""
= alt.Chart(df).mark_bar().encode(
chart = alt.X(
x "num_comments:Q",
= "Number of Comments",
title bin = alt.Bin(step = 1)
),= alt.Y(
y "frequency:Q",
= "Frequency",
title
),
).properties(= title,
title = 700,
width = 400,
height
)
return chart
"Histogram of Number of Comments per Post") hist_comments(freq_df1,
There are so many posts with 0 comments that we cannot see the histogram bins for other numbers of comments.
Considering that the dataset is large and most rows have 0 comments, it would be best to drop all rows with 0 comments. This would make analysis less computationally expensive and allow us to answer our research questions.
= hn["num_comments"] > 0
with_comments = hn.loc[with_comments].reset_index(drop = True)
hn
print(hn.shape)
(80401, 7)
Now, the dataset is left with only 80,401 rows. This will be easier to work with.
Below is the new histogram.
= freq_comments()
freq_df2
"Histogram of Number of Comments per Post") hist_comments(freq_df2,
The distribution is still heavily right-skewed since many posts have very few comments. What’s important is that unnecessary data has been removed.
Missing Values
Finally, let us remove rows with missing values. In order to answer our research questions, we only need the following columns:
- title
- num_comments
- created_at
Thus, we will delete rows with missing values in this column.
hn.dropna(= ["title", "num_comments", "created_at"],
subset = True,
inplace
)
print(hn.shape)
(80401, 7)
The number of rows did not change from 80401. Therefore, no missing values were found in these columns, and no rows were dropped.
Data cleaning is now done.
Filtering Posts
As mentioned earlier, the first research question involves comparing Ask posts to Show posts. In order to do this, we have to group the posts into three types:
- Ask Posts
- Show Posts
- Other Posts
Other posts are usually posts that share a tech article found online.
Ask and Show posts can be identified using the start of the post title. Ask posts start with “Ask HN:”.
= [index
ask_mask for index, value in hn["title"].iteritems()
if value.startswith("Ask HN: ")
]
hn.loc[ask_mask].head()
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
1 | 12578908 | Ask HN: What TLD do you use for local developm... | NaN | 4 | 7 | Sevrene | 9/26/2016 2:53 |
6 | 12578522 | Ask HN: How do you pass on your work when you ... | NaN | 6 | 3 | PascLeRasc | 9/26/2016 1:17 |
18 | 12577870 | Ask HN: Why join a fund when you can be an angel? | NaN | 1 | 3 | anthony_james | 9/25/2016 22:48 |
27 | 12577647 | Ask HN: Someone uses stock trading as passive ... | NaN | 5 | 2 | 00taffe | 9/25/2016 21:50 |
41 | 12576946 | Ask HN: How hard would it be to make a cheap, ... | NaN | 2 | 1 | hkt | 9/25/2016 19:30 |
On the other hand, Show posts start with “Show HN:”.
= [index
show_mask for index, value in hn["title"].iteritems()
if value.startswith("Show HN: ")
]
hn.loc[show_mask].head()
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
35 | 12577142 | Show HN: Jumble Essays on the go #PaulInYourP... | https://itunes.apple.com/us/app/jumble-find-st... | 1 | 1 | ryderj | 9/25/2016 20:06 |
43 | 12576813 | Show HN: Learn Japanese Vocab via multiple cho... | http://japanese.vul.io/ | 1 | 1 | soulchild37 | 9/25/2016 19:06 |
52 | 12576090 | Show HN: Markov chain Twitter bot. Trained on ... | https://twitter.com/botsonasty | 3 | 1 | keepingscore | 9/25/2016 16:50 |
68 | 12575471 | Show HN: Project-Okot: Novel, CODE-FREE data-a... | https://studio.nuchwezi.com/ | 3 | 1 | nfixx | 9/25/2016 14:30 |
88 | 12574773 | Show HN: Cursor that Screenshot | http://edward.codes/cursor-that-screenshot | 3 | 3 | ed-bit | 9/25/2016 10:50 |
Other posts do not start with any special label.
Below, I create a new column “post_type” and assign the appropriate value to each row.
"post_type"] = "Other"
hn[
"post_type"] = "Ask"
hn.loc[ask_mask, "post_type"] = "Show"
hn.loc[show_mask,
"title", "post_type"]] hn[[
title | post_type | |
---|---|---|
0 | Saving the Hassle of Shopping | Other |
1 | Ask HN: What TLD do you use for local developm... | Ask |
2 | Amazons Algorithms Dont Find You the Best Deals | Other |
3 | Emergency dose of epinephrine that does not co... | Other |
4 | Phone Makers Could Cut Off Drivers. So Why Don... | Other |
... | ... | ... |
80396 | My Keyboard | Other |
80397 | Google's new logo was created by Russian desig... | Other |
80398 | Why we aren't tempted to use ACLs on our Unix ... | Other |
80399 | Ask HN: What is/are your favorite quote(s)? | Ask |
80400 | Dying vets fuck you letter (2013) | Other |
80401 rows × 2 columns
Each row has now been labeled as a type of post.
Research Question 1: Comparing Ask and Show Posts
The first research question is, “Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?”
Note that the data is not normally distributed; it is right-skewed. For example, here is the distribution of the number of comments per Ask post.
= freq_comments(
ask_freq = hn.loc[hn["post_type"] == "Ask"]
df
)
ask_freq
num_comments | frequency | |
---|---|---|
0 | 1 | 1383 |
1 | 2 | 1238 |
2 | 3 | 762 |
3 | 4 | 592 |
4 | 5 | 373 |
... | ... | ... |
203 | 898 | 1 |
204 | 910 | 1 |
205 | 937 | 1 |
206 | 947 | 1 |
207 | 1007 | 1 |
208 rows × 2 columns
hist_comments(
ask_freq,"Histogram of Number of Comments per Ask Post"
)
The histogram is similar for Show posts.
= freq_comments(
show_freq = hn.loc[hn["post_type"] == "Show"],
df
)
show_freq
num_comments | frequency | |
---|---|---|
0 | 1 | 1738 |
1 | 2 | 814 |
2 | 3 | 504 |
3 | 4 | 300 |
4 | 5 | 196 |
... | ... | ... |
137 | 250 | 1 |
138 | 257 | 1 |
139 | 280 | 1 |
140 | 298 | 1 |
141 | 306 | 1 |
142 rows × 2 columns
hist_comments(
show_freq,"Histogram of Number of Comments per Show Post"
)
Therefore, the mean would not be a good measure of central tendency for the “average number of comments per post.” Thus, we will use the median instead.
= {"Ask": None, "Show": None}
dct
for key in dct:
= np.median(
median "num_comments"].loc[hn["post_type"] == key]
hn[
)
= median
dct[key]
= pd.DataFrame.from_dict(
table
dct,= "index",
orient
).reset_index(= {
).rename(columns "index": "post_type",
0: "median_comments",
})
= alt.Chart(table).mark_bar().encode(
chart = alt.Y("post_type:N", title = "Post Type"),
y = alt.X("median_comments:Q", title = "Median Number of Comments per Post"),
x
).properties(= "Median Number of Comments for the Two Post Types",
title
)
chart
The bar graph shows that Ask posts have a higher median number of comments per post, compared to Show posts.
The results suggest that Ask posts get more comments than Show posts. It may be easier for users to reach a larger audience via Ask posts.
Research Question 2: Active Times
The second research question is, “Can posting at a certain time of day result in getting more comments?”
For this part of the analysis, we will only be using Ask post data for simplicity.
We will divide the day into 24 one-hour periods, and then calculate the number of Ask posts created in each period.
String Template for Time
Before analying, we need to inspect the “created_at” column of the dataset.
= hn.loc[
hn_ask "post_type"] == "Ask"
hn[
].reset_index(= True,
drop
)
"created_at"]].head() hn_ask[[
created_at | |
---|---|
0 | 9/26/2016 2:53 |
1 | 9/26/2016 1:17 |
2 | 9/25/2016 22:48 |
3 | 9/25/2016 21:50 |
4 | 9/25/2016 19:30 |
The strings in this column appear to follow the following format:
month/day/year hour:minute
With the datetime module, the following is the equivalent formatting template.
= "%m/%d/%Y %H:%M" template
Parsing Times
The time data can now be parsed and used for analysis.
"created_at"] = pd.to_datetime(
hn_ask["created_at"],
hn_ask[format = template,
)
"created_at"].head() hn_ask[
0 2016-09-26 02:53:00
1 2016-09-26 01:17:00
2 2016-09-25 22:48:00
3 2016-09-25 21:50:00
4 2016-09-25 19:30:00
Name: created_at, dtype: datetime64[ns]
The column is now in datetime
format.
With this data, we will make 2 dictionaries. The hours_posts
dictionary will count the number of posts at certain hours. The hours_comments
dictionary will count the number of comments received by posts made at certain hours.
= {}
hours_posts = {}
hours_comments
for index, row in hn_ask.iterrows():
= row["created_at"]
date_dt = row["num_comments"]
num_comments
# extract hour
= date_dt.hour
hour
# update dictionaries
0)
hours_posts.setdefault(hour, += 1
hours_posts[hour]
0)
hours_comments.setdefault(hour, += num_comments hours_comments[hour]
The hours were parsed and mapped to their respective counts of posts and comments.
The code below transforms the dictionaries into DataFrames for ease of use.
def hour_to_df(dct, data_label):
"""Make a DataFrame from a dictionary that maps
an 'hour' column to another column, named by `data_label`."""
= pd.DataFrame.from_dict(
result
dct,= "index",
orient
).reset_index(= {
).rename(columns "index": "hour",
0: data_label,
}).sort_values(= "hour",
by
).reset_index(= True,
drop
)
return result
= hour_to_df(
hours_posts_df
hours_posts,= "num_posts",
data_label
)
= hour_to_df(
hours_comments_df
hours_comments,= "num_comments",
data_label
)
hours_posts_df.head()
hour | num_posts | |
---|---|---|
0 | 0 | 228 |
1 | 1 | 222 |
2 | 2 | 227 |
3 | 3 | 210 |
4 | 4 | 184 |
hours_comments_df.head()
hour | num_comments | |
---|---|---|
0 | 0 | 2261 |
1 | 1 | 2068 |
2 | 2 | 2996 |
3 | 3 | 2152 |
4 | 4 | 2353 |
The hours have been parsed, and the tables have been generated.
Additionally, another DataFrame is created below. It calculates the median number of comments per post by the hour posted.
"hour"] = hn_ask["created_at"].dt.hour
hn_ask[
= hn_ask.pivot_table(
hours_median = "hour",
index = "num_comments",
values = np.median,
aggfunc
).reset_index()
hours_median
hour | num_comments | |
---|---|---|
0 | 0 | 3.0 |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
3 | 3 | 3.0 |
4 | 4 | 4.0 |
5 | 5 | 3.0 |
6 | 6 | 3.0 |
7 | 7 | 4.0 |
8 | 8 | 3.5 |
9 | 9 | 3.0 |
10 | 10 | 4.0 |
11 | 11 | 4.0 |
12 | 12 | 4.0 |
13 | 13 | 4.0 |
14 | 14 | 3.0 |
15 | 15 | 4.0 |
16 | 16 | 3.0 |
17 | 17 | 4.0 |
18 | 18 | 3.0 |
19 | 19 | 3.0 |
20 | 20 | 4.0 |
21 | 21 | 3.0 |
22 | 22 | 4.0 |
23 | 23 | 4.0 |
Number of Posts by Hour of the Day
Below is a table showing the total number of posts that were created, grouped by hour of the day.
hours_posts_df
hour | num_posts | |
---|---|---|
0 | 0 | 228 |
1 | 1 | 222 |
2 | 2 | 227 |
3 | 3 | 210 |
4 | 4 | 184 |
5 | 5 | 165 |
6 | 6 | 176 |
7 | 7 | 156 |
8 | 8 | 190 |
9 | 9 | 176 |
10 | 10 | 218 |
11 | 11 | 250 |
12 | 12 | 272 |
13 | 13 | 323 |
14 | 14 | 377 |
15 | 15 | 467 |
16 | 16 | 412 |
17 | 17 | 402 |
18 | 18 | 450 |
19 | 19 | 418 |
20 | 20 | 392 |
21 | 21 | 407 |
22 | 22 | 286 |
23 | 23 | 276 |
This table is in 24-hour time. Hour 13 refers to 1:00 PM. The table shows how many posts are made for every hour in the day.
Below is a line chart that shows this visually.
= alt.Chart(hours_posts_df).mark_line().encode(
chart = alt.X(
x "hour:O", title = "Hour of the Day",
= alt.Axis(labelAngle = 0)
axis
),= alt.Y("num_posts:Q", title = "Number of Posts"),
y
).properties(= "Number of Posts by Hour of the Day",
title = 700,
width = 400,
height
).configure_axis(= True,
grid
)
chart
The histogram clearly shows that Hacker News users most actively make posts between 15:00 and 18:00, or from 3:00 PM to 6:00 PM.
The most active hour for posting is 3:00 PM - 4:00 PM.
Number of Comments by Hour Posted
Next, a similar analysis is done for the total number of comments written by Hacker News users, grouped by the hour that the original posts were created. Below is the table for this data.
hours_comments_df
hour | num_comments | |
---|---|---|
0 | 0 | 2261 |
1 | 1 | 2068 |
2 | 2 | 2996 |
3 | 3 | 2152 |
4 | 4 | 2353 |
5 | 5 | 1838 |
6 | 6 | 1587 |
7 | 7 | 1584 |
8 | 8 | 2362 |
9 | 9 | 1477 |
10 | 10 | 3011 |
11 | 11 | 2794 |
12 | 12 | 4226 |
13 | 13 | 7219 |
14 | 14 | 4970 |
15 | 15 | 18525 |
16 | 16 | 4458 |
17 | 17 | 5536 |
18 | 18 | 4824 |
19 | 19 | 3949 |
20 | 20 | 4462 |
21 | 21 | 4500 |
22 | 22 | 3369 |
23 | 23 | 2297 |
Below is the line chart that visualizes the table.
= alt.Chart(hours_comments_df).mark_line().encode(
chart = alt.X(
x "hour:O", title = "Hour of the Day",
= alt.Axis(labelAngle = 0)
axis
),= alt.Y("num_comments:Q", title = "Number of Comments"),
y
).properties(= "Number of Comments by Hour Posted",
title = 700,
width = 400,
height
).configure_axis(= True,
grid
)
chart
The line chart shows that many comments are made on posts created from 13:00 to 17:00, or 1:00 PM - 5:00 PM.
Notably, there was a total of over 18,000 comments on posts created at 3:00 PM. This may suggest that Hacker News users most actively comment at around this time.
However, there is also the possibility that this total was influenced by a few outliers. Let’s check the distribution of the number of comments made at 3:00 PM.
alt.Chart("hour"] == 15]
hn_ask.loc[hn_ask[
).mark_bar().encode(= alt.X("num_comments:Q"),
x = alt.Y("count()")
y
).properties(= "Distribution of Number of Comments per Post (3:00 PM)",
title = 700,
width = 200,
height
).configure_axis(= True,
grid )
Indeed, there are several outlier posts with a very high number of comments, going up to 1000. These influenced the spike in the line chart.
Therefore, we can’t say that a post will definitely get many comments if it is posted at 3:00 PM. However, we can say that generally a lot of comments are made in the afternoon from 1:00 PM to 5:00 PM.
Median Number of Comments per Post, by Hour Posted
The table below shows the median number of comments per post, by the hour of posting.
hours_median
hour | num_comments | |
---|---|---|
0 | 0 | 3.0 |
1 | 1 | 3.0 |
2 | 2 | 4.0 |
3 | 3 | 3.0 |
4 | 4 | 4.0 |
5 | 5 | 3.0 |
6 | 6 | 3.0 |
7 | 7 | 4.0 |
8 | 8 | 3.5 |
9 | 9 | 3.0 |
10 | 10 | 4.0 |
11 | 11 | 4.0 |
12 | 12 | 4.0 |
13 | 13 | 4.0 |
14 | 14 | 3.0 |
15 | 15 | 4.0 |
16 | 16 | 3.0 |
17 | 17 | 4.0 |
18 | 18 | 3.0 |
19 | 19 | 3.0 |
20 | 20 | 4.0 |
21 | 21 | 3.0 |
22 | 22 | 4.0 |
23 | 23 | 4.0 |
This is visualized in the line chart below, which looks quite different from the previous two charts.
= alt.Chart(hours_median).mark_line().encode(
chart = alt.X(
x "hour:O", title = "Hour of the Day",
= alt.Axis(labelAngle = 0)
axis
),= alt.Y("num_comments:Q", title = "Median Number of Comments per Post"),
y
).properties(= "Median Number of Comments per Post, by Hour Posted",
title = 700,
width = 400,
height
).configure_axis(= True,
grid
)
chart
This graph shows that the median number of comments per post is very consistent throughout the day. It ranges from 3 comments to 4 comments.
This brings us a new question. We’ve seen that Hacker News users most actively post and comment in the afternoon. So, why does the median number of comments per post not increase in the afternoon?
A possible explanation is that since the site is oversaturated with new posts in the afternoon, only the very best posts receive attention. The rest are lost in the flood of new posts.
Conclusion
In this project, we analyzed data about Hacker News posts, specifically regarding the number of comments that they receive. Below are the research questions, and the best answers that we could come up with from our analysis.
- Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?
Ask posts receive a higher median number of comments per post, compared to Show posts. In order to reach a wider audience or converse with more users, it is better to make an Ask post.
- Can posting at a certain time of day result in getting more comments?
Hacker News users are very active in the afternoon, from 1:00 PM to 6:00 PM. However, if you post in the afternoon, your post may get lost in the flood of new posts. Hypothetically, it may be better to post in the morning, like at 7:00 AM, so that some people can notice your post. Then, your post can get more attention in the afternoon.
Thanks for reading!