Overview

Hacker News is a popular website about technology. Specifically, it is a community choice aggregator of tech content. Users can:

Submit tech articles that the user found online.
Submit “Ask” posts to ask the community a question.
Submit “Show” posts to show the community something that the user made.
Vote and comment on other people’s posts.

In this project, we will analyze and compare Ask posts and Show posts in order to answer the following questions:

Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?
Can posting at a certain time of day result in getting more comments?

This analysis can be helpful for Hacker News users who would like for their posts to reach a larger audience on the platform.

Note

I wrote this notebook for the Dataquest course’s Guided Project: Exploring Hacker News Posts. The research questions and general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.

Package Installs

import pandas as pd
import numpy as np
import altair as alt
import datetime as dt

Dataset

The dataset for this project is the Hacker News Posts dataset on Kaggle, uploaded by Hacker News.

The following is quoted from the dataset’s Description.

This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

title: title of the post (self explanatory)
url: the url of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Let us view the first 5 rows of the dataset below.

hn = pd.read_csv("./private/2021-05-11-OHNP-Files/HN_posts_year_to_Sep_26_2016.csv")
hn.head()

	id	title	url	num_points	author	created_at
0	12579008	You have two days to comment if you want stem ...	http://www.regulations.gov/document?D=FDA-2015...	1	altstar	9/26/2016 3:26
1	12579005	SQLAR the SQLite Archiver	https://www.sqlite.org/sqlar/doc/trunk/README.md	1	blacksqr	9/26/2016 3:24
2	12578997	What if we just printed a flatscreen televisio...	https://medium.com/vanmoof/our-secrets-out-f21...	1	pavel_lishin	9/26/2016 3:19
3	12578989	algorithmic music	http://cacm.acm.org/magazines/2011/7/109891-al...	1	poindontcare	9/26/2016 3:16
4	12578979	How the Data Vault Enables the Next-Gen Data W...	https://www.talend.com/blog/2016/05/12/talend-...	1	markgainor1	9/26/2016 3:14

Below is the shape of the dataset.

print(hn.shape)

(293119, 7)

There are 293,119 rows and 7 columns in the dataset.

Before the data can be analyzed, it must first be cleaned.

Data Cleaning

Duplicate Rows

Below, I use pandas to delete duplicate rows except for the first instance of each duplicate.

Rows will be considered as duplicates if they are exactly alike in all features. I decided on this because it is possible for two posts to have the same title and/or url but be posted at different times or by different users. Thus, we cannot identify duplicates based on one or two features alone.

hn = hn.drop_duplicates(keep = "first")
print(hn.shape)

(293119, 7)

No duplicates were found. All rows were kept.

Posts without Comments

Our research questions involve the number of comments on each post. However, there are many posts with 0 comments.

To illustrate this, below a frequency table of the number of comments on each post.

def freq_comments(df = hn):
    
    """Function to make a frequency table of the number of comments per post
    specifically for the Hacker News dataset."""

    freq_df = df["num_comments"].value_counts().reset_index()
    
    freq_df.columns = ["num_comments", "frequency"]
    
    freq_df = freq_df.sort_values(
        by = "num_comments",
    ).reset_index(
        drop = True,
    )

    return freq_df

freq_df1 = freq_comments()

freq_df1

	num_comments	frequency
0	0	212718
1	1	28055
2	2	9731
3	3	5016
4	4	3272
...	...	...
543	1007	1
544	1120	1
545	1448	1
546	1733	1
547	2531	1

548 rows × 2 columns

The table above shows that posts with 0 comments are most frequent.

Let us plot the table on a histogram.

def hist_comments(df, title):

    """Function to make a histogram of the number of comments per post
    specifically for the Hacker News dataset."""
    
    chart = alt.Chart(df).mark_bar().encode(
        x = alt.X(
            "num_comments:Q",
            title = "Number of Comments",
            bin = alt.Bin(step = 1)
        ),
        y = alt.Y(
            "frequency:Q",
            title = "Frequency",
        ),
    ).properties(
        title = title,
        width = 700,
        height = 400,
    )
    
    return chart
    
hist_comments(freq_df1, "Histogram of Number of Comments per Post")

There are so many posts with 0 comments that we cannot see the histogram bins for other numbers of comments.

Considering that the dataset is large and most rows have 0 comments, it would be best to drop all rows with 0 comments. This would make analysis less computationally expensive and allow us to answer our research questions.

with_comments = hn["num_comments"] > 0
hn = hn.loc[with_comments].reset_index(drop = True)

print(hn.shape)

(80401, 7)

Now, the dataset is left with only 80,401 rows. This will be easier to work with.

Below is the new histogram.

freq_df2 = freq_comments()

hist_comments(freq_df2, "Histogram of Number of Comments per Post")

The distribution is still heavily right-skewed since many posts have very few comments. What’s important is that unnecessary data has been removed.

Missing Values

Finally, let us remove rows with missing values. In order to answer our research questions, we only need the following columns:

title
num_comments
created_at

Thus, we will delete rows with missing values in this column.

hn.dropna(
    subset = ["title", "num_comments", "created_at"],
    inplace = True,
)

print(hn.shape)

(80401, 7)

The number of rows did not change from 80401. Therefore, no missing values were found in these columns, and no rows were dropped.

Data cleaning is now done.

Filtering Posts

As mentioned earlier, the first research question involves comparing Ask posts to Show posts. In order to do this, we have to group the posts into three types:

Ask Posts
Show Posts
Other Posts

Other posts are usually posts that share a tech article found online.

Ask and Show posts can be identified using the start of the post title. Ask posts start with “Ask HN:”.

ask_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Ask HN: ")
           ]

hn.loc[ask_mask].head()

	id	title	url	num_points	num_comments	author	created_at
1	12578908	Ask HN: What TLD do you use for local developm...	NaN	4	7	Sevrene	9/26/2016 2:53
6	12578522	Ask HN: How do you pass on your work when you ...	NaN	6	3	PascLeRasc	9/26/2016 1:17
18	12577870	Ask HN: Why join a fund when you can be an angel?	NaN	1	3	anthony_james	9/25/2016 22:48
27	12577647	Ask HN: Someone uses stock trading as passive ...	NaN	5	2	00taffe	9/25/2016 21:50
41	12576946	Ask HN: How hard would it be to make a cheap, ...	NaN	2	1	hkt	9/25/2016 19:30

On the other hand, Show posts start with “Show HN:”.

show_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Show HN: ")
           ]

hn.loc[show_mask].head()

	id	title	url	num_points	num_comments	author	created_at
35	12577142	Show HN: Jumble Essays on the go #PaulInYourP...	https://itunes.apple.com/us/app/jumble-find-st...	1	1	ryderj	9/25/2016 20:06
43	12576813	Show HN: Learn Japanese Vocab via multiple cho...	http://japanese.vul.io/	1	1	soulchild37	9/25/2016 19:06
52	12576090	Show HN: Markov chain Twitter bot. Trained on ...	https://twitter.com/botsonasty	3	1	keepingscore	9/25/2016 16:50
68	12575471	Show HN: Project-Okot: Novel, CODE-FREE data-a...	https://studio.nuchwezi.com/	3	1	nfixx	9/25/2016 14:30
88	12574773	Show HN: Cursor that Screenshot	http://edward.codes/cursor-that-screenshot	3	3	ed-bit	9/25/2016 10:50

Other posts do not start with any special label.

Below, I create a new column “post_type” and assign the appropriate value to each row.

hn["post_type"] = "Other"

hn.loc[ask_mask, "post_type"] = "Ask"
hn.loc[show_mask, "post_type"] = "Show"

hn[["title", "post_type"]]

	title	post_type
0	Saving the Hassle of Shopping	Other
1	Ask HN: What TLD do you use for local developm...	Ask
2	Amazons Algorithms Dont Find You the Best Deals	Other
3	Emergency dose of epinephrine that does not co...	Other
4	Phone Makers Could Cut Off Drivers. So Why Don...	Other
...	...	...
80396	My Keyboard	Other
80397	Google's new logo was created by Russian desig...	Other
80398	Why we aren't tempted to use ACLs on our Unix ...	Other
80399	Ask HN: What is/are your favorite quote(s)?	Ask
80400	Dying vets fuck you letter (2013)	Other

80401 rows × 2 columns

Each row has now been labeled as a type of post.

Research Question 1: Comparing Ask and Show Posts

The first research question is, “Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?”

Note that the data is not normally distributed; it is right-skewed. For example, here is the distribution of the number of comments per Ask post.

ask_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Ask"]
)

ask_freq

	num_comments	frequency
0	1	1383
1	2	1238
2	3	762
3	4	592
4	5	373
...	...	...
203	898	1
204	910	1
205	937	1
206	947	1
207	1007	1

208 rows × 2 columns

hist_comments(
    ask_freq,
    "Histogram of Number of Comments per Ask Post"
)

The histogram is similar for Show posts.

show_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Show"],
)

show_freq

	num_comments	frequency
0	1	1738
1	2	814
2	3	504
3	4	300
4	5	196
...	...	...
137	250	1
138	257	1
139	280	1
140	298	1
141	306	1

142 rows × 2 columns

hist_comments(
    show_freq,
    "Histogram of Number of Comments per Show Post"
)

Therefore, the mean would not be a good measure of central tendency for the “average number of comments per post.” Thus, we will use the median instead.

dct = {"Ask": None, "Show": None}

for key in dct:
    median = np.median(
        hn["num_comments"].loc[hn["post_type"] == key]
    )
    
    dct[key] = median

table = pd.DataFrame.from_dict(
    dct,
    orient = "index",
).reset_index(
).rename(columns = {
    "index": "post_type",
    0: "median_comments",
})

chart = alt.Chart(table).mark_bar().encode(
    y = alt.Y("post_type:N", title = "Post Type"),
    x = alt.X("median_comments:Q", title = "Median Number of Comments per Post"),
).properties(
    title = "Median Number of Comments for the Two Post Types",
)

chart

The bar graph shows that Ask posts have a higher median number of comments per post, compared to Show posts.

Important

The results suggest that Ask posts get more comments than Show posts. It may be easier for users to reach a larger audience via Ask posts.

Research Question 2: Active Times

The second research question is, “Can posting at a certain time of day result in getting more comments?”

For this part of the analysis, we will only be using Ask post data for simplicity.

We will divide the day into 24 one-hour periods, and then calculate the number of Ask posts created in each period.

String Template for Time

Before analying, we need to inspect the “created_at” column of the dataset.

hn_ask = hn.loc[
    hn["post_type"] == "Ask"
].reset_index(
    drop = True,
)

hn_ask[["created_at"]].head()

	created_at
0	9/26/2016 2:53
1	9/26/2016 1:17
2	9/25/2016 22:48
3	9/25/2016 21:50
4	9/25/2016 19:30

The strings in this column appear to follow the following format:

month/day/year hour:minute

With the datetime module, the following is the equivalent formatting template.

template = "%m/%d/%Y %H:%M"

Parsing Times

The time data can now be parsed and used for analysis.

hn_ask["created_at"] = pd.to_datetime(
    hn_ask["created_at"],
    format = template,
)

hn_ask["created_at"].head()

0   2016-09-26 02:53:00
1   2016-09-26 01:17:00
2   2016-09-25 22:48:00
3   2016-09-25 21:50:00
4   2016-09-25 19:30:00
Name: created_at, dtype: datetime64[ns]

The column is now in datetime format.

With this data, we will make 2 dictionaries. The hours_posts dictionary will count the number of posts at certain hours. The hours_comments dictionary will count the number of comments received by posts made at certain hours.

hours_posts = {}
hours_comments = {}

for index, row in hn_ask.iterrows():
    date_dt = row["created_at"]
    num_comments = row["num_comments"]
    
    # extract hour
    hour = date_dt.hour
    
    # update dictionaries
    hours_posts.setdefault(hour, 0)
    hours_posts[hour] += 1
    
    hours_comments.setdefault(hour, 0)
    hours_comments[hour] += num_comments

The hours were parsed and mapped to their respective counts of posts and comments.

The code below transforms the dictionaries into DataFrames for ease of use.

def hour_to_df(dct, data_label):
    
    """Make a DataFrame from a dictionary that maps
    an 'hour' column to another column, named by `data_label`."""
    
    result = pd.DataFrame.from_dict(
        dct,
        orient = "index",
    ).reset_index(
    ).rename(columns = {
        "index": "hour",
        0: data_label,
    }).sort_values(
        by = "hour",
    ).reset_index(
        drop = True,
    )
    
    return result
    
hours_posts_df = hour_to_df(
    hours_posts,
    data_label = "num_posts",
)

hours_comments_df = hour_to_df(
    hours_comments,
    data_label = "num_comments",
)

hours_posts_df.head()

	hour	num_posts
0	0	228
1	1	222
2	2	227
3	3	210
4	4	184

hours_comments_df.head()

	hour	num_comments
0	0	2261
1	1	2068
2	2	2996
3	3	2152
4	4	2353

The hours have been parsed, and the tables have been generated.

Additionally, another DataFrame is created below. It calculates the median number of comments per post by the hour posted.

hn_ask["hour"] = hn_ask["created_at"].dt.hour

hours_median = hn_ask.pivot_table(
    index = "hour",
    values = "num_comments",
    aggfunc = np.median,
).reset_index()

hours_median

	hour	num_comments
0	0	3.0
1	1	3.0
2	2	4.0
3	3	3.0
4	4	4.0
5	5	3.0
6	6	3.0
7	7	4.0
8	8	3.5
9	9	3.0
10	10	4.0
11	11	4.0
12	12	4.0
13	13	4.0
14	14	3.0
15	15	4.0
16	16	3.0
17	17	4.0
18	18	3.0
19	19	3.0
20	20	4.0
21	21	3.0
22	22	4.0
23	23	4.0

Number of Posts by Hour of the Day

Below is a table showing the total number of posts that were created, grouped by hour of the day.

hours_posts_df

	hour	num_posts
0	0	228
1	1	222
2	2	227
3	3	210
4	4	184
5	5	165
6	6	176
7	7	156
8	8	190
9	9	176
10	10	218
11	11	250
12	12	272
13	13	323
14	14	377
15	15	467
16	16	412
17	17	402
18	18	450
19	19	418
20	20	392
21	21	407
22	22	286
23	23	276

This table is in 24-hour time. Hour 13 refers to 1:00 PM. The table shows how many posts are made for every hour in the day.

Below is a line chart that shows this visually.

chart = alt.Chart(hours_posts_df).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_posts:Q", title = "Number of Posts"),
).properties(
    title = "Number of Posts by Hour of the Day",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

The histogram clearly shows that Hacker News users most actively make posts between 15:00 and 18:00, or from 3:00 PM to 6:00 PM.

The most active hour for posting is 3:00 PM - 4:00 PM.

Number of Comments by Hour Posted

Next, a similar analysis is done for the total number of comments written by Hacker News users, grouped by the hour that the original posts were created. Below is the table for this data.

hours_comments_df

	hour	num_comments
0	0	2261
1	1	2068
2	2	2996
3	3	2152
4	4	2353
5	5	1838
6	6	1587
7	7	1584
8	8	2362
9	9	1477
10	10	3011
11	11	2794
12	12	4226
13	13	7219
14	14	4970
15	15	18525
16	16	4458
17	17	5536
18	18	4824
19	19	3949
20	20	4462
21	21	4500
22	22	3369
23	23	2297

Below is the line chart that visualizes the table.

chart = alt.Chart(hours_comments_df).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_comments:Q", title = "Number of Comments"),
).properties(
    title = "Number of Comments by Hour Posted",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

The line chart shows that many comments are made on posts created from 13:00 to 17:00, or 1:00 PM - 5:00 PM.

Notably, there was a total of over 18,000 comments on posts created at 3:00 PM. This may suggest that Hacker News users most actively comment at around this time.

However, there is also the possibility that this total was influenced by a few outliers. Let’s check the distribution of the number of comments made at 3:00 PM.

alt.Chart(
    hn_ask.loc[hn_ask["hour"] == 15]
).mark_bar().encode(
    x = alt.X("num_comments:Q"),
    y = alt.Y("count()")
).properties(
    title = "Distribution of Number of Comments per Post (3:00 PM)",
    width = 700,
    height = 200,
).configure_axis(
    grid = True,
)

Indeed, there are several outlier posts with a very high number of comments, going up to 1000. These influenced the spike in the line chart.

Therefore, we can’t say that a post will definitely get many comments if it is posted at 3:00 PM. However, we can say that generally a lot of comments are made in the afternoon from 1:00 PM to 5:00 PM.

Median Number of Comments per Post, by Hour Posted

The table below shows the median number of comments per post, by the hour of posting.

hours_median

	hour	num_comments
0	0	3.0
1	1	3.0
2	2	4.0
3	3	3.0
4	4	4.0
5	5	3.0
6	6	3.0
7	7	4.0
8	8	3.5
9	9	3.0
10	10	4.0
11	11	4.0
12	12	4.0
13	13	4.0
14	14	3.0
15	15	4.0
16	16	3.0
17	17	4.0
18	18	3.0
19	19	3.0
20	20	4.0
21	21	3.0
22	22	4.0
23	23	4.0

This is visualized in the line chart below, which looks quite different from the previous two charts.

chart = alt.Chart(hours_median).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_comments:Q", title = "Median Number of Comments per Post"),
).properties(
    title = "Median Number of Comments per Post, by Hour Posted",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

This graph shows that the median number of comments per post is very consistent throughout the day. It ranges from 3 comments to 4 comments.

This brings us a new question. We’ve seen that Hacker News users most actively post and comment in the afternoon. So, why does the median number of comments per post not increase in the afternoon?

A possible explanation is that since the site is oversaturated with new posts in the afternoon, only the very best posts receive attention. The rest are lost in the flood of new posts.

Conclusion

In this project, we analyzed data about Hacker News posts, specifically regarding the number of comments that they receive. Below are the research questions, and the best answers that we could come up with from our analysis.

Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?

Ask posts receive a higher median number of comments per post, compared to Show posts. In order to reach a wider audience or converse with more users, it is better to make an Ask post.

Can posting at a certain time of day result in getting more comments?

Hacker News users are very active in the afternoon, from 1:00 PM to 6:00 PM. However, if you post in the afternoon, your post may get lost in the flood of new posts. Hypothetically, it may be better to post in the morning, like at 7:00 AM, so that some people can notice your post. Then, your post can get more attention in the afternoon.

Thanks for reading!