Optimizing Hacker News Posts

Hacker News is a popular tech posting site. I used data analysis in order to determine the best practices for optimizing the number of comments on your post.
python
pandas
altair
Author

Migs Germar

Published

May 11, 2021

Unsplash | Clint Patterson

Overview

Hacker News is a popular website about technology. Specifically, it is a community choice aggregator of tech content. Users can:

  • Submit tech articles that the user found online.
  • Submit “Ask” posts to ask the community a question.
  • Submit “Show” posts to show the community something that the user made.
  • Vote and comment on other people’s posts.

In this project, we will analyze and compare Ask posts and Show posts in order to answer the following questions:

  • Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?
  • Can posting at a certain time of day result in getting more comments?

This analysis can be helpful for Hacker News users who would like for their posts to reach a larger audience on the platform.

Note

I wrote this notebook for the Dataquest course’s Guided Project: Exploring Hacker News Posts. The research questions and general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.

Package Installs

import pandas as pd
import numpy as np
import altair as alt
import datetime as dt

Dataset

The dataset for this project is the Hacker News Posts dataset on Kaggle, uploaded by Hacker News.

The following is quoted from the dataset’s Description.

This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

  • title: title of the post (self explanatory)
  • url: the url of the item being linked to
  • num_points: the number of upvotes the post received
  • num_comments: the number of comments the post received
  • author: the name of the account that made the post
  • created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Let us view the first 5 rows of the dataset below.

hn = pd.read_csv("./private/2021-05-11-OHNP-Files/HN_posts_year_to_Sep_26_2016.csv")
hn.head()
id title url num_points num_comments author created_at
0 12579008 You have two days to comment if you want stem ... http://www.regulations.gov/document?D=FDA-2015... 1 0 altstar 9/26/2016 3:26
1 12579005 SQLAR the SQLite Archiver https://www.sqlite.org/sqlar/doc/trunk/README.md 1 0 blacksqr 9/26/2016 3:24
2 12578997 What if we just printed a flatscreen televisio... https://medium.com/vanmoof/our-secrets-out-f21... 1 0 pavel_lishin 9/26/2016 3:19
3 12578989 algorithmic music http://cacm.acm.org/magazines/2011/7/109891-al... 1 0 poindontcare 9/26/2016 3:16
4 12578979 How the Data Vault Enables the Next-Gen Data W... https://www.talend.com/blog/2016/05/12/talend-... 1 0 markgainor1 9/26/2016 3:14

Below is the shape of the dataset.

print(hn.shape)
(293119, 7)

There are 293,119 rows and 7 columns in the dataset.

Before the data can be analyzed, it must first be cleaned.

Data Cleaning

Duplicate Rows

Below, I use pandas to delete duplicate rows except for the first instance of each duplicate.

Rows will be considered as duplicates if they are exactly alike in all features. I decided on this because it is possible for two posts to have the same title and/or url but be posted at different times or by different users. Thus, we cannot identify duplicates based on one or two features alone.

hn = hn.drop_duplicates(keep = "first")
print(hn.shape)
(293119, 7)

No duplicates were found. All rows were kept.

Posts without Comments

Our research questions involve the number of comments on each post. However, there are many posts with 0 comments.

To illustrate this, below a frequency table of the number of comments on each post.

def freq_comments(df = hn):
    
    """Function to make a frequency table of the number of comments per post
    specifically for the Hacker News dataset."""

    freq_df = df["num_comments"].value_counts().reset_index()
    
    freq_df.columns = ["num_comments", "frequency"]
    
    freq_df = freq_df.sort_values(
        by = "num_comments",
    ).reset_index(
        drop = True,
    )

    return freq_df

freq_df1 = freq_comments()

freq_df1
num_comments frequency
0 0 212718
1 1 28055
2 2 9731
3 3 5016
4 4 3272
... ... ...
543 1007 1
544 1120 1
545 1448 1
546 1733 1
547 2531 1

548 rows × 2 columns

The table above shows that posts with 0 comments are most frequent.

Let us plot the table on a histogram.

def hist_comments(df, title):

    """Function to make a histogram of the number of comments per post
    specifically for the Hacker News dataset."""
    
    chart = alt.Chart(df).mark_bar().encode(
        x = alt.X(
            "num_comments:Q",
            title = "Number of Comments",
            bin = alt.Bin(step = 1)
        ),
        y = alt.Y(
            "frequency:Q",
            title = "Frequency",
        ),
    ).properties(
        title = title,
        width = 700,
        height = 400,
    )
    
    return chart
    
hist_comments(freq_df1, "Histogram of Number of Comments per Post")

There are so many posts with 0 comments that we cannot see the histogram bins for other numbers of comments.

Considering that the dataset is large and most rows have 0 comments, it would be best to drop all rows with 0 comments. This would make analysis less computationally expensive and allow us to answer our research questions.

with_comments = hn["num_comments"] > 0
hn = hn.loc[with_comments].reset_index(drop = True)

print(hn.shape)
(80401, 7)

Now, the dataset is left with only 80,401 rows. This will be easier to work with.

Below is the new histogram.

freq_df2 = freq_comments()

hist_comments(freq_df2, "Histogram of Number of Comments per Post")

The distribution is still heavily right-skewed since many posts have very few comments. What’s important is that unnecessary data has been removed.

Missing Values

Finally, let us remove rows with missing values. In order to answer our research questions, we only need the following columns:

  • title
  • num_comments
  • created_at

Thus, we will delete rows with missing values in this column.

hn.dropna(
    subset = ["title", "num_comments", "created_at"],
    inplace = True,
)

print(hn.shape)
(80401, 7)

The number of rows did not change from 80401. Therefore, no missing values were found in these columns, and no rows were dropped.

Data cleaning is now done.

Filtering Posts

As mentioned earlier, the first research question involves comparing Ask posts to Show posts. In order to do this, we have to group the posts into three types:

  • Ask Posts
  • Show Posts
  • Other Posts

Other posts are usually posts that share a tech article found online.

Ask and Show posts can be identified using the start of the post title. Ask posts start with “Ask HN:”.

ask_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Ask HN: ")
           ]

hn.loc[ask_mask].head()
id title url num_points num_comments author created_at
1 12578908 Ask HN: What TLD do you use for local developm... NaN 4 7 Sevrene 9/26/2016 2:53
6 12578522 Ask HN: How do you pass on your work when you ... NaN 6 3 PascLeRasc 9/26/2016 1:17
18 12577870 Ask HN: Why join a fund when you can be an angel? NaN 1 3 anthony_james 9/25/2016 22:48
27 12577647 Ask HN: Someone uses stock trading as passive ... NaN 5 2 00taffe 9/25/2016 21:50
41 12576946 Ask HN: How hard would it be to make a cheap, ... NaN 2 1 hkt 9/25/2016 19:30

On the other hand, Show posts start with “Show HN:”.

show_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Show HN: ")
           ]

hn.loc[show_mask].head()
id title url num_points num_comments author created_at
35 12577142 Show HN: Jumble Essays on the go #PaulInYourP... https://itunes.apple.com/us/app/jumble-find-st... 1 1 ryderj 9/25/2016 20:06
43 12576813 Show HN: Learn Japanese Vocab via multiple cho... http://japanese.vul.io/ 1 1 soulchild37 9/25/2016 19:06
52 12576090 Show HN: Markov chain Twitter bot. Trained on ... https://twitter.com/botsonasty 3 1 keepingscore 9/25/2016 16:50
68 12575471 Show HN: Project-Okot: Novel, CODE-FREE data-a... https://studio.nuchwezi.com/ 3 1 nfixx 9/25/2016 14:30
88 12574773 Show HN: Cursor that Screenshot http://edward.codes/cursor-that-screenshot 3 3 ed-bit 9/25/2016 10:50

Other posts do not start with any special label.

Below, I create a new column “post_type” and assign the appropriate value to each row.

hn["post_type"] = "Other"

hn.loc[ask_mask, "post_type"] = "Ask"
hn.loc[show_mask, "post_type"] = "Show"

hn[["title", "post_type"]]
title post_type
0 Saving the Hassle of Shopping Other
1 Ask HN: What TLD do you use for local developm... Ask
2 Amazons Algorithms Dont Find You the Best Deals Other
3 Emergency dose of epinephrine that does not co... Other
4 Phone Makers Could Cut Off Drivers. So Why Don... Other
... ... ...
80396 My Keyboard Other
80397 Google's new logo was created by Russian desig... Other
80398 Why we aren't tempted to use ACLs on our Unix ... Other
80399 Ask HN: What is/are your favorite quote(s)? Ask
80400 Dying vets fuck you letter (2013) Other

80401 rows × 2 columns

Each row has now been labeled as a type of post.

Research Question 1: Comparing Ask and Show Posts

The first research question is, “Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?”

Note that the data is not normally distributed; it is right-skewed. For example, here is the distribution of the number of comments per Ask post.

ask_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Ask"]
)

ask_freq
num_comments frequency
0 1 1383
1 2 1238
2 3 762
3 4 592
4 5 373
... ... ...
203 898 1
204 910 1
205 937 1
206 947 1
207 1007 1

208 rows × 2 columns

hist_comments(
    ask_freq,
    "Histogram of Number of Comments per Ask Post"
)

The histogram is similar for Show posts.

show_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Show"],
)

show_freq
num_comments frequency
0 1 1738
1 2 814
2 3 504
3 4 300
4 5 196
... ... ...
137 250 1
138 257 1
139 280 1
140 298 1
141 306 1

142 rows × 2 columns

hist_comments(
    show_freq,
    "Histogram of Number of Comments per Show Post"
)

Therefore, the mean would not be a good measure of central tendency for the “average number of comments per post.” Thus, we will use the median instead.

dct = {"Ask": None, "Show": None}

for key in dct:
    median = np.median(
        hn["num_comments"].loc[hn["post_type"] == key]
    )
    
    dct[key] = median

table = pd.DataFrame.from_dict(
    dct,
    orient = "index",
).reset_index(
).rename(columns = {
    "index": "post_type",
    0: "median_comments",
})

chart = alt.Chart(table).mark_bar().encode(
    y = alt.Y("post_type:N", title = "Post Type"),
    x = alt.X("median_comments:Q", title = "Median Number of Comments per Post"),
).properties(
    title = "Median Number of Comments for the Two Post Types",
)

chart

The bar graph shows that Ask posts have a higher median number of comments per post, compared to Show posts.

Important

The results suggest that Ask posts get more comments than Show posts. It may be easier for users to reach a larger audience via Ask posts.

Research Question 2: Active Times

The second research question is, “Can posting at a certain time of day result in getting more comments?”

For this part of the analysis, we will only be using Ask post data for simplicity.

We will divide the day into 24 one-hour periods, and then calculate the number of Ask posts created in each period.

String Template for Time

Before analying, we need to inspect the “created_at” column of the dataset.

hn_ask = hn.loc[
    hn["post_type"] == "Ask"
].reset_index(
    drop = True,
)

hn_ask[["created_at"]].head()
created_at
0 9/26/2016 2:53
1 9/26/2016 1:17
2 9/25/2016 22:48
3 9/25/2016 21:50
4 9/25/2016 19:30

The strings in this column appear to follow the following format:

month/day/year hour:minute

With the datetime module, the following is the equivalent formatting template.

template = "%m/%d/%Y %H:%M"

Parsing Times

The time data can now be parsed and used for analysis.

hn_ask["created_at"] = pd.to_datetime(
    hn_ask["created_at"],
    format = template,
)

hn_ask["created_at"].head()
0   2016-09-26 02:53:00
1   2016-09-26 01:17:00
2   2016-09-25 22:48:00
3   2016-09-25 21:50:00
4   2016-09-25 19:30:00
Name: created_at, dtype: datetime64[ns]

The column is now in datetime format.

With this data, we will make 2 dictionaries. The hours_posts dictionary will count the number of posts at certain hours. The hours_comments dictionary will count the number of comments received by posts made at certain hours.

hours_posts = {}
hours_comments = {}

for index, row in hn_ask.iterrows():
    date_dt = row["created_at"]
    num_comments = row["num_comments"]
    
    # extract hour
    hour = date_dt.hour
    
    # update dictionaries
    hours_posts.setdefault(hour, 0)
    hours_posts[hour] += 1
    
    hours_comments.setdefault(hour, 0)
    hours_comments[hour] += num_comments

The hours were parsed and mapped to their respective counts of posts and comments.

The code below transforms the dictionaries into DataFrames for ease of use.

def hour_to_df(dct, data_label):
    
    """Make a DataFrame from a dictionary that maps
    an 'hour' column to another column, named by `data_label`."""
    
    result = pd.DataFrame.from_dict(
        dct,
        orient = "index",
    ).reset_index(
    ).rename(columns = {
        "index": "hour",
        0: data_label,
    }).sort_values(
        by = "hour",
    ).reset_index(
        drop = True,
    )
    
    return result
    
hours_posts_df = hour_to_df(
    hours_posts,
    data_label = "num_posts",
)

hours_comments_df = hour_to_df(
    hours_comments,
    data_label = "num_comments",
)

hours_posts_df.head()
hour num_posts
0 0 228
1 1 222
2 2 227
3 3 210
4 4 184
hours_comments_df.head()
hour num_comments
0 0 2261
1 1 2068
2 2 2996
3 3 2152
4 4 2353

The hours have been parsed, and the tables have been generated.

Additionally, another DataFrame is created below. It calculates the median number of comments per post by the hour posted.

hn_ask["hour"] = hn_ask["created_at"].dt.hour

hours_median = hn_ask.pivot_table(
    index = "hour",
    values = "num_comments",
    aggfunc = np.median,
).reset_index()

hours_median
hour num_comments
0 0 3.0
1 1 3.0
2 2 4.0
3 3 3.0
4 4 4.0
5 5 3.0
6 6 3.0
7 7 4.0
8 8 3.5
9 9 3.0
10 10 4.0
11 11 4.0
12 12 4.0
13 13 4.0
14 14 3.0
15 15 4.0
16 16 3.0
17 17 4.0
18 18 3.0
19 19 3.0
20 20 4.0
21 21 3.0
22 22 4.0
23 23 4.0

Number of Posts by Hour of the Day

Below is a table showing the total number of posts that were created, grouped by hour of the day.

hours_posts_df
hour num_posts
0 0 228
1 1 222
2 2 227
3 3 210
4 4 184
5 5 165
6 6 176
7 7 156
8 8 190
9 9 176
10 10 218
11 11 250
12 12 272
13 13 323
14 14 377
15 15 467
16 16 412
17 17 402
18 18 450
19 19 418
20 20 392
21 21 407
22 22 286
23 23 276

This table is in 24-hour time. Hour 13 refers to 1:00 PM. The table shows how many posts are made for every hour in the day.

Below is a line chart that shows this visually.

chart = alt.Chart(hours_posts_df).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_posts:Q", title = "Number of Posts"),
).properties(
    title = "Number of Posts by Hour of the Day",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

The histogram clearly shows that Hacker News users most actively make posts between 15:00 and 18:00, or from 3:00 PM to 6:00 PM.

The most active hour for posting is 3:00 PM - 4:00 PM.

Number of Comments by Hour Posted

Next, a similar analysis is done for the total number of comments written by Hacker News users, grouped by the hour that the original posts were created. Below is the table for this data.

hours_comments_df
hour num_comments
0 0 2261
1 1 2068
2 2 2996
3 3 2152
4 4 2353
5 5 1838
6 6 1587
7 7 1584
8 8 2362
9 9 1477
10 10 3011
11 11 2794
12 12 4226
13 13 7219
14 14 4970
15 15 18525
16 16 4458
17 17 5536
18 18 4824
19 19 3949
20 20 4462
21 21 4500
22 22 3369
23 23 2297

Below is the line chart that visualizes the table.

chart = alt.Chart(hours_comments_df).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_comments:Q", title = "Number of Comments"),
).properties(
    title = "Number of Comments by Hour Posted",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

The line chart shows that many comments are made on posts created from 13:00 to 17:00, or 1:00 PM - 5:00 PM.

Notably, there was a total of over 18,000 comments on posts created at 3:00 PM. This may suggest that Hacker News users most actively comment at around this time.

However, there is also the possibility that this total was influenced by a few outliers. Let’s check the distribution of the number of comments made at 3:00 PM.

alt.Chart(
    hn_ask.loc[hn_ask["hour"] == 15]
).mark_bar().encode(
    x = alt.X("num_comments:Q"),
    y = alt.Y("count()")
).properties(
    title = "Distribution of Number of Comments per Post (3:00 PM)",
    width = 700,
    height = 200,
).configure_axis(
    grid = True,
)

Indeed, there are several outlier posts with a very high number of comments, going up to 1000. These influenced the spike in the line chart.

Therefore, we can’t say that a post will definitely get many comments if it is posted at 3:00 PM. However, we can say that generally a lot of comments are made in the afternoon from 1:00 PM to 5:00 PM.

Median Number of Comments per Post, by Hour Posted

The table below shows the median number of comments per post, by the hour of posting.

hours_median
hour num_comments
0 0 3.0
1 1 3.0
2 2 4.0
3 3 3.0
4 4 4.0
5 5 3.0
6 6 3.0
7 7 4.0
8 8 3.5
9 9 3.0
10 10 4.0
11 11 4.0
12 12 4.0
13 13 4.0
14 14 3.0
15 15 4.0
16 16 3.0
17 17 4.0
18 18 3.0
19 19 3.0
20 20 4.0
21 21 3.0
22 22 4.0
23 23 4.0

This is visualized in the line chart below, which looks quite different from the previous two charts.

chart = alt.Chart(hours_median).mark_line().encode(
    x = alt.X(
        "hour:O", title = "Hour of the Day",
        axis = alt.Axis(labelAngle = 0)
    ),
    y = alt.Y("num_comments:Q", title = "Median Number of Comments per Post"),
).properties(
    title = "Median Number of Comments per Post, by Hour Posted",
    width = 700,
    height = 400,
).configure_axis(
    grid = True,
)

chart

This graph shows that the median number of comments per post is very consistent throughout the day. It ranges from 3 comments to 4 comments.

This brings us a new question. We’ve seen that Hacker News users most actively post and comment in the afternoon. So, why does the median number of comments per post not increase in the afternoon?

A possible explanation is that since the site is oversaturated with new posts in the afternoon, only the very best posts receive attention. The rest are lost in the flood of new posts.

Conclusion

In this project, we analyzed data about Hacker News posts, specifically regarding the number of comments that they receive. Below are the research questions, and the best answers that we could come up with from our analysis.


  • Between Ask posts and Show posts, which type receives more comments in terms of average number of comments per post?

Ask posts receive a higher median number of comments per post, compared to Show posts. In order to reach a wider audience or converse with more users, it is better to make an Ask post.


  • Can posting at a certain time of day result in getting more comments?

Hacker News users are very active in the afternoon, from 1:00 PM to 6:00 PM. However, if you post in the afternoon, your post may get lost in the flood of new posts. Hypothetically, it may be better to post in the morning, like at 7:00 AM, so that some people can notice your post. Then, your post can get more attention in the afternoon.


Thanks for reading!