Introduction

The Multinomial Naive Bayes Algorithm is a machine learning algorithm based on the Bayes Theorem. It calculates the probability that an event $B$ occurred given that event $A$ occurred. Thus, it is usually used in classification problems. (Vadapalli, 2021)

In this project, we will use the algorithm to determine the probability that a message is spam given its contents. We will then use this probability to decide whether to treat new messages as spam or not. For example, if the probability of being spam is over 50%, then we may treat the message as spam.

Identifying spam is important in the Philippines because phishing campaigns went up by 200% after the pandemic began (Devanesan, 2020), and a telecommunications provider recently had to block around 71 million spam messages (Yap, 2021). Such messages may attempt to steal personal information, steal money from an account, or install malware (FTC, 2020). Thus, machine learning can be a very helpful tool in preventing such harm from occurring.

Though the algorithm can be easily implemented using existing functions such as those in the scikit-learn package, I will manually code the algorithm step-by-step in order to explain the mathematical intuition behind it.

Note: I wrote this notebook by following a guided project on the Dataquest platform, specifically the Guided Project: Building a Spam Filter with Naive Bayes The general project flow came from Dataquest. The mathematical explanations are also based on what I learned from Dataquest.

Packages

Below are the packages necessary for this project.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


The Dataset

We will use the SMS Spam Collection Dataset by Almeida and Hidalgo in 2012. It can be downloaded from the UCI Machine Learning Repository.

sms_df = pd.read_csv(
"./private/Naive-Bayes-Files/SMSSpamCollection",
# Tab-separated
sep = "\t",
names = ["label", "sms"]
)

sms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   label   5572 non-null   object
1   sms     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


The dataset has 2 columns and 5572 rows.

• The label column contains "ham" if the message is legitimate, or "spam" if it is spam.
• The sms column contains individual SMS messages.

For example, below are the first 5 rows of the dataset.

sms_df.head()

label sms
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Training and Testing Sets

The messages will be split into two sets. The training set, comprising 80% of the total data, will be used to train the Naive Bayes Algorithm. The testing set, with 20% of the total data, will be used to test the model's accuracy.

First, however, let us calculate what percentage of the messages in the dataset are spam.

spam_perc = sms_df["label"].eq("spam").sum() / sms_df.shape[0] * 100
print(f"Percentage of spam messages: {spam_perc:.2f}%")

Percentage of spam messages: 13.41%


Only 13% of the messages are spam. Therefore, spam and non-spam messages are not equally represented in this dataset, and this may be problematic. However, this is all the data we have, so the best we can do is to ensure that both the training and testing sets have around 13% of their messages as spam.

This is an example of proportionate stratified sampling (Thomas, 2020). We first separate the data into two strata (spam and non-spam). We then take 80% of the messages from each strata as the training set. The remaining 20% of each strata is set aside for the testing set. Thus, each of the two sets will contain around 13% spam.

This has been done with the code below.

# Note: I could have used train_test_split from sklearn, but I coded this manually for the sake of grasping the logic.
split_lists = {
"training": [],
"testing": [],
}

# Stratify the dataset
for label in "spam", "ham":
stratum = sms_df.loc[sms_df["label"] == label]

train_part = stratum.sample(
# Sample 80% of the data points
frac = 0.8,
random_state = 1,
)

# The other 20% that were not sampled go to the testing set.
test_part = stratum.loc[~stratum.index.isin(train_part.index)]

split_lists["training"].append(train_part)
split_lists["testing"].append(test_part)

split_dfs = pd.Series(dtype = "object")
for key in split_lists:
# Concatenate spam and non-spam parts into one DataFrame.
set_df = pd.concat(split_lists[key]).reset_index()
split_dfs[key] = set_df

perc_spam = set_df.label.eq('spam').sum() / set_df.shape[0] * 100

print(f"Number of rows in {key} set: {set_df.shape[0]}")
print(f"Percentage of {key} messages that are spam: {perc_spam:.2f}%")

Number of rows in training set: 4458
Percentage of training messages that are spam: 13.41%
Number of rows in testing set: 1114
Percentage of testing messages that are spam: 13.38%


We can see that the percentage of spam messages is roughly the same between the two sets. This will help the accuracy of the model later on.

Now, the two sets will be further split into X and y. y refers to the target, or the variable that we are trying to predict. In this case, we are trying to predict whether a message is spam or non-spam, so the "label" column is the target:

sms_df.label.head()

0     ham
1     ham
2    spam
3     ham
4     ham
Name: label, dtype: object

On the other hand, X refers to the features, which are information used to predict the target. We only have one feature column as of now, which is the "sms" column.

sms_df.sms.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: sms, dtype: object

Thus, we end up with four final objects:

• X_train: The messages in the training data.
• X_test: The messages in the testing data.
• y_train: The labels in the training data. These correspond to X_train.
• y_test: The labels in the testing data. These correspond to X_test.

# The four objects listed above.
X_train = split_dfs.training[["sms"]].copy()
X_test = split_dfs.testing[["sms"]].copy()
y_train = split_dfs.training["label"].copy()
y_test = split_dfs.testing["label"].copy()


The Algorithm

Now, let's discuss the multinomial naive bayes algorithm. Conditional probability is necessary in order to understand it. For our use case, let $Spam$ be the event that a message is spam, and $Ham$ be the event for non-spam.

Note: The mathematical explanations below are not my own ideas. I learned these from the Dataquest course on Naive Bayes.

Main Formulas

We want to compare the probability that a given message is spam to the probability that it is ham. Thus, we use the following formulas:

$P(Spam|w_1, w_2, \dots , w_n) \propto P(Spam) \cdot \Pi_{i=1}^n P(w_i|Spam)$

$P(Ham|w_1, w_2, \dots , w_n) \propto P(Ham) \cdot \Pi_{i=1}^n P(w_i|Ham)$

Note: These formulas are not the same as the Bayes Theorem. To understand how these were derived from the Bayes Theorem, see the Appendix of this post.

These two formulas are identical except for the $Spam$ or $Ham$ event. Let us just look at the first equation to unpack it.

The probability of event $B$ given that event $A$ has happened can be represented as $P(B|A)$ ("probability of B given A"). Thus, the left side of the formula, $P(Spam|w_1, w_2, \dots , w_n)$, represents the probability of spam given the contents of a message. Each variable $w_i$ represents one word in the message. For example, $w_1$ is the first word in the message, and so on.

In the middle, the "directly proportional to" sign ($\propto$) is used instead of the equals sign. The left and right sides are not equal, but one increases as the other increases.

At the right side, $P(Spam)$ simply refers to the probability that any message is spam. It can be calculated as the number of spam messages in the dataset over the total number of messages.

Finally, the formula ends with $\Pi_{i=1}^n P(w_i|Spam)$. The $P(w_i|Spam)$ part refers to the probability of a certain word occurring given that the message is known to be spam. We must calculate this probability for each word in the message. Then, because the uppercase pi ($\Pi$) refers to a product, we must multiply the word probabilities together.

In order to calculate $P(w_i|Spam)$, we need to use the following formula:

$P(w_i | Spam) = \frac{N_{w_i | Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$

We use an almost identical equation for $P(w_i|Ham)$ as well:

$P(w_i | Ham) = \frac{N_{w_i | Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$

Again, let us just unpack the first formula. $N_{w_i|Spam}$ refers to the number of times that the word appears in the dataset's spam messages.

$\alpha$ is the additive smoothing parameter. We will use $\alpha = 1$. This is added to the numerator to prevent it from becoming zero. If it does become zero, the entire product in the main formula will become zero.

$N_{Spam}$ refers to the total number of words in all of the spam messages. Duplicate words are not removed when this is calculated.

Lastly, $N_{Vocabulary}$ refers to the number of words in the vocabulary. This is the set of all unique words found in any of the messages, whether spam or non-spam. Duplicates are removed.

Implementation

Based on the theory behind the algorithm, I have written a set of steps to implement it. I will use these steps as the pseudocode for this project.

1. Determine the model parameters. These are the variables in the formulas shown earlier. Only the training data will be used for this.
• Find $P(Spam), P(Ham)$.
• Divide the number of spam messages by the total number of messages.
• Do the same for ham messages.
• Preprocess the messages to focus on individual words.
• Make all words lowercase.
• Remove punctuation marks.
• Form a vocabulary.
• Make a set of all the words in the messages, without duplicates.
• $N_{Vocabulary}$ is the number of words in this set.
• Find $N_{Spam}, N_{Ham}$.
• Count the number of times each word appears in each message.
• Count the total number of words in spam messages. Do the same for ham messages.
• Find $N_{w_i|Spam}, N_{w_i|Ham}$ for each word in the vocabulary.
• Sum up the word counts in spam messages to get $N_{w_i|Spam}$.
• Do the same for ham messages to get $N_{w_i|Spam}$.
2. Write a predictive function. This takes a new message and predicts whether it is spam or not.
• Plug the values that we calculated previously into the equation.
• Return $P(Spam|w_1, w_2, \dots , w_n)$, $P(Ham|w_1, w_2, \dots , w_n)$, and the prediction ("spam" or "ham").
3. Evaluate the model using the testing data.
• Make predictions for all messages in the testing set.
• Divide the number of correct predictions by the total number of predictions. This will result in the accuracy of the model.

Model Parameters

In the first step, we will calculate the parameters of the model, which include the following.

• $P(Spam), P(Ham)$
• $N_{Vocabulary}$
• $N_{Spam}, N_{Ham}$
• $N_{w_i|Spam}, N_{w_i|Ham}$

We will calculate these values first so that we can plug them into the equation later on when we predict whether new messages are spam or non-spam.

$P_{Spam}, P_{Ham}$

The probability of spam is equal to the number of spam messages over the total number of messages. The same goes for ham messages.

p_label = {}
p_label["spam"] = y_train.eq("spam").sum() / y_train.shape[0]
p_label["ham"] = 1 - p_label["spam"]

print(f"P(Spam) = {p_label['spam'] * 100:.2f}%")
print(f"P(Ham) = {p_label['ham'] * 100:.2f}%")

P(Spam) = 13.41%
P(Ham) = 86.59%


Message Preprocessing

Below are the messages:

X_train.head()

sms
0 Marvel Mobile Play the official Ultimate Spide...
1 Thank you, winner notified by sms. Good Luck! ...
2 Free msg. Sorry, a service you ordered from 81...
3 Thanks for your ringtone order, ref number R83...
4 PRIVATE! Your 2003 Account Statement for shows...

In order to get individual words, we make all words lowercase and remove punctuation marks and other non-word characters. We then turn each message into a list of its words.

def preprocess_messages(series):
result = (
series
.str.lower()
# Delete all non-word characters.
.str.replace(r"[^a-z0-9 ]", "", regex = True)
.str.strip()
.str.split()
)

return result

X_train = pd.DataFrame(preprocess_messages(X_train.sms))


sms
0 [marvel, mobile, play, the, official, ultimate...
1 [thank, you, winner, notified, by, sms, good, ...
2 [free, msg, sorry, a, service, you, ordered, f...
3 [thanks, for, your, ringtone, order, ref, numb...
4 [private, your, 2003, account, statement, for,...

Vocabulary

Using the preprocessed messages, we can form a set of all of the unique words that they contain.

vocab = set()
for lst in X_train.sms:
vocab.update(lst)

# Use a Series to delete items that are blank or only contain whitespace.
vocab_series = pd.Series(list(vocab))