CISC482 - Lecture18

Supervised Learning - Naive Bayes

Dr. Jeremy Castagno

Class Business

Schedule

Reading 7-2: Mar 31 @ 12PM, Friday
Reading 7-3: Apr 5 @ 12PM, Wednesday
HW6 - Working on it… April 12 @ Midnight, Wednesday

Today

Short Review
Naive Bayes
Class Activity

Review

Terms

An instance is labeled if
Supervised learning is
Regression vs Classification?
What does KNN stand for?

Metric Space

A metric space is an unordered pair $(M, d)$ where $M$ is a set and $d$ is a metric on $M$, a function where $d: M \times M \rightarrow \mathbb{R}$. Where $d$ satisfies the following axioms for all points $x,y,z \in M$

The distance from a point to itself is 0: $d(x,x) = 0$
The distance between any two distinct points is always positive: If $x \neq y, then \; d(y,x) > 0$
The distance from x to y is always the same distance from y to x: $d(x,y) = d(y,x)$
Triangle Inequality holds: $d(x,z) \geq d(x,y) + d(y, z)$

Naive Bayes

Basic Idea

Naive Bayes classification is a supervised learning classifier that uses the number of times a feature occurs in each possible class to estimate the likelihood an instance is in the class.
Naive Bayes is often used for applications with large amounts of text data.
- identifying the author of a new document based on prior documents with known authors.
- detecting spam emails

Motivating Example

Professor Castagno, I have a question about when HW6 is due. Is it due on Wednesday before class or at midnight. Thank you for your help!

Proffesssor Castagno, Congratulations!!!1 You won free trip to B@hamas. Give credit card to confirm sh1pping of tickets. Free no, expense trip! Respond now!

Which one is spam? How do you know?

Example Problem Focus

For the rest of the class we will be doing examples on spam classification
This means we are give a document that has words in it and want to predict the class: ham or spam
All machine learn learning models need features to learn from, and we expect these to be numbers
We need to transform our document into a feature vector: $X$

Two Approaches

There are two main approaches to Naive Bayes and they boil down to how to transform the document. But first some definitions:

We denote $D = \{d_1, d_2, ..., d_i, ..., d_m \}$ as the set of all documents. Often called the corpus. $\lvert D \rvert = m$.
Every $d_i$ is composed of tokens (words) where spaces are ignored
The set of all unique tokens from $D$ is called the vocabulary. Lets denote the cardinality of this set to be $n$. So there are $n$ unique words in the corpus.

Example

D1 - The best part of waking up is Folgers in your cup. Fill your cup up with Folgers.

D2 - You are all great students! I am so proud to be a teacher of great students!

import string
vocabulary = set(d_1.split(' ') + d_2.split(' '))
print(vocabulary)
print(len(vocabulary))

{'up', 'so', 'am', 'waking', 'of', 'Folgers.', 'be', 'your', 'teacher', 'cup.', 'cup', 'best', 'all', 'are', 'proud', 'Folgers', 'Fill', 'to', 'is', 'a', 'in', 'I', 'great', 'part', 'You', 'The', 'students!', 'with'}
28

Preprocess Text

Remove Punctuation
Leading and Ending White space - ’ Hey ’
Replace common occurring text patterns with a single word, Regular Expression -‘http://spam.me’ –> url
- ‘$’ ‘£’& –> ‘mnsymb’
- ‘55555’ – ‘shrtcode’
- ‘867-5309’ –> ‘phonenumber’
- ‘88’ –> ‘number’
Lower case
Port Stemmer - ‘testing’ -> ‘test’
Remove Stop words - ‘the’, ‘a’

Creating Features Vectors

Every document $d_i$ has a feature vector, $x_i$, that has $n$ elements in it. Each element will represent information about a unique vocab word in our corpus.

Multinomial Naive Bayes
- Each document is broken up to tokens. The feature vector is then the frequency count of the vocabulary words in the token set.
Bernoulli Naive Bayes
- Each document is broken up into tokens. The feature vector is then a boolean variable if the the vocab word was found in the document.

Multinomial Feature Vector

Code

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_m = CountVectorizer()
X_m = vectorizer_m.fit_transform(docs)
print(f"Vocab: {len(vectorizer_m.get_feature_names_out())}")
print(vectorizer_m.get_feature_names_out())
print("Feature Vector: ")
print(X_m.toarray())

Vocab: 24
['all' 'am' 'are' 'be' 'best' 'cup' 'fill' 'folgers' 'great' 'in' 'is'
 'of' 'part' 'proud' 'so' 'students' 'teacher' 'the' 'to' 'up' 'waking'
 'with' 'you' 'your']
Feature Vector: 
[[0 0 0 0 1 2 1 2 0 1 1 1 1 0 0 0 0 1 0 2 1 1 0 2]
 [1 1 1 1 0 0 0 0 2 0 0 1 0 1 1 2 1 0 1 0 0 0 1 0]]

Bernoulli Feature Vector

Code

vectorizer_b = CountVectorizer(binary=True)
X_b = vectorizer_b.fit_transform(docs)
print(f"Vocab: {len(vectorizer_b.get_feature_names_out())}")
print(vectorizer_b.get_feature_names_out())
print("Feature Vector: ")
print(X_b.toarray())

Vocab: 24
['all' 'am' 'are' 'be' 'best' 'cup' 'fill' 'folgers' 'great' 'in' 'is'
 'of' 'part' 'proud' 'so' 'students' 'teacher' 'the' 'to' 'up' 'waking'
 'with' 'you' 'your']
Feature Vector: 
[[0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1]
 [1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0]]

Which one to use?

Multinomial is the more complex model. Bernoulli is the simpler model
Less Data -> Bernoulli -> Less overfitting
Alot of Data -> Multinomial
Zybook only shows Multinomial

Naive Bayes - How to use Feature Vectors?

Bayes Models

$P(H|\textbf{D}) = P(H ) \frac{P(\textbf{D} | H)}{P(\textbf{D})}$

$P(C = Spam|\textbf{X}) = \underbrace{P(C = Spam)}_{Prior} \frac{P(\textbf{X} | C = Spam)}{P(\textbf{X})}$

$P(C = Spam|\textbf{X}) = P(C = Spam) \frac{\underbrace{P(\textbf{X} | C = Spam)}_{Likelihood}}{P(\textbf{X})}$

Remember that $\textbf{X} = x_1, x_2, ... x_n$, one for each vocaublary word. Mulitnomial $x_i$ is the _____ and for bernoulli it is the _____.

The Prior

$P(C = Spam)$ = ?

Irrepspective of data, what is the probability that this is a spam document.

$P(C = Spam)$ = $\frac{\text{# Spam Documents}}{\text{# All Documents}} = \frac{m_s}{m}$

$P(C = Ham)$ = $1 - P(C = Spam)$

The Likelihood (the hard one)

$P(\textbf{X} | C = Spam) = P(x_1, x_2, ... x_i, x_n | Spam)$

This is a very large joint probability! Very difficult to compute! But we have some tricks…

$P(\textbf{X} | Spam) = P(x_1 | x_2, ... , x_n, Spam) P(x_2, ... x_n | Spam)$

$P(\textbf{X} | Spam) = P(x_1 | x_2, ... , x_n, Spam) P(x_2 | x_3, ..., Spam) P(x_3, ... x_n | Spam)$

This is still just incredibly difficult to compute…. But we can make a naive assumption.

The Naive Assumption

$P(x_1 | x_2, ..., x_n, Spam ) = P(x_1 | Spam)$

What is this assumption saying?
Is it true?
Is it useful?

$P(\textbf{X} | Spam) = \Pi P(x_i | Spam)$

Calculating $P(x_i | C=Spam)$, Bernoulli

Looking at only spam documents
Number of spam documents where $x_i$ appeared = $x_{i}^s$
Total number of spam documents = $m_s$
$P(x_i | C=Spam) = \frac{x_{i}^s}{m_s}$

Calculating $P(x_i | C=Ham)$, Bernoulli

Looking at only ham documents
Number of ham documents where $x_i$ appeared = $x_{i}^h$
Total number of ham documents = $m_h$
$P(x_i | C=Ham) = \frac{x_{i}^h}{m_h}$

Putting it all together

$P(C=Spam | \textbf{X}) = P(C=Spam)\dfrac{P(\textbf{X} | C=Spam)}{P(\textbf{X})}$

$P(C=Spam | \textbf{X}) \propto P(C=Spam) \; P(\textbf{X} | C=Spam)$

$P(C=Spam | \textbf{X}) \propto P(C=Spam) \; \Pi P(x_i | C=Spam)$

$P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \Pi \dfrac{x_{i}^s}{m_s}$

Warning

Anyone see a problem here? What happens if a vocab word does not appear in the document?

Laplacian Smoothing for Naive Bayes

$P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \Pi \dfrac{x_{i}^s + 1}{m_s + 2}$
$P(C=Ham | \textbf{X}) \propto \dfrac{m_h}{m} \; \Pi \dfrac{x_{i}^h + 1}{m_h + 2}$
Compute both numbers. Whichever class has a higher prob, classify as that one

Warning

Any astute computer scientists see a possible problem….

Floating point number inaccuracies

import math
probs = [0.01] * 20
final_prob = math.prod(probs)
print(f"Probabilities: {probs}")
print(f"Final Probability: {final_prob}")

Probabilities: [0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
Final Probability: 1.0000000000000005e-40

import math
probs = [0.01] * 10_000
final_prob = math.prod(probs)
print(f"Final Probability: {final_prob}")

Final Probability: 0.0

Any one have any ideas to fix this?

Log function

$P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \prod \dfrac{x_{i}^s + 1}{m_s + 2}$
$P(C=Spam | \textbf{X}) \propto \ln \left( \dfrac{m_s}{m} \; \prod \dfrac{x_{i}^s + 1}{m_s + 2} \right)$
What is this $\log(a \cdot b) = ?$
$\log(a \cdot b) = \log(a) + \log(b)$
$P(C=Spam | \textbf{X}) \propto \ln \left(\dfrac{m_s}{m} \right) + \; \sum \ln \left(\dfrac{x_{i}^s + 1}{m_s + 2}\right)$

Log Functon Graph

x = np.linspace(0, 1, 1000000)
y = np.log(x)
plt.plot(x, y)

Tip

We can transfrom our probabilties with log and make everything numerically stable!!

CISC482 - Lecture18

Class Business

Schedule

Today

Review

Terms

Metric Space

Naive Bayes

Basic Idea

Motivating Example

Example Problem Focus

Two Approaches

Example

Preprocess Text

Creating Features Vectors

Multinomial Feature Vector

Bernoulli Feature Vector

Which one to use?

Naive Bayes - How to use Feature Vectors?

Bayes Models

The Prior

The Likelihood (the hard one)

The Naive Assumption

Calculating \(P(x_i | C=Spam)\), Bernoulli

Calculating \(P(x_i | C=Ham)\), Bernoulli

Putting it all together

Laplacian Smoothing for Naive Bayes

Floating point number inaccuracies

Log function

Log Functon Graph