CISC482 - Lecture18

Supervised Learning - Naive Bayes

Dr. Jeremy Castagno

Class Business

Schedule

  • Reading 7-2: Mar 31 @ 12PM, Friday
  • Reading 7-3: Apr 5 @ 12PM, Wednesday
  • HW6 - Working on it… April 12 @ Midnight, Wednesday

Today

  • Short Review
  • Naive Bayes
  • Class Activity

Review

Terms

  • An instance is labeled if
  • Supervised learning is
  • Regression vs Classification?
  • What does KNN stand for?

Metric Space

A metric space is an unordered pair \((M, d)\) where \(M\) is a set and \(d\) is a metric on \(M\), a function where \(d: M \times M \rightarrow \mathbb{R}\). Where \(d\) satisfies the following axioms for all points \(x,y,z \in M\)

  1. The distance from a point to itself is 0: \(d(x,x) = 0\)
  2. The distance between any two distinct points is always positive: If \(x \neq y, then \; d(y,x) > 0\)
  3. The distance from x to y is always the same distance from y to x: \(d(x,y) = d(y,x)\)
  4. Triangle Inequality holds: \(d(x,z) \geq d(x,y) + d(y, z)\)

Naive Bayes

Basic Idea

  • Naive Bayes classification is a supervised learning classifier that uses the number of times a feature occurs in each possible class to estimate the likelihood an instance is in the class.
  • Naive Bayes is often used for applications with large amounts of text data.
    • identifying the author of a new document based on prior documents with known authors.
    • detecting spam emails

Motivating Example

Professor Castagno, I have a question about when HW6 is due. Is it due on Wednesday before class or at midnight. Thank you for your help!

Proffesssor Castagno, Congratulations!!!1 You won free trip to B@hamas. Give credit card to confirm sh1pping of tickets. Free no, expense trip! Respond now!

Which one is spam? How do you know?

Example Problem Focus

  • For the rest of the class we will be doing examples on spam classification
  • This means we are give a document that has words in it and want to predict the class: ham or spam
  • All machine learn learning models need features to learn from, and we expect these to be numbers
  • We need to transform our document into a feature vector: \(X\)

Two Approaches

There are two main approaches to Naive Bayes and they boil down to how to transform the document. But first some definitions:

  • We denote \(D = \{d_1, d_2, ..., d_i, ..., d_m \}\) as the set of all documents. Often called the corpus. \(\lvert D \rvert = m\).
  • Every \(d_i\) is composed of tokens (words) where spaces are ignored
  • The set of all unique tokens from \(D\) is called the vocabulary. Lets denote the cardinality of this set to be \(n\). So there are \(n\) unique words in the corpus.

Example

D1 - The best part of waking up is Folgers in your cup. Fill your cup up with Folgers.

D2 - You are all great students! I am so proud to be a teacher of great students!

import string
vocabulary = set(d_1.split(' ') + d_2.split(' '))
print(vocabulary)
print(len(vocabulary))
{'up', 'so', 'am', 'waking', 'of', 'Folgers.', 'be', 'your', 'teacher', 'cup.', 'cup', 'best', 'all', 'are', 'proud', 'Folgers', 'Fill', 'to', 'is', 'a', 'in', 'I', 'great', 'part', 'You', 'The', 'students!', 'with'}
28

Preprocess Text

  • Remove Punctuation
  • Leading and Ending White space - ’ Hey ’
  • Replace common occurring text patterns with a single word, Regular Expression -‘http://spam.me’ –> url
    • ‘$’ ‘£’& –> ‘mnsymb’
    • ‘55555’ – ‘shrtcode’
    • ‘867-5309’ –> ‘phonenumber’
    • ‘88’ –> ‘number’
  • Lower case
  • Port Stemmer - ‘testing’ -> ‘test’
  • Remove Stop words - ‘the’, ‘a’

Creating Features Vectors

Every document \(d_i\) has a feature vector, \(x_i\), that has \(n\) elements in it. Each element will represent information about a unique vocab word in our corpus.

  • Multinomial Naive Bayes
    • Each document is broken up to tokens. The feature vector is then the frequency count of the vocabulary words in the token set.
  • Bernoulli Naive Bayes
    • Each document is broken up into tokens. The feature vector is then a boolean variable if the the vocab word was found in the document.

Multinomial Feature Vector

Code
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_m = CountVectorizer()
X_m = vectorizer_m.fit_transform(docs)
print(f"Vocab: {len(vectorizer_m.get_feature_names_out())}")
print(vectorizer_m.get_feature_names_out())
print("Feature Vector: ")
print(X_m.toarray())
Vocab: 24
['all' 'am' 'are' 'be' 'best' 'cup' 'fill' 'folgers' 'great' 'in' 'is'
 'of' 'part' 'proud' 'so' 'students' 'teacher' 'the' 'to' 'up' 'waking'
 'with' 'you' 'your']
Feature Vector: 
[[0 0 0 0 1 2 1 2 0 1 1 1 1 0 0 0 0 1 0 2 1 1 0 2]
 [1 1 1 1 0 0 0 0 2 0 0 1 0 1 1 2 1 0 1 0 0 0 1 0]]

Bernoulli Feature Vector

Code
vectorizer_b = CountVectorizer(binary=True)
X_b = vectorizer_b.fit_transform(docs)
print(f"Vocab: {len(vectorizer_b.get_feature_names_out())}")
print(vectorizer_b.get_feature_names_out())
print("Feature Vector: ")
print(X_b.toarray())
Vocab: 24
['all' 'am' 'are' 'be' 'best' 'cup' 'fill' 'folgers' 'great' 'in' 'is'
 'of' 'part' 'proud' 'so' 'students' 'teacher' 'the' 'to' 'up' 'waking'
 'with' 'you' 'your']
Feature Vector: 
[[0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1]
 [1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0]]

Which one to use?

  • Multinomial is the more complex model. Bernoulli is the simpler model
  • Less Data -> Bernoulli -> Less overfitting
  • Alot of Data -> Multinomial
  • Zybook only shows Multinomial

Naive Bayes - How to use Feature Vectors?

Bayes Models

\(P(H|\textbf{D}) = P(H ) \frac{P(\textbf{D} | H)}{P(\textbf{D})}\)

\(P(C = Spam|\textbf{X}) = \underbrace{P(C = Spam)}_{Prior} \frac{P(\textbf{X} | C = Spam)}{P(\textbf{X})}\)

\(P(C = Spam|\textbf{X}) = P(C = Spam) \frac{\underbrace{P(\textbf{X} | C = Spam)}_{Likelihood}}{P(\textbf{X})}\)

Remember that \(\textbf{X} = x_1, x_2, ... x_n\), one for each vocaublary word. Mulitnomial \(x_i\) is the _____ and for bernoulli it is the _____.

The Prior

\(P(C = Spam)\) = ?

Irrepspective of data, what is the probability that this is a spam document.

\(P(C = Spam)\) = \(\frac{\text{# Spam Documents}}{\text{# All Documents}} = \frac{m_s}{m}\)

\(P(C = Ham)\) = \(1 - P(C = Spam)\)

The Likelihood (the hard one)

\(P(\textbf{X} | C = Spam) = P(x_1, x_2, ... x_i, x_n | Spam)\)

This is a very large joint probability! Very difficult to compute! But we have some tricks…

\(P(\textbf{X} | Spam) = P(x_1 | x_2, ... , x_n, Spam) P(x_2, ... x_n | Spam)\)

\(P(\textbf{X} | Spam) = P(x_1 | x_2, ... , x_n, Spam) P(x_2 | x_3, ..., Spam) P(x_3, ... x_n | Spam)\)

This is still just incredibly difficult to compute…. But we can make a naive assumption.

The Naive Assumption

\(P(x_1 | x_2, ..., x_n, Spam ) = P(x_1 | Spam)\)

  • What is this assumption saying?
  • Is it true?
  • Is it useful?

\(P(\textbf{X} | Spam) = \Pi P(x_i | Spam)\)

Calculating \(P(x_i | C=Spam)\), Bernoulli

  • Looking at only spam documents
  • Number of spam documents where \(x_i\) appeared = \(x_{i}^s\)
  • Total number of spam documents = \(m_s\)
  • \(P(x_i | C=Spam) = \frac{x_{i}^s}{m_s}\)

Calculating \(P(x_i | C=Ham)\), Bernoulli

  • Looking at only ham documents
  • Number of ham documents where \(x_i\) appeared = \(x_{i}^h\)
  • Total number of ham documents = \(m_h\)
  • \(P(x_i | C=Ham) = \frac{x_{i}^h}{m_h}\)

Putting it all together

\(P(C=Spam | \textbf{X}) = P(C=Spam)\dfrac{P(\textbf{X} | C=Spam)}{P(\textbf{X})}\)

\(P(C=Spam | \textbf{X}) \propto P(C=Spam) \; P(\textbf{X} | C=Spam)\)

\(P(C=Spam | \textbf{X}) \propto P(C=Spam) \; \Pi P(x_i | C=Spam)\)

\(P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \Pi \dfrac{x_{i}^s}{m_s}\)

Warning

Anyone see a problem here? What happens if a vocab word does not appear in the document?

Laplacian Smoothing for Naive Bayes

  • \(P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \Pi \dfrac{x_{i}^s + 1}{m_s + 2}\)
  • \(P(C=Ham | \textbf{X}) \propto \dfrac{m_h}{m} \; \Pi \dfrac{x_{i}^h + 1}{m_h + 2}\)
  • Compute both numbers. Whichever class has a higher prob, classify as that one

Warning

Any astute computer scientists see a possible problem….

Floating point number inaccuracies

import math
probs = [0.01] * 20
final_prob = math.prod(probs)
print(f"Probabilities: {probs}")
print(f"Final Probability: {final_prob}")
Probabilities: [0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
Final Probability: 1.0000000000000005e-40
import math
probs = [0.01] * 10_000
final_prob = math.prod(probs)
print(f"Final Probability: {final_prob}")
Final Probability: 0.0

Any one have any ideas to fix this?

Log function

  • \(P(C=Spam | \textbf{X}) \propto \dfrac{m_s}{m} \; \prod \dfrac{x_{i}^s + 1}{m_s + 2}\)
  • \(P(C=Spam | \textbf{X}) \propto \ln \left( \dfrac{m_s}{m} \; \prod \dfrac{x_{i}^s + 1}{m_s + 2} \right)\)
  • What is this \(\log(a \cdot b) = ?\)
  • \(\log(a \cdot b) = \log(a) + \log(b)\)
  • \(P(C=Spam | \textbf{X}) \propto \ln \left(\dfrac{m_s}{m} \right) + \; \sum \ln \left(\dfrac{x_{i}^s + 1}{m_s + 2}\right)\)

Log Functon Graph

x = np.linspace(0, 1, 1000000)
y = np.log(x)
plt.plot(x, y)

Tip

We can transfrom our probabilties with log and make everything numerically stable!!