CISC482 - Lecture24

Principal Component Analysis

Dr. Jeremy Castagno

Class Business

Schedule

Today

  • Recap Peer Review
  • Review Hierarchal Clustering
  • Principal Component Analysis

Peer Review

My Review

  • Overall we did a great job! You can find my and your peers feedback on Brightspace.
    • Its also inside your shared Google Drive folder under PeerReview
  • I particularly appreciated the detailed walkthrough that many of you did.
  • It was very interesting seeing all your ideas!
  • I did see a few common errors that I wanted to bring up.

Common Issues 1

  • Don’t have empty cells.
  • Don’t have grammatical errors.
  • Label all your axes in your figures.
  • The primary goal of this paper is to create a predictive model.
    • You need to be clear in your introduction what your response variable is and if you are doing classification or regression.

Common Issues 2

  • Be sure to have a good amount of detail in your report that explains WHY you are doing what you are doing.
  • I recommend making tables showing your results. You can do this by creating a data frame of your results or by using this website to make markdown tables.
  • Please review this document to learn how to write equations in google collab.

Common Issues 3

  • If you are doing regression, please make Residual Plots and report \(R^2\) metric.
  • If you are doing classification, please report metrics such as accuracy, precision, recall, and f1 score.

Hierarchical clustering

Types of Clustering

  • Agglomerative hierarchical clustering is _______
  • Divisive hierarchical clustering is ____

Measures of Similarity

  • The single linkage method calculates the distance between a pair of samples, one from each cluster, that are the _____.
  • The complete linkage method calculates the distance between a pair of samples, one from each cluster, that are the _____.
  • The centroid linkage method calculates the distance between the ________ of two clusters.

Questions - 1

Which two samples should be used to determine similarity using the Single linkage? Complete Linkage?

Questions - 1

Which two samples should be used to determine similarity using the Single linkage? Complete Linkage?

Terminology

  • A _______ is a ________ that shows the order in which clusters are grouped together and the distances between clusters.
  • Read it from BOTTOM up!

  • A _____ is a branch of a dendogram/vertical line.
  • A _____ is a horizontal line that connects two ______, height gives the distance between clusters.
  • A _____ is the terminal end of each _______ in a dendrogram, which represents a single sample.

Question Threshold

  • How many total samples?
  • How many clusters would there be with dashed blue line?
  • How many clusters would there be with dashed red line?

Principal Component Analysis

Terms

  • Principal component analysis, or PCA, is a dimensionality reduction technique.
  • It can be used to compress data.
  • Lets say we have \(n=10\) points that such that \(x \in\mathbf{R}^2\)
  • PCA can reduce the data be \(n=10\) points inside \(R^{1}\)
  • It does this by finding the component, or feature vector that contains the largest variability
  • Lets looks at an example

Example 1 Start

Find me a vector or axis that when you project the data to it maximizes the variance

Projection

Example 1 Animation

Example 2 Start

Code
x = np.arange(0, 10) + np.random.randn(10) * 0.1
y = x * 1 + np.random.randn(10) * 0.3
sns.scatterplot(x=x,y=y)
<Axes: >

Find me a vector or axis that when you project the data to it maximizes the variance

Example 2 - First Principle Component

Code
x = np.arange(0, 10) + np.random.randn(10) * 0.1
y = x * 1 + np.random.randn(10) * 0.3
ax = sns.scatterplot(x=x,y=y)
ax.plot([0, 10], [0, 10], ls='dashed', c='r')

Do you see how this vector will maximize the variance? The vector is \([1, 1]\) or \([.707, .707]\) when normalized.

Intuition

Back to Example 2

Code
X = np.column_stack([x, y])
df = pd.DataFrame(X, columns=['x', 'y'])
df
x y
0 -0.20 -0.28
1 0.83 0.66
2 2.03 2.07
3 3.24 3.10
4 4.11 4.50
5 5.17 5.23
6 6.01 6.13
7 7.14 7.04
8 7.97 8.35
9 9.06 8.84
Code
pc1 = np.array([[.707], [.707]])
print(f"PC1 = {pc1.flatten()}")
pc1_vals = X @ pc1
df = pd.DataFrame(pc1_vals, columns=['PC1'])
df
PC1 = [0.7 0.7]
PC1
0 -0.34
1 1.05
2 2.89
3 4.48
4 6.09
5 7.35
6 8.58
7 10.02
8 11.54
9 12.66

Compression from Example 2

Code
sns.scatterplot(x=x,y=y, label="Raw");

Code
points = pc1.flatten() * pc1_vals
ax = sns.scatterplot(x=x,y=y, label='Raw')
ax.scatter(x=points[:, 0],y=points[:, 1], label='Compressed')
plt.legend();

Higher Dimensions

3D

Code
x = np.arange(0, 10) + np.random.randn(10) * 0.1
y = np.arange(0, 10) + np.random.randn(10) * 0.1
z = x * 1 + y * 2 + np.random.randn(10) * 1
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

X = np.column_stack([x, y, z])

ax.scatter(x, y, z, marker='o')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
Text(0.5, 0, 'Z Label')

PCA

Code
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
# fit and transform data
X_pca = pca.fit_transform(X)
print(X_pca.shape)
X_comp = pca.inverse_transform(X_pca)

fig = plt.figure()
ax = fig.add_subplot(projection='3d')

ax.scatter(x, y, z, marker='o', label='Raw')
ax.scatter(X_comp[:, 0], X_comp[:, 1], X_comp[:, 2], marker='x', label='compressed')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
(10, 1)
Text(0.5, 0, 'Z Label')

Wine Data Set

The Data

Features: ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
Classes: (array([1, 2, 3]), array([59, 71, 48]))
group 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 14.23 1.71 2.43 15.60 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.20 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.60 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.80 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.00 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Can we visualize this data? Why or why not?

Split Data and Scale

Code
from sklearn.preprocessing import StandardScaler

df_train, df_test = train_test_split(
    df, test_size=0.30, random_state=0)

y_train = df_train['group']
y_test = df_test['group']

scaler = StandardScaler().set_output(transform="pandas")
scaled_X_train = scaler.fit_transform(df_train.iloc[:, 1:])
scaled_X_test = scaler.fit_transform(df_test.iloc[:, 1:])
scaled_X_train.head()
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
22 0.91 -0.46 -0.01 -0.82 0.06 0.59 0.94 -0.76 0.13 -0.51 0.66 1.94 0.94
108 -0.96 -0.97 -1.54 -0.15 -0.55 0.17 0.07 0.21 0.78 -0.98 -0.41 0.58 -1.41
175 0.36 1.68 -0.37 0.13 1.36 -1.12 -1.31 0.53 -0.44 2.22 -1.56 -1.45 0.29
145 0.22 1.05 -0.77 0.41 0.13 -1.27 -1.46 0.53 -0.52 -0.43 -1.52 -1.28 0.27
71 1.10 -0.77 1.11 1.54 -0.96 1.16 0.92 -1.25 0.43 -0.69 1.72 0.78 -1.09

\(z = (x - \mu) / \sigma\)

PCA

Code
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(scaled_X_train)
plt.scatter(X_train_pca[:, 0], X_train_pca[:,1], c=y_train, cmap='viridis', edgecolor='k');

[[2.6 -0.0]
 [0.2 2.3]
 [-2.6 -2.7]
 [-2.5 -0.5]
 [1.7 0.9]
 [-2.8 -0.4]
 [-2.8 -2.0]
 [1.4 -0.0]
 [-2.5 0.1]
 [-2.3 0.4]
 [1.1 2.4]
 [-2.3 1.1]
 [-2.5 -0.6]
 [0.2 1.1]
 [2.5 -1.1]
 [-0.7 2.8]
 [2.5 0.2]
 [-0.6 0.7]
 [0.5 -0.4]
 [3.6 -1.4]
 [1.6 1.5]
 [2.5 0.1]
 [-3.6 -0.9]
 [-1.6 -2.4]
 [1.5 1.4]
 [0.0 2.0]
 [-0.2 2.8]
 [-2.4 -2.5]
 [-3.1 0.3]
 [3.3 -0.4]
 [-3.5 -1.8]
 [-0.5 2.6]
 [-0.6 2.0]
 [-1.2 0.8]
 [1.0 1.4]
 [2.0 1.6]
 [2.8 -1.9]
 [2.1 -1.3]
 [0.8 2.0]
 [3.5 -1.4]
 [-3.8 -0.1]
 [1.7 0.5]
 [-3.4 -0.9]
 [3.1 -0.8]
 [2.3 -1.7]
 [1.3 0.9]
 [3.6 -1.8]
 [0.9 2.3]
 [0.5 2.0]
 [3.8 -2.9]
 [-2.4 -2.2]
 [-1.6 1.4]
 [2.5 -1.3]
 [-0.7 0.2]
 [-0.8 2.4]
 [0.8 1.5]
 [-1.3 -0.0]
 [2.2 -0.9]
 [-3.9 -0.5]
 [-1.8 -1.3]
 [4.4 -2.3]
 [3.3 -1.4]
 [-1.5 1.9]
 [-2.7 -2.2]
 [2.8 -1.4]
 [1.9 -0.7]
 [-0.5 2.2]
 [-0.1 1.2]
 [2.0 -0.2]
 [2.2 -1.3]
 [0.8 -0.3]
 [-3.3 -2.2]
 [0.9 0.8]
 [2.3 0.1]
 [0.8 1.4]
 [-2.3 -0.6]
 [3.1 -1.3]
 [-1.7 1.8]
 [-2.9 -0.2]
 [-2.7 -0.3]
 [1.9 -1.6]
 [1.6 0.6]
 [-2.0 -0.3]
 [2.3 -1.9]
 [-2.3 -0.2]
 [-0.4 2.0]
 [1.4 -0.7]
 [2.2 -0.7]
 [-0.4 1.9]
 [2.8 -1.5]
 [-2.8 -1.9]
 [-1.6 1.4]
 [-3.4 -1.1]
 [1.7 -0.1]
 [-2.9 -0.4]
 [-2.3 -2.2]
 [-3.5 -1.3]
 [2.3 -0.3]
 [1.5 2.1]
 [-0.4 2.4]
 [0.4 1.1]
 [0.5 3.9]
 [-2.7 -1.6]
 [-3.2 -2.7]
 [-0.6 1.0]
 [-1.4 1.5]
 [0.9 -0.7]
 [1.1 1.3]
 [-2.8 -1.3]
 [-2.4 -2.4]
 [2.5 -1.9]
 [3.2 -1.8]
 [-2.7 -0.2]
 [-1.1 1.8]
 [-1.5 1.0]
 [-0.5 2.5]
 [1.4 -0.7]
 [1.1 -0.2]
 [2.8 -1.0]
 [-0.5 2.6]
 [0.3 2.3]
 [-0.1 2.0]
 [2.9 -0.8]
 [-2.4 -2.2]]

Separate with Logistic Regression

Code
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1e5)
model.fit(X_train_pca, y_train)

disp = DecisionBoundaryDisplay.from_estimator(
    model, X_train_pca, response_method="predict", cmap=plt.cm.viridis, alpha=0.5, xlabel='PC1', ylabel='PC2'
)
disp.ax_.scatter(X_train_pca[:, 0], X_train_pca[:,1], c=y_train, cmap='viridis', edgecolor='k');

Why is this a big deal?

  • We just separated this think with only 2 dimensions! We started with 13 Dimensions
  • Remember we did not reduce features. We extracted two new features, PC1 and PC2 that best expalin the data
  • PC1 is a combination of meany of the 13 features
PC1 : [0.1 -0.2 -0.0 -0.3 0.1 0.4 0.4 -0.3 0.3 -0.1 0.3 0.4 0.3]

Even higher Dimensions (IMAGES!)

MNIST Digits - 8 X 8 Images

Code
from sklearn.datasets import load_digits
mnist = load_digits()
X = mnist.data
y = mnist.target
images = mnist.images

fig, axes = plt.subplots(2, 10, figsize=(16, 6))
for i in range(20):
    axes[i//10, i %10].imshow(images[i], cmap='gray');
    axes[i//10, i %10].axis('off')
    axes[i//10, i %10].set_title(f"target: {y[i]}")
    
plt.tight_layout()

MINST Data

Code
pd.DataFrame(X)
0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
0 0.00 0.00 5.00 13.00 9.00 1.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 6.00 13.00 10.00 0.00 0.00 0.00
1 0.00 0.00 0.00 12.00 13.00 5.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 11.00 16.00 10.00 0.00 0.00
2 0.00 0.00 0.00 4.00 15.00 12.00 0.00 0.00 0.00 0.00 ... 5.00 0.00 0.00 0.00 0.00 3.00 11.00 16.00 9.00 0.00
3 0.00 0.00 7.00 15.00 13.00 1.00 0.00 0.00 0.00 8.00 ... 9.00 0.00 0.00 0.00 7.00 13.00 13.00 9.00 0.00 0.00
4 0.00 0.00 0.00 1.00 11.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 2.00 16.00 4.00 0.00 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1792 0.00 0.00 4.00 10.00 13.00 6.00 0.00 0.00 0.00 1.00 ... 4.00 0.00 0.00 0.00 2.00 14.00 15.00 9.00 0.00 0.00
1793 0.00 0.00 6.00 16.00 13.00 11.00 1.00 0.00 0.00 0.00 ... 1.00 0.00 0.00 0.00 6.00 16.00 14.00 6.00 0.00 0.00
1794 0.00 0.00 1.00 11.00 15.00 1.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 2.00 9.00 13.00 6.00 0.00 0.00
1795 0.00 0.00 2.00 10.00 7.00 0.00 0.00 0.00 0.00 0.00 ... 2.00 0.00 0.00 0.00 5.00 12.00 16.00 12.00 0.00 0.00
1796 0.00 0.00 10.00 14.00 8.00 1.00 0.00 0.00 0.00 2.00 ... 8.00 0.00 0.00 1.00 8.00 12.00 14.00 12.00 1.00 0.00

1797 rows × 64 columns

More Images - Digit 1

More Images - Digit 8

Apply PCA

Code
pca = PCA(n_components=64)
pca_data = pca.fit_transform(X)
percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_)
cum_var_explained = np.cumsum(percentage_var_explained) * 100

plt.plot(range(1,65), cum_var_explained, linewidth=2)
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance Explained");
plt.xticks(np.arange(0, 64, 4));
plt.ylim(0,100);

Compression - Using 10 PC - Digit 8

Compression - Using 2 PC - Digit 8

Plot PC1 and PC2