CISC482 - Lecture23

Hierarchial Clustering

Dr. Jeremy Castagno

Class Business

Schedule

Today

  • Recap Peer Review
  • Hierarchal Clustering

Peer Review

Groups

You should have been e-mailed view only access to their drafts

  • Supriya -> Review Jaydon and Carter
  • Anmol -> Review Carter and Iana
  • Jaydon -> Review Iana and Chhandak
  • Carter -> Review Prashant and Haven
  • Iana -> Review Haven and Chhandak
  • Haven -> Review Prashant and Shivay
  • Chhandak -> Review Shivay and Supriya
  • Prashant -> Review Supriya and Anmol
  • Shivay -> Review Anmol and Jaydon

Tip

How did it go?

Questions 1

  1. What did you learn from reviewing your peers’ work? Were there any insights or perspectives that you gained from seeing other students’ writing or ideas?
  2. How did the peer review process impact your own writing or thinking about the assignment? Did you make any changes to your work based on the feedback you received from your peers?

Questions 2

  1. Looking forward, what advice would you give to someone who is new to peer review? What are some best practices or strategies that you would recommend for someone who wants to give or receive feedback effectively?
  2. What feedback did you receive from your peers? Was it helpful, and if so, why? Did any feedback surprise you, or challenge your assumptions about your own work?

Hierarchical clustering

What is it?

Types of Clustering

  • Agglomerative hierarchical clustering is a clustering method where each sample is treated as an individual cluster
    • If we have 100 samples, we start with 100 clusters!
    • Two clusters are combined iteratively until all samples belong to a single cluster
  • Divisive hierarchical clustering
    • Start with one and split

Tip

Use agglomerative! It allows us to observe local patterns first before creating groups. It gives really great results!

Measures of Similarity

  • The single linkage method calculates the distance between a pair of samples, one from each cluster, that are the most similar.
  • The complete linkage method calculates the distance between a pair of samples, one from each cluster, that are the most different.
  • The centroid linkage method calculates the distance between the centroids of two clusters.

Single Linkage

Complete Linkage

Centroid Linkage

Questions - 1

Which two samples should be used to determine similarity using the Single linkage? Complete Linkage?

Questions - 1

Which two samples should be used to determine similarity using the Single linkage? Complete Linkage?

Questions - 1

  • How many total samples?
  • How many total clusters?
  • What is the Euclidean distance between the clusters using complete linkage?

Dendograms

Terminology

  • The output of a hierarchical clustering algorithm can be visualized using a dendrogram.
  • A dendrogram is a tree that shows the order in which clusters are grouped together and the distances between clusters.
  • Read it from BOTTOM up!

  • A clade is a branch of a dendrogram/vertical line.
  • A link is a horizontal line that connects two clades, height gives the distance between clusters.
  • A leaf is the terminal end of each clade in a dendrogram, which represents a single sample.

  • How many total samples?
  • How many total clusters in the beggining?
  • Which clusters get grouped 1st? 2nd? 3rd? 4th?

Visual 1

Visual 2

Visual 3

Visual 4

Visual 5

Visual 6

Visual 7

Thresholding

  • A dendrogram can be used as a starting point to find the optimal number of clusters.
  • No conclusions about the optimal number of clusters should be made without using more quantitative techniques like the elbow method.
  • We can specity a maximum distance between groups.
  • Any clusters below that distance should be clustered
  • Any clusters above that distance should not be clustered
  • We call this distance a threshold

Threshold 1

Threshold 2

Question Threshold

  • How many total samples?
  • How many clusters would there be with dashed blue line?
  • How many clusters would there be with dashed red line?