CISC482 - Lecture24

The Recap

Dr. Jeremy Castagno

Class Business

Schedule

Today

  • Part 1
  • Part 2
  • Part 3
  • Part Future?

Triangle of Doom

Part 1 - Stats

flowchart LR
    A[Stats and Prob] --> B[Desc. Stats]
    A --> C[Distribtuions]
    A --> D[Population Inference]

    E[Data Exploration] --> F[Plotting]
    E --> G[Dataframes]
    E --> H[Exploratory Data Analysis]

Key Takeaways 1

  • Data science relies on statistics to make data driven decisions
  • Sampling methods are used to efeciently collect data and reduce bias
  • We use descriptive statistics to describe our samples
    • Measure of: center, spread, position, shape

Key Takeaways 2

  • Inferential statistics are methods that result in conclusions and estimates about the population based on a sample
  • Estimating a Population Mean with a CONFIDENCE INTERVAL
  • Hypothesis Testing - Accepting or Rejecting the null hypthothesis
    • Ducks on our lake have mean weigh of 1500 grams.

Key Takeaways 3

  • Working with data is a multi step process
  • We usually get a dataset as giant table. Often using a pandas dataframe
  • Each row is sample, the columns are the features for that sample
  • We first do Exploratory Data Analsyis (EDA)
    • Trying to understand our data.
    • What is the sahpe of these features. How are they realted

Key Takeaways 3 (Continued)

  • We also
    • clean
    • enrich
    • validate
  • This preps us to start creating models for prediction

Part 2 - Regression, Evaluation

flowchart LR
    A[Regression] --> B[Linear Regression]
    A --> C[Mutliple Linear Regression]
    A --> D[Polynomial Regression]
    A --> Z[Logistic Regression]

    E[Model Evaluation] --> F[RMSE, R^2]
    E --> G[Precision, Recall]
    E --> H[Train,Valid,Test ]

Key Takeaways 1

  • An input feature, \(X\), is used to predict …
  • an output feature has values that vary in response to variation in some other feature(s).
    • We often called this the response variable
  • If the output feature is continous, we call this a regression

Key Takeaways 2

  • Set of data: \(\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\)
    • x: input, y: output
  • Model: \(\hat{y} = \beta_0 + \beta_1 x + \epsilon\)
    • \(\hat{y}\) = prediction
    • \(\beta_0\) = y-intercept
    • \(\beta_1\) = slope
    • \(\epsilon = error\)
  • \(\beta_1 (slope) = \frac{\sum\limits_{i=1}^{n}[(x_i-\bar{x})(y_i- \bar{y})]}{\sum\limits_{i=1}^{n} (x_i - \bar{x})^2}\)
  • \(\beta_0\) (intercept) - \(\bar{y} - \beta_1 \bar{x}\)

Key Takeaways 3

  • Assumption of Linear Regression
    • \(x\) and \(y\) have a linear relationship.
    • The residuals of the observations are independent.
    • The mean of the residuals is 0 and the variance of the residuals is constant.
    • The residuals are approximately normally distributed.

Key Takeaways 3

  • \(R^2\) and RMSE for Regression model evaluation
  • Precison, Recall, F1 Score for Classification Evaluation

Key Takeaways 4

  • Underfit - model is too simple to fit the data well.
  • Overfit - model is too complex to fit the data well.

Key Takeaways 5

  • Training data is used to fit a model.
    • Use this dataset to do K-Fold Cross Validation
  • Test data is used to evaluate final model performance and compare different models.
  • The ratio for this split: 80/20 or 70/30

Part 3 - Supervised and Unsupervised Learning

flowchart LR
    A[Supervised Learning] --> B[Linear Regression]
    A --> C[KNN]
    A --> D[Naive Bayes]
    A --> Z[SVM]

    E[Unsupervised Learning] --> F[Clustering]
    F --> I[K-Means]
    F --> J[AHC]
    E --> G[PCA]

Key Takeaways 1

  • Regression - Given \(x_1, x_2, ..., x_n\), predict a continous output, \(y\)
  • Classification - Given \(x_1, x_2, ..., x_n\), predict a grouping/class, \(y\)
  • Regression
    • Linear Regression (Multiple, Polynomial, etc)
    • K Nearest Neighbro (KNN)
    • We use the library sklearn to build our models for us.

Key Takeaways 2

  • Logistic Regression works great for simple linear classifications
  • SVM are awesome and can do non-linear seperations. But you need to try different hyperparemters
    • Kernel - rpf, polynomial, linear
    • Try C - Try 0.0001 -> 10_000

Key Takeaways 3

  • K-Means clustering work for only easy clustering problems
  • If you clusters have compicated shapes or densities, stick with AHC or DBScan

Key Takeaways 4

  • Principal component analysis, or PCA, is a dimensionality reduction technique.
  • It can be used to compress data.

Final Words

Happy

  • I have really enjoyed teaching this course, I hope you enjoyed taking it!

What are some concepts that were completely new to you in this course?

Future

  • Data science, machine learning, and artificial intelligence are not going anywhere
  • This course has laid a small foundation for you to build upon.
  • Many jobs are becoming automated or enhanced by AI and Machine Learning.
  • In order to stay competitive in the workforce, you should have a basic understanding of these technologies and learn how to use them.

The rise of AI

  • Large Language Models (ChatGPT) and AI image Generation (Midjourney) are the start of big changes to come
  • Jobs - Writing/Journalism, Digital Artists, Even Junior coding jobs

You have the skills!

  • You have more understanding of how AI works than most others
  • Its just pattern matching. Thats it. Features come in, we apply some mathematical models, response variables come out.
  • There is a bright future ahead of you! Even if you dont pursue data science, you are now more prepared to use their tools effectively!