CISC482 - Lecture09

Exploratory Data Analysis

Dr. Jeremy Castagno

Class Business

Schedule

Reading 5-1: Feb 22 @ 12PM, Wednesday
Reading 5-2: Feb 24 @ 12PM, Friday
Topic Ideas - Feb 22 @ Midnight

Today

Review Exam
Review Topic Ideas
More Exploratory Data Analysis

Exam Review

Most Missed Questions

Student spending at bookstore is normally distributed with a mean of $130 and a standard deviation of $30. Approximately what percentage of students spend less than $100 a month. Options: - 16, 20, 33, 50
Springfield College wants to give gift cards to the top 3% of spenders. What should be monthly spending cutoff to determine which students get gift cards?
- 130, 160, 186, 220
Hypothesis Testing, p-value of 0.01, using 95% confidence when making a decison to reject or accept null hypothesis.

Normal Distribution

Normal Distribution Solution

50% of students spend more than 130. 34% of students spend between 100 and 130. Thefore 50+34=84% students spend more than 100. Thefore 100-84=16% students spend less than 100.
50% of students spend at least 130. Roughly 34+14=48% of students spend between 130 and 190 dollars. Thefore 50+48=98% students spend less than 190. Thefore the top 2% spends 190 or more. If we want the top 3% it will be must a little less. Looking at the options 186 is the best one.

Hypothesis Testing Solution

Reject null hypothesis when p-value is less than 0.05 and accept the alternative hypothesis.

Topic Ideas

Purpose

Find 2-3 datasets that you are interested and are connected to your core signature assignments
Verify the data is suitable for analysis
Practice writing concisely and clearly about complex topics

Requirements

Data Requirements
Paper Requirements

Data Requirements

At least 100 observations
At least 8 columns
At least 6 of the columns must be useful and unique predictor variables.
At least one variable that can be identified as a reasonable response variable
- The response variable can be quantitative or categorical.
Observations should reasonably meet the independence condition.

Paper Requirements

Introduction
Research Question
Glimpse of Data

Introduction Section

State source of data
Describe when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Describe the observations and the general characteristics being measured in the data.
Describe how the data set connect to your Core Studies.

Research Question

Describe a research question you’re interested in answering using this data
Can you accurately predict whether a person will survive in the titanic data set? What features would be most important to make that prediction?

Glimpse of Data

Please print out the results of the info() function of the dataframe
Also print out the first few rows of data head()

Example Template

Brightspace assignment
Link to template to follow
Copy and put in your shared google drive folder
Print PDF and submit to brightspace

Exploratory Data Analysis (EDA)

Steps

Understand the Data
1. Size of the dataset (rows,cols), features (categorical, numerical).
Identify Relationships between features
1. Direction and strength of correlation
Describe the shape of the data
1. Symmetric, Skewed
Detect outliers and missing values
1. Box an Whisker!

Understand the Data

Number of Rows, Columns?
What features are categorical vs numeric?
Any missing data?

Example Data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
None

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.10	18.70	181.00	3,750.00	male	2007
1	Adelie	Torgersen	39.50	17.40	186.00	3,800.00	female	2007
2	Adelie	Torgersen	40.30	18.00	195.00	3,250.00	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.70	19.30	193.00	3,450.00	female	2007

Relationship 1

Relationship 2

Relationship 3

Quantifying the Relationship

Covariance - measure of the joint variability of two random variables
The sign of the covariance, therefore, shows whether it is positive or negative relationship
$Cov(X,Y)= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{n}$
The magnitude doesn’t tell you much….we need a way to normalize it….Correlation to the rescue!

Correlation

A metric between -1 and +1
$r(X,Y)= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2 \cdot \sum_{i=1}^n(y_i - \bar{y})}}$
$r(X,Y)= \frac{Cov(X,Y)}{\sigma_x \cdot \sigma_y}$

Example Data

data = np.array([
  [1, 2,   -2,     3],
  [2, 4,   -4,     3],
  [3, 3.5, -3.5,   3],
  [4, 7.5, -7.5,   3],
  [5, 10.2, -10.2, 3],
  [6, 12.1, -12.1, 3],
])
cov = np.cov(data, rowvar=False)
print('Cov:', np.array2string(cov, prefix='Cov: ', formatter={'float_kind':lambda x: f"{x:5.1f}"}))
print(np.var(data[:, 0], ddof=1))
print(np.var(data[:, 1], ddof=1))

Cov: [[  3.5   7.3  -7.3   0.0]
      [  7.3  16.3 -16.3   0.0]
      [ -7.3 -16.3  16.3   0.0]
      [  0.0   0.0   0.0   0.0]]
3.5
16.307

Example Correlation Graph

corr = np.corrcoef(data, rowvar=False)
dataplot = sns.heatmap(corr, cmap="vlag", annot=True)

Secret Weapon

Now its time to show you my secret weapon
Dont tell anyone : )
What I am about to show you single handedly landed me an offer to work at a big company in California
Now for the story

The Story

Background

My job was to predict where on the airport runway the airplane was using cameras
Specifally your lateral position on the runway

The Story (Continued)

I created a model that predicted cross-track position.
- $\hat{CT} = f(\text{FlightState}) = f(\theta, \psi, \phi, \text{image, etc.})$
We had the true CT position (from GPS) to compare my model against. $Error = \hat{CT} - CT$
It worked great most of the time!
However, we kept getting really large errors sometimes
No obvious pattern?
WHY!?!?!

Data

	ct_error	roll	pitch	yaw	downtracks	crosstracks
0	1.63	7.99	2.55	3.91	0.00	0.00
1	3.33	2.93	-2.10	3.89	4.02	1.10
2	2.43	0.12	-0.72	-0.41	8.03	2.20
3	3.24	-1.19	-2.21	-15.68	12.05	3.30
4	3.16	4.12	-3.75	-5.25	16.06	4.39

Pair Plot

sns.pairplot(df_bad)

More Data!

	ct_error	roll	pitch	yaw	downtracks	crosstracks	alt
0	1.63	7.99	2.55	3.91	0.00	0.00	203.27
1	3.33	2.93	-2.10	3.89	4.02	1.10	206.84
2	2.43	0.12	-0.72	-0.41	8.03	2.20	205.51
3	3.24	-1.19	-2.21	-15.68	12.05	3.30	207.36
4	3.16	4.12	-3.75	-5.25	16.06	4.39	207.70

The Error

Long story short, the GPS altitude was broken!
It was reporting that the airpane was off the ground!
My algorithm took into account your height off the ground.
At first, my bosses would not beleive me! It was a $10,000 GPS!
But the correlation plots and all my subsequent research convinced them

The True Error

Someone forgot to renew the subscription service for the High Precision GPS!

Recap EDA

Key Workflow and Graphs

Descriptive Statisitcs
- Means, quartiles, etc. for each feature
Histogram of any intresting features
Shape of Data
Missing Data
Find relationship (Correlation)

Class Activity

Work on your Topic Idea!