CISC482 - Lecture03

Statistics

Dr. Jeremy Castagno

Class Business

Schedule

  • Reading 2-1: Jan 25 @ 12PM, Wed
  • Reading 2-2: Jan 27 @ 12PM, Friday
  • Reading 2-3: Feb 1 @ 12PM, Wed
  • HW2: Feb 1 @ Midnight, Wed

Data Collection

Stats for data science

  • Data science relies on statistics to make data driven decisions
  • Sampling methods are used to efeciently collect data and reduce bias

Note

Bias - anything that leads to a systematic difference between the true parameters of a population and the statistics used to estimate those parameters. E.g., sampling only men.

It’s hard…

Sampling Errors

  • A small, systematic polling error made a big difference
  • All polls were off in the same direction (5 pts) in swing states
  • Correlatated sampling errors in the midwest
  • Failure to appreciate uncertainty

Stats for data science

  • Descriptive statistics -> explore visualizations
  • Inferential statistics -> modeling and estimation
  • Statisitcs is foundational to ensure results are interpreted correctly

Sampling

  • Population - Entire set of individuals, items, or events of interest
  • Observational Unit - individual item or event
  • Sample - subset of observational units from the population

Types of Sampling

  • Random Sampling
    • OU are selected at random
  • Stratified Sampling
    • Population is divided into groups (primary feature). Each group is sampled.
  • Cluster Sampling
    • Population divided into groups (not a primary feature, geography)
  • Systematic Sampling
    • Every k’th observational unit is sampled
  • Convenience Sampling
    • OU are selected that are easier

Observational vs Experiement

  • Observational
    • Observing or collecting data
    • Not trying to control or influence an outcome
  • Experimental
    • You are controlling a varaible
    • You manipulate that variable to get a different response

Descriptive Statistics

Oveview

  • Terminology
  • Measure of center
  • Measure of spread
  • Measure of position
  • Measure of shape

Terminology

  • Descriptive Statistics - summarize and describe a features important characteristics
  • distribution - the possible values the feature can take
  • cluster - a distinct group of neighboring values in a distribution
  • tails - the end values of a distributution

Terminology - Visualized

Measure of Center

  • Mean - average, sum of all values divided by the total number of values, n \[ \frac{1}{n} \sum_{i=i}^{n} x_{i} \]
  • Median - the middle value of the ordered data

Code Example!

heights = [5.5, 5.7, 5.8, 5.9]
avg_height = sum(heights) / len(heights)
print(f"Average Height: {avg_height:.2f}")
Average Height: 5.72


Using Numpy

import numpy as np
heights = np.array([5.5, 5.7, 5.8, 5.9, 6.2])
avg_height = np.average(heights)
print(f"Average Height: {avg_height:.2f}")
Average Height: 5.82

What is NumPy

  • NumPy is the fundamental package for scientific computing in Python.
  • Python library that provides a multidimensional array object.
  • Stores data very effeciently very fast
a = np.array([1, 2])

2D array - rows and columns

b = np.array([
  [1, 2], # first row
  [3, 4]  # second row
])

Question

np.average(...) - finds the average

Tip

What function to get the median?

heights = np.array([5.5, 5.7, 5.8, 5.9, 6.2])
median_height = np.median(heights)
print(f"Median Height: {median_height:.2f}")
Median Height: 5.80

Descriptive Stats

Spread

Spread Terminology

  • range - distance between the min and max
  • interquartile range (IQR) - range of the middle 50%
  • variance - the average squared distance between a feature and its discribution mean

\[ \sigma^2 = Var(x) = \frac{\sum_{i=1}^{n} (x_i - \bar{x} )^2}{n-1} \]

  • Standard deviation - \(\sigma = sqrt(Var(X))\)

Postion Terminology (Quantiles)

Example

Measures of Shape

  • Skewness - measure of the amount and direction of skew
  • Kurtosis - measure of tail heaviness

Code

Mean

                            # mean,  std,    n
samples = np.random.normal(loc=5.0, scale=1, size=11)
samples
array([6.624, 4.388, 4.472, 3.927, 5.865, 2.698, 6.745, 4.239, 5.319,
       4.751, 6.462])

How would I get the mean?

print(np.mean(samples))
5.044609002634907

Standard Deviation

std = np.std(samples)
print(f"The standard deviation is: {std:.2f}")
The standard deviation is: 1.22

The variance?

var = std * std
print(f"The variance is: {var:.2f}")
The variance is: 1.49

Quantiles

samples_sorted = np.sort(samples)
print(samples_sorted)
[2.698 3.927 4.239 4.388 4.472 4.751 5.319 5.865 6.462 6.624 6.745]

Getting the quantile

fifty_percent_quantile = np.quantile(samples, 0.50)
median = np.median(samples)

print(f"{fifty_percent_quantile:.3f}")
print(f"{median:.3f}")
4.751
4.751
quantiles = np.quantile(samples, [0.25, 0.5, 0.75])

print(quantiles)
[4.314 4.751 6.164]

Shape

Advanced statistics -> I recommend using the library scipy. This library is built on top of numpy but has more functionality.

from scipy.stats import skewnorm
samples_skewed = skewnorm.rvs(3, loc=15, scale=2, size=1000)
samples = skewnorm.rvs(0, loc=5, scale=1, size=1000)

Skewness and Kurtosis

Calculate the skewness

import scipy.stats as stats
print(f"Skewed sample data set, skewnewss is: {stats.skew(samples_skewed):.3f}")
print(f"Normal sample data set, skewnewss is: {stats.skew(samples):.3f}")
Skewed sample data set, skewnewss is: 0.790
Normal sample data set, skewnewss is: -0.003

Calculate the kurtosis

import scipy.stats as stats
print(f"Skewed sample data set, kurtosis is: {stats.kurtosis(samples_skewed):.3f}")
print(f"Normal sample data set, kurtosis is: {stats.kurtosis(samples):.3f}")
Skewed sample data set, kurtosis is: 0.938
Normal sample data set, kurtosis is: -0.067

Warning

This skewed distribution is not very good at demonstrating kurtosis.

All

import scipy.stats as stats
stats.describe(samples)
DescribeResult(nobs=1000, minmax=(1.7616568032476234, 7.787361447950662), mean=5.053665979218166, variance=1.0126662448100565, skewness=-0.0026145821525120016, kurtosis=-0.06654059199852824)