CISC482 - Lecture06

Data Wrangling

Dr. Jeremy Castagno

Class Business

Schedule

Reading 3-1: Feb 3 @ 12PM, Today!
Reading 3-2: Feb 8 @ 12PM, Wednesday
Reading 4-1: Feb 10 @ 12PM, Friday
HW3: Feb 15 @ Midnight, Wednesday
Exam 1: Feb 15 in Class

CS Faculty Candidate

Benz Tran is here today!
Please attend a meet and greet at 3:15 in SBSC 112
Extra Credit!

Data Wrangling

Steps

Discovery - Familiarize with source data
Structuring - Transforms features to uniform formats, units, and scales.
Cleaning - Removes or replaces missing and outlier data.
Enriching - Derives new features from existing features and appends new data from external sources.
Validating - Verifies that the dataset is internally consistent and accurate
Publishing - Makes the dataset available to others

Example

Extract, Transform, Load

Extract, Transform, Load (ETL) is a process that extracts data from databases, transforms the data, and loads the data into an analytic database.
Similar to data wrangling: process data, clean, enrich, etc.
Difference is that data wrangling is more informal, where ETL is a usually a business process or service.

Pandas

What is Pandas

Pandas is an open source Python package
Widely used for data science/data analysis
Key idea is organizing data into a dataframe
Tabular data, effecient queries, uses numpy if possible

Loading Pandas from python lists

import pandas as pd
d = {'a': [1, 2, 3], 
    'b': [4.0, 5.0, 6.0]
    }
df = pd.DataFrame(data=d)
df

	a	b
0	1	4.0
1	2	5.0
2	3	6.0

Pandas creates an index automatically (unnamed column)

Loading Pandas from NumPy

d = {'a': np.array([1, 2, 3]), 
    'b': np.array([4.0, 5.0, 6.0])
    }
df = pd.DataFrame(data=d)
df

	a	b
0	1	4.0
1	2	5.0
2	3	6.0

You can also use NumPy arrays

CSV Files

Link

Loading Pandas from a CSV

df = pd.read_csv('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins-raw.csv')
df.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	2007-11-16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN

Pandas Methods - Info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   studyName            344 non-null    object 
 1   Sample Number        344 non-null    int64  
 2   Species              344 non-null    object 
 3   Region               344 non-null    object 
 4   Island               344 non-null    object 
 5   Stage                344 non-null    object 
 6   Individual ID        344 non-null    object 
 7   Clutch Completion    344 non-null    object 
 8   Date Egg             344 non-null    object 
 9   Culmen Length (mm)   342 non-null    float64
 10  Culmen Depth (mm)    342 non-null    float64
 11  Flipper Length (mm)  342 non-null    float64
 12  Body Mass (g)        342 non-null    float64
 13  Sex                  333 non-null    object 
 14  Delta 15 N (o/oo)    330 non-null    float64
 15  Delta 13 C (o/oo)    331 non-null    float64
 16  Comments             54 non-null     object 
dtypes: float64(6), int64(1), object(10)
memory usage: 45.8+ KB

Select Only Interesting Columns

interesting_columns = ['Species', 'Island', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex', 'Date Egg']
df = df[interesting_columns]
df.head()

	Species	Island	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Date Egg
0	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.1	18.7	181.0	3750.0	MALE	2007-11-11
1	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.5	17.4	186.0	3800.0	FEMALE	2007-11-11
2	Adelie Penguin (Pygoscelis adeliae)	Torgersen	40.3	18.0	195.0	3250.0	FEMALE	2007-11-16
3	Adelie Penguin (Pygoscelis adeliae)	Torgersen	NaN	NaN	NaN	NaN	NaN	2007-11-16
4	Adelie Penguin (Pygoscelis adeliae)	Torgersen	36.7	19.3	193.0	3450.0	FEMALE	2007-11-16

Rename Columns

new_names = ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']
mapping = dict(zip(interesting_columns, new_names))
df = df.rename(columns=mapping)
df.head()

{'Body Mass (g)': 'body_mass_g',
 'Culmen Depth (mm)': 'bill_depth_mm',
 'Culmen Length (mm)': 'bill_length_mm',
 'Date Egg': 'year',
 'Flipper Length (mm)': 'flipper_length_mm',
 'Island': 'island',
 'Sex': 'sex',
 'Species': 'species'}

Rename Columns

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.1	18.7	181.0	3750.0	MALE	2007-11-11
1	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.5	17.4	186.0	3800.0	FEMALE	2007-11-11
2	Adelie Penguin (Pygoscelis adeliae)	Torgersen	40.3	18.0	195.0	3250.0	FEMALE	2007-11-16
3	Adelie Penguin (Pygoscelis adeliae)	Torgersen	NaN	NaN	NaN	NaN	NaN	2007-11-16
4	Adelie Penguin (Pygoscelis adeliae)	Torgersen	36.7	19.3	193.0	3450.0	FEMALE	2007-11-16

Remove Nan

df = df.dropna()
df = df.reset_index(drop=True) # used to be 343, now 333
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Recap

Got raw data by using pd.read_csv(..)
Look at data
Selected columns of interest
Rename columns
TODO
- Extract year as integer from string year column
- lower case sex column
- simplify species name

Advanced Pandas

Selecting Data

You can select a column by using the [] notation
One column wil return a series, more than one column will return a new data frame is returned

column = df['year']
column

0      2007-11-11
1      2007-11-11
2      2007-11-16
3      2007-11-16
4      2007-11-16
          ...    
328    2009-11-19
329    2009-11-21
330    2009-11-21
331    2009-11-21
332    2009-11-21
Name: year, Length: 333, dtype: object

Selecting Data, View

df2 = df[['year', 'sex']]
df2.head()

	year	sex
0	2007-11-11	MALE
1	2007-11-11	FEMALE
2	2007-11-16	FEMALE
3	2007-11-16	FEMALE
4	2007-11-16	MALE

Create a new column

df2['test'] = 'CISC482'
df2.head()

	year	sex	test
0	2007-11-11	MALE	CISC482
1	2007-11-11	FEMALE	CISC482
2	2007-11-16	FEMALE	CISC482
3	2007-11-16	FEMALE	CISC482
4	2007-11-16	MALE	CISC482

Changing Column

df2.loc[:, 'test'] = 'DataScience'
# df2['test'] = "Data Science" # don't do it!
df2.head()

	year	sex	test
0	2007-11-11	MALE	DataScience
1	2007-11-11	FEMALE	DataScience
2	2007-11-16	FEMALE	DataScience
3	2007-11-16	FEMALE	DataScience
4	2007-11-16	MALE	DataScience

Selecting Some Rows

df_big_bills = df[df['bill_length_mm'] > 56]
df_big_bills

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
178	Gentoo penguin (Pygoscelis papua)	Biscoe	59.6	17.0	230.0	6050.0	MALE	2007-12-03
282	Chinstrap penguin (Pygoscelis antarctica)	Dream	58.0	17.8	181.0	3700.0	FEMALE	2007-11-30

Note

How would I just select the male penguins

df_male = df[df['sex'] == "MALE"]

Back to Wrangling

Fixing the Year

df.loc[:, 'year'] = df['year'].str[:4]
# df.loc[:, 'year'] = df['year'].apply(lambda x: str(x).split('-')[0])
df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.1	18.7	181.0	3750.0	MALE	2007
1	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.5	17.4	186.0	3800.0	FEMALE	2007
2	Adelie Penguin (Pygoscelis adeliae)	Torgersen	40.3	18.0	195.0	3250.0	FEMALE	2007
3	Adelie Penguin (Pygoscelis adeliae)	Torgersen	36.7	19.3	193.0	3450.0	FEMALE	2007
4	Adelie Penguin (Pygoscelis adeliae)	Torgersen	39.3	20.6	190.0	3650.0	MALE	2007

Fixing the Species

mapping = {
  "Adelie Penguin (Pygoscelis adeliae)": "Adelie", 
  "Gentoo penguin (Pygoscelis papua)": "Gentoo",
  "Chinstrap penguin (Pygoscelis antarctica)": "Chinstrap"
}
df.loc[:, 'species'] = df['species'].map(mapping)
# df.loc[:, 'species'] = df['species'].apply(lambda x: str(x).split('-')[0])
df.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE	2007
3	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE	2007
4	Adelie	Torgersen	39.3	20.6	190.0	3650.0	MALE	2007

Fixing the sex

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
4	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007

df.loc[:, 'sex'] = df['sex'].str.lower()

Checking Data Types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Checking Data Types

df.loc[:, 'year'] = pd.to_numeric(df['year'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Class Activity

Practice Pandas

Cleaning Data and Graphing