CISC482 - Lecture06

Data Wrangling

Dr. Jeremy Castagno

Class Business

Schedule

  • Reading 3-1: Feb 3 @ 12PM, Today!
  • Reading 3-2: Feb 8 @ 12PM, Wednesday
  • Reading 4-1: Feb 10 @ 12PM, Friday
  • HW3: Feb 15 @ Midnight, Wednesday
  • Exam 1: Feb 15 in Class

CS Faculty Candidate

  • Benz Tran is here today!
  • Please attend a meet and greet at 3:15 in SBSC 112
  • Extra Credit!

Data Wrangling

Steps

  1. Discovery - Familiarize with source data
  2. Structuring - Transforms features to uniform formats, units, and scales.
  3. Cleaning - Removes or replaces missing and outlier data.
  4. Enriching - Derives new features from existing features and appends new data from external sources.
  5. Validating - Verifies that the dataset is internally consistent and accurate
  6. Publishing - Makes the dataset available to others

Example

Extract, Transform, Load

  • Extract, Transform, Load (ETL) is a process that extracts data from databases, transforms the data, and loads the data into an analytic database.
  • Similar to data wrangling: process data, clean, enrich, etc.
  • Difference is that data wrangling is more informal, where ETL is a usually a business process or service.

Pandas

What is Pandas

  • Pandas is an open source Python package
  • Widely used for data science/data analysis
  • Key idea is organizing data into a dataframe
  • Tabular data, effecient queries, uses numpy if possible

Loading Pandas from python lists

import pandas as pd
d = {'a': [1, 2, 3], 
    'b': [4.0, 5.0, 6.0]
    }
df = pd.DataFrame(data=d)
df
a b
0 1 4.0
1 2 5.0
2 3 6.0
  • Pandas creates an index automatically (unnamed column)

Loading Pandas from NumPy

d = {'a': np.array([1, 2, 3]), 
    'b': np.array([4.0, 5.0, 6.0])
    }
df = pd.DataFrame(data=d)
df
a b
0 1 4.0
1 2 5.0
2 3 6.0
  • You can also use NumPy arrays

CSV Files

Link

Loading Pandas from a CSV

df = pd.read_csv('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins-raw.csv')
df.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

Pandas Methods - Info

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   studyName            344 non-null    object 
 1   Sample Number        344 non-null    int64  
 2   Species              344 non-null    object 
 3   Region               344 non-null    object 
 4   Island               344 non-null    object 
 5   Stage                344 non-null    object 
 6   Individual ID        344 non-null    object 
 7   Clutch Completion    344 non-null    object 
 8   Date Egg             344 non-null    object 
 9   Culmen Length (mm)   342 non-null    float64
 10  Culmen Depth (mm)    342 non-null    float64
 11  Flipper Length (mm)  342 non-null    float64
 12  Body Mass (g)        342 non-null    float64
 13  Sex                  333 non-null    object 
 14  Delta 15 N (o/oo)    330 non-null    float64
 15  Delta 13 C (o/oo)    331 non-null    float64
 16  Comments             54 non-null     object 
dtypes: float64(6), int64(1), object(10)
memory usage: 45.8+ KB

Select Only Interesting Columns

interesting_columns = ['Species', 'Island', 'Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex', 'Date Egg']
df = df[interesting_columns]
df.head()
Species Island Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Date Egg
0 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1 18.7 181.0 3750.0 MALE 2007-11-11
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 17.4 186.0 3800.0 FEMALE 2007-11-11
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3 18.0 195.0 3250.0 FEMALE 2007-11-16
3 Adelie Penguin (Pygoscelis adeliae) Torgersen NaN NaN NaN NaN NaN 2007-11-16
4 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7 19.3 193.0 3450.0 FEMALE 2007-11-16

Rename Columns

new_names = ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']
mapping = dict(zip(interesting_columns, new_names))
df = df.rename(columns=mapping)
df.head()
{'Body Mass (g)': 'body_mass_g',
 'Culmen Depth (mm)': 'bill_depth_mm',
 'Culmen Length (mm)': 'bill_length_mm',
 'Date Egg': 'year',
 'Flipper Length (mm)': 'flipper_length_mm',
 'Island': 'island',
 'Sex': 'sex',
 'Species': 'species'}

Rename Columns

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1 18.7 181.0 3750.0 MALE 2007-11-11
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 17.4 186.0 3800.0 FEMALE 2007-11-11
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3 18.0 195.0 3250.0 FEMALE 2007-11-16
3 Adelie Penguin (Pygoscelis adeliae) Torgersen NaN NaN NaN NaN NaN 2007-11-16
4 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7 19.3 193.0 3450.0 FEMALE 2007-11-16

Remove Nan

df = df.dropna()
df = df.reset_index(drop=True) # used to be 343, now 333
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Recap

  • Got raw data by using pd.read_csv(..)
  • Look at data
  • Selected columns of interest
  • Rename columns
  • TODO
    • Extract year as integer from string year column
    • lower case sex column
    • simplify species name

Advanced Pandas

Selecting Data

  • You can select a column by using the [] notation
  • One column wil return a series, more than one column will return a new data frame is returned
column = df['year']
column
0      2007-11-11
1      2007-11-11
2      2007-11-16
3      2007-11-16
4      2007-11-16
          ...    
328    2009-11-19
329    2009-11-21
330    2009-11-21
331    2009-11-21
332    2009-11-21
Name: year, Length: 333, dtype: object

Selecting Data, View

df2 = df[['year', 'sex']]
df2.head()
year sex
0 2007-11-11 MALE
1 2007-11-11 FEMALE
2 2007-11-16 FEMALE
3 2007-11-16 FEMALE
4 2007-11-16 MALE

Create a new column

df2['test'] = 'CISC482'
df2.head()
year sex test
0 2007-11-11 MALE CISC482
1 2007-11-11 FEMALE CISC482
2 2007-11-16 FEMALE CISC482
3 2007-11-16 FEMALE CISC482
4 2007-11-16 MALE CISC482

Changing Column

df2.loc[:, 'test'] = 'DataScience'
# df2['test'] = "Data Science" # don't do it!
df2.head()
year sex test
0 2007-11-11 MALE DataScience
1 2007-11-11 FEMALE DataScience
2 2007-11-16 FEMALE DataScience
3 2007-11-16 FEMALE DataScience
4 2007-11-16 MALE DataScience

Selecting Some Rows

df_big_bills = df[df['bill_length_mm'] > 56]
df_big_bills
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
178 Gentoo penguin (Pygoscelis papua) Biscoe 59.6 17.0 230.0 6050.0 MALE 2007-12-03
282 Chinstrap penguin (Pygoscelis antarctica) Dream 58.0 17.8 181.0 3700.0 FEMALE 2007-11-30

Note

How would I just select the male penguins

df_male = df[df['sex'] == "MALE"]

Back to Wrangling

Fixing the Year

df.loc[:, 'year'] = df['year'].str[:4]
# df.loc[:, 'year'] = df['year'].apply(lambda x: str(x).split('-')[0])
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1 18.7 181.0 3750.0 MALE 2007
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 17.4 186.0 3800.0 FEMALE 2007
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3 18.0 195.0 3250.0 FEMALE 2007
3 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7 19.3 193.0 3450.0 FEMALE 2007
4 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.3 20.6 190.0 3650.0 MALE 2007

Fixing the Species

mapping = {
  "Adelie Penguin (Pygoscelis adeliae)": "Adelie", 
  "Gentoo penguin (Pygoscelis papua)": "Gentoo",
  "Chinstrap penguin (Pygoscelis antarctica)": "Chinstrap"
}
df.loc[:, 'species'] = df['species'].map(mapping)
# df.loc[:, 'species'] = df['species'].apply(lambda x: str(x).split('-')[0])
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE 2007
3 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE 2007
4 Adelie Torgersen 39.3 20.6 190.0 3650.0 MALE 2007

Fixing the sex

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
4 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
df.loc[:, 'sex'] = df['sex'].str.lower()

Checking Data Types

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Checking Data Types

df.loc[:, 'year'] = pd.to_numeric(df['year'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    object 
dtypes: float64(4), object(4)
memory usage: 20.9+ KB

Class Activity

Practice Pandas

Cleaning Data and Graphing