CISC482 - Lecture07

Data Exploration

Dr. Jeremy Castagno

Class Business

Schedule

  • Reading 3-2: Feb 8 @ 12PM, Wednesday
  • Reading 4-1: Feb 10 @ 12PM, Friday
  • Reading 4-2: Feb 15 @ 12PM, Wednesday
  • HW3: Feb 15 @ Midnight, Wednesday

CS Faculty Candidate

  • Razuan Hossain is here on Friday
  • Please attend a meet and greet at 3:15 in SBSC 112
  • Extra Credit!

Visualizing Data with *one** feature

Bar Chart

  • A bar chart: groups on one axis, rectangles with heights that represent the number of samples.

Example Bar Chart

Code
species_count = df.groupby(['species'])['species'].count()
ax = sns.barplot(x=species_count.index.values, y=species_count.values)
ax.bar_label(ax.containers[0])
ax;

Numerical Features

  • Sometimes we want visualize numerical features
  • We are interested in showing users the variation of this feature
  • Histograms
  • Density Plots
  • Box Plots

Histogram Bar Chart

Histogram Bar Chart

  • Dividing the numerical feature into small regions and then count the number of values in each region
  • Notice - axis have labels!
  • Notice - bar widths are small enough that you can see the distributions shape
  • What do you notice about this distribution?
  • What is (roughly) the most likely flipper length
Code
ax = sns.histplot(data=df, x="flipper_length_mm");

Histogram Bar Horizontal Bar Chart

  • Sometimes its better to have the bar chart grow horizontal
Code
species_count = df.groupby(['species'])['species'].count()
ax = sns.barplot(data=df, y="island", x="body_mass_g", errorbar=None)
ax.bar_label(ax.containers[0]);

Density Plot

  • A plot that approximates the density function of the distribution for the feature.
  • Density plots can be thought of as a smoothed histogram
Code
ax = sns.kdeplot(data=df, x="body_mass_g");

Density Plot with Histogram

Code
ax = sns.histplot(data=df, x="body_mass_g", kde=True);

Box Plot

  • A visual representation of the summary:
    • minimum, maximum
    • first quartile, median, third quartiles
    • outliers
Code
ax = sns.boxplot(data=df, x="body_mass_g");

Boxen Plot

  • Plots more quantiles
  • Provides more information about the shape of the distribution, particularly in the tails.
  • 50%, 25%, 12.5%, 6.25%, 3.13%
Code
ax = sns.boxenplot(data=df, x="body_mass_g");

Multiple Features

Two features

  • We visualized a single feature uses one axis to display the feature value and another axis to display the value’s frequency
  • However, what if we want to communicate or investigate the relationship between two variables?
    • Scatterplot, line plots, etc!

Example Scatter Plot Data

tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Example Scatter Plot

  • Every point is a data point
  • Point out the best tips
  • Point out the worst tips
Code
sns.relplot(data=tips, x="total_bill", y="tip", aspect=1.5);

Example Line Plot

Code
tips = sns.load_dataset("tips")
sns.relplot(data=tips, x="total_bill", y="tip", aspect=1.5, kind='line');

Combined Plot

  • Data points and linear regression model

Looks can be deceiving

  • Always show the points…

Categorical Features

Categorical Features

  • Sometimes we have a categorical feature and see the difference between two different sets
    • Category: Man, Woman. Feature: Height
    • Category: Friday, Saturday, Sunday. Feature: Avg. Tip
    • Category: Penguin Species. Feature: Body Mass

Density Plot with Histogram, Species as Hue

Code
ax = sns.histplot(data=df, x="body_mass_g", kde=True, hue='species');

Bar Plot, Sex as Hue

Code
# Draw a nested barplot by species and sex
g = sns.catplot(
    data=df, kind="bar", 
    x="species", y="body_mass_g", hue="sex", 
    errorbar=None, alpha=0.6, dodge=False
)

Bar Plot, Sex as Hue

  • What species and sex combination is the most prevalent?
  • Which species has the smallest numerical difference between sexes?
Code
# Draw a nested barplot by species and sex
g = sns.catplot(
    data=df, kind="bar", 
    x="species", y="body_mass_g", hue="sex", 
    errorbar=None, alpha=0.6
)

Awesome Plots

Strip Plot

Code
g = sns.catplot(
    data=df, kind="strip", 
    y="species", x="bill_length_mm", hue="sex",  aspect=2, alpha=0.6
)

Swarm Plot

  • A swarm plot is a scatter plot with points jittered off the lines for the categorical feature so the points do not overlap.
  • A swarm plot is useful for small datasets, but with an increasing number of points, the plots get too wide.

Swarm Plot Example

Code
g = sns.swarmplot(
    data=df, 
    y="species", x="bill_length_mm", hue='sex', alpha=0.6
)

Violin

Code
sns.violinplot(data=df, x="bill_length_mm", y="species");

Violin Explained

Data Plotting Tools

Matplotlib

  • Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
  • The most popular graphic library (2D and 3D)
  • Out of the box, plots are functional but don’t look too nice

Matplotlib Example

Seaborn

  • Seaborn is a Python data visualization library based on matplotlib
  • Nice and simple api that integrates very nicely with pandas
  • Just pass it a data frame and call functions like
    • histplot
    • scatterplot
    • relplot
    • boxplot

Seaborn Examples