9 Lab: Seaborn

import warnings
warnings.filterwarnings('ignore')

In the previous notebooks we created basic, yet fast, visualisations using matplotlib. In this one we will be using a different plotting library called Seaborn. Some consider it the ggplot of Python with excellent default setting which make your data life easier.

9.1 Preparations

As usual, we need to load any package and data needed for our work.

import numpy as np
import pandas as pd
import seaborn as sns

office_df = pd.read_csv('data/raw/office_ratings.csv', encoding='UTF-8')

And check our data:

office_df.head()

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	7.6	3706	2005-03-24
1	1	2	Diversity Day	8.3	3566	2005-03-29
2	1	3	Health Care	7.9	2983	2005-04-05
3	1	4	The Alliance	8.1	2886	2005-04-12
4	1	5	Basketball	8.4	3179	2005-04-19

We can try to replicate the same plots as in the previous notebooks.

9.2 Scatterplots

This is relatively similar to what we did in Section 7.3, but in this case we will be using seaborn’s replot() method.

sns.relplot(x='total_votes', y='imdb_rating', data=office_df)

9.2.1 Dates

If we want to create a scatterplot with dates, we will need to convert them to dates, too:

office_df['air_date'] =  pd.to_datetime(office_df['air_date'], errors='ignore')

g = sns.relplot(x="air_date", y="imdb_rating", kind="scatter", data=office_df)

9.3 Functions

We can define our own functions. A function helps us with code we are going to run multiple times. For instance, the below function scales values between 0 and 1.

Here is a modified function from stackoverflow.

office_df.head()

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	7.6	3706	2005-03-24
1	1	2	Diversity Day	8.3	3566	2005-03-29
2	1	3	Health Care	7.9	2983	2005-04-05
3	1	4	The Alliance	8.1	2886	2005-04-12
4	1	5	Basketball	8.4	3179	2005-04-19

def normalize(df, feature_name):
    result = df.copy()
    
    max_value = df[feature_name].max()
    min_value = df[feature_name].min()
    
    result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    
    return result

Passing the dataframe and name of the column will return a dataframe with that column scaled between 0 and 1.

normalize(office_df, 'imdb_rating')

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	0.300000	3706	2005-03-24
1	1	2	Diversity Day	0.533333	3566	2005-03-29
2	1	3	Health Care	0.400000	2983	2005-04-05
3	1	4	The Alliance	0.466667	2886	2005-04-12
4	1	5	Basketball	0.566667	3179	2005-04-19
...	...	...	...	...	...	...
183	9	19	Stairmageddon	0.433333	1484	2013-04-11
184	9	20	Paper Airplane	0.433333	1482	2013-04-25
185	9	21	Livin' the Dream	0.733333	2041	2013-05-02
186	9	22	A.A.R.M.	0.866667	2860	2013-05-09
187	9	23	Finale	1.000000	7934	2013-05-16

188 rows × 6 columns

Replacing the origonal dataframe. We can normalize both out votes and rating.

office_df = normalize(office_df, 'imdb_rating')

office_df = normalize(office_df, 'total_votes')

office_df

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	0.300000	0.353616	2005-03-24
1	1	2	Diversity Day	0.533333	0.332212	2005-03-29
2	1	3	Health Care	0.400000	0.243082	2005-04-05
3	1	4	The Alliance	0.466667	0.228253	2005-04-12
4	1	5	Basketball	0.566667	0.273047	2005-04-19
...	...	...	...	...	...	...
183	9	19	Stairmageddon	0.433333	0.013912	2013-04-11
184	9	20	Paper Airplane	0.433333	0.013606	2013-04-25
185	9	21	Livin' the Dream	0.733333	0.099067	2013-05-02
186	9	22	A.A.R.M.	0.866667	0.224278	2013-05-09
187	9	23	Finale	1.000000	1.000000	2013-05-16

188 rows × 6 columns

9.3.1 Long format

Seaborn prefers a long format table. Details of melt can be found here.

office_df_long=pd.melt(office_df, id_vars=['season', 'episode', 'title', 'air_date'], value_vars=['imdb_rating', 'total_votes'])
office_df_long

	season	episode	title	air_date	variable	value
0	1	1	Pilot	2005-03-24	imdb_rating	0.300000
1	1	2	Diversity Day	2005-03-29	imdb_rating	0.533333
2	1	3	Health Care	2005-04-05	imdb_rating	0.400000
3	1	4	The Alliance	2005-04-12	imdb_rating	0.466667
4	1	5	Basketball	2005-04-19	imdb_rating	0.566667
...	...	...	...	...	...	...
371	9	19	Stairmageddon	2013-04-11	total_votes	0.013912
372	9	20	Paper Airplane	2013-04-25	total_votes	0.013606
373	9	21	Livin' the Dream	2013-05-02	total_votes	0.099067
374	9	22	A.A.R.M.	2013-05-09	total_votes	0.224278
375	9	23	Finale	2013-05-16	total_votes	1.000000

376 rows × 6 columns

Which we can plot in seaborn like so.

sns.relplot(x='air_date', y='value', size='variable', data=office_df_long)

?sns.relplot