import warnings
'ignore') warnings.filterwarnings(
9 Lab: Seaborn
In the previous notebooks we created basic, yet fast, visualisations using matplotlib
. In this one we will be using a different plotting library called Seaborn
. Some consider it the ggplot
of Python with excellent default setting which make your data life easier.
9.1 Preparations
As usual, we need to load any package and data needed for our work.
import numpy as np
import pandas as pd
import seaborn as sns
= pd.read_csv('data/raw/office_ratings.csv', encoding='UTF-8') office_df
And check our data:
office_df.head()
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 |
We can try to replicate the same plots as in the previous notebooks.
9.2 Scatterplots
This is relatively similar to what we did in Section 7.3, but in this case we will be using seaborn’s replot()
method.
='total_votes', y='imdb_rating', data=office_df) sns.relplot(x
9.2.1 Dates
If we want to create a scatterplot with dates, we will need to convert them to dates, too:
'air_date'] = pd.to_datetime(office_df['air_date'], errors='ignore')
office_df[
= sns.relplot(x="air_date", y="imdb_rating", kind="scatter", data=office_df) g
9.3 Functions
We can define our own functions. A function helps us with code we are going to run multiple times. For instance, the below function scales values between 0 and 1.
Here is a modified function from stackoverflow.
office_df.head()
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 |
def normalize(df, feature_name):
= df.copy()
result
= df[feature_name].max()
max_value = df[feature_name].min()
min_value
= (df[feature_name] - min_value) / (max_value - min_value)
result[feature_name]
return result
Passing the dataframe and name of the column will return a dataframe with that column scaled between 0 and 1.
'imdb_rating') normalize(office_df,
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 0.300000 | 3706 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 0.533333 | 3566 | 2005-03-29 |
2 | 1 | 3 | Health Care | 0.400000 | 2983 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 0.466667 | 2886 | 2005-04-12 |
4 | 1 | 5 | Basketball | 0.566667 | 3179 | 2005-04-19 |
... | ... | ... | ... | ... | ... | ... |
183 | 9 | 19 | Stairmageddon | 0.433333 | 1484 | 2013-04-11 |
184 | 9 | 20 | Paper Airplane | 0.433333 | 1482 | 2013-04-25 |
185 | 9 | 21 | Livin' the Dream | 0.733333 | 2041 | 2013-05-02 |
186 | 9 | 22 | A.A.R.M. | 0.866667 | 2860 | 2013-05-09 |
187 | 9 | 23 | Finale | 1.000000 | 7934 | 2013-05-16 |
188 rows × 6 columns
Replacing the origonal dataframe. We can normalize both out votes and rating.
= normalize(office_df, 'imdb_rating') office_df
= normalize(office_df, 'total_votes') office_df
office_df
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 0.300000 | 0.353616 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 0.533333 | 0.332212 | 2005-03-29 |
2 | 1 | 3 | Health Care | 0.400000 | 0.243082 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 0.466667 | 0.228253 | 2005-04-12 |
4 | 1 | 5 | Basketball | 0.566667 | 0.273047 | 2005-04-19 |
... | ... | ... | ... | ... | ... | ... |
183 | 9 | 19 | Stairmageddon | 0.433333 | 0.013912 | 2013-04-11 |
184 | 9 | 20 | Paper Airplane | 0.433333 | 0.013606 | 2013-04-25 |
185 | 9 | 21 | Livin' the Dream | 0.733333 | 0.099067 | 2013-05-02 |
186 | 9 | 22 | A.A.R.M. | 0.866667 | 0.224278 | 2013-05-09 |
187 | 9 | 23 | Finale | 1.000000 | 1.000000 | 2013-05-16 |
188 rows × 6 columns
9.3.1 Long format
Seaborn prefers a long format table. Details of melt can be found here.
=pd.melt(office_df, id_vars=['season', 'episode', 'title', 'air_date'], value_vars=['imdb_rating', 'total_votes'])
office_df_long office_df_long
season | episode | title | air_date | variable | value | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 2005-03-24 | imdb_rating | 0.300000 |
1 | 1 | 2 | Diversity Day | 2005-03-29 | imdb_rating | 0.533333 |
2 | 1 | 3 | Health Care | 2005-04-05 | imdb_rating | 0.400000 |
3 | 1 | 4 | The Alliance | 2005-04-12 | imdb_rating | 0.466667 |
4 | 1 | 5 | Basketball | 2005-04-19 | imdb_rating | 0.566667 |
... | ... | ... | ... | ... | ... | ... |
371 | 9 | 19 | Stairmageddon | 2013-04-11 | total_votes | 0.013912 |
372 | 9 | 20 | Paper Airplane | 2013-04-25 | total_votes | 0.013606 |
373 | 9 | 21 | Livin' the Dream | 2013-05-02 | total_votes | 0.099067 |
374 | 9 | 22 | A.A.R.M. | 2013-05-09 | total_votes | 0.224278 |
375 | 9 | 23 | Finale | 2013-05-16 | total_votes | 1.000000 |
376 rows × 6 columns
Which we can plot in seaborn like so.
='air_date', y='value', size='variable', data=office_df_long) sns.relplot(x
?sns.relplot