6 Lab: Pandas

Last week we loaded data using pandas’ read.csv() method, but Pandas can do way more than that. Pandas is an essential library for data science, as it provides the data structures (nameley, series -1D- and data frames -2D) and operations for manipulating tabular data.

In this unit’s labs we will be using pandas to read in, process and explore data (this notebook); create (basic) visualisations capabilities (Chapter 7) as well as transforming data (Chapter 8). To do so, we will be using a dataset about “The Office”.

6.1 Dataset

The Office is a humoristic TV series originally created in 2001 by Ricky Gervais and Stephen Merchant that has received several adaptations. The dataset that we will be using contains information (i.e., title, date and ratings from IMBDB) about every episode of the 9 seaons of the very successful USA’s adaptation aired between 2005 and 2013.

The dataset is stored in a csv file that has the following columns: season, episode, title, imdb_rating, total_votes, air_date.

6.2 Starting

To work with the dataset we will need to import pandas so we can use every feature provided by the library, as well as loading the dataset stored in the office_ratings.csv.

# These two lines are added so that not all the warnings are rendered in the cell. 
# We do this not to confuse you during your learning journey with some of the warnings but 
# normally you would want them turned on since they can tell you something about things that might not be working as expected.
import warnings
warnings.filterwarnings('ignore')

# this following import will always be needed whenever you want to work with Pandas.
import pandas as pd

df = pd.read_csv('data/raw/office_ratings.csv', encoding='UTF-8')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       188 non-null    int64  
 1   episode      188 non-null    int64  
 2   title        188 non-null    object 
 3   imdb_rating  188 non-null    float64
 4   total_votes  188 non-null    int64  
 5   air_date     188 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB

6.3 Help!

Python has inbuilt documentation. To access this add a ? before an object or method.

Note

The output of the help function has been omitted in the handbook. Please run the cells in your notebook to read the different outputs

For example, our dataframe

?df

or the dtypes property

Tip

Properties of object are values associated with the object and are not called with a () at the end.

?df.dtypes

The info method for dataframes.

?df.info

If you would like to get help in-line like the examples above, that can give a very long help message that might not be always convenient. If you like, you can try to get the help for this following function like this:

?pd.read_csv

However, the below will be quite long – it provides you the various arguments (options) you can use with the method.

Instead of this approach, a much better way to get help is to refer to the documentation and the API of the library that you are using. For instance, for read_csv(), this page is much more useful – https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

We recommend that you use a search engine very frequently.

?pd.read_csv

The Pandas documentation is rather good. Relevent to our below work is:

I also found a rather nice series of lessons a kind person put together. There are lots of online tutorials which will help you.

6.4 Structure

In Section 2.3 we introduced Python’s data types as well as how to use the function type() to retrieve an object’s data type. Pandas expands python’s data types by creating a new one called data frame

Do you remember what are Python’s data types? You can refer to Section 2.3 for a refresher and to know more about them.

Data frames

Data frames are 2-dimensional data structures that store information in columns and rows, very much like data is stored in a spreadsheet or a database. Typically, every column will contain variables (or sometimes called attributes) whereas every row represents an observation. This is known as wide data frames, as opposed to long data frames.

In pandas, every column has a name and rows can be named, too.

So let’s check the what our newly created object’s (df) data type:

type(df)

pandas.core.frame.DataFrame

Unsurprisingly, df is a `DataFrame`` object, provided by pandas.

The DataFrame object has lots of built in methods and attributes.

The info method gives us information about datatypes, dimensions and the presence of null values in our dataframe. Let’s see how can we use it and what information is returned:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       188 non-null    int64  
 1   episode      188 non-null    int64  
 2   title        188 non-null    object 
 3   imdb_rating  188 non-null    float64
 4   total_votes  188 non-null    int64  
 5   air_date     188 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB

We can just dtypes to check the data types of every variable in the data frame.

df.dtypes

season           int64
episode          int64
title           object
imdb_rating    float64
total_votes      int64
air_date        object
dtype: object

Or just the dimensions (e.g., rows and columns).

df.shape

(188, 6)

In this case, there are only 188 rows. But for larger datasets we might want to look at the head (top 5) and tail (bottom 5) rows using .head() and .tail(), respectively.

df.head()

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	7.6	3706	2005-03-24
1	1	2	Diversity Day	8.3	3566	2005-03-29
2	1	3	Health Care	7.9	2983	2005-04-05
3	1	4	The Alliance	8.1	2886	2005-04-12
4	1	5	Basketball	8.4	3179	2005-04-19

df.tail()

	season	episode	title	imdb_rating	total_votes	air_date
183	9	19	Stairmageddon	8.0	1484	2013-04-11
184	9	20	Paper Airplane	8.0	1482	2013-04-25
185	9	21	Livin' the Dream	8.9	2041	2013-05-02
186	9	22	A.A.R.M.	9.3	2860	2013-05-09
187	9	23	Finale	9.7	7934	2013-05-16

6.5 Summary

To get an overview of our data we can ask Python to ‘describe our (numeric) data’

df.describe()

	season	episode	imdb_rating	total_votes
count	188.000000	188.000000	188.000000	188.000000
mean	5.468085	11.877660	8.257447	2126.648936
std	2.386245	7.024855	0.538067	787.098275
min	1.000000	1.000000	6.700000	1393.000000
25%	3.000000	6.000000	7.900000	1631.500000
50%	6.000000	11.500000	8.200000	1952.500000
75%	7.250000	18.000000	8.600000	2379.000000
max	9.000000	26.000000	9.700000	7934.000000

or we can pull out specific statistics for numeric columns.

df.mean()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 df.mean()

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11693, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11685 @doc(make_doc("mean", ndim=2))
  11686 def mean(
  11687     self,
   (...)
  11691     **kwargs,
  11692 ):
> 11693     result = super().mean(axis, skipna, numeric_only, **kwargs)
  11694     if isinstance(result, Series):
  11695         result = result.__finalize__(self, method="mean")

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12413 def mean(
  12414     self,
  12415     axis: Axis | None = 0,
   (...)
  12418     **kwargs,
  12419 ) -> Series | float:
> 12420     return self._stat_function(
  12421         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12422     )

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12373 nv.validate_func(name, (), kwargs)
  12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12377 return self._reduce(
  12378     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12379 )

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11562, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11558     df = df.T
  11560 # After possibly _get_data and transposing, we are now in the
  11561 #  simple case where we can use BlockManager.reduce
> 11562 res = df._mgr.reduce(blk_func)
  11563 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11564 if out_dtype is not None and out.dtype != "boolean":

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/internals/managers.py:1500, in BlockManager.reduce(self, func)
   1498 res_blocks: list[Block] = []
   1499 for blk in self.blocks:
-> 1500     nbs = blk.reduce(func)
   1501     res_blocks.extend(nbs)
   1503 index = Index([None])  # placeholder

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func)
    398 @final
    399 def reduce(self, func) -> list[Block]:
    400     # We will apply the function and reshape the result into a single-row
    401     #  Block with the same mgr_locs; squeezing will be done at a higher level
    402     assert self.ndim == 2
--> 404     result = func(self.values)
    406     if self.values.ndim == 1:
    407         res_values = result

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11481, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11479         return np.array([result])
  11480 else:
> 11481     return op(values, axis=axis, skipna=skipna, **kwds)

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):
    723     count = cast(np.ndarray, count)

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x)
   1683 inferred = lib.infer_dtype(x)
   1684 if inferred in ["string", "mixed"]:
   1685     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1686     raise TypeError(f"Could not convert {x} to numeric")
   1687 try:
   1688     x = x.astype(np.complex128)

TypeError: Could not convert ["PilotDiversity DayHealth CareThe AllianceBasketballHot GirlThe DundiesSexual HarassmentOffice OlympicsThe FireHalloweenThe FightThe ClientPerformance ReviewE-Mail SurveillanceChristmas PartyBooze CruiseThe InjuryThe SecretThe CarpetBoys and GirlsValentine's DayDwight's SpeechTake Your Daughter to Work DayMichael's BirthdayDrug TestingConflict ResolutionCasino NightGay Witch HuntThe ConventionThe CoupGrief CounselingInitiationDiwaliBranch ClosingThe MergerThe ConvictA Benihana ChristmasBack from VacationTraveling SalesmenThe ReturnBen FranklinPhyllis' WeddingBusiness SchoolCocktailsThe NegotiationSafety TrainingProduct RecallWomen's AppreciationBeach GamesThe JobFun RunDunder Mifflin InfinityLaunch PartyMoneyLocal AdBranch WarsSurvivor ManThe DepositionDinner PartyChair ModelNight OutDid I Stutter?Job FairGoodbye, TobyWeight LossBusiness EthicsBaby ShowerCrime AidEmployee TransferCustomer SurveyBusiness TripFrame TobyThe SurplusMoroccan ChristmasThe DuelPrince Family PaperStress ReliefLecture Circuit: Part 1Lecture Circuit: Part 2Blood DriveGolden TicketNew BossTwo WeeksDream TeamMichael Scott Paper CompanyHeavy CompetitionBrokeCasual FridayCafe DiscoCompany PicnicGossipThe MeetingThe PromotionNiagara: Part 1Niagara: Part 2MafiaThe LoverKoi PondDouble DateMurderShareholder MeetingScott's TotsSecret SantaThe BankerSabreManager and SalesmanThe Delivery: Part 1The Delivery: Part 2St. Patrick's DayNew LeadsHappy HourSecretary's DayBody LanguageThe Cover-UpThe ChumpWhistleblowerNepotismCounselingAndy's PlaySex EdThe StingCostume ContestChristeningViewing PartyWUPHF.comChinaClassy ChristmasUltimatumThe SeminarThe SearchPDAThreat Level MidnightTodd PackerGarage SaleTraining DayMichael's Last DundiesGoodbye, MichaelThe Inner CircleDwight K. Schrute, (Acting) ManagerSearch CommitteeThe ListThe IncentiveLottoGarden PartySpookedDoomsdayPam's ReplacementGettysburgMrs. CaliforniaChristmas WishesTriviaPool PartyJury DutySpecial ProjectTallahasseeAfter HoursTest the StoreLast Day in FloridaGet the GirlWelcome PartyAngry AndyFundraiserTurf WarFree Family Portrait StudioNew GuysRoy's WeddingAndy's AncestryWork BusHere Comes TrebleThe BoatThe WhaleThe TargetDwight ChristmasLiceSuit WarehouseCustomer LoyaltyJunior SalesmanVandalismCouples DiscountMoving OnThe FarmPromosStairmageddonPaper AirplaneLivin' the DreamA.A.R.M.Finale"
 '2005-03-242005-03-292005-04-052005-04-122005-04-192005-04-262005-09-202005-09-272005-10-042005-10-112005-10-182005-11-012005-11-082005-11-152005-11-222005-12-062006-01-052006-01-122006-01-192006-01-262006-02-022006-02-092006-03-022006-03-162006-03-302006-04-272006-05-042006-05-112006-09-212006-09-282006-10-052006-10-122006-10-192006-11-022006-11-092006-11-162006-11-302006-12-142007-01-042007-01-112007-01-182007-02-012007-02-082007-02-152007-02-222007-04-052007-04-122007-04-262007-05-032007-05-102007-05-172007-09-272007-10-042007-10-112007-10-182007-10-252007-11-012007-11-082007-11-152008-04-102008-04-172008-04-242008-05-012008-05-082008-05-152008-09-252008-10-092008-10-162008-10-232008-10-302008-11-062008-11-132008-11-202008-12-042008-12-112009-01-152009-01-222009-02-012009-02-052009-02-122009-03-052009-03-122009-03-192009-03-262009-04-092009-04-092009-04-162009-04-232009-04-302009-05-072009-05-142009-09-172009-09-242009-10-012009-10-082009-10-082009-10-152009-10-222009-10-292009-11-052009-11-122009-11-192009-12-032009-12-102010-01-212010-02-042010-02-112010-03-042010-03-042010-03-112010-03-182010-03-252010-04-222010-04-292010-05-062010-05-132010-05-202010-09-232010-09-302010-10-072010-10-142010-10-212010-10-282010-11-042010-11-112010-11-182010-12-022010-12-092011-01-202011-01-272011-02-032011-02-102011-02-172011-02-242011-03-242011-04-142011-04-212011-04-282011-05-052011-05-122011-05-192011-09-222011-09-292011-10-062011-10-132011-10-272011-11-032011-11-102011-11-172011-12-012011-12-082012-01-122012-01-192012-02-022012-02-092012-02-162012-02-232012-03-012012-03-082012-03-152012-04-122012-04-192012-04-262012-05-032012-05-102012-09-202012-09-272012-10-042012-10-182012-10-252012-11-082012-11-152012-11-292012-12-062013-01-102013-01-172013-01-242013-01-312013-01-312013-02-072013-02-142013-03-142013-04-042013-04-112013-04-252013-05-022013-05-092013-05-16'] to numeric

Note the error triggered above due to pandas attempting to calculate the mean of the wrong type (i.e. non-numeric values). We can address that by only computing the mean of numeric values (see below):

df.mean(numeric_only=True)

season            5.468085
episode          11.877660
imdb_rating       8.257447
total_votes    2126.648936
dtype: float64

or the sum of every value within the same column:

df.sum()

season                                                      1028
episode                                                     2233
title          PilotDiversity DayHealth CareThe AllianceBaske...
imdb_rating                                               1552.4
total_votes                                               399810
air_date       2005-03-242005-03-292005-04-052005-04-122005-0...
dtype: object

Similarly to what happened with mean(), sum() is adding all values in every observation of every attribute, regardless of their type, but this time is not producing an error. Can you see what happens with strings? And with dates?

Again, we can force to use numeric values only:

df.sum(numeric_only=True)

season           1028.0
episode          2233.0
imdb_rating      1552.4
total_votes    399810.0
dtype: float64

6.6 Subsetting

Often times we may have a large dataset and we only need to work with just a part of it (a subset) consisting of certain columns and/or rows. Selecting specific columns and/or rows is known as subsetting.

6.6.1 Selecting columns

Because in pandas every column has a name, we can select columns by their name or their position.

6.6.1.1 Selecting by name

To select by name we will use the syntax df['<column_name>']. For example, if we wanted to select the ratings:

df['imdb_rating']

0      7.6
1      8.3
2      7.9
3      8.1
4      8.4
      ... 
183    8.0
184    8.0
185    8.9
186    9.3
187    9.7
Name: imdb_rating, Length: 188, dtype: float64

or we could select the date in which the chapters were first aired:

df['air_date']

0      2005-03-24
1      2005-03-29
2      2005-04-05
3      2005-04-12
4      2005-04-19
          ...    
183    2013-04-11
184    2013-04-25
185    2013-05-02
186    2013-05-09
187    2013-05-16
Name: air_date, Length: 188, dtype: object

We can even select more than one column!

df[['imdb_rating', 'total_votes']]

	imdb_rating	total_votes
0	7.6	3706
1	8.3	3566
2	7.9	2983
3	8.1	2886
4	8.4	3179
...	...	...
183	8.0	1484
184	8.0	1482
185	8.9	2041
186	9.3	2860
187	9.7	7934

188 rows × 2 columns

Important

Did you notice that we used two sets of squared brackets ([[]])? This is needed because we need to passing a list of the column names to the __getitem__ method of the pandas dataframe object, and as you may remember from Section 2.3.2, this is the syntax used for lists (thank this stackoverflow question).

This is what we’d get otherwise:

df['imdb_rating', 'total_votes']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('imdb_rating', 'total_votes')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[22], line 1
----> 1 df['imdb_rating', 'total_votes']

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: ('imdb_rating', 'total_votes')

You can also check out the pandas documentation on indexing and selecting data.

We can also apply methods to subset, such as this one to get the average rating:

df['imdb_rating'].mean()

8.25744680851064

Or to calculate the total number of votes:

df['total_votes'].sum()

Or a combination of multiple columns:

df[['imdb_rating', 'total_votes']].mean()

imdb_rating       8.257447
total_votes    2126.648936
dtype: float64

6.6.1.2 Selecting by position

If we do not want to use column names, we can use iloc method by using the syntax <object>.iloc[<row slice>, <column slice>], where a slice is a range of numbers separated by a colon :. So, if we were to select the value in the 4th row and 2nd column, we’d use:

df.iloc[4,2]

'Basketball'

But if we just wanted to select a column? In that case, we can use the same method but instead of specifiying a row, we will need to use : to indicate that we are selecting all the rows, such as:

df.iloc[:,2]

0                 Pilot
1         Diversity Day
2           Health Care
3          The Alliance
4            Basketball
             ...       
183       Stairmageddon
184      Paper Airplane
185    Livin' the Dream
186            A.A.R.M.
187              Finale
Name: title, Length: 188, dtype: object

Conversely, if we just wanted to select all the columns from a given row, we’d use : on the right side of the , like this:

df.iloc[4,:]

season                  1
episode                 5
title          Basketball
imdb_rating           8.4
total_votes          3179
air_date       2005-04-19
Name: 4, dtype: object

We can use negative values in indexes to indicate ‘from the end’. So, an index of [-10, :] returns the 10th from last row.

df.iloc[-10,:]

season                  9
episode                14
title           Vandalism
imdb_rating           7.6
total_votes          1402
air_date       2013-01-31
Name: 178, dtype: object

Instead of using tail, we could ask for the last 5 rows with an index of [-5:, :]. I read : as ‘and everything else’ in these cases.

df.iloc[-5:,:]

	season	episode	title	imdb_rating	total_votes	air_date
183	9	19	Stairmageddon	8.0	1484	2013-04-11
184	9	20	Paper Airplane	8.0	1482	2013-04-25
185	9	21	Livin' the Dream	8.9	2041	2013-05-02
186	9	22	A.A.R.M.	9.3	2860	2013-05-09
187	9	23	Finale	9.7	7934	2013-05-16

df.tail()

	season	episode	title	imdb_rating	total_votes	air_date
183	9	19	Stairmageddon	8.0	1484	2013-04-11
184	9	20	Paper Airplane	8.0	1482	2013-04-25
185	9	21	Livin' the Dream	8.9	2041	2013-05-02
186	9	22	A.A.R.M.	9.3	2860	2013-05-09
187	9	23	Finale	9.7	7934	2013-05-16

Note that the row is shown on the left. That will stop you getting lost in slices of the data.

For the top ten rows

df.iloc[:10,:]

	season	episode	title	imdb_rating	total_votes	air_date
0	1	1	Pilot	7.6	3706	2005-03-24
1	1	2	Diversity Day	8.3	3566	2005-03-29
2	1	3	Health Care	7.9	2983	2005-04-05
3	1	4	The Alliance	8.1	2886	2005-04-12
4	1	5	Basketball	8.4	3179	2005-04-19
5	1	6	Hot Girl	7.8	2852	2005-04-26
6	2	1	The Dundies	8.7	3213	2005-09-20
7	2	2	Sexual Harassment	8.2	2736	2005-09-27
8	2	3	Office Olympics	8.4	2742	2005-10-04
9	2	4	The Fire	8.4	2713	2005-10-11

Of course, we can run methods on these slices. We could, if we wanted to, calculate the mean imdb rating of only the first and last 100 episodes. Note the indexing starts at 0 so we want the column index of 3 (0:season, 1:episode, 2:title, 3:imdb_rating).

df.iloc[:100,3].mean()

8.483

df.iloc[-100:,3].mean()

8.062

If you are unsure how many rows you have then the count method comes to the rescue.

df.iloc[-100:,3].count()

df.describe()

	season	episode	imdb_rating	total_votes
count	188.000000	188.000000	188.000000	188.000000
mean	5.468085	11.877660	8.257447	2126.648936
std	2.386245	7.024855	0.538067	787.098275
min	1.000000	1.000000	6.700000	1393.000000
25%	3.000000	6.000000	7.900000	1631.500000
50%	6.000000	11.500000	8.200000	1952.500000
75%	7.250000	18.000000	8.600000	2379.000000
max	9.000000	26.000000	9.700000	7934.000000

So it looks like the last 100 episodes were less good than the first 100. I guess that is why it was cancelled.

Our data is organised by season. Looking at the average by season might help.

df[['season', 'imdb_rating']].groupby('season').mean()

	imdb_rating
season
1	8.016667
2	8.436364
3	8.573913
4	8.600000
5	8.492308
6	8.219231
7	8.316667
8	7.666667
9	7.956522

The above line groups our dataframe by values in the season column and then displays the mean for each group. Pretty nifty.

Season 8 looks pretty bad. We can look at just the rows for season 8.

df[df['season'] == 8]

	season	episode	title	imdb_rating	total_votes	air_date
141	8	1	The List	8.2	1829	2011-09-22
142	8	2	The Incentive	8.2	1668	2011-09-29
143	8	3	Lotto	7.3	1601	2011-10-06
144	8	4	Garden Party	8.1	1717	2011-10-13
145	8	5	Spooked	7.6	1543	2011-10-27
146	8	6	Doomsday	7.8	1476	2011-11-03
147	8	7	Pam's Replacement	7.7	1563	2011-11-10
148	8	8	Gettysburg	7.0	1584	2011-11-17
149	8	9	Mrs. California	7.7	1553	2011-12-01
150	8	10	Christmas Wishes	8.0	1547	2011-12-08
151	8	11	Trivia	7.9	1488	2012-01-12
152	8	12	Pool Party	8.0	1612	2012-01-19
153	8	13	Jury Duty	7.5	1478	2012-02-02
154	8	14	Special Project	7.8	1432	2012-02-09
155	8	15	Tallahassee	7.9	1522	2012-02-16
156	8	16	After Hours	8.1	1567	2012-02-23
157	8	17	Test the Store	7.8	1478	2012-03-01
158	8	18	Last Day in Florida	7.8	1429	2012-03-08
159	8	19	Get the Girl	6.7	1642	2012-03-15
160	8	20	Welcome Party	7.2	1489	2012-04-12
161	8	21	Angry Andy	7.1	1585	2012-04-19
162	8	22	Fundraiser	7.1	1453	2012-04-26
163	8	23	Turf War	7.7	1393	2012-05-03
164	8	24	Free Family Portrait Studio	7.8	1464	2012-05-10

6.6.2 Filtering rows

We can filter rows matching some criteria by using the syntax <object>.loc[<criteria>]. So, if we wanted to filter all the episodes from the 8th season, we would do the following:

df.loc[df['season'] == 8]

	season	episode	title	imdb_rating	total_votes	air_date
141	8	1	The List	8.2	1829	2011-09-22
142	8	2	The Incentive	8.2	1668	2011-09-29
143	8	3	Lotto	7.3	1601	2011-10-06
144	8	4	Garden Party	8.1	1717	2011-10-13
145	8	5	Spooked	7.6	1543	2011-10-27
146	8	6	Doomsday	7.8	1476	2011-11-03
147	8	7	Pam's Replacement	7.7	1563	2011-11-10
148	8	8	Gettysburg	7.0	1584	2011-11-17
149	8	9	Mrs. California	7.7	1553	2011-12-01
150	8	10	Christmas Wishes	8.0	1547	2011-12-08
151	8	11	Trivia	7.9	1488	2012-01-12
152	8	12	Pool Party	8.0	1612	2012-01-19
153	8	13	Jury Duty	7.5	1478	2012-02-02
154	8	14	Special Project	7.8	1432	2012-02-09
155	8	15	Tallahassee	7.9	1522	2012-02-16
156	8	16	After Hours	8.1	1567	2012-02-23
157	8	17	Test the Store	7.8	1478	2012-03-01
158	8	18	Last Day in Florida	7.8	1429	2012-03-08
159	8	19	Get the Girl	6.7	1642	2012-03-15
160	8	20	Welcome Party	7.2	1489	2012-04-12
161	8	21	Angry Andy	7.1	1585	2012-04-19
162	8	22	Fundraiser	7.1	1453	2012-04-26
163	8	23	Turf War	7.7	1393	2012-05-03
164	8	24	Free Family Portrait Studio	7.8	1464	2012-05-10

Understanding the criteria

To understand why we have to write the name of the dataframe twice, we can focus on the output provided by the filtering criteria only:

df['season'] == 9

0      False
1      False
2      False
3      False
4      False
       ...  
183     True
184     True
185     True
186     True
187     True
Name: season, Length: 188, dtype: bool

As you can see, it returns a boolean serie specifiying which rows are matching the criteria (True) and which ones are not (False)

As a side note, while writing the name of the dataframe twice may seem redundant, this means that we could filter rows based on other objects.

We can get an overview of the rating of all chapters within season 8 by:

df.loc[df['season'] == 8, 'imdb_rating'].describe()

count    24.000000
mean      7.666667
std       0.405041
min       6.700000
25%       7.450000
50%       7.800000
75%       7.925000
max       8.200000
Name: imdb_rating, dtype: float64

Generally pretty bad, but there is clearly one very disliked episode.

6.7 Adding columns

We can add new columns pretty simply.

df['x'] = 44
df.head()

	season	episode	title	imdb_rating	total_votes	air_date	x
0	1	1	Pilot	7.6	3706	2005-03-24	44
1	1	2	Diversity Day	8.3	3566	2005-03-29	44
2	1	3	Health Care	7.9	2983	2005-04-05	44
3	1	4	The Alliance	8.1	2886	2005-04-12	44
4	1	5	Basketball	8.4	3179	2005-04-19	44

Our new column can be an operation on other columns

df['rating_div_total_votes'] = df['imdb_rating'] / df['total_votes']
df.head()

	season	episode	title	imdb_rating	total_votes	air_date	x	rating_div_total_votes
0	1	1	Pilot	7.6	3706	2005-03-24	44	0.002051
1	1	2	Diversity Day	8.3	3566	2005-03-29	44	0.002328
2	1	3	Health Care	7.9	2983	2005-04-05	44	0.002648
3	1	4	The Alliance	8.1	2886	2005-04-12	44	0.002807
4	1	5	Basketball	8.4	3179	2005-04-19	44	0.002642

or as simple as adding one to every value.

df['y'] = df['season'] + 1
df.iloc[0:5,:]

	season	episode	title	imdb_rating	total_votes	air_date	x	rating_div_total_votes	y
0	1	1	Pilot	7.6	3706	2005-03-24	44	0.002051	2
1	1	2	Diversity Day	8.3	3566	2005-03-29	44	0.002328	2
2	1	3	Health Care	7.9	2983	2005-04-05	44	0.002648	2
3	1	4	The Alliance	8.1	2886	2005-04-12	44	0.002807	2
4	1	5	Basketball	8.4	3179	2005-04-19	44	0.002642	2

y =  df['season'] + 1

6.8 Writing data

Pandas supports writing out data frames to various formats.

?df.to_csv

Now you can uncomment the code below to save your dataframe into a csv file. But before doing so, check that your data/output folder is empty, as it would override its content:

df.to_csv('data/output/my_output_ratings.csv', encoding='UTF-8')

Likewise, we could export our dataset to an excel file by using to_excel:

?df.to_excel

Now you can uncomment the code below to save your dataframe into an excel file. But before doing so, check that your data/output folder is empty:

# df.to_excel('data/output/my_output_ratings.xlsx')

6.9 Combining datasets

In this notebook, our dataset was created from a single file that contained all the data that was needed. However, often times data will be spread into different files that we will need to combine to create our own dataset.

Consider the two dataframes below:

df_1 = pd.read_csv('data/raw/office1.csv', encoding='UTF-8')
df_2 = pd.read_csv('data/raw/office2.csv', encoding='UTF-8')

df_1.head()

	id	season	episode	imdb_rating
0	5-1	5	1	8.8
1	9-13	9	13	7.7
2	5-6	5	6	8.5
3	3-23	3	23	9.3
4	9-16	9	16	8.2

df_2.head()

	id	total_votes
0	4-10	2095
1	3-21	2403
2	7-24	2040
3	6-18	1769
4	8-8	1584

As can be seen, the total votes and imdb ratings data are split between files that we will need to combine. Usually this is done by using a shared column between the two datasets that works as an index. Gladly, head() reveals that in both cases there is a common column called id. We can join the two dataframes together using the common column.

inner_join_office_df = pd.merge(df_1, df_2, on='id', how='inner')
inner_join_office_df

	id	season	episode	imdb_rating	total_votes
0	5-1	5	1	8.8	2501
1	9-13	9	13	7.7	1394
2	5-6	5	6	8.5	2018
3	3-23	3	23	9.3	3010
4	9-16	9	16	8.2	1572
...	...	...	...	...	...
183	5-21	5	21	8.7	2032
184	2-13	2	13	8.3	2363
185	9-6	9	6	7.8	1455
186	2-2	2	2	8.2	2736
187	3-4	3	4	8.0	2311

188 rows × 5 columns

In this way you can combine datasets using common columns and an inner join. We will leave that for the moment. If you want more information about merging data then see this page and the pandas documentation.

6.9.1 Well done!

Well done! You’ve reached the end of a pretty long notebook that went through a lot of details about how to work with this pretty critical package called Pandas. We haven’t really done a lot of detailed analysis in this one yet but we will use Pandas a lot and frequently.

Your best friend will be the Pandas documentation – https://pandas.pydata.org/docs/index.html

This documentation is great. We particuarly recomment the User Guide that will answer most of your questions and will give you a lot of code to copy and paste first and then modify to do what you need to do – https://pandas.pydata.org/docs/user_guide/index.html#user-guide