# These two lines are added so that not all the warnings are rendered in the cell.
# We do this not to confuse you during your learning journey with some of the warnings but
# normally you would want them turned on since they can tell you something about things that might not be working as expected.
import warnings
'ignore') warnings.filterwarnings(
6 Lab: Pandas
Last week we loaded data using pandas
’ read.csv()
method, but Pandas can do way more than that. Pandas is an essential library for data science, as it provides the data structures (nameley, series
-1D- and data frames
-2D) and operations for manipulating tabular data.
In this unit’s labs we will be using pandas
to read in, process and explore data (this notebook); create (basic) visualisations capabilities (Chapter 7) as well as transforming data (Chapter 8). To do so, we will be using a dataset about “The Office”.
6.1 Dataset
The Office is a humoristic TV series originally created in 2001 by Ricky Gervais and Stephen Merchant that has received several adaptations. The dataset that we will be using contains information (i.e., title, date and ratings from IMBDB) about every episode of the 9 seaons of the very successful USA’s adaptation aired between 2005 and 2013.
The dataset is stored in a csv
file that has the following columns: season
, episode
, title
, imdb_rating
, total_votes
, air_date
.
6.2 Starting
To work with the dataset we will need to import pandas so we can use every feature provided by the library, as well as loading the dataset stored in the office_ratings.csv
.
# this following import will always be needed whenever you want to work with Pandas.
import pandas as pd
= pd.read_csv('data/raw/office_ratings.csv', encoding='UTF-8') df
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 season 188 non-null int64
1 episode 188 non-null int64
2 title 188 non-null object
3 imdb_rating 188 non-null float64
4 total_votes 188 non-null int64
5 air_date 188 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB
6.3 Help!
Python has inbuilt documentation. To access this add a ?
before an object or method.
The output of the help function has been omitted in the handbook. Please run the cells in your notebook to read the different outputs
For example, our dataframe
?df
or the dtypes
property
Properties of object are values associated with the object and are not called with a ()
at the end.
?df.dtypes
The info
method for dataframes.
?df.info
If you would like to get help in-line like the examples above, that can give a very long help message that might not be always convenient. If you like, you can try to get the help for this following function like this:
?pd.read_csv
However, the below will be quite long – it provides you the various arguments (options) you can use with the method.
Instead of this approach, a much better way to get help is to refer to the documentation and the API of the library that you are using. For instance, for read_csv()
, this page is much more useful – https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
We recommend that you use a search engine very frequently.
?pd.read_csv
The Pandas documentation is rather good. Relevent to our below work is:
- What kind of data does pandas handle?
- How to calculate summary statistics?
- How to create plots in pandas?
- How to handle time series data with ease?
I also found a rather nice series of lessons a kind person put together. There are lots of online tutorials which will help you.
6.4 Structure
In Section 2.3 we introduced Python’s data types as well as how to use the function type()
to retrieve an object’s data type. Pandas expands python’s data types by creating a new one called data frame
Do you remember what are Python’s data types? You can refer to Section 2.3 for a refresher and to know more about them.
Data frames are 2-dimensional data structures that store information in columns and rows, very much like data is stored in a spreadsheet or a database. Typically, every column will contain variables (or sometimes called attributes) whereas every row represents an observation. This is known as wide data frames, as opposed to long data frames.
In pandas, every column has a name and rows can be named, too.
So let’s check the what our newly created object’s (df
) data type:
type(df)
pandas.core.frame.DataFrame
Unsurprisingly, df
is a `DataFrame`` object, provided by pandas.
The DataFrame
object has lots of built in methods and attributes.
The info
method gives us information about datatypes, dimensions and the presence of null values in our dataframe. Let’s see how can we use it and what information is returned:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 season 188 non-null int64
1 episode 188 non-null int64
2 title 188 non-null object
3 imdb_rating 188 non-null float64
4 total_votes 188 non-null int64
5 air_date 188 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB
We can just dtypes
to check the data types of every variable in the data frame.
df.dtypes
season int64
episode int64
title object
imdb_rating float64
total_votes int64
air_date object
dtype: object
Or just the dimensions (e.g., rows and columns).
df.shape
(188, 6)
In this case, there are only 188 rows. But for larger datasets we might want to look at the head (top 5) and tail (bottom 5) rows using .head()
and .tail()
, respectively.
df.head()
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 |
df.tail()
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
183 | 9 | 19 | Stairmageddon | 8.0 | 1484 | 2013-04-11 |
184 | 9 | 20 | Paper Airplane | 8.0 | 1482 | 2013-04-25 |
185 | 9 | 21 | Livin' the Dream | 8.9 | 2041 | 2013-05-02 |
186 | 9 | 22 | A.A.R.M. | 9.3 | 2860 | 2013-05-09 |
187 | 9 | 23 | Finale | 9.7 | 7934 | 2013-05-16 |
6.5 Summary
To get an overview of our data we can ask Python to ‘describe our (numeric) data’
df.describe()
season | episode | imdb_rating | total_votes | |
---|---|---|---|---|
count | 188.000000 | 188.000000 | 188.000000 | 188.000000 |
mean | 5.468085 | 11.877660 | 8.257447 | 2126.648936 |
std | 2.386245 | 7.024855 | 0.538067 | 787.098275 |
min | 1.000000 | 1.000000 | 6.700000 | 1393.000000 |
25% | 3.000000 | 6.000000 | 7.900000 | 1631.500000 |
50% | 6.000000 | 11.500000 | 8.200000 | 1952.500000 |
75% | 7.250000 | 18.000000 | 8.600000 | 2379.000000 |
max | 9.000000 | 26.000000 | 9.700000 | 7934.000000 |
or we can pull out specific statistics for numeric columns.
df.mean()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[15], line 1 ----> 1 df.mean() File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11693, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs) 11685 @doc(make_doc("mean", ndim=2)) 11686 def mean( 11687 self, (...) 11691 **kwargs, 11692 ): > 11693 result = super().mean(axis, skipna, numeric_only, **kwargs) 11694 if isinstance(result, Series): 11695 result = result.__finalize__(self, method="mean") File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs) 12413 def mean( 12414 self, 12415 axis: Axis | None = 0, (...) 12418 **kwargs, 12419 ) -> Series | float: > 12420 return self._stat_function( 12421 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs 12422 ) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs) 12373 nv.validate_func(name, (), kwargs) 12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False) > 12377 return self._reduce( 12378 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only 12379 ) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11562, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 11558 df = df.T 11560 # After possibly _get_data and transposing, we are now in the 11561 # simple case where we can use BlockManager.reduce > 11562 res = df._mgr.reduce(blk_func) 11563 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0] 11564 if out_dtype is not None and out.dtype != "boolean": File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/internals/managers.py:1500, in BlockManager.reduce(self, func) 1498 res_blocks: list[Block] = [] 1499 for blk in self.blocks: -> 1500 nbs = blk.reduce(func) 1501 res_blocks.extend(nbs) 1503 index = Index([None]) # placeholder File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func) 398 @final 399 def reduce(self, func) -> list[Block]: 400 # We will apply the function and reshape the result into a single-row 401 # Block with the same mgr_locs; squeezing will be done at a higher level 402 assert self.ndim == 2 --> 404 result = func(self.values) 406 if self.values.ndim == 1: 407 res_values = result File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:11481, in DataFrame._reduce.<locals>.blk_func(values, axis) 11479 return np.array([result]) 11480 else: > 11481 return op(values, axis=axis, skipna=skipna, **kwds) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds) 145 result = alt(values, axis=axis, skipna=skipna, **kwds) 146 else: --> 147 result = alt(values, axis=axis, skipna=skipna, **kwds) 149 return result File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs) 401 if datetimelike and mask is None: 402 mask = isna(values) --> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs) 406 if datetimelike: 407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask) 718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count) 719 the_sum = values.sum(axis, dtype=dtype_sum) --> 720 the_sum = _ensure_numeric(the_sum) 722 if axis is not None and getattr(the_sum, "ndim", False): 723 count = cast(np.ndarray, count) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x) 1683 inferred = lib.infer_dtype(x) 1684 if inferred in ["string", "mixed"]: 1685 # GH#44008, GH#36703 avoid casting e.g. strings to numeric -> 1686 raise TypeError(f"Could not convert {x} to numeric") 1687 try: 1688 x = x.astype(np.complex128) TypeError: Could not convert ["PilotDiversity DayHealth CareThe AllianceBasketballHot GirlThe DundiesSexual HarassmentOffice OlympicsThe FireHalloweenThe FightThe ClientPerformance ReviewE-Mail SurveillanceChristmas PartyBooze CruiseThe InjuryThe SecretThe CarpetBoys and GirlsValentine's DayDwight's SpeechTake Your Daughter to Work DayMichael's BirthdayDrug TestingConflict ResolutionCasino NightGay Witch HuntThe ConventionThe CoupGrief CounselingInitiationDiwaliBranch ClosingThe MergerThe ConvictA Benihana ChristmasBack from VacationTraveling SalesmenThe ReturnBen FranklinPhyllis' WeddingBusiness SchoolCocktailsThe NegotiationSafety TrainingProduct RecallWomen's AppreciationBeach GamesThe JobFun RunDunder Mifflin InfinityLaunch PartyMoneyLocal AdBranch WarsSurvivor ManThe DepositionDinner PartyChair ModelNight OutDid I Stutter?Job FairGoodbye, TobyWeight LossBusiness EthicsBaby ShowerCrime AidEmployee TransferCustomer SurveyBusiness TripFrame TobyThe SurplusMoroccan ChristmasThe DuelPrince Family PaperStress ReliefLecture Circuit: Part 1Lecture Circuit: Part 2Blood DriveGolden TicketNew BossTwo WeeksDream TeamMichael Scott Paper CompanyHeavy CompetitionBrokeCasual FridayCafe DiscoCompany PicnicGossipThe MeetingThe PromotionNiagara: Part 1Niagara: Part 2MafiaThe LoverKoi PondDouble DateMurderShareholder MeetingScott's TotsSecret SantaThe BankerSabreManager and SalesmanThe Delivery: Part 1The Delivery: Part 2St. Patrick's DayNew LeadsHappy HourSecretary's DayBody LanguageThe Cover-UpThe ChumpWhistleblowerNepotismCounselingAndy's PlaySex EdThe StingCostume ContestChristeningViewing PartyWUPHF.comChinaClassy ChristmasUltimatumThe SeminarThe SearchPDAThreat Level MidnightTodd PackerGarage SaleTraining DayMichael's Last DundiesGoodbye, MichaelThe Inner CircleDwight K. Schrute, (Acting) ManagerSearch CommitteeThe ListThe IncentiveLottoGarden PartySpookedDoomsdayPam's ReplacementGettysburgMrs. CaliforniaChristmas WishesTriviaPool PartyJury DutySpecial ProjectTallahasseeAfter HoursTest the StoreLast Day in FloridaGet the GirlWelcome PartyAngry AndyFundraiserTurf WarFree Family Portrait StudioNew GuysRoy's WeddingAndy's AncestryWork BusHere Comes TrebleThe BoatThe WhaleThe TargetDwight ChristmasLiceSuit WarehouseCustomer LoyaltyJunior SalesmanVandalismCouples DiscountMoving OnThe FarmPromosStairmageddonPaper AirplaneLivin' the DreamA.A.R.M.Finale" '2005-03-242005-03-292005-04-052005-04-122005-04-192005-04-262005-09-202005-09-272005-10-042005-10-112005-10-182005-11-012005-11-082005-11-152005-11-222005-12-062006-01-052006-01-122006-01-192006-01-262006-02-022006-02-092006-03-022006-03-162006-03-302006-04-272006-05-042006-05-112006-09-212006-09-282006-10-052006-10-122006-10-192006-11-022006-11-092006-11-162006-11-302006-12-142007-01-042007-01-112007-01-182007-02-012007-02-082007-02-152007-02-222007-04-052007-04-122007-04-262007-05-032007-05-102007-05-172007-09-272007-10-042007-10-112007-10-182007-10-252007-11-012007-11-082007-11-152008-04-102008-04-172008-04-242008-05-012008-05-082008-05-152008-09-252008-10-092008-10-162008-10-232008-10-302008-11-062008-11-132008-11-202008-12-042008-12-112009-01-152009-01-222009-02-012009-02-052009-02-122009-03-052009-03-122009-03-192009-03-262009-04-092009-04-092009-04-162009-04-232009-04-302009-05-072009-05-142009-09-172009-09-242009-10-012009-10-082009-10-082009-10-152009-10-222009-10-292009-11-052009-11-122009-11-192009-12-032009-12-102010-01-212010-02-042010-02-112010-03-042010-03-042010-03-112010-03-182010-03-252010-04-222010-04-292010-05-062010-05-132010-05-202010-09-232010-09-302010-10-072010-10-142010-10-212010-10-282010-11-042010-11-112010-11-182010-12-022010-12-092011-01-202011-01-272011-02-032011-02-102011-02-172011-02-242011-03-242011-04-142011-04-212011-04-282011-05-052011-05-122011-05-192011-09-222011-09-292011-10-062011-10-132011-10-272011-11-032011-11-102011-11-172011-12-012011-12-082012-01-122012-01-192012-02-022012-02-092012-02-162012-02-232012-03-012012-03-082012-03-152012-04-122012-04-192012-04-262012-05-032012-05-102012-09-202012-09-272012-10-042012-10-182012-10-252012-11-082012-11-152012-11-292012-12-062013-01-102013-01-172013-01-242013-01-312013-01-312013-02-072013-02-142013-03-142013-04-042013-04-112013-04-252013-05-022013-05-092013-05-16'] to numeric
Note the error triggered above due to pandas attempting to calculate the mean of the wrong type (i.e. non-numeric values). We can address that by only computing the mean of numeric values (see below):
=True) df.mean(numeric_only
season 5.468085
episode 11.877660
imdb_rating 8.257447
total_votes 2126.648936
dtype: float64
or the sum of every value within the same column:
sum() df.
season 1028
episode 2233
title PilotDiversity DayHealth CareThe AllianceBaske...
imdb_rating 1552.4
total_votes 399810
air_date 2005-03-242005-03-292005-04-052005-04-122005-0...
dtype: object
Similarly to what happened with mean()
, sum()
is adding all values in every observation of every attribute, regardless of their type, but this time is not producing an error. Can you see what happens with strings? And with dates?
Again, we can force to use numeric values only:
sum(numeric_only=True) df.
season 1028.0
episode 2233.0
imdb_rating 1552.4
total_votes 399810.0
dtype: float64
6.6 Subsetting
Often times we may have a large dataset and we only need to work with just a part of it (a subset) consisting of certain columns and/or rows. Selecting specific columns and/or rows is known as subsetting.
6.6.1 Selecting columns
Because in pandas every column has a name, we can select columns by their name or their position.
6.6.1.1 Selecting by name
To select by name we will use the syntax df['<column_name>']
. For example, if we wanted to select the ratings:
'imdb_rating'] df[
0 7.6
1 8.3
2 7.9
3 8.1
4 8.4
...
183 8.0
184 8.0
185 8.9
186 9.3
187 9.7
Name: imdb_rating, Length: 188, dtype: float64
or we could select the date in which the chapters were first aired:
'air_date'] df[
0 2005-03-24
1 2005-03-29
2 2005-04-05
3 2005-04-12
4 2005-04-19
...
183 2013-04-11
184 2013-04-25
185 2013-05-02
186 2013-05-09
187 2013-05-16
Name: air_date, Length: 188, dtype: object
We can even select more than one column!
'imdb_rating', 'total_votes']] df[[
imdb_rating | total_votes | |
---|---|---|
0 | 7.6 | 3706 |
1 | 8.3 | 3566 |
2 | 7.9 | 2983 |
3 | 8.1 | 2886 |
4 | 8.4 | 3179 |
... | ... | ... |
183 | 8.0 | 1484 |
184 | 8.0 | 1482 |
185 | 8.9 | 2041 |
186 | 9.3 | 2860 |
187 | 9.7 | 7934 |
188 rows × 2 columns
Did you notice that we used two sets of squared brackets ([[]]
)? This is needed because we need to passing a list of the column names to the __getitem__
method of the pandas dataframe object, and as you may remember from Section 2.3.2, this is the syntax used for lists (thank this stackoverflow question).
This is what we’d get otherwise:
'imdb_rating', 'total_votes'] df[
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: ('imdb_rating', 'total_votes') The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[22], line 1 ----> 1 df['imdb_rating', 'total_votes'] File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: ('imdb_rating', 'total_votes')
You can also check out the pandas documentation on indexing and selecting data.
We can also apply methods to subset, such as this one to get the average rating:
'imdb_rating'].mean() df[
8.25744680851064
Or to calculate the total number of votes:
'total_votes'].sum() df[
399810
Or a combination of multiple columns:
'imdb_rating', 'total_votes']].mean() df[[
imdb_rating 8.257447
total_votes 2126.648936
dtype: float64
6.6.1.2 Selecting by position
If we do not want to use column names, we can use iloc
method by using the syntax <object>.iloc[<row slice>, <column slice>]
, where a slice is a range of numbers separated by a colon :
. So, if we were to select the value in the 4th row and 2nd column, we’d use:
4,2] df.iloc[
'Basketball'
But if we just wanted to select a column? In that case, we can use the same method but instead of specifiying a row, we will need to use :
to indicate that we are selecting all the rows, such as:
2] df.iloc[:,
0 Pilot
1 Diversity Day
2 Health Care
3 The Alliance
4 Basketball
...
183 Stairmageddon
184 Paper Airplane
185 Livin' the Dream
186 A.A.R.M.
187 Finale
Name: title, Length: 188, dtype: object
Conversely, if we just wanted to select all the columns from a given row, we’d use :
on the right side of the ,
like this:
4,:] df.iloc[
season 1
episode 5
title Basketball
imdb_rating 8.4
total_votes 3179
air_date 2005-04-19
Name: 4, dtype: object
We can use negative values in indexes to indicate ‘from the end’. So, an index of [-10, :] returns the 10th from last row.
-10,:] df.iloc[
season 9
episode 14
title Vandalism
imdb_rating 7.6
total_votes 1402
air_date 2013-01-31
Name: 178, dtype: object
Instead of using tail, we could ask for the last 5 rows with an index of [-5:, :]
. I read :
as ‘and everything else’ in these cases.
-5:,:] df.iloc[
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
183 | 9 | 19 | Stairmageddon | 8.0 | 1484 | 2013-04-11 |
184 | 9 | 20 | Paper Airplane | 8.0 | 1482 | 2013-04-25 |
185 | 9 | 21 | Livin' the Dream | 8.9 | 2041 | 2013-05-02 |
186 | 9 | 22 | A.A.R.M. | 9.3 | 2860 | 2013-05-09 |
187 | 9 | 23 | Finale | 9.7 | 7934 | 2013-05-16 |
df.tail()
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
183 | 9 | 19 | Stairmageddon | 8.0 | 1484 | 2013-04-11 |
184 | 9 | 20 | Paper Airplane | 8.0 | 1482 | 2013-04-25 |
185 | 9 | 21 | Livin' the Dream | 8.9 | 2041 | 2013-05-02 |
186 | 9 | 22 | A.A.R.M. | 9.3 | 2860 | 2013-05-09 |
187 | 9 | 23 | Finale | 9.7 | 7934 | 2013-05-16 |
Note that the row is shown on the left. That will stop you getting lost in slices of the data.
For the top ten rows
10,:] df.iloc[:
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 |
5 | 1 | 6 | Hot Girl | 7.8 | 2852 | 2005-04-26 |
6 | 2 | 1 | The Dundies | 8.7 | 3213 | 2005-09-20 |
7 | 2 | 2 | Sexual Harassment | 8.2 | 2736 | 2005-09-27 |
8 | 2 | 3 | Office Olympics | 8.4 | 2742 | 2005-10-04 |
9 | 2 | 4 | The Fire | 8.4 | 2713 | 2005-10-11 |
Of course, we can run methods on these slices. We could, if we wanted to, calculate the mean imdb rating of only the first and last 100 episodes. Note the indexing starts at 0 so we want the column index of 3 (0:season, 1:episode, 2:title, 3:imdb_rating).
100,3].mean() df.iloc[:
8.483
-100:,3].mean() df.iloc[
8.062
If you are unsure how many rows you have then the count method comes to the rescue.
-100:,3].count() df.iloc[
100
df.describe()
season | episode | imdb_rating | total_votes | |
---|---|---|---|---|
count | 188.000000 | 188.000000 | 188.000000 | 188.000000 |
mean | 5.468085 | 11.877660 | 8.257447 | 2126.648936 |
std | 2.386245 | 7.024855 | 0.538067 | 787.098275 |
min | 1.000000 | 1.000000 | 6.700000 | 1393.000000 |
25% | 3.000000 | 6.000000 | 7.900000 | 1631.500000 |
50% | 6.000000 | 11.500000 | 8.200000 | 1952.500000 |
75% | 7.250000 | 18.000000 | 8.600000 | 2379.000000 |
max | 9.000000 | 26.000000 | 9.700000 | 7934.000000 |
So it looks like the last 100 episodes were less good than the first 100. I guess that is why it was cancelled.
Our data is organised by season. Looking at the average by season might help.
'season', 'imdb_rating']].groupby('season').mean() df[[
imdb_rating | |
---|---|
season | |
1 | 8.016667 |
2 | 8.436364 |
3 | 8.573913 |
4 | 8.600000 |
5 | 8.492308 |
6 | 8.219231 |
7 | 8.316667 |
8 | 7.666667 |
9 | 7.956522 |
The above line groups our dataframe by values in the season column and then displays the mean for each group. Pretty nifty.
Season 8 looks pretty bad. We can look at just the rows for season 8.
'season'] == 8] df[df[
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
141 | 8 | 1 | The List | 8.2 | 1829 | 2011-09-22 |
142 | 8 | 2 | The Incentive | 8.2 | 1668 | 2011-09-29 |
143 | 8 | 3 | Lotto | 7.3 | 1601 | 2011-10-06 |
144 | 8 | 4 | Garden Party | 8.1 | 1717 | 2011-10-13 |
145 | 8 | 5 | Spooked | 7.6 | 1543 | 2011-10-27 |
146 | 8 | 6 | Doomsday | 7.8 | 1476 | 2011-11-03 |
147 | 8 | 7 | Pam's Replacement | 7.7 | 1563 | 2011-11-10 |
148 | 8 | 8 | Gettysburg | 7.0 | 1584 | 2011-11-17 |
149 | 8 | 9 | Mrs. California | 7.7 | 1553 | 2011-12-01 |
150 | 8 | 10 | Christmas Wishes | 8.0 | 1547 | 2011-12-08 |
151 | 8 | 11 | Trivia | 7.9 | 1488 | 2012-01-12 |
152 | 8 | 12 | Pool Party | 8.0 | 1612 | 2012-01-19 |
153 | 8 | 13 | Jury Duty | 7.5 | 1478 | 2012-02-02 |
154 | 8 | 14 | Special Project | 7.8 | 1432 | 2012-02-09 |
155 | 8 | 15 | Tallahassee | 7.9 | 1522 | 2012-02-16 |
156 | 8 | 16 | After Hours | 8.1 | 1567 | 2012-02-23 |
157 | 8 | 17 | Test the Store | 7.8 | 1478 | 2012-03-01 |
158 | 8 | 18 | Last Day in Florida | 7.8 | 1429 | 2012-03-08 |
159 | 8 | 19 | Get the Girl | 6.7 | 1642 | 2012-03-15 |
160 | 8 | 20 | Welcome Party | 7.2 | 1489 | 2012-04-12 |
161 | 8 | 21 | Angry Andy | 7.1 | 1585 | 2012-04-19 |
162 | 8 | 22 | Fundraiser | 7.1 | 1453 | 2012-04-26 |
163 | 8 | 23 | Turf War | 7.7 | 1393 | 2012-05-03 |
164 | 8 | 24 | Free Family Portrait Studio | 7.8 | 1464 | 2012-05-10 |
6.6.2 Filtering rows
We can filter rows matching some criteria by using the syntax <object>.loc[<criteria>]
. So, if we wanted to filter all the episodes from the 8th season, we would do the following:
'season'] == 8] df.loc[df[
season | episode | title | imdb_rating | total_votes | air_date | |
---|---|---|---|---|---|---|
141 | 8 | 1 | The List | 8.2 | 1829 | 2011-09-22 |
142 | 8 | 2 | The Incentive | 8.2 | 1668 | 2011-09-29 |
143 | 8 | 3 | Lotto | 7.3 | 1601 | 2011-10-06 |
144 | 8 | 4 | Garden Party | 8.1 | 1717 | 2011-10-13 |
145 | 8 | 5 | Spooked | 7.6 | 1543 | 2011-10-27 |
146 | 8 | 6 | Doomsday | 7.8 | 1476 | 2011-11-03 |
147 | 8 | 7 | Pam's Replacement | 7.7 | 1563 | 2011-11-10 |
148 | 8 | 8 | Gettysburg | 7.0 | 1584 | 2011-11-17 |
149 | 8 | 9 | Mrs. California | 7.7 | 1553 | 2011-12-01 |
150 | 8 | 10 | Christmas Wishes | 8.0 | 1547 | 2011-12-08 |
151 | 8 | 11 | Trivia | 7.9 | 1488 | 2012-01-12 |
152 | 8 | 12 | Pool Party | 8.0 | 1612 | 2012-01-19 |
153 | 8 | 13 | Jury Duty | 7.5 | 1478 | 2012-02-02 |
154 | 8 | 14 | Special Project | 7.8 | 1432 | 2012-02-09 |
155 | 8 | 15 | Tallahassee | 7.9 | 1522 | 2012-02-16 |
156 | 8 | 16 | After Hours | 8.1 | 1567 | 2012-02-23 |
157 | 8 | 17 | Test the Store | 7.8 | 1478 | 2012-03-01 |
158 | 8 | 18 | Last Day in Florida | 7.8 | 1429 | 2012-03-08 |
159 | 8 | 19 | Get the Girl | 6.7 | 1642 | 2012-03-15 |
160 | 8 | 20 | Welcome Party | 7.2 | 1489 | 2012-04-12 |
161 | 8 | 21 | Angry Andy | 7.1 | 1585 | 2012-04-19 |
162 | 8 | 22 | Fundraiser | 7.1 | 1453 | 2012-04-26 |
163 | 8 | 23 | Turf War | 7.7 | 1393 | 2012-05-03 |
164 | 8 | 24 | Free Family Portrait Studio | 7.8 | 1464 | 2012-05-10 |
To understand why we have to write the name of the dataframe twice, we can focus on the output provided by the filtering criteria only:
'season'] == 9 df[
0 False
1 False
2 False
3 False
4 False
...
183 True
184 True
185 True
186 True
187 True
Name: season, Length: 188, dtype: bool
As you can see, it returns a boolean serie specifiying which rows are matching the criteria (True
) and which ones are not (False
)
As a side note, while writing the name of the dataframe twice may seem redundant, this means that we could filter rows based on other objects.
We can get an overview of the rating of all chapters within season 8 by:
'season'] == 8, 'imdb_rating'].describe() df.loc[df[
count 24.000000
mean 7.666667
std 0.405041
min 6.700000
25% 7.450000
50% 7.800000
75% 7.925000
max 8.200000
Name: imdb_rating, dtype: float64
Generally pretty bad, but there is clearly one very disliked episode.
6.7 Adding columns
We can add new columns pretty simply.
'x'] = 44
df[ df.head()
season | episode | title | imdb_rating | total_votes | air_date | x | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 | 44 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 | 44 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 | 44 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 | 44 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 | 44 |
Our new column can be an operation on other columns
'rating_div_total_votes'] = df['imdb_rating'] / df['total_votes']
df[ df.head()
season | episode | title | imdb_rating | total_votes | air_date | x | rating_div_total_votes | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 | 44 | 0.002051 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 | 44 | 0.002328 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 | 44 | 0.002648 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 | 44 | 0.002807 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 | 44 | 0.002642 |
or as simple as adding one to every value.
'y'] = df['season'] + 1
df[0:5,:] df.iloc[
season | episode | title | imdb_rating | total_votes | air_date | x | rating_div_total_votes | y | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Pilot | 7.6 | 3706 | 2005-03-24 | 44 | 0.002051 | 2 |
1 | 1 | 2 | Diversity Day | 8.3 | 3566 | 2005-03-29 | 44 | 0.002328 | 2 |
2 | 1 | 3 | Health Care | 7.9 | 2983 | 2005-04-05 | 44 | 0.002648 | 2 |
3 | 1 | 4 | The Alliance | 8.1 | 2886 | 2005-04-12 | 44 | 0.002807 | 2 |
4 | 1 | 5 | Basketball | 8.4 | 3179 | 2005-04-19 | 44 | 0.002642 | 2 |
= df['season'] + 1 y
6.8 Writing data
Pandas supports writing out data frames to various formats.
?df.to_csv
Now you can uncomment the code below to save your dataframe into a csv file. But before doing so, check that your data/output
folder is empty, as it would override its content:
'data/output/my_output_ratings.csv', encoding='UTF-8') df.to_csv(
Likewise, we could export our dataset to an excel file by using to_excel
:
?df.to_excel
Now you can uncomment the code below to save your dataframe into an excel file. But before doing so, check that your data/output
folder is empty:
# df.to_excel('data/output/my_output_ratings.xlsx')
6.9 Combining datasets
In this notebook, our dataset was created from a single file that contained all the data that was needed. However, often times data will be spread into different files that we will need to combine to create our own dataset.
Consider the two dataframes below:
= pd.read_csv('data/raw/office1.csv', encoding='UTF-8')
df_1 = pd.read_csv('data/raw/office2.csv', encoding='UTF-8') df_2
df_1.head()
id | season | episode | imdb_rating | |
---|---|---|---|---|
0 | 5-1 | 5 | 1 | 8.8 |
1 | 9-13 | 9 | 13 | 7.7 |
2 | 5-6 | 5 | 6 | 8.5 |
3 | 3-23 | 3 | 23 | 9.3 |
4 | 9-16 | 9 | 16 | 8.2 |
df_2.head()
id | total_votes | |
---|---|---|
0 | 4-10 | 2095 |
1 | 3-21 | 2403 |
2 | 7-24 | 2040 |
3 | 6-18 | 1769 |
4 | 8-8 | 1584 |
As can be seen, the total votes and imdb ratings data are split between files that we will need to combine. Usually this is done by using a shared column between the two datasets that works as an index. Gladly, head()
reveals that in both cases there is a common column called id
. We can join the two dataframes together using the common column.
= pd.merge(df_1, df_2, on='id', how='inner')
inner_join_office_df inner_join_office_df
id | season | episode | imdb_rating | total_votes | |
---|---|---|---|---|---|
0 | 5-1 | 5 | 1 | 8.8 | 2501 |
1 | 9-13 | 9 | 13 | 7.7 | 1394 |
2 | 5-6 | 5 | 6 | 8.5 | 2018 |
3 | 3-23 | 3 | 23 | 9.3 | 3010 |
4 | 9-16 | 9 | 16 | 8.2 | 1572 |
... | ... | ... | ... | ... | ... |
183 | 5-21 | 5 | 21 | 8.7 | 2032 |
184 | 2-13 | 2 | 13 | 8.3 | 2363 |
185 | 9-6 | 9 | 6 | 7.8 | 1455 |
186 | 2-2 | 2 | 2 | 8.2 | 2736 |
187 | 3-4 | 3 | 4 | 8.0 | 2311 |
188 rows × 5 columns
In this way you can combine datasets using common columns and an inner join. We will leave that for the moment. If you want more information about merging data then see this page and the pandas documentation.
6.9.1 Well done!
Well done! You’ve reached the end of a pretty long notebook that went through a lot of details about how to work with this pretty critical package called Pandas. We haven’t really done a lot of detailed analysis in this one yet but we will use Pandas a lot and frequently.
Your best friend will be the Pandas documentation – https://pandas.pydata.org/docs/index.html
This documentation is great. We particuarly recomment the User Guide that will answer most of your questions and will give you a lot of code to copy and paste first and then modify to do what you need to do – https://pandas.pydata.org/docs/user_guide/index.html#user-guide