import warnings
'ignore')
warnings.filterwarnings(
# Importing libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.spatial.distance import cdist # This is the function that computes the Mahalanobis distance
from scipy.stats import skew, kurtosis # The skew and kurtosis of a distribution
from statsmodels.robust.scale import mad # This computes the median absolute deviation of a distribution
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline # Display matplotlib plots inline with the text. +info: https://stackoverflow.com/a/43028034/1913457
11 Lab: Data Processing and Summarization
Here we will be exploring different functions in data processing and summarization.
11.1 Outliers
In this section we will be applying several methods to identify outliers. We will be using a custom dataset called accord_sedan.csv
that contains cars properties. The dataset is provided as a csv file.
11.1.1 Assess data
As usual, we will want to start loading the data and assessing it:
# Loading Data
= pd.read_csv('data/raw/accord_sedan.csv')
df
# Inspecting the few first rows of the dataframe
df.head()
price | mileage | year | trim | engine | transmission | |
---|---|---|---|---|---|---|
0 | 14995 | 67697 | 2006 | ex | 4 Cyl | Manual |
1 | 11988 | 73738 | 2006 | ex | 4 Cyl | Manual |
2 | 11999 | 80313 | 2006 | lx | 4 Cyl | Automatic |
3 | 12995 | 86096 | 2006 | lx | 4 Cyl | Automatic |
4 | 11333 | 79607 | 2006 | lx | 4 Cyl | Automatic |
And we can get some summary statistics, too:
df.describe()
price | mileage | year | |
---|---|---|---|
count | 417.000000 | 417.000000 | 417.0 |
mean | 12084.242206 | 89725.779376 | 2006.0 |
std | 2061.430034 | 25957.872271 | 0.0 |
min | 6900.000000 | 19160.000000 | 2006.0 |
25% | 10779.000000 | 71844.000000 | 2006.0 |
50% | 11995.000000 | 89900.000000 | 2006.0 |
75% | 13000.000000 | 106705.000000 | 2006.0 |
max | 18995.000000 | 149269.000000 | 2006.0 |
We only get the summaries for the numeric values here. How about the others, such as trim
, engine
or transmission
? Can you think of a way to get an idea of statistics that will provide you an overview of the distribution of these values? Also, do you think the mean of the year
values make sense? What could be a better statistics here?
## One thing to try is the describe function with an "all" parameter:
='all') df.describe(include
price | mileage | year | trim | engine | transmission | |
---|---|---|---|---|---|---|
count | 417.000000 | 417.000000 | 417.0 | 417 | 417 | 417 |
unique | NaN | NaN | NaN | 3 | 2 | 2 |
top | NaN | NaN | NaN | ex | 4 Cyl | Automatic |
freq | NaN | NaN | NaN | 288 | 238 | 382 |
mean | 12084.242206 | 89725.779376 | 2006.0 | NaN | NaN | NaN |
std | 2061.430034 | 25957.872271 | 0.0 | NaN | NaN | NaN |
min | 6900.000000 | 19160.000000 | 2006.0 | NaN | NaN | NaN |
25% | 10779.000000 | 71844.000000 | 2006.0 | NaN | NaN | NaN |
50% | 11995.000000 | 89900.000000 | 2006.0 | NaN | NaN | NaN |
75% | 13000.000000 | 106705.000000 | 2006.0 | NaN | NaN | NaN |
max | 18995.000000 | 149269.000000 | 2006.0 | NaN | NaN | NaN |
# But this doesn't address out question. Think about visuals that can help you here? Maybe a histrogram would help?
=?, x="?") sns.histplot(data
Once you have an idea of the descriptive statistics, you might want to focus on the outliers. It might be difficult to spot outliers just with the descriptive statistics. This is where we can use data visualisations to identify outliers visually and also make use of a few proxy metrics to help us assess data points.
11.1.2 Identify outliers visually
In this case we will explore variables individually (1D) by creating a boxplot1. We will visualise the columns price
and mileage
using matplotlib’s subplots, which combines multiple plots into a single figure:
1 Boxplots may be difficult to interpret if we are not used to them. It is used to visualise distributions, where the box represents the Interquartile range (IQR), whereas the whiskers can be defined in multiple ways (e.g. the first and third quartiles, 1.5 IQR…) . From Wikipedia: “In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles.”
# 1D Visualizations
=(9, 5)) # Setting the figure's size: format width, height (in inches)
plt.figure(figsize1,2,1) # subplot(nrows, ncols, index, **kwargs)
plt.subplot(
plt.boxplot(df.price)"Boxplot of the Price attribute")
plt.title(1,2,2)
plt.subplot(
plt.boxplot(df.mileage)"Boxplot of the Mileage attribute");
plt.title(# A semicolon in Python denotes separation, rather than termination.
# It allows you to write multiple statements on the same line.
But we may want to identify the 2D outliers using a scatterplot
# 2D Visualization
= .5)
plt.scatter(df.price, df.mileage, alpha 'Price')
plt.xlabel('Mileage')
plt.ylabel('Mileage vs. Price\n'); plt.title(
Visually, outliers appear to be outside the ‘normal’ range of the rest of the points. A few outliers are quite obvious to spot, but the choice of the threshold (the limit after which you decide to label a point as an outlier) visually remains a very subjective matter.
11.1.3 Computing outliers’ threshold: standard deviation
Add two new columns to the dataframe called isOutlierPrice
and isOutlierMileage
. We will define our threshold as 2 times standard deviations away from the mean for price
and mileage
, respectively.
# Computing the isOutlierPrice column
= df.price.mean() + 2*df.price.std()
upper_threshold_price = df.price.mean() - 2*df.price.std()
lower_threshold_price 'isOutlierPrice'] = ((df.price > upper_threshold_price) | (df.price < lower_threshold_price))
df[
# Computing the isOutlierMileage column
= df.mileage.mean() + 2*df.mileage.std()
upper_threshold_mileage = df.mileage.mean() - 2*df.mileage.std()
lower_threshold_mileage 'isOutlierMileage'] = ((df.mileage > upper_threshold_mileage) | (df.mileage < lower_threshold_mileage))
df[
# Inspect the new DataFrame with the added columns
df.head()
price | mileage | year | trim | engine | transmission | isOutlierPrice | isOutlierMileage | |
---|---|---|---|---|---|---|---|---|
0 | 14995 | 67697 | 2006 | ex | 4 Cyl | Manual | False | False |
1 | 11988 | 73738 | 2006 | ex | 4 Cyl | Manual | False | False |
2 | 11999 | 80313 | 2006 | lx | 4 Cyl | Automatic | False | False |
3 | 12995 | 86096 | 2006 | lx | 4 Cyl | Automatic | False | False |
4 | 11333 | 79607 | 2006 | lx | 4 Cyl | Automatic | False | False |
We may want to use this more succint approach:
# Second way of doing the above using the np.where() function
'isOutlierPrice'] = np.where(abs(df.price - df.price.mean()) < 2*df.price.std(), False, True)
df['isOutlierMileage'] = np.where(abs(df.mileage - df.mileage.mean()) < 2*df.mileage.std(), False, True)
df[
# Inspect the new DataFrame with the added columns
df.head()
price | mileage | year | trim | engine | transmission | isOutlierPrice | isOutlierMileage | |
---|---|---|---|---|---|---|---|---|
0 | 14995 | 67697 | 2006 | ex | 4 Cyl | Manual | False | False |
1 | 11988 | 73738 | 2006 | ex | 4 Cyl | Manual | False | False |
2 | 11999 | 80313 | 2006 | lx | 4 Cyl | Automatic | False | False |
3 | 12995 | 86096 | 2006 | lx | 4 Cyl | Automatic | False | False |
4 | 11333 | 79607 | 2006 | lx | 4 Cyl | Automatic | False | False |
As we’d expect, both methods produce the same output.
Now that we have these two new columns, we could visualize these values with a different color in the plot. Observe whether they are the same as you would mark them:
# Visualizing outliers in a different color
= ['tomato' if i+j else 'seagreen' for i,j in zip(df.isOutlierPrice, df.isOutlierMileage)]
col = col)
plt.scatter(df.price, df.mileage, color 'Price')
plt.xlabel('Mileage')
plt.ylabel('Mileage vs. Price : Outliers 2+ std\'s away from the mean\n'); plt.title(
Visually filtering out outliers can be an effective tactic if we’re just trying to conduct a quick and dirty experimentation. However, when we need to perform a solid and founded analysis, it’s better to have a robust justification for our choices.
In this case, we can use the deviation from the mean to define a threshold that separates ‘normal’ values from ‘outliers’. Here, we opted for a two standard deviation threshold.
The mathematical intuition behind this, is that under the normality assumption (if we assume our variable is normally distributed, which it almost is, refer to the next plot), then the probability of it having a value two standard deviations OR MORE away from the mean, is around 5%, which is very unlikely to happen. This is why we label these data points as outliers with respect to the (assumed) probability distribution of the variable. But this remains a way to identify 1D outliers only (identifying outliers within each column separately)
# Histograms of Price and Mileage (checking the normality assumption)
1,2,1)
plt.subplot(= 12)
plt.hist(df.price, bins 'Price')
plt.title(1,2,2)
plt.subplot(= 15)
plt.hist(df.mileage, bins 'Mileage'); plt.title(
11.1.4 Computing outliers’ threshold: Mahalanobis distance
- Using the 2D Mahalanobis distance to find outliers
# Mean vector (computing the mean returns a Series, which we need to convert back to a DataFrame because cdist requires it)
= df.iloc[:, 0:2].mean().to_frame().T # DataFrame.T returns the Transpose of the DataFrame
mean_v #mean_v = np.asarray([df.price.mean(), df.mileage.mean() ]).reshape(1,2) # This is a better way of writing the line before (for our use case : cdist function)
# Computing the Mahalanobis distance of each row to the mean vector
= cdist(df.iloc[:, 0:2], mean_v, metric='mahalanobis')
d #d = cdist(df[['price', 'mileage']].values, mean_v, metric='mahalanobis') # Another way of writing the line before
# Visualizing the scatter plot while coloring each point (i.e row) with a color from a chosen gradient colormap corresponding to the mahalanobis score
=(12, 5))
plt.figure(figsize= d.flatten(), cmap = 'plasma') # in order to know why we use flatten() on d, try printing d with and without flatten
plt.scatter(df.price, df.mileage, c # to show the colorbar
plt.colorbar() 'Price')
plt.xlabel('Mileage')
plt.ylabel('Mileage vs. Price colored by a 2D Mahalanobis score\n'); plt.title(
11.2 Q-Q Plots
11.2.1 Data
In this case, we will be using data reported by countries to WHO and estimates of tuberculosis burden generated by WHO for the Global Tuberculosis Report, as part of WHO’s Global Tuberculosis Programme. You can refer to this section at WHO’s website to know more about the data and their variables.
For your convenience, we have stored a copy in a csv file, but you could try to replicate the code with more up-to-date data from the original website. As usual, it is a best practice to explore the data before proceeding any further:
# Getting the Data
= pd.read_csv('data/raw/TB_burden_countries_2014-09-29.csv')
df_tuber
df_tuber.head()
country | iso2 | iso3 | iso_numeric | g_whoregion | year | e_pop_num | e_prev_100k | e_prev_100k_lo | e_prev_100k_hi | ... | e_inc_tbhiv_100k | e_inc_tbhiv_100k_lo | e_inc_tbhiv_100k_hi | e_inc_tbhiv_num | e_inc_tbhiv_num_lo | e_inc_tbhiv_num_hi | source_tbhiv | c_cdr | c_cdr_lo | c_cdr_hi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | AF | AFG | 4 | EMR | 1990 | 11731193 | 327.0 | 112.0 | 655.0 | ... | 0.35 | 0.22 | 0.52 | 41.0 | 25.0 | 60.0 | Model | 20.0 | 13.0 | 32.0 |
1 | Afghanistan | AF | AFG | 4 | EMR | 1991 | 12612043 | 359.0 | 172.0 | 613.0 | ... | 0.36 | 0.19 | 0.58 | 45.0 | 24.0 | 73.0 | Model | 97.0 | 77.0 | 120.0 |
2 | Afghanistan | AF | AFG | 4 | EMR | 1992 | 13811876 | 387.0 | 169.0 | 693.0 | ... | 0.37 | 0.19 | 0.62 | 51.0 | 26.0 | 86.0 | Model | NaN | NaN | NaN |
3 | Afghanistan | AF | AFG | 4 | EMR | 1993 | 15175325 | 412.0 | 186.0 | 724.0 | ... | 0.38 | 0.20 | 0.63 | 58.0 | 30.0 | 95.0 | Model | NaN | NaN | NaN |
4 | Afghanistan | AF | AFG | 4 | EMR | 1994 | 16485018 | 431.0 | 199.0 | 751.0 | ... | 0.40 | 0.21 | 0.64 | 65.0 | 35.0 | 100.0 | Model | NaN | NaN | NaN |
5 rows × 39 columns
df_tuber.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4903 entries, 0 to 4902
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 4903 non-null object
1 iso2 4880 non-null object
2 iso3 4903 non-null object
3 iso_numeric 4903 non-null int64
4 g_whoregion 4903 non-null object
5 year 4903 non-null int64
6 e_pop_num 4903 non-null int64
7 e_prev_100k 4892 non-null float64
8 e_prev_100k_lo 4892 non-null float64
9 e_prev_100k_hi 4892 non-null float64
10 e_prev_num 4892 non-null float64
11 e_prev_num_lo 4892 non-null float64
12 e_prev_num_hi 4892 non-null float64
13 e_mort_exc_tbhiv_100k 4902 non-null float64
14 e_mort_exc_tbhiv_100k_lo 4902 non-null float64
15 e_mort_exc_tbhiv_100k_hi 4902 non-null float64
16 e_mort_exc_tbhiv_num 4902 non-null float64
17 e_mort_exc_tbhiv_num_lo 4902 non-null float64
18 e_mort_exc_tbhiv_num_hi 4902 non-null float64
19 source_mort 4902 non-null object
20 e_inc_100k 4902 non-null float64
21 e_inc_100k_lo 4902 non-null float64
22 e_inc_100k_hi 4902 non-null float64
23 e_inc_num 4902 non-null float64
24 e_inc_num_lo 4902 non-null float64
25 e_inc_num_hi 4902 non-null float64
26 e_tbhiv_prct 3653 non-null float64
27 e_tbhiv_prct_lo 3457 non-null float64
28 e_tbhiv_prct_hi 3457 non-null float64
29 e_inc_tbhiv_100k 3597 non-null float64
30 e_inc_tbhiv_100k_lo 3597 non-null float64
31 e_inc_tbhiv_100k_hi 3597 non-null float64
32 e_inc_tbhiv_num 3597 non-null float64
33 e_inc_tbhiv_num_lo 3597 non-null float64
34 e_inc_tbhiv_num_hi 3597 non-null float64
35 source_tbhiv 4902 non-null object
36 c_cdr 4476 non-null float64
37 c_cdr_lo 4476 non-null float64
38 c_cdr_hi 4476 non-null float64
dtypes: float64(30), int64(3), object(6)
memory usage: 1.5+ MB
As you can see we have a pretty large dataset numerical and categorical variables, some of which have many missing values. In this case we will be filling missing values with the mean2
2 Refer to Section 8.2 for a detailed explanation and alternative methods to inputting missing values.
# Filling missing numeric values ONLY (categorical values will be untouched)
= df_tuber.fillna(value=df_tuber.mean(numeric_only = True))
df_tuber
# Count missing values
sum() df_tuber.isna().
country 0
iso2 23
iso3 0
iso_numeric 0
g_whoregion 0
year 0
e_pop_num 0
e_prev_100k 0
e_prev_100k_lo 0
e_prev_100k_hi 0
e_prev_num 0
e_prev_num_lo 0
e_prev_num_hi 0
e_mort_exc_tbhiv_100k 0
e_mort_exc_tbhiv_100k_lo 0
e_mort_exc_tbhiv_100k_hi 0
e_mort_exc_tbhiv_num 0
e_mort_exc_tbhiv_num_lo 0
e_mort_exc_tbhiv_num_hi 0
source_mort 1
e_inc_100k 0
e_inc_100k_lo 0
e_inc_100k_hi 0
e_inc_num 0
e_inc_num_lo 0
e_inc_num_hi 0
e_tbhiv_prct 0
e_tbhiv_prct_lo 0
e_tbhiv_prct_hi 0
e_inc_tbhiv_100k 0
e_inc_tbhiv_100k_lo 0
e_inc_tbhiv_100k_hi 0
e_inc_tbhiv_num 0
e_inc_tbhiv_num_lo 0
e_inc_tbhiv_num_hi 0
source_tbhiv 1
c_cdr 0
c_cdr_lo 0
c_cdr_hi 0
dtype: int64
- Pick one of the columns from the Tuberculosis data and copy it into a numpy array as before
# Picking a column (I created a variable for this so I (and you (: ) can modify the column easily here and the change will be carried out everywhere I use the variable colname)
= 'e_prev_100k'
colname
# Creating a numpy array from our column
= np.array(df_tuber[colname])
col
# Printing the type of our newly created column
print(type(col))
<class 'numpy.ndarray'>
- Compare this selected column to a Normal distribution. Then Sample from a Normal distribution and show a second Q-Q plot
# Plotting the Q-Q Plot for our column
='r')
sm.qqplot(col, line'Q-Q Plot of the "{}" column of our dataset'.format(colname));
plt.title(
# Plotting the Q-Q Plot for the log of our column
='r')
sm.qqplot(np.log(col), line'Q-Q Plot of the Log of the "{}" column'.format(colname));
plt.title(
# Sampling from a Gaussian and a uniform distribution
= np.random.randn(1000)
sample_norm = np.random.rand(1000)
sample_unif
# Plotting the second Q-Q Plot for our sample (that was generated using a normal distribution)
='r')
sm.qqplot(sample_norm, line'Q-Q Plot of the generated sample (Gaussian)')
plt.title(='r')
sm.qqplot(sample_unif, line'Q-Q Plot of the generated sample (Uniform)'); plt.title(
Go ahead and change the colname variable (question 1) into a different column name (that you can pick from the list you have just before question 1 (but do pick a numeric column). And re-execute the code from question 1 and question 2 and you’ll see your new Q-Q plot of the column you just picked.
- Have a look at the slides from Week 03 for different shapes
Ok ? Now try to guess the shape of the distribution of our selected column (shape of its histogram) from its Q-Q Plot above.
- Visualise the column on a histogram and reflect on whether the shape you inferred from Q-Q plots and the shape of the histogram correlate
# Histogramming the column we picked (not sure the verb exists though)
=(12, 5))
plt.figure(figsize1,2,1)
plt.subplot('Histogram of "{}"'.format(colname))
plt.title(
plt.hist(col)1,2,2)
plt.subplot('Histogram of Log "{}"'.format(colname))
plt.title(; plt.hist(np.log(col))
Of course it does ! From the shape of the Q-Q Plot above (convex, slope upwards) and the Slide of Q-Q Plots from Week 3, we could conclude before looking at the histogram that our distribution was right tailed (or positively skewed if you’re into complex vocabulary lol). And it is !
11.3 Distributions, Sampling
- Inspecting the effect of sample size on descriptive statistics
# Defining a few variables se we can change their values easiliy without having to change the rest of the code
= [5, 20, 100, 2000]
n = [0.5, 1, 3]
stds
# Initializing empty 2D arrays where we're going to store the results of our simulation
= np.empty([len(n), len(stds)])
mean = np.empty([len(n), len(stds)])
std = np.empty([len(n), len(stds)])
skewness = np.empty([len(n), len(stds)])
kurtos
# Conducting the experiments and storing the results in the respective 2D arrays
for i, sample_size in enumerate(n):
for j, theoritical_std in enumerate(stds):
= np.random.normal(loc=0, scale=theoritical_std, size=sample_size)
sample = sample.mean()
mean[i,j] = sample.std()
std[i,j] = skew(sample)
skewness[i,j] = kurtosis(sample)
kurtos[i,j]
# Turning the mean 2D array into a pandas dataframe
= pd.DataFrame(mean, columns = stds, index = n)
mean = mean.rename_axis('Sample Size').rename_axis("Standard Deviation", axis="columns")
mean
# Turning the std 2D array into a pandas dataframe
= pd.DataFrame(std, columns = stds, index = n)
std = std.rename_axis('Sample Size').rename_axis("Standard Deviation", axis="columns")
std
# Turning the skewness 2D array into a pandas dataframe
= pd.DataFrame(skewness, columns = stds, index = n)
skewness = skewness.rename_axis('Sample Size').rename_axis("Standard Deviation", axis="columns")
skewness
# Turning the kurtosis 2D array into a pandas dataframe
= pd.DataFrame(kurtos, columns = stds, index = n)
kurtos = kurtos.rename_axis('Sample Size').rename_axis("Standard Deviation", axis="columns")
kurtos
print("GAUSSIAN DISTRIBUTION\n")
print('Results for the Mean :')
# This is a dataframe containing the means of the samples generated with different values of std and sample size
mean print('Results for the Standard Deviation :')
# This is a dataframe containing the standard deviations of the samples generated with different values of std and sample size
std print('Results for the Skewness :')
# This is a dataframe containing the skews of the samples generated with different values of std and sample size
skewness print('Results for the Kurtosis :')
# This is a dataframe containing the kurtosis of the samples generated with different values of std and sample size kurtos
GAUSSIAN DISTRIBUTION
Results for the Mean :
Results for the Standard Deviation :
Results for the Skewness :
Results for the Kurtosis :
Standard Deviation | 0.5 | 1.0 | 3.0 |
---|---|---|---|
Sample Size | |||
5 | -1.014009 | -0.367523 | -0.592649 |
20 | 1.613498 | 0.497900 | -1.049867 |
100 | -0.110038 | -0.491078 | -0.109846 |
2000 | 0.038718 | 0.184164 | 0.031062 |
Basically, the more data you have (the bigger your sample), the more accurate your empirical estimates are going to be. Observe for example the values of mean (1st DataFrame) and variance (2nd DataFrame) for the 2000 sample size (last row). In the first one, the values are all close to 0, because we generated our sample from a Gaussian with mean 0, and the values in the second one are all close to the values in the column names (which refer to the variance of the distribution of the sample). This means that for with a sample size of 2000, our estimates are really close to the “True” values (with which we generated the sample). Also, the Skew of a Gaussian distribution should be 0, and it is confirmed in the 3rd DataFrame where the values are close to 0 in the last row (i.e big sample size).
2) Same as before but with a Poisson distribution (which has just one parameter lambda instead of 2 like the gaussian)
# Defining a few variables se we can change their values easiliy without having to change the rest of the code
= [5, 20, 100, 2000]
n = [0.5, 1, 3] # In a gaussian we had two parameters, a var specified here and a mean we chose to be 0
lambd #everywhere. Here we have one parameter called lambda.
# Initializing empty 2D arrays where we're going to store the results of our simulation
= np.empty([len(n), len(lambd)])
mean = np.empty([len(n), len(lambd)])
std = np.empty([len(n), len(lambd)])
skewness = np.empty([len(n), len(lambd)])
kurtos
# Conducting the experiments and storing the results in the respective 2D arrays
for i, sample_size in enumerate(n):
for j, theoritical_lambd in enumerate(lambd):
#**********************************************************************
= np.random.poisson(lam = theoritical_lambd, size = sample_size) # THIS IS WHAT WE CHANGED IN Q2 !
sample #**********************************************************************
= sample.mean()
mean[i,j] = sample.std()
std[i,j] = skew(sample)
skewness[i,j] = kurtosis(sample)
kurtos[i,j]
# Turning the mean 2D array into a pandas dataframe
= pd.DataFrame(mean, columns = lambd, index = n)
mean = mean.rename_axis('Sample Size').rename_axis("Lambda", axis="columns")
mean
# Turning the std 2D array into a pandas dataframe
= pd.DataFrame(std, columns = lambd, index = n)
std = std.rename_axis('Sample Size').rename_axis("Lambda", axis="columns")
std
# Turning the skewness 2D array into a pandas dataframe
= pd.DataFrame(skewness, columns = lambd, index = n)
skewness = skewness.rename_axis('Sample Size').rename_axis("Lambda", axis="columns")
skewness
# Turning the kurtosis 2D array into a pandas dataframe
= pd.DataFrame(kurtos, columns = lambd, index = n)
kurtos = kurtos.rename_axis('Sample Size').rename_axis("Lambda", axis="columns")
kurtos
print("POISSON DISTRIBUTION\n")
print('Results for the Mean :')
# This is a dataframe containing the means of the samples generated with different values of std and sample size
mean print('Results for the Standard Deviation :')
# This is a dataframe containing the standard deviations of the samples generated with different values of std
std #and sample size
print('Results for the Skewness :')
# This is a dataframe containing the skews of the samples generated with different values of std and sample
skewness #size
print('Results for the Kurtosis :')
# This is a dataframe containing the kurtosis of the samples generated with different values of std and sample
kurtos #size
POISSON DISTRIBUTION
Results for the Mean :
Results for the Standard Deviation :
Results for the Skewness :
Results for the Kurtosis :
Lambda | 0.5 | 1.0 | 3.0 |
---|---|---|---|
Sample Size | |||
5 | NaN | -0.270833 | -0.500000 |
20 | 3.285444 | 0.212842 | -0.411993 |
100 | 3.485728 | 0.895970 | 0.288245 |
2000 | 1.910071 | 0.895998 | 0.407630 |
Just remember, the lambda parameter that defines the Poisson distribution is also the mean of the distribution. This is confirmed in the first DataFrame where the values (means of samples) are close to the column labels (theoretical lambda which is also equal to theoretical mean), especially in the last row.
11.4 Robust Statistics
- Choose a number of columns with different shapes, for instance, “e_prev_100k_hi” is left skewed or some columns where the variation is high or you notice potential outliers. You can make use of a series of boxplots to exploratively analyse the data for outliers
# Listing the columns
df_tuber.columns
Index(['country', 'iso2', 'iso3', 'iso_numeric', 'g_whoregion', 'year',
'e_pop_num', 'e_prev_100k', 'e_prev_100k_lo', 'e_prev_100k_hi',
'e_prev_num', 'e_prev_num_lo', 'e_prev_num_hi', 'e_mort_exc_tbhiv_100k',
'e_mort_exc_tbhiv_100k_lo', 'e_mort_exc_tbhiv_100k_hi',
'e_mort_exc_tbhiv_num', 'e_mort_exc_tbhiv_num_lo',
'e_mort_exc_tbhiv_num_hi', 'source_mort', 'e_inc_100k', 'e_inc_100k_lo',
'e_inc_100k_hi', 'e_inc_num', 'e_inc_num_lo', 'e_inc_num_hi',
'e_tbhiv_prct', 'e_tbhiv_prct_lo', 'e_tbhiv_prct_hi',
'e_inc_tbhiv_100k', 'e_inc_tbhiv_100k_lo', 'e_inc_tbhiv_100k_hi',
'e_inc_tbhiv_num', 'e_inc_tbhiv_num_lo', 'e_inc_tbhiv_num_hi',
'source_tbhiv', 'c_cdr', 'c_cdr_lo', 'c_cdr_hi'],
dtype='object')
# Alright I already know a few columns with outliers but let's try to find them together exploratively using BoxPlots
= 'c_cdr' # change the column name by choosing different ones from above (numeric ones)
colname =(15,5))
plt.figure(figsize1,2,1)
plt.subplot(
plt.hist(df_tuber[colname])'Histogram of "{}"'.format(colname))
plt.title(1,2,2)
plt.subplot(;
plt.boxplot(df_tuber[colname])'Boxplot of "{}"'.format(colname)); plt.title(
# Chosen columns : I picked 3, feel free to change them and experiment
= ['e_pop_num', 'e_prev_100k', 'c_cdr'] chosen_colnames
2) For the chosen columns, estimate both the conventional and the robust descriptive statistics and compare. Observe how these pairs deviate from each other based on the characteristics of the underlying data
# Central Tendency : Mean vs Median (Median is the robust version of the mean, because it takes into account
#the ordering of the points and not the actual values like the mean does)
'mean', '50%'], :] # The 50% is the median (50% quantile) df_tuber[chosen_colnames].describe().loc[[
e_pop_num | e_prev_100k | c_cdr | |
---|---|---|---|
mean | 2.899179e+07 | 207.694422 | 67.570706 |
50% | 5.140332e+06 | 93.000000 | 70.000000 |
Look at how the values are different between the mean and the median … LOOOOOOOK ! This is why when you have a skewed (unsymmetrical) distribution it’s usually more interesting to use the median as a measure of the central tendency of the data. One important thing to note here, for the two first attributes, the mean is higher than the median, but for the last it’s the opposite. This can tell you a thing or two about the shape of your distribution : if the mean is higher than the median, this means that the distribution is skewed to the right (right tail) which pulls the mean higher. And vice-versa.
Moral of the story is … outliers are a pain in the a**.
# Spread : Standard Deviation vs Inter-Quartile Range vs Median Absolute Deviation (MAD)
= df_tuber[chosen_colnames].std()
stds = df_tuber[chosen_colnames].quantile(0.75) - df_tuber[chosen_colnames].quantile(0.25)
iqrs = mad(df_tuber[chosen_colnames])
medianAD
= pd.DataFrame(stds, columns = ['std']).T
output = pd.concat([output, pd.DataFrame(iqrs, columns = ['IQR']).T], ignore_index=False)
output = pd.concat([output, pd.DataFrame(medianAD, columns = ['MAD'], index = chosen_colnames).T], ignore_index=False, names = ['std', 'iqr', 'mad'])
output output
e_pop_num | e_prev_100k | c_cdr | |
---|---|---|---|
std | 1.177827e+08 | 269.418159 | 25.234773 |
IQR | 1.677193e+07 | 280.500000 | 34.000000 |
MAD | 7.454908e+06 | 120.090780 | 25.204238 |
The values here are different as well, maybe more so for the “e_pop_num” attribute than the others, but that is just because of the scaling : “e_pop_num” takes big values overall compared to the other columns, which you can check with the mean values right above.
For the first attribute, the standard deviation is higher, and both the IQR and MAD are close to each other. For the second attribute, the inter-quartile range is slightly higher than the standard deviation, but the MAD is far below (less than half) the other two values, and the reason for that is a little bit involved : Basically, the standard deviation measures the spread by computing the squared deviation from the mean while the median absolute deviation evaluates the spread by computing the absolute deviation. This means that when the outliers have much bigger values than the “normal” points, the squared difference explodes (figuratively of course ;p) compared to the absolute difference. And this is actually the case for our second distribution (e_prev_100k) where most values are between 50 and 300 while many outliers lay above the 750 mark and go all the way up to 1800 (look at the boxplots below). For the third attribute the values are somewhat close, especially the std and the MAD, that’s because if you inspect the boxplot, this column doesn’t have many outliers to begin with.
Nonetheless, the differences are real, and if we don’t want to have to handle outliers, then we should be using robust statistics like the median to describe the central tendency and inter-quartile range or median absolute deviation to measure the spread of our data.
# Boxplots of the different columns
=(12,20))
plt.figure(figsize
3,2,1)
plt.subplot(0]])
plt.hist(df_tuber[chosen_colnames['Histogram of "{}"'.format(chosen_colnames[0]))
plt.title(3,2,2)
plt.subplot(0]])
plt.boxplot(df_tuber[chosen_colnames['Boxplot of "{}"'.format(chosen_colnames[0]))
plt.title(
3,2,3)
plt.subplot(1]])
plt.hist(df_tuber[chosen_colnames['Histogram of "{}"'.format(chosen_colnames[1]))
plt.title(3,2,4)
plt.subplot(1]])
plt.boxplot(df_tuber[chosen_colnames['Boxplot of "{}"'.format(chosen_colnames[1]))
plt.title(
3,2,5)
plt.subplot(2]])
plt.hist(df_tuber[chosen_colnames['Histogram of "{}"'.format(chosen_colnames[2]))
plt.title(3,2,6)
plt.subplot(2]])
plt.boxplot(df_tuber[chosen_colnames['Boxplot of "{}"'.format(chosen_colnames[2])); plt.title(