%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
= pd.read_excel('data/WDI_countries_v2.xlsx', sheet_name='Data4') df
33 Lab: Poverty and Inequality
The idea of measuring Poverty and Inequality using the case study on “The Statistics of Poverty and Inequality” (Rouncefield 1995). The questions that Mary Rouncefield asked her students were the following:
- Is the world’s wealth distributed evenly? What countries are outliers?
- Do people living in different countries have similar life expectancies?
- Do men and women have similar life expectancies? What is the average difference? What is the minimum difference? What is the maximum difference? In which countries do these occur? What are possible explanations for these differences?
- Are birth rates related to death rates?
- How quickly are populations growing?
By looking into 6 variables, you can investigate some major inequalites across the globe.
33.1 Source
In this lab, we will be using World Development Indicators dataset from the World Bank, which contains the following features:
33.2 Reading the dataset
Let’s have a look at our dataset
df.head()
Country Code | birthrate | Deathrate | GNI | Lifeexp_female | Lifeexp_male | Neonatal_death | |
---|---|---|---|---|---|---|---|
0 | AFG | 32.487 | 6.423 | 2260.0 | 66.026 | 63.047 | 44503.0 |
1 | ALB | 11.780 | 7.898 | 13820.0 | 80.167 | 76.816 | 243.0 |
2 | DZA | 24.282 | 4.716 | 11450.0 | 77.938 | 75.494 | 16407.0 |
3 | AND | 7.200 | 4.400 | NaN | NaN | NaN | 1.0 |
4 | AGO | 40.729 | 8.190 | 6550.0 | 63.666 | 58.064 | 35489.0 |
33.2.1 Missing values
Let’s check if we have any missing data
df.info()sum() df.isna().
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Code 216 non-null object
1 birthrate 205 non-null float64
2 Deathrate 205 non-null float64
3 GNI 187 non-null float64
4 Lifeexp_female 198 non-null float64
5 Lifeexp_male 198 non-null float64
6 Neonatal_death 193 non-null float64
dtypes: float64(6), object(1)
memory usage: 11.9+ KB
Country Code 0
birthrate 11
Deathrate 11
GNI 29
Lifeexp_female 18
Lifeexp_male 18
Neonatal_death 23
dtype: int64
Looks like there are null values in all but one column (Country Code)
# Let's look at the distribution of values in the birthrate and deathrate columns
=['birthrate', 'Deathrate']) df.boxplot(column
# Let's look at the distribution of values in terms of male and female life expectancy
=['Lifeexp_female', 'Lifeexp_male']) df.boxplot(column
# Let's look at the distribution of the Gross National Income (surely we expect to see outliers!)
=['GNI']) df.boxplot(column
# So, as expected, the world's wealth is not distributed evenly, but what countries are the outliers as Mary Rouncefield asked?
# We don't have the geographic coordinates as with our Hate Crime dataset to project on a map
# But perhaps we can think of other ways to get this information. Following is one way of doing this.
# Get the 25th and 75th percentile GNI values
= df['GNI'].describe()[['25%', '75%']]
GNI_quartiles
# Calculate the IQR
= GNI_quartiles['75%'] - GNI_quartiles['25%']
IQR_GNI
# Calculate the whiskers
= GNI_quartiles['25%'] - 1.5 * IQR_GNI
lb_GNI = GNI_quartiles['75%'] + 1.5 * IQR_GNI
ub_GNI
# Retrieve the outliers
= df[(df['GNI'] < lb_GNI) | (df['GNI'] > ub_GNI)]
outliers_GNI_df
# Get Country Codes for the outliers
= outliers_GNI_df['Country Code']
outliers_GNI_countries outliers_GNI_countries
85 HKG
92 IRL
115 LUX
116 MAC
146 NOR
158 QAT
170 SGP
187 CHE
203 ARE
205 USA
Name: Country Code, dtype: object
Perhaps you can checkout this link if you are doubtful about the country names and codes
# As with Part 1, let's standardise our data before we attempt further modelling using the data
from sklearn import preprocessing
import numpy as np
import seaborn as sns
# Get column names first
= df[['birthrate', 'Deathrate', 'GNI', 'Lifeexp_female', 'Lifeexp_male', 'Neonatal_death']]
df_stand = df_stand.columns
names
# Create the Scaler object
= preprocessing.StandardScaler()
scaler
# Fit our data on the scaler object
= scaler.fit_transform(df_stand)
df2
# Check what type is df2? (Do you recollect this from Part 1?)
type(df2)
numpy.ndarray
# Let's convert the numpy array into a DataFrame before further processing
= pd.DataFrame(df2, columns=names)
df2 df2.tail()
birthrate | Deathrate | GNI | Lifeexp_female | Lifeexp_male | Neonatal_death | |
---|---|---|---|---|---|---|
211 | -0.727171 | 0.200024 | NaN | 0.994355 | 0.807538 | NaN |
212 | 0.982566 | -1.563870 | -0.661830 | 0.051167 | 0.262031 | -0.233989 |
213 | 1.101866 | -0.604926 | NaN | -0.942333 | -0.798039 | 0.209120 |
214 | 1.686551 | -0.425077 | -0.813823 | -1.114030 | -1.323007 | 0.039069 |
215 | 1.124585 | 0.117514 | -0.840505 | -1.604284 | -1.462458 | -0.026205 |
# Now that our data has been standardised, let's look at the distribution across all columns
= sns.boxplot(data=df2, orient="h") ax
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
# There are clearly several outliers in some columns. Let's take a look at the numbers
df.describe()
birthrate | Deathrate | GNI | Lifeexp_female | Lifeexp_male | Neonatal_death | |
---|---|---|---|---|---|---|
count | 205.000000 | 205.000000 | 187.000000 | 198.000000 | 198.000000 | 193.000000 |
mean | 19.637580 | 7.573941 | 20630.427807 | 75.193288 | 70.323854 | 12948.031088 |
std | 9.839573 | 2.636414 | 21044.240160 | 7.870933 | 7.419214 | 48782.770706 |
min | 5.900000 | 1.202000 | 780.000000 | 54.991000 | 50.582000 | 0.000000 |
25% | 10.900000 | 5.800000 | 5090.000000 | 69.497250 | 65.533500 | 163.000000 |
50% | 17.545000 | 7.163000 | 13280.000000 | 77.193000 | 71.140500 | 1288.000000 |
75% | 27.100000 | 9.100000 | 28360.000000 | 80.776500 | 76.047500 | 7316.000000 |
max | 46.079000 | 15.400000 | 123290.000000 | 87.700000 | 82.300000 | 546427.000000 |
# Let's look at the histogram for birthrate
'birthrate']].plot(kind='hist', ec='black') df[[
Tip for reflection: Is the above a unimodal or bimodal distribution? Positive or negative skew - We have worked on this in earlier labs. Try to recollect/revisit and reflect. Can we see/say any better with a kernel density plot?
# Let's create a pairwise plot as we have done many times before to get an idea about the relationship between variables
= df.iloc[:,1:]) sns.pairplot(data
Hmmm, some likely positive and negative correlations
# Let's check the relationship between birthrate and deathrate
= 'birthrate', y = 'Deathrate', kind='scatter') df.plot(x
# Let's create a heatmap like we did in Part 1
= df.corr(numeric_only=True).round(1) # Again, round(1) so that it's easier to read given number of variables
corrMatrix =True)
sns.heatmap(corrMatrix, annot plt.show()
33.3 How quickly are populations growing?
This question can be investigated by calculating birth rate minus death rate.
# Let's add a new column to our DataFrame to indicate the rate of population change (assuming the absence of migration)
'pop_change_rate'] = df['birthrate'] - df['Deathrate']
df[ df.head()
Country Code | birthrate | Deathrate | GNI | Lifeexp_female | Lifeexp_male | Neonatal_death | pop_change_rate | |
---|---|---|---|---|---|---|---|---|
0 | AFG | 32.487 | 6.423 | 2260.0 | 66.026 | 63.047 | 44503.0 | 26.064 |
1 | ALB | 11.780 | 7.898 | 13820.0 | 80.167 | 76.816 | 243.0 | 3.882 |
2 | DZA | 24.282 | 4.716 | 11450.0 | 77.938 | 75.494 | 16407.0 | 19.566 |
3 | AND | 7.200 | 4.400 | NaN | NaN | NaN | 1.0 | 2.800 |
4 | AGO | 40.729 | 8.190 | 6550.0 | 63.666 | 58.064 | 35489.0 | 32.539 |
# Let's look at the distribution of our new rate of population change column
=['pop_change_rate']) df.boxplot(column
Let’s investigate the values in the column more deeply:
= df['pop_change_rate'].describe()
pop_change_rate_summary pop_change_rate_summary
count 205.000000
mean 12.063639
std 10.477867
min -6.500000
25% 2.700000
50% 11.213000
75% 20.582000
max 37.811000
Name: pop_change_rate, dtype: float64
The results range from -6.5 (a decreasing population) to +37.811 (an increasing population). The mean is around 12.06. What does this signify?