33  Lab: Poverty and Inequality

The idea of measuring Poverty and Inequality using the case study on “The Statistics of Poverty and Inequality” (Rouncefield 1995). The questions that Mary Rouncefield asked her students were the following:

  1. Is the world’s wealth distributed evenly? What countries are outliers?
  2. Do people living in different countries have similar life expectancies?
  3. Do men and women have similar life expectancies? What is the average difference? What is the minimum difference? What is the maximum difference? In which countries do these occur? What are possible explanations for these differences?
  4. Are birth rates related to death rates?
  5. How quickly are populations growing?

By looking into 6 variables, you can investigate some major inequalites across the globe.

33.1 Source

In this lab, we will be using World Development Indicators dataset from the World Bank, which contains the following features:features world bank development indicators

33.2 Reading the dataset

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('data/WDI_countries_v2.xlsx', sheet_name='Data4')

Let’s have a look at our dataset

df.head()
Country Code birthrate Deathrate GNI Lifeexp_female Lifeexp_male Neonatal_death
0 AFG 32.487 6.423 2260.0 66.026 63.047 44503.0
1 ALB 11.780 7.898 13820.0 80.167 76.816 243.0
2 DZA 24.282 4.716 11450.0 77.938 75.494 16407.0
3 AND 7.200 4.400 NaN NaN NaN 1.0
4 AGO 40.729 8.190 6550.0 63.666 58.064 35489.0

33.2.1 Missing values

Let’s check if we have any missing data

df.info()
df.isna().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Code    216 non-null    object 
 1   birthrate       205 non-null    float64
 2   Deathrate       205 non-null    float64
 3   GNI             187 non-null    float64
 4   Lifeexp_female  198 non-null    float64
 5   Lifeexp_male    198 non-null    float64
 6   Neonatal_death  193 non-null    float64
dtypes: float64(6), object(1)
memory usage: 11.9+ KB
Country Code       0
birthrate         11
Deathrate         11
GNI               29
Lifeexp_female    18
Lifeexp_male      18
Neonatal_death    23
dtype: int64

Looks like there are null values in all but one column (Country Code)

# Let's look at the distribution of values in the birthrate and deathrate columns
df.boxplot(column=['birthrate', 'Deathrate'])
<Axes: >

# Let's look at the distribution of values in terms of male and female life expectancy
df.boxplot(column=['Lifeexp_female', 'Lifeexp_male'])
<Axes: >

# Let's look at the distribution of the Gross National Income (surely we expect to see outliers!)
df.boxplot(column=['GNI'])
<Axes: >

# So, as expected, the world's wealth is not distributed evenly, but what countries are the outliers as Mary Rouncefield asked?
# We don't have the geographic coordinates as with our Hate Crime dataset to project on a map
# But perhaps we can think of other ways to get this information. Following is one way of doing this. 

# Get the 25th and 75th percentile GNI values
GNI_quartiles = df['GNI'].describe()[['25%', '75%']]

# Calculate the IQR
IQR_GNI = GNI_quartiles['75%'] - GNI_quartiles['25%']

# Calculate the whiskers
lb_GNI = GNI_quartiles['25%'] - 1.5 * IQR_GNI
ub_GNI = GNI_quartiles['75%'] + 1.5 * IQR_GNI

# Retrieve the outliers
outliers_GNI_df = df[(df['GNI'] < lb_GNI) | (df['GNI'] > ub_GNI)]

# Get Country Codes for the outliers
outliers_GNI_countries = outliers_GNI_df['Country Code']
outliers_GNI_countries
85     HKG
92     IRL
115    LUX
116    MAC
146    NOR
158    QAT
170    SGP
187    CHE
203    ARE
205    USA
Name: Country Code, dtype: object

Perhaps you can checkout this link if you are doubtful about the country names and codes

# As with Part 1, let's standardise our data before we attempt further modelling using the data
from sklearn import preprocessing
import numpy as np
import seaborn as sns

# Get column names first
df_stand = df[['birthrate', 'Deathrate', 'GNI', 'Lifeexp_female', 'Lifeexp_male', 'Neonatal_death']]
names = df_stand.columns

# Create the Scaler object
scaler = preprocessing.StandardScaler()

# Fit our data on the scaler object
df2 = scaler.fit_transform(df_stand)

# Check what type is df2? (Do you recollect this from Part 1?)
type(df2)
numpy.ndarray
# Let's convert the numpy array into a DataFrame before further processing
df2 = pd.DataFrame(df2, columns=names)
df2.tail()
birthrate Deathrate GNI Lifeexp_female Lifeexp_male Neonatal_death
211 -0.727171 0.200024 NaN 0.994355 0.807538 NaN
212 0.982566 -1.563870 -0.661830 0.051167 0.262031 -0.233989
213 1.101866 -0.604926 NaN -0.942333 -0.798039 0.209120
214 1.686551 -0.425077 -0.813823 -1.114030 -1.323007 0.039069
215 1.124585 0.117514 -0.840505 -1.604284 -1.462458 -0.026205
# Now that our data has been standardised, let's look at the distribution across all columns
ax = sns.boxplot(data=df2, orient="h")

# There are clearly several outliers in some columns. Let's take a look at the numbers
df.describe()
birthrate Deathrate GNI Lifeexp_female Lifeexp_male Neonatal_death
count 205.000000 205.000000 187.000000 198.000000 198.000000 193.000000
mean 19.637580 7.573941 20630.427807 75.193288 70.323854 12948.031088
std 9.839573 2.636414 21044.240160 7.870933 7.419214 48782.770706
min 5.900000 1.202000 780.000000 54.991000 50.582000 0.000000
25% 10.900000 5.800000 5090.000000 69.497250 65.533500 163.000000
50% 17.545000 7.163000 13280.000000 77.193000 71.140500 1288.000000
75% 27.100000 9.100000 28360.000000 80.776500 76.047500 7316.000000
max 46.079000 15.400000 123290.000000 87.700000 82.300000 546427.000000
# Let's look at the histogram for birthrate 
df[['birthrate']].plot(kind='hist', ec='black')
<Axes: ylabel='Frequency'>

Tip for reflection: Is the above a unimodal or bimodal distribution? Positive or negative skew - We have worked on this in earlier labs. Try to recollect/revisit and reflect. Can we see/say any better with a kernel density plot?

# Let's create a pairwise plot as we have done many times before to get an idea about the relationship between variables  
sns.pairplot(data = df.iloc[:,1:])

Hmmm, some likely positive and negative correlations

# Let's check the relationship between birthrate and deathrate
df.plot(x = 'birthrate', y = 'Deathrate', kind='scatter')
<Axes: xlabel='birthrate', ylabel='Deathrate'>

# Let's create a heatmap like we did in Part 1
corrMatrix = df.corr(numeric_only=True).round(1) # Again, round(1) so that it's easier to read given number of variables
sns.heatmap(corrMatrix, annot=True)
plt.show()

33.3 How quickly are populations growing?

This question can be investigated by calculating birth rate minus death rate.

# Let's add a new column to our DataFrame to indicate the rate of population change (assuming the absence of migration)
df['pop_change_rate'] = df['birthrate'] - df['Deathrate']
df.head()
Country Code birthrate Deathrate GNI Lifeexp_female Lifeexp_male Neonatal_death pop_change_rate
0 AFG 32.487 6.423 2260.0 66.026 63.047 44503.0 26.064
1 ALB 11.780 7.898 13820.0 80.167 76.816 243.0 3.882
2 DZA 24.282 4.716 11450.0 77.938 75.494 16407.0 19.566
3 AND 7.200 4.400 NaN NaN NaN 1.0 2.800
4 AGO 40.729 8.190 6550.0 63.666 58.064 35489.0 32.539
# Let's look at the distribution of our new rate of population change column 
df.boxplot(column=['pop_change_rate'])
<Axes: >

Let’s investigate the values in the column more deeply:

pop_change_rate_summary = df['pop_change_rate'].describe()
pop_change_rate_summary
count    205.000000
mean      12.063639
std       10.477867
min       -6.500000
25%        2.700000
50%       11.213000
75%       20.582000
max       37.811000
Name: pop_change_rate, dtype: float64

The results range from -6.5 (a decreasing population) to +37.811 (an increasing population). The mean is around 12.06. What does this signify?