34 Lab: Gender gaps

34.1 Source (Dataset)

Office of the National Statistics Gender Pay Gap ONS Source

34.2 Explanations (from the source)

Gender pay gap (GPG) - calculated as the difference between average hourly earnings (excluding overtime) of men and women as a proportion of average hourly earnings (excluding overtime) of men. For example, a 4% GPG denotes that women earn 4% less, on average, than men. Conversely, a -4% GPG denotes that women earn 4% more, on average, than men.
Mean: a measure of the average which is derived by summing the values for a given sample, and then dividing the sum by the number of observations (i.e. jobs) in the sample. In earnings distributions, the mean can be disproportionately influenced by a relatively small number of high-paying jobs.
Median: the value below which 50% of jobs fall. It is ONS’s preferred measure of average earnings as it is less affected by a relatively small number of very high earners and the skewed distribution of earnings. It therefore gives a better indication of typical pay than the mean.

34.2.1 Coverage and timeliness

The Annual Survey of Hours and Earnings (ASHE) covers employee jobs in the United Kingdom. It does not cover the self-employed, nor does it cover employees not paid during the reference period (2023).

GPG estimates are provided for the pay period that included a specified date in April. They relate to employees on adult rates of pay, whose earnings for the survey pay period were not affected by absence.

ASHE is based on a 1% sample of jobs taken from HM Revenue and Customs’ Pay As You Earn (PAYE) records. Consequently, individuals with more than one job may appear in the sample more than once.

34.3 Reading the dataset

%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df_profession = pd.read_excel('data/genderpaygap.xlsx', sheet_name='All')
df_profession_category = pd.read_excel('data/genderpaygap.xlsx', sheet_name='Main')
df_age = pd.read_excel('data/genderpaygap.xlsx', sheet_name='Age')
df_geography = pd.read_excel('data/genderpaygap.xlsx', sheet_name='Geography')

Let’s have a look at our dataset

df_profession.tail()

	Description	Code	GPGmedian	GPGmean
31	Process, plant and machine operatives	81	14.0	14.1
32	Transport and mobile machine drivers and ope...	82	10.5	2.9
33	Elementary occupations	9	5.8	8.1
34	Elementary trades and related occupations	91	7.1	7.7
35	Elementary administration and service occupa...	92	5.6	8.2

df_profession_category.tail()

	Description	Code	GPGmedian	GPGmean
5	Skilled trades occupations	5	19.0	14.5
6	Caring, leisure and other service occupations	6	1.5	2.0
7	Sales and customer service occupations	7	3.7	4.5
8	Process, plant and machine operatives	8	14.1	13.0
9	Elementary occupations	9	5.8	8.1

df_age

	age_group	GPGmedian	GPGmean
0	16-17b	0.0	-7.9
1	18-21	0.8	10.6
2	22-29	4.8	4.3
3	30-39	11.5	9.8
4	40-49	17.0	15.1
5	50-59	19.7	17.9
6	60+	18.1	18.2

df_geography.tail()

	Description	Code	GPGmedian	GPGmean
385	South Lanarkshire	S12000029	6.1	7.5
386	Stirling	S12000030	7.4	21.9
387	West Dunbartonshire	S12000039	17.5	12.8
388	West Lothian	S12000040	8.3	9.6
389	Northern Ireland	N92000002	8.1	9.6

If you look at the Excel data files, we see that occupations have a main and sub-category. Since we have the main category values in df_profession_category anyway, let’s drop them from ‘df_profession’ to retain the focus on sub-categories only. We can do this based on the values in the Code column since as you can see main category professions have code values < 10 and sub-categories have values greater than 10.

indices_to_drop = df_profession[df_profession['Code'] < 10].index
df_profession.drop(indices_to_drop, inplace=True)
df_profession

	Description	Code	GPGmedian	GPGmean
2	Corporate managers and directors	11	12.4	12.8
3	Other managers and proprietors	12	4.8	8.6
5	Science, research, engineering and technolog...	21	10.2	9.2
6	Health professionals	22	10.2	15.2
7	Teaching and other educational professionals	23	3.8	8.9
8	Business, media and public service professio...	24	7.9	11
10	Science, engineering and technology associat...	31	11.8	8
11	Health and social care associate professionals	32	4.7	4.9
12	Protective service occupations	33	4.7	3.4
13	Culture, media and sports occupations	34	5.2	x
14	Business and public service associate profes...	35	13.9	18
16	Administrative occupations	41	5.9	6.4
17	Secretarial and related occupations	42	-0.9	-2.3
19	Skilled agricultural and related trades	51	-6.4	-3.8
20	Skilled metal, electrical and electronic trades	52	9.4	2.3
21	Skilled construction and building trades	53	12.6	6.2
22	Textiles, printing and other skilled trades	54	3.4	4.4
24	Caring personal service occupations	61	0.7	0.7
25	Leisure, travel and related personal service...	62	5.3	7.4
26	Community and civil enforcement occupations	63	-28.9	-20.6
28	Sales occupations	71	1.8	6.3
29	Customer service occupations	72	0.0	2.8
31	Process, plant and machine operatives	81	14.0	14.1
32	Transport and mobile machine drivers and ope...	82	10.5	2.9
34	Elementary trades and related occupations	91	7.1	7.7
35	Elementary administration and service occupa...	92	5.6	8.2

34.3.1 Missing values

Let’s check our data

df_profession.info()
df_profession_category.info()
df_age.info()
df_geography.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26 entries, 2 to 35
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Description  26 non-null     object 
 1   Code         26 non-null     int64  
 2   GPGmedian    26 non-null     float64
 3   GPGmean      26 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.0+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Description  10 non-null     object 
 1   Code         10 non-null     int64  
 2   GPGmedian    10 non-null     float64
 3   GPGmean      10 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 452.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age_group  7 non-null      object 
 1   GPGmedian  7 non-null      float64
 2   GPGmean    7 non-null      float64
dtypes: float64(2), object(1)
memory usage: 300.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  390 non-null    object
 1   Code         390 non-null    object
 2   GPGmedian    390 non-null    object
 3   GPGmean      390 non-null    object
dtypes: object(4)
memory usage: 12.3+ KB

# It looks like GPGmean is read as an object (string) in df_profession dataframe. 
# GPGmean and GPGmedian are both objects in df_geography
# Let's convert the data to float64, so we can create plots later
df_profession['GPGmean'] = pd.to_numeric(df_profession['GPGmean'], errors='coerce')
df_geography['GPGmean'] = pd.to_numeric(df_geography['GPGmean'], errors='coerce')
df_geography['GPGmedian'] = pd.to_numeric(df_geography['GPGmedian'], errors='coerce')

# Next, let's check for missing values
df_profession.isna().sum()
df_profession_category.isna().sum()
df_age.isna().sum()

age_group    0
GPGmedian    0
GPGmean      0
dtype: int64

All seems fine - let’s get plotting

# Let's plot the mean and median Gender Pay Gap (GPG)
df_profession.boxplot(column=['GPGmedian', 'GPGmean'])

Hmmm, there are outliers. Let’s check the descriptive statistics

# Let's look at the distribution of the values in the columns
df_profession.describe()

	Code	GPGmedian	GPGmean
count	26.000000	26.000000	25.00000
mean	47.923077	4.988462	5.70800
std	23.918065	8.505778	7.49277
min	11.000000	-28.900000	-20.60000
25%	31.250000	3.500000	2.90000
50%	46.500000	5.450000	6.40000
75%	62.750000	10.200000	8.90000
max	92.000000	14.000000	18.00000

# Let's try to visualise what's going on with a histogram - what type of skew do you notice?
df_profession[['GPGmedian']].plot(kind='hist', ec='black')

Hmmm, there appears to be a lone bin in our histogram. Which might be the profession or professions where women earn more than men?

# Is there one profession or more professions where women earn more? Let's do some investigation through visualisation. 
import altair as alt
alt.Chart(df_profession).mark_bar().encode(
    alt.X("GPGmedian:Q", bin=True, title='GPGmedian'), 
    y=alt.Y('Description:N', sort='-x', title='Professional Category'),  
    color='Description:N',
    tooltip=['Description', 'GPGmedian']
).properties(
    width=600,
    height=400
)

This plot shows us that Community and civil enforcement occupations, skilled agricultural and related trades, and secretarial and related occupations are the ones where women earn, on average, more than men.

If you are wondering what ‘community and civil enforcement occupations’ mean - then this ONS source says it includes police community and parking and civil enforcement officers.

Are these occupations the ones you suspected women to earn more than men (on average)?

Sidenote

The above visualisation is detailed, but it’s busy and cluttered. How about if we try doing this on df_profession_category instead?

# Is there one profession where Women earn more? Let's do some investigation. 
import altair as alt
alt.Chart(df_profession_category).mark_bar().encode(
    alt.X("GPGmedian:Q", bin=True), 
    y=alt.Y('Description:N', sort='-x'),  
    color='Description:N',
    tooltip=['Description', 'GPGmedian']
).properties(
    width=600,
    height=400
)

In this, we have lost some of the detail we had in the earlier visualisation, but we get to know that “Caring, leisure and other service occupations” is a ‘main category’ of occcupation where the GPG is low (but women don’t earn more than men).

Sidenote

What does this narrative tell you about women being more likely to do multiple jobs to work around their domestic responsibilities which we spoke about in the lecture (and recordings)?

# Alternative visualisation (excluding all employees category)

# In which main professional categories is the gap narrow? Let's find out!
df_professions_sorted = df_profession_category.sort_values('GPGmedian', ascending=True)

# Let's drop the row corresponding to 'All employees' because we are more interested in looking at the differences across professional categories and sub-categories here 
df_professions_sorted = df_professions_sorted[df_professions_sorted['Description'] != 'All employees']

# Let's create the bar plot
df_professions_sorted.plot.bar(x='Description', y='GPGmedian')

Let’s look at age-based differences next:

df_age.sort_values('age_group', ascending=True).plot.bar(x = 'age_group', y = 'GPGmedian')

Reflection

It seems that GPG increases with age - what does this say about our dicussion during the lectures about GPG increasing for women who take time off from work for a variety of reasons compared to their male and female counterparts who do not take time out of work! What do you think might be the reasons for the minor fall in GPG at 60+?

34.4 Geography

34.4.0.1 But first:

Since moving on from our Iris and Wine datasets, the real-world datasets rarely come prepared (ready to use).

If you download the zip file for the latest 2023 Pay Gap ONS statistics, you will notice that they have color-coded their cells based on the certainty of estimates. On the one hand, this is very good practice - being transparent about the quality of the data. On the other hand, it tells us that we need to be careful about what insights we can draw from the data.
If you look at the statistics for one year, you can get a glimpse of what’s happening across various categories (geography, age, profession, etc.) in terms of GPG but it’s a cross-sectional view. But you can collate a longitudinal view should you wish to. E.g., by downloading the zip folders across the desired years and collating the information for desired categories for multiple years. But remember that this will be a ‘simplified approach’ to a longitudinal view and will have limitations. Also, recollect one of the figures Cagatay showed in the earlier lectures - it’s common to spend a lot of time at the start of your Data Science project just collating the necessary information. If you fancy, you can write a script to automate the data collation process!
We have geography information as area codes from the ONS source, but wouldn’t it be nice if we are able to visualise GPG by Geography on a map of England (with Levelling Up agenda and all). That’s the data hunt I went on. And the ONS’ Geodata portal provides datasets from the different administrative boundaries, so I downloaded this one:Counties and Unitary Authorities (May 2023) Boundaries UK BUC. Now let’s see what visualisation we can create with it.

Takeaway

Collating data from multiple sources is a significant, valuable and legitimate part of the Data Science project journey

# Getting the geospatial polygons for England
import geopandas as gpd 
import altair as alt

geo_states_england = gpd.read_file('data/Counties_and_Unitary_Authorities_May_2023_UK_BUC_-7406349609691062173.gpkg')
geo_states_england.head()

	CTYUA23CD	CTYUA23NM	BNG_E	BNG_N	LONG	LAT	GlobalID	geometry
0	E06000001	Hartlepool	447160	531474	-1.27018	54.676102	{224B1BB0-27FA-4B44-AD01-F22525CE232E}	MULTIPOLYGON (((448973.593 536745.277, 448290....
1	E06000002	Middlesbrough	451141	516887	-1.21099	54.544701	{8A06DF87-1F09-4A1C-9D6E-A32D40A0B159}	MULTIPOLYGON (((451894.299 521145.303, 448410....
2	E06000003	Redcar and Cleveland	464361	519597	-1.00608	54.567501	{4A930CE8-4656-4A98-880E-8110EE3D8501}	MULTIPOLYGON (((478232.568 518788.831, 478074....
3	E06000004	Stockton-on-Tees	444940	518183	-1.30664	54.556900	{304224A1-E808-4BF2-8F3E-AC43B0368BE8}	MULTIPOLYGON (((452243.536 526335.188, 451148....
4	E06000005	Darlington	428029	515648	-1.56835	54.535301	{F7BBD06A-7E09-4832-90D0-F6CA591D4A1D}	MULTIPOLYGON (((436388.002 522354.197, 435529....

print(geo_states_england.columns)

Index(['CTYUA23CD', 'CTYUA23NM', 'CTYUA23NMW', 'BNG_E', 'BNG_N', 'LONG', 'LAT',
       'GlobalID', 'geometry'],
      dtype='object')

# Let's drop the columns we don't need
geo_states_england = geo_states_england.drop(['CTYUA23NMW', 'BNG_E', 'BNG_N', 'GlobalID'], axis=1)

# Let's check again
geo_states_england.head()

	CTYUA23CD	CTYUA23NM	LONG	LAT	geometry
0	E06000001	Hartlepool	-1.27018	54.676102	MULTIPOLYGON (((448973.593 536745.277, 448290....
1	E06000002	Middlesbrough	-1.21099	54.544701	MULTIPOLYGON (((451894.299 521145.303, 448410....
2	E06000003	Redcar and Cleveland	-1.00608	54.567501	MULTIPOLYGON (((478232.568 518788.831, 478074....
3	E06000004	Stockton-on-Tees	-1.30664	54.556900	MULTIPOLYGON (((452243.536 526335.188, 451148....
4	E06000005	Darlington	-1.56835	54.535301	MULTIPOLYGON (((436388.002 522354.197, 435529....

# Let's create a map of England
pre_GPG_England = alt.Chart(geo_states_england, title='Map of England').mark_geoshape().encode(
    tooltip=['CTYUA23NM']
).properties(
    width=500,
    height=300
)
pre_GPG_England

Wait, what’s that?! That’s not what we were expecting!

Map projections

Because the Earth is round, and maps are flat, geospatial data needs to be “projected”. There are many types of projecting geospatial data, and all of them come with some tradeoff in terms of distorting area and/or distance (in other words, none of them are perfect). You can read more here.

Now, the geospatial dataset that we are using for this notebook was downloaded from the Office for National Statistics’ Geoportal and uses a Coordinate Reference System (CRS) known as EPSG:27700 - OSGB36 / British National Grid. Regretfully, Altair works with a different CRS: WGS 84 (also known as epsg:4326), and this is creating the conflict.

We have two options: either reproject our data using geopandas, or according to Altair documentation try using the project configuration (type: 'identity', reflectY': True). It draws the geometries without applying a projection.

# Let's create a map of England
pre_GPG_England = alt.Chart(
    geo_states_england, title='Map of England'
).mark_geoshape().encode(
    tooltip=['CTYUA23NM']
).properties(
    width=500,
    height=300
).project(
    type='identity',
    reflectY=True
)
pre_GPG_England

df_geography.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Description  390 non-null    object 
 1   Code         390 non-null    object 
 2   GPGmedian    386 non-null    float64
 3   GPGmean      387 non-null    float64
dtypes: float64(2), object(2)
memory usage: 12.3+ KB

geo_states_england

	CTYUA23CD	CTYUA23NM	LONG	LAT	geometry
0	E06000001	Hartlepool	-1.27018	54.676102	MULTIPOLYGON (((448973.593 536745.277, 448290....
1	E06000002	Middlesbrough	-1.21099	54.544701	MULTIPOLYGON (((451894.299 521145.303, 448410....
2	E06000003	Redcar and Cleveland	-1.00608	54.567501	MULTIPOLYGON (((478232.568 518788.831, 478074....
3	E06000004	Stockton-on-Tees	-1.30664	54.556900	MULTIPOLYGON (((452243.536 526335.188, 451148....
4	E06000005	Darlington	-1.56835	54.535301	MULTIPOLYGON (((436388.002 522354.197, 435529....
...	...	...	...	...	...
213	W06000020	Torfaen	-3.05101	51.698399	MULTIPOLYGON (((333723 192653.903, 330700.402 ...
214	W06000021	Monmouthshire	-2.90280	51.778301	MULTIPOLYGON (((329597.402 229251.797, 326793....
215	W06000022	Newport	-2.89769	51.582298	MULTIPOLYGON (((343091.833 184213.309, 342279....
216	W06000023	Powys	-3.43531	52.348598	MULTIPOLYGON (((322891.55 333139.949, 321104.0...
217	W06000024	Merthyr Tydfil	-3.36425	51.748600	MULTIPOLYGON (((308057.304 211036.201, 306294....

218 rows × 5 columns

df_geography

	Description	Code	GPGmedian	GPGmean
0	Darlington UA	E06000005	5.4	13.3
1	Hartlepool UA	E06000001	6.2	8.9
2	Middlesbrough UA	E06000002	14.5	15.6
3	Redcar and Cleveland UA	E06000003	12.8	12.3
4	Stockton-on-Tees UA	E06000004	17.1	16.9
...	...	...	...	...
385	South Lanarkshire	S12000029	6.1	7.5
386	Stirling	S12000030	7.4	21.9
387	West Dunbartonshire	S12000039	17.5	12.8
388	West Lothian	S12000040	8.3	9.6
389	Northern Ireland	N92000002	8.1	9.6

390 rows × 4 columns

# Add the data
geo_states_england_merged = geo_states_england.merge(df_geography, left_on = 'CTYUA23CD', right_on = 'Code')

# Check the merged data
geo_states_england_merged.head(10)

	CTYUA23CD	CTYUA23NM	LONG	LAT	geometry	Description	Code	GPGmedian	GPGmean
0	E06000001	Hartlepool	-1.27018	54.676102	MULTIPOLYGON (((448973.593 536745.277, 448290....	Hartlepool UA	E06000001	6.2	8.9
1	E06000002	Middlesbrough	-1.21099	54.544701	MULTIPOLYGON (((451894.299 521145.303, 448410....	Middlesbrough UA	E06000002	14.5	15.6
2	E06000003	Redcar and Cleveland	-1.00608	54.567501	MULTIPOLYGON (((478232.568 518788.831, 478074....	Redcar and Cleveland UA	E06000003	12.8	12.3
3	E06000004	Stockton-on-Tees	-1.30664	54.556900	MULTIPOLYGON (((452243.536 526335.188, 451148....	Stockton-on-Tees UA	E06000004	17.1	16.9
4	E06000005	Darlington	-1.56835	54.535301	MULTIPOLYGON (((436388.002 522354.197, 435529....	Darlington UA	E06000005	5.4	13.3
5	E06000006	Halton	-2.68853	53.334202	MULTIPOLYGON (((358131.901 385425.802, 355191....	Halton UA	E06000006	3.4	4.6
6	E06000007	Warrington	-2.56167	53.391602	MULTIPOLYGON (((367582.201 396058.199, 367158....	Warrington UA	E06000007	12.8	14.6
7	E06000008	Blackburn with Darwen	-2.46360	53.700802	MULTIPOLYGON (((372966.498 423266.501, 371465....	Blackburn with Darwen UA	E06000008	22.3	15.3
8	E06000009	Blackpool	-3.02199	53.821602	MULTIPOLYGON (((333572.799 437130.702, 333041....	Blackpool UA	E06000009	4.4	3.5
9	E06000010	Kingston upon Hull, City of	-0.30382	53.769199	MULTIPOLYGON (((515429.592 427689.472, 516047....	Kingston upon Hull UA	E06000010	16.1	7.9

# Let's plot the GPG by geography now
post_GPG_England = alt.Chart(geo_states_england_merged, title='GPG by region - England').mark_geoshape().encode(
    color='GPGmedian',
    tooltip=['Description', 'GPGmedian']
).properties(
    width=500,
    height=300
).project(
    type='identity',
    reflectY=True
)
post_GPG_England

# side by side view
GPG_England = pre_GPG_England | post_GPG_England
GPG_England

How do the results in this workbook compare to the visualisation we saw during the lecture, for example, for the UK in Information is Beautiful But remember the earnings across the two might be for different years - do remember to check the metadata!