18 Lab: Dimension reduction (2)

In the previous lab, we learned how to work with dimension reduction in a relatively simple dataset. In this lab we are going to work with a much larger dataset: a subset of the US Census & Crime data (Redmond 2009). Details about the various variables can be found here.

How can we tackle a dataset with 100 different dimensions? We try using PCA and clustering. There is a lot you can do with this dataset. We encourage you to dig in. Try and find some interesting patterns in the data. Even upload your findings to the Teams channel and discuss.

18.1 Data

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
df = pd.read_csv('data/censusCrimeClean.csv')

df

	communityname	fold	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	...	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LandArea	PopDens	PctUsePubTrans	ViolentCrimesPerPop
0	Lakewoodcity	1	0.19	0.33	0.02	0.90	0.12	0.17	0.34	0.47	...	0.00	0.12	0.42	0.50	0.51	0.64	0.12	0.26	0.20	0.20
1	Tukwilacity	1	0.00	0.16	0.12	0.74	0.45	0.07	0.26	0.59	...	0.00	0.21	0.50	0.34	0.60	0.52	0.02	0.12	0.45	0.67
2	Aberdeentown	1	0.00	0.42	0.49	0.56	0.17	0.04	0.39	0.47	...	0.00	0.14	0.49	0.54	0.67	0.56	0.01	0.21	0.02	0.43
3	Willingborotownship	1	0.04	0.77	1.00	0.08	0.12	0.10	0.51	0.50	...	0.00	0.19	0.30	0.73	0.64	0.65	0.02	0.39	0.28	0.12
4	Bethlehemtownship	1	0.01	0.55	0.02	0.95	0.09	0.05	0.38	0.38	...	0.00	0.11	0.72	0.64	0.61	0.53	0.04	0.09	0.02	0.03
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1989	TempleTerracecity	10	0.01	0.40	0.10	0.87	0.12	0.16	0.43	0.51	...	0.00	0.22	0.28	0.34	0.48	0.39	0.01	0.28	0.05	0.09
1990	Seasidecity	10	0.05	0.96	0.46	0.28	0.83	0.32	0.69	0.86	...	0.00	0.53	0.25	0.17	0.10	0.00	0.02	0.37	0.20	0.45
1991	Waterburytown	10	0.16	0.37	0.25	0.69	0.04	0.25	0.35	0.50	...	0.02	0.25	0.68	0.61	0.79	0.76	0.08	0.32	0.18	0.23
1992	Walthamcity	10	0.08	0.51	0.06	0.87	0.22	0.10	0.58	0.74	...	0.01	0.45	0.64	0.54	0.59	0.52	0.03	0.38	0.33	0.19
1993	Ontariocity	10	0.20	0.78	0.14	0.46	0.24	0.77	0.50	0.62	...	0.08	0.68	0.50	0.34	0.35	0.68	0.11	0.30	0.05	0.48

1994 rows × 102 columns

Do-this-yourself: Let’s say “hi” to the data

You have probably learnt by now that the best start to working with a new data set is to get to know it. What would be a good way to approach this?

A few steps we always take: - Use the internal spreadsheet viewer in JupyterLab, go into the data folder and have a look at the data. - Check the size of the dataset. Do you remember how to look at the shape of a dataset? - Get a summary of the data, observe the descriptive statistics. - Check the data types and see if there are mixes of different types, such as numerical, categorical, textual, etc.. - Create a visualisation to get a first peek into the relations. Think about the number of features. Is a visualisation feasible?

Do-this-yourself: Check if we need to do any normalisation for this case?

What are the descriptive statistics look like, see if we need to do anything more?

Usually a standardisation is recommended before you apply PCA on a dataset. Here is that link again – https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

18.2 Principal Component Analysis

Note for the above: In case you wondered, we have done some cleaning and preprocessing on this data to make things easier for you, so normalisation might not be needed or at least won’t make a lot of difference to the results. But the real life will always be much more messier.

OK, you probably have noticed that there is a lot of data features we have here. Let’s start experimenting with the PCA technique and see what it could surface. Maybe it will help us reduce the dimensionality a little but and tell us something about the features.

from sklearn.decomposition import PCA

n_components = 2
 
pca = PCA(n_components=n_components)
df_pca = pca.fit(df.iloc[:, 1:])

df_pca.explained_variance_ratio_

array([0.67387831, 0.08863102])

What do these numbers mean?

df_pca_vals = pca.fit_transform(df.iloc[:,1:])
df['c1'] = [item[0] for item in df_pca_vals]
df['c2'] = [item[1] for item in df_pca_vals]

import seaborn as sns
sns.scatterplot(data = df, x = 'c1', y = 'c2')

Hmm, that looks problematic. What is going on? When you see patterns like this, you will need to take a step back and look at everything closely.

We should look at the component loadings. Though there are going to be over 200 of them. We can put them into a Pandas dataframe and then sort them.

data = {'columns' : df.columns[1:102],
        'component 1' : df_pca.components_[0],
        'component 2' : df_pca.components_[1]}


loadings = pd.DataFrame(data)
loadings.sort_values(by=['component 1'], ascending=False, key=abs)

	columns	component 1	component 2
0	fold	0.999788	0.016430
12	pctUrban	0.004944	-0.146079
37	PctOccupManu	-0.004403	0.132659
85	RentHighQ	0.004078	-0.188187
13	medIncome	0.003932	-0.185649
...	...	...	...
92	PctForeignBorn	0.000076	-0.024639
3	racepctblack	0.000066	0.130905
5	racePctAsian	0.000064	-0.061561
99	PctUsePubTrans	0.000037	-0.036078
6	racePctHisp	-0.000002	0.064135

101 rows × 3 columns

loadings.sort_values(by=['component 2'], ascending=False)

	columns	component 1	component 2
78	PctHousNoPhone	-0.001759	0.198116
29	PctPopUnderPov	-0.003594	0.190548
18	pctWPubAsst	-0.003697	0.173593
31	PctNotHSGrad	-0.002967	0.158646
51	PctIlleg	-0.001022	0.153504
...	...	...	...
45	PctKids2Par	0.002385	-0.169581
20	medFamInc	0.003025	-0.177225
46	PctYoungKids2Par	0.003818	-0.178675
13	medIncome	0.003932	-0.185649
85	RentHighQ	0.004078	-0.188187

101 rows × 3 columns

Caution

Do you notice anything extraordinary within these variable loadings?

Remember, the variable loadings tell us to what extent the information encoded along a principal component axis is determined by a single variable. In other words, how important a variable is. Haev a look and see if anything is too problematic here.

The fold variable is messing with our PCA. What does that variable look like.

df.fold.value_counts()

fold
1     200
2     200
3     200
4     200
5     199
6     199
7     199
8     199
9     199
10    199
Name: count, dtype: int64

df['fold'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Aha! It looks to be some sort of categorical variable. Looking into the dataset more, it is actually a variable used for cross tabling.

We should not include this variable in our analysis.

df.iloc[:, 2:-2]

	population	householdsize	racepctblack	racePctWhite	racePctAsian	racePctHisp	agePct12t21	agePct12t29	agePct16t24	agePct65up	...	NumStreet	PctForeignBorn	PctBornSameState	PctSameHouse85	PctSameCity85	PctSameState85	LandArea	PopDens	PctUsePubTrans	ViolentCrimesPerPop
0	0.19	0.33	0.02	0.90	0.12	0.17	0.34	0.47	0.29	0.32	...	0.00	0.12	0.42	0.50	0.51	0.64	0.12	0.26	0.20	0.20
1	0.00	0.16	0.12	0.74	0.45	0.07	0.26	0.59	0.35	0.27	...	0.00	0.21	0.50	0.34	0.60	0.52	0.02	0.12	0.45	0.67
2	0.00	0.42	0.49	0.56	0.17	0.04	0.39	0.47	0.28	0.32	...	0.00	0.14	0.49	0.54	0.67	0.56	0.01	0.21	0.02	0.43
3	0.04	0.77	1.00	0.08	0.12	0.10	0.51	0.50	0.34	0.21	...	0.00	0.19	0.30	0.73	0.64	0.65	0.02	0.39	0.28	0.12
4	0.01	0.55	0.02	0.95	0.09	0.05	0.38	0.38	0.23	0.36	...	0.00	0.11	0.72	0.64	0.61	0.53	0.04	0.09	0.02	0.03
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1989	0.01	0.40	0.10	0.87	0.12	0.16	0.43	0.51	0.35	0.30	...	0.00	0.22	0.28	0.34	0.48	0.39	0.01	0.28	0.05	0.09
1990	0.05	0.96	0.46	0.28	0.83	0.32	0.69	0.86	0.73	0.14	...	0.00	0.53	0.25	0.17	0.10	0.00	0.02	0.37	0.20	0.45
1991	0.16	0.37	0.25	0.69	0.04	0.25	0.35	0.50	0.31	0.54	...	0.02	0.25	0.68	0.61	0.79	0.76	0.08	0.32	0.18	0.23
1992	0.08	0.51	0.06	0.87	0.22	0.10	0.58	0.74	0.63	0.41	...	0.01	0.45	0.64	0.54	0.59	0.52	0.03	0.38	0.33	0.19
1993	0.20	0.78	0.14	0.46	0.24	0.77	0.50	0.62	0.40	0.17	...	0.08	0.68	0.50	0.34	0.35	0.68	0.11	0.30	0.05	0.48

1994 rows × 100 columns

from sklearn.decomposition import PCA

n_components = 2

pca_no_fold = PCA(n_components=n_components)
df_pca_no_fold = pca_no_fold.fit(df.iloc[:, 2:-2])

df_pca_vals = pca_no_fold.fit_transform(df.iloc[:, 2:-2])

df['c1_no_fold'] = [item[0] for item in df_pca_vals]
df['c2_no_fold'] = [item[1] for item in df_pca_vals]

sns.scatterplot(data = df, x = 'c1_no_fold', y = 'c2_no_fold')

How do our component loadings look?

data = {'columns' : df.iloc[:, 2:-4].columns,
        'component 1' : df_pca_no_fold.components_[0],
        'component 2' : df_pca_no_fold.components_[1]}


loadings = pd.DataFrame(data)
loadings_sorted = loadings.sort_values(by=['component 1'], ascending=False)
loadings_sorted.iloc[1:10,:]

	columns	component 1	component 2
28	PctPopUnderPov	0.190608	0.031635
17	pctWPubAsst	0.173689	0.057778
30	PctNotHSGrad	0.158702	0.031026
50	PctIlleg	0.153267	0.089731
32	PctUnemployed	0.152607	0.045667
29	PctLess9thGrade	0.150183	0.053855
99	ViolentCrimesPerPop	0.133194	0.095805
36	PctOccupManu	0.132981	-0.022150
2	racepctblack	0.130601	0.027162

loadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)
loadings_sorted.iloc[1:10,:]

	columns	component 1	component 2
59	PctRecImmig10	0.000556	0.253476
57	PctRecImmig5	0.000901	0.252463
56	PctRecentImmig	-0.001797	0.248239
91	PctForeignBorn	-0.024743	0.241783
61	PctNotSpeakEnglWell	0.048595	0.207575
5	racePctHisp	0.063868	0.187826
68	PctPersDenseHous	0.087193	0.184704
11	pctUrban	-0.146557	0.183317
4	racePctAsian	-0.061547	0.160710

Interesting that our first component variables are income related whereas our second component variables are immigration related. If we look at the projections of the model, coloured by Crime, what do we see?

sns.scatterplot(data = df,
                x = 'c1_no_fold',
                y = 'c2_no_fold',
                hue = 'ViolentCrimesPerPop',
                size = 'ViolentCrimesPerPop')

18.3 Clustering

What about clustering using these income and immigration components?

18.3.1 Defining number of clusters

from sklearn.cluster import KMeans

ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k, n_init = 10)
    
    # Fit model to samples
    model.fit(df[['c1_no_fold', 'c2_no_fold']])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

import matplotlib.pyplot as plt

plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Four clusters looks good.

18.4 Visualising clusters

k_means_4 = KMeans(n_clusters = 3, init = 'random', n_init = 10)
k_means_4.fit(df[['c1_no_fold', 'c2_no_fold']])
df['Four clusters'] = pd.Series(k_means_4.predict(df[['c1_no_fold', 'c2_no_fold']].values), index = df.index)

sns.scatterplot(data = df, x = 'c1_no_fold', y = 'c2_no_fold', hue = 'Four clusters')

Hmm, might we interpret this plot? The plot is unclear. Let us assume c1 is an income component and c2 is an immigration component. Cluster 0 are places high on income and low on immigration. Cluster 2 are low on income and low on immigration. The interesting group are those high on immigration and relatively low on income.

We can include crime on the plot.

import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (15,10)
sns.scatterplot(data = df, x = 'c1_no_fold', y = 'c2_no_fold', hue = 'Four clusters', size = 'ViolentCrimesPerPop')

Alas, we stop here in this exercise. You could work on this further. There are outliers you might want to remove. Or, at this stage, you might want to look at more components, 3 or 4 and look for some other factors.