16  Lab: Dimension Reduction

So far, we have worked with relatively small data sets, but often times, datasets may have a high number of dimensions. In those cases, appliying the methods we have covered in past sessions (i.e., visual exploration using data visualisations, correlations between two variables…) may be really difficult, if not impossible to do.

In this notebook we are going to explore two different techniques for reducing the dimensionality of our data: Clustering and Principle Component Analysis (PCA).

16.1 Data preparations

In this notebook, we are going to use the simple Iris flower dataset (Fisher 1936). The dataset, which is famously used to introduce these methods, consists of 4 measures or attributes (the length and the width of the sepals and petals, in centimeters) describing 50 samples from three species of flowers (Iris setosa, Iris virginica and Iris versicolor).

Image of a primrose willowherb ‘’Ludwigia octovalvis’’ (Family Onagraceae), flower showing petals and sepals. Photograph made in Hawai’i by Eric Guinther and released under the GNU Free Documentation License.

Contrary to previous sessions, in this case, the dataset will not be read from a csv file, but it is provided by the sklearn.datasets submodule:

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

# Explore our data.
iris
{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.2],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.6, 1.4, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'frame': None,
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 150 (50 in each of three classes)\n:Number of Attributes: 4 numeric, predictive attributes and the class\n:Attribute Information:\n    - sepal length in cm\n    - sepal width in cm\n    - petal length in cm\n    - petal width in cm\n    - class:\n            - Iris-Setosa\n            - Iris-Versicolour\n            - Iris-Virginica\n\n:Summary Statistics:\n\n============== ==== ==== ======= ===== ====================\n                Min  Max   Mean    SD   Class Correlation\n============== ==== ==== ======= ===== ====================\nsepal length:   4.3  7.9   5.84   0.83    0.7826\nsepal width:    2.0  4.4   3.05   0.43   -0.4194\npetal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\npetal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n============== ==== ==== ======= ===== ====================\n\n:Missing Attribute Values: None\n:Class Distribution: 33.3% for each of 3 classes.\n:Creator: R.A. Fisher\n:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n:Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. dropdown:: References\n\n  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n    Mathematical Statistics" (John Wiley, NY, 1950).\n  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n    Structure and Classification Rule for Recognition in Partially Exposed\n    Environments".  IEEE Transactions on Pattern Analysis and Machine\n    Intelligence, Vol. PAMI-2, No. 1, 67-71.\n  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n    on Information Theory, May 1972, 431-433.\n  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n    conceptual clustering system finds 3 classes in the data.\n  - Many, many more ...\n',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 'filename': 'iris.csv',
 'data_module': 'sklearn.datasets.data'}

Regretfully, the iris dataset is not a data frame as the previous ones:

type(iris)
sklearn.utils._bunch.Bunch

This means that, if we want to apply all the methods that we are familiar with, we will need to convert this odd data type into a pandas dataframe format we know and love. We can do this following this stackoverflow answer:

import pandas as pd

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

As usual, we may want to see some descriptive measures to get a sense of the data:

iris_df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

As can be seen from above, we do not have any missing data (all variables have 150 observations) but the scale (ranges) for every variable differ considerably (look the min and max values of sepal length and compare them with petal width).

16.1.1 Normalisation

Some algorithms are senstive to the size of variables. For example, if the sepal widths were in meters and the other variables in cm then an algorithm may underweight sepal widths. This means that we will need to rescale our variables to be in the safe side. There are two ways to put variables into the same scale: normalisation and standarisation.

  • Normalisation: rescales a dataset so that each value falls between 0 and 1. It uses the following formula to do so:

\[xnew = (xi – xmin) / (xmax – xmin)\]

  • Standarisation: rescales a dataset to have a mean of 0 and a standard deviation of 1. It uses the following formula to do so:

\[xnew = (xi – x) / s\]

Which one should we use?

If you cannot choose between them then try it both ways. You could compare the result with your raw data, the normalised data and the standardised data.

These blog posts may help you: (Statology) Standardization vs. Normalization: What’s the Difference? and (Analytics Vidhya) Feature Engineering: Scaling, Normalization, and Standardization (Updated 2023)).

In the code below we will normalise the data between 0 and 1 by using .fit_transform() from sklearn.

from sklearn.preprocessing import MinMaxScaler
col_names = iris_df.columns
iris_df =  pd.DataFrame(MinMaxScaler().fit_transform(iris_df))
iris_df.columns = col_names # Column names were lost, so we need to re-introduce
iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 0.222222 0.625000 0.067797 0.041667
1 0.166667 0.416667 0.067797 0.041667
2 0.111111 0.500000 0.050847 0.041667
3 0.083333 0.458333 0.084746 0.041667
4 0.194444 0.666667 0.067797 0.041667
... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667
146 0.555556 0.208333 0.677966 0.750000
147 0.611111 0.416667 0.711864 0.791667
148 0.527778 0.583333 0.745763 0.916667
149 0.444444 0.416667 0.694915 0.708333

150 rows × 4 columns

Let’s see how the the descriptive measures have changed after the transformation:

iris_df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 0.428704 0.440556 0.467458 0.458056
std 0.230018 0.181611 0.299203 0.317599
min 0.000000 0.000000 0.000000 0.000000
25% 0.222222 0.333333 0.101695 0.083333
50% 0.416667 0.416667 0.567797 0.500000
75% 0.583333 0.541667 0.694915 0.708333
max 1.000000 1.000000 1.000000 1.000000

Great.

Our dataset show us the length and width of both the sepal (leaf) and petals of 150 plants. The dataset is quite famous and you can find a wikipedia page with details of the dataset.

Questions

To motivate our exploration of the data, consider the sorts of questions we can ask:

  • Are all our plants from the same species?
  • Do some plants have similiar leaf and petal sizes?
  • Can we differentiate between the plants using all 4 variables (dimensions)?
  • Do we need to include both length and width, or can we reduce these dimensions and simplify our analysis?

16.1.2 Initial exploration

We can explore a dataset with few variables using plots.

16.1.2.1 Distribution (variabilty and density)

We’d like to see each variable’s distributions in terms of variablity and density. We have seen several ways to do this, but in this case we will be using a new plot type (a violin plot) to visualise the distribution.

To do so, we will be using seaborn’s violinplot(). Because we want to create a single plot with a violin plot per variable, we will need to transform our data from wide to a long format.

Explanation of Violin plot. (Chambers 2017)
Wide vs long data

A dataset can be written in two different formats: wide and long.

  • wide: every row is a unique observation, where columns are variables (or attributes) describing the observation.
  • long: single observations are split into multiple rows. Usually, the first column contains the index to the observation, and there are two more columns: the name of the variable, and the actual value of the variable.

Wide vs long data (Source: Statology)
import seaborn as sns

# some plots require a long dataframe structure
iris_df_long = iris_df.melt()
iris_df_long
variable value
0 sepal length (cm) 0.222222
1 sepal length (cm) 0.166667
2 sepal length (cm) 0.111111
3 sepal length (cm) 0.083333
4 sepal length (cm) 0.194444
... ... ...
595 petal width (cm) 0.916667
596 petal width (cm) 0.750000
597 petal width (cm) 0.791667
598 petal width (cm) 0.916667
599 petal width (cm) 0.708333

600 rows × 2 columns

And now that we have transformed our data into a long data format, we can create the visualisation:

sns.violinplot(data = iris_df_long, x = 'variable', y = 'value')

16.1.2.2 Correlations

The below plots use the wide data structure.

iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 0.222222 0.625000 0.067797 0.041667
1 0.166667 0.416667 0.067797 0.041667
2 0.111111 0.500000 0.050847 0.041667
3 0.083333 0.458333 0.084746 0.041667
4 0.194444 0.666667 0.067797 0.041667
... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667
146 0.555556 0.208333 0.677966 0.750000
147 0.611111 0.416667 0.711864 0.791667
148 0.527778 0.583333 0.745763 0.916667
149 0.444444 0.416667 0.694915 0.708333

150 rows × 4 columns

sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'sepal width (cm)')

sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'petal length (cm)')

Interesting. There seem to be two groupings in the data.

It might be easier to look at all the variables at once.

sns.pairplot(iris_df)

There seem to be some groupings in the data. Though we cannot easily identify which point corresponds to which row.

16.2 Clustering

A cluster is simply a group based on simliarity. There are several methods and we will use a relatively simple one called K-means clustering.

In K-means clustering an algorithm tries to group our items (plants in the iris dataset) based on similarity. We decide how many groups we want and the algorithm does the best it can (an accessible introduction to k-means clustering is here).

To start, we import the KMeans function from sklearn cluster module and turn our data into a matrix.

from sklearn.cluster import KMeans

iris = iris_df.values
iris
array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     , 0.06779661, 0.08333333],
       [0.38888889, 0.75      , 0.11864407, 0.08333333],
       [0.22222222, 0.75      , 0.08474576, 0.08333333],
       [0.30555556, 0.58333333, 0.11864407, 0.04166667],
       [0.22222222, 0.70833333, 0.08474576, 0.125     ],
       [0.08333333, 0.66666667, 0.        , 0.04166667],
       [0.22222222, 0.54166667, 0.11864407, 0.16666667],
       [0.13888889, 0.58333333, 0.15254237, 0.04166667],
       [0.19444444, 0.41666667, 0.10169492, 0.04166667],
       [0.19444444, 0.58333333, 0.10169492, 0.125     ],
       [0.25      , 0.625     , 0.08474576, 0.04166667],
       [0.25      , 0.58333333, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.10169492, 0.04166667],
       [0.13888889, 0.45833333, 0.10169492, 0.04166667],
       [0.30555556, 0.58333333, 0.08474576, 0.125     ],
       [0.25      , 0.875     , 0.08474576, 0.        ],
       [0.33333333, 0.91666667, 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.5       , 0.03389831, 0.04166667],
       [0.33333333, 0.625     , 0.05084746, 0.04166667],
       [0.16666667, 0.66666667, 0.06779661, 0.        ],
       [0.02777778, 0.41666667, 0.05084746, 0.04166667],
       [0.22222222, 0.58333333, 0.08474576, 0.04166667],
       [0.19444444, 0.625     , 0.05084746, 0.08333333],
       [0.05555556, 0.125     , 0.05084746, 0.08333333],
       [0.02777778, 0.5       , 0.05084746, 0.04166667],
       [0.19444444, 0.625     , 0.10169492, 0.20833333],
       [0.22222222, 0.75      , 0.15254237, 0.125     ],
       [0.13888889, 0.41666667, 0.06779661, 0.08333333],
       [0.22222222, 0.75      , 0.10169492, 0.04166667],
       [0.08333333, 0.5       , 0.06779661, 0.04166667],
       [0.27777778, 0.70833333, 0.08474576, 0.04166667],
       [0.19444444, 0.54166667, 0.06779661, 0.04166667],
       [0.75      , 0.5       , 0.62711864, 0.54166667],
       [0.58333333, 0.5       , 0.59322034, 0.58333333],
       [0.72222222, 0.45833333, 0.66101695, 0.58333333],
       [0.33333333, 0.125     , 0.50847458, 0.5       ],
       [0.61111111, 0.33333333, 0.61016949, 0.58333333],
       [0.38888889, 0.33333333, 0.59322034, 0.5       ],
       [0.55555556, 0.54166667, 0.62711864, 0.625     ],
       [0.16666667, 0.16666667, 0.38983051, 0.375     ],
       [0.63888889, 0.375     , 0.61016949, 0.5       ],
       [0.25      , 0.29166667, 0.49152542, 0.54166667],
       [0.19444444, 0.        , 0.42372881, 0.375     ],
       [0.44444444, 0.41666667, 0.54237288, 0.58333333],
       [0.47222222, 0.08333333, 0.50847458, 0.375     ],
       [0.5       , 0.375     , 0.62711864, 0.54166667],
       [0.36111111, 0.375     , 0.44067797, 0.5       ],
       [0.66666667, 0.45833333, 0.57627119, 0.54166667],
       [0.36111111, 0.41666667, 0.59322034, 0.58333333],
       [0.41666667, 0.29166667, 0.52542373, 0.375     ],
       [0.52777778, 0.08333333, 0.59322034, 0.58333333],
       [0.36111111, 0.20833333, 0.49152542, 0.41666667],
       [0.44444444, 0.5       , 0.6440678 , 0.70833333],
       [0.5       , 0.33333333, 0.50847458, 0.5       ],
       [0.55555556, 0.20833333, 0.66101695, 0.58333333],
       [0.5       , 0.33333333, 0.62711864, 0.45833333],
       [0.58333333, 0.375     , 0.55932203, 0.5       ],
       [0.63888889, 0.41666667, 0.57627119, 0.54166667],
       [0.69444444, 0.33333333, 0.6440678 , 0.54166667],
       [0.66666667, 0.41666667, 0.6779661 , 0.66666667],
       [0.47222222, 0.375     , 0.59322034, 0.58333333],
       [0.38888889, 0.25      , 0.42372881, 0.375     ],
       [0.33333333, 0.16666667, 0.47457627, 0.41666667],
       [0.33333333, 0.16666667, 0.45762712, 0.375     ],
       [0.41666667, 0.29166667, 0.49152542, 0.45833333],
       [0.47222222, 0.29166667, 0.69491525, 0.625     ],
       [0.30555556, 0.41666667, 0.59322034, 0.58333333],
       [0.47222222, 0.58333333, 0.59322034, 0.625     ],
       [0.66666667, 0.45833333, 0.62711864, 0.58333333],
       [0.55555556, 0.125     , 0.57627119, 0.5       ],
       [0.36111111, 0.41666667, 0.52542373, 0.5       ],
       [0.33333333, 0.20833333, 0.50847458, 0.5       ],
       [0.33333333, 0.25      , 0.57627119, 0.45833333],
       [0.5       , 0.41666667, 0.61016949, 0.54166667],
       [0.41666667, 0.25      , 0.50847458, 0.45833333],
       [0.19444444, 0.125     , 0.38983051, 0.375     ],
       [0.36111111, 0.29166667, 0.54237288, 0.5       ],
       [0.38888889, 0.41666667, 0.54237288, 0.45833333],
       [0.38888889, 0.375     , 0.54237288, 0.5       ],
       [0.52777778, 0.375     , 0.55932203, 0.5       ],
       [0.22222222, 0.20833333, 0.33898305, 0.41666667],
       [0.38888889, 0.33333333, 0.52542373, 0.5       ],
       [0.55555556, 0.54166667, 0.84745763, 1.        ],
       [0.41666667, 0.29166667, 0.69491525, 0.75      ],
       [0.77777778, 0.41666667, 0.83050847, 0.83333333],
       [0.55555556, 0.375     , 0.77966102, 0.70833333],
       [0.61111111, 0.41666667, 0.81355932, 0.875     ],
       [0.91666667, 0.41666667, 0.94915254, 0.83333333],
       [0.16666667, 0.20833333, 0.59322034, 0.66666667],
       [0.83333333, 0.375     , 0.89830508, 0.70833333],
       [0.66666667, 0.20833333, 0.81355932, 0.70833333],
       [0.80555556, 0.66666667, 0.86440678, 1.        ],
       [0.61111111, 0.5       , 0.69491525, 0.79166667],
       [0.58333333, 0.29166667, 0.72881356, 0.75      ],
       [0.69444444, 0.41666667, 0.76271186, 0.83333333],
       [0.38888889, 0.20833333, 0.6779661 , 0.79166667],
       [0.41666667, 0.33333333, 0.69491525, 0.95833333],
       [0.58333333, 0.5       , 0.72881356, 0.91666667],
       [0.61111111, 0.41666667, 0.76271186, 0.70833333],
       [0.94444444, 0.75      , 0.96610169, 0.875     ],
       [0.94444444, 0.25      , 1.        , 0.91666667],
       [0.47222222, 0.08333333, 0.6779661 , 0.58333333],
       [0.72222222, 0.5       , 0.79661017, 0.91666667],
       [0.36111111, 0.33333333, 0.66101695, 0.79166667],
       [0.94444444, 0.33333333, 0.96610169, 0.79166667],
       [0.55555556, 0.29166667, 0.66101695, 0.70833333],
       [0.66666667, 0.54166667, 0.79661017, 0.83333333],
       [0.80555556, 0.5       , 0.84745763, 0.70833333],
       [0.52777778, 0.33333333, 0.6440678 , 0.70833333],
       [0.5       , 0.41666667, 0.66101695, 0.70833333],
       [0.58333333, 0.33333333, 0.77966102, 0.83333333],
       [0.80555556, 0.41666667, 0.81355932, 0.625     ],
       [0.86111111, 0.33333333, 0.86440678, 0.75      ],
       [1.        , 0.75      , 0.91525424, 0.79166667],
       [0.58333333, 0.33333333, 0.77966102, 0.875     ],
       [0.55555556, 0.33333333, 0.69491525, 0.58333333],
       [0.5       , 0.25      , 0.77966102, 0.54166667],
       [0.94444444, 0.41666667, 0.86440678, 0.91666667],
       [0.55555556, 0.58333333, 0.77966102, 0.95833333],
       [0.58333333, 0.45833333, 0.76271186, 0.70833333],
       [0.47222222, 0.41666667, 0.6440678 , 0.70833333],
       [0.72222222, 0.45833333, 0.74576271, 0.83333333],
       [0.66666667, 0.45833333, 0.77966102, 0.95833333],
       [0.72222222, 0.45833333, 0.69491525, 0.91666667],
       [0.41666667, 0.29166667, 0.69491525, 0.75      ],
       [0.69444444, 0.5       , 0.83050847, 0.91666667],
       [0.66666667, 0.54166667, 0.79661017, 1.        ],
       [0.66666667, 0.41666667, 0.71186441, 0.91666667],
       [0.55555556, 0.20833333, 0.6779661 , 0.75      ],
       [0.61111111, 0.41666667, 0.71186441, 0.79166667],
       [0.52777778, 0.58333333, 0.74576271, 0.91666667],
       [0.44444444, 0.41666667, 0.69491525, 0.70833333]])

16.2.1 Specify our number of clusters.

IMPORTANT: Check if your data features are standardised/normalised!!

Before you apply techniques such as PCA, clustering or other feature embedding technniques (such as t-SNE, MDS, etc.). It is very important to make sure that the data features that go into these techniques are normalised/standardised: - you can bring the value ranges between 0 and 1 for all of them with a MixMax scaling operation - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html - you can standardise the values to have a mean of 0 and a standard deviation of 1, aka, z-score standardisation – https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html - Or make use of more specific normalisation operators that might be more suitable for a particular context. The scikit-learn collection is a good place to look for alternatives – https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

These operations ensure that the results are not biased/skewed/dominated by some of the inherent characteristics of the data that is simply due to the domain of values.

Scikit-learn has some very nice tutorial here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

Do-this-yourself: Check if we need to do any normalisation for this case?

We have already looked at how the data looks, what are the descriptive statistics look like, see if we need to do anything more?

k_means = KMeans(n_clusters = 3, init = 'random',  n_init = 10)

Fit our kmeans model to the data

k_means.fit(iris)
KMeans(init='random', n_clusters=3, n_init=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The algorithm has assigned the a label to each row.

k_means.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)

Each row has been assigned a label.

To tidy things up we should put everything into a dataframe.

iris_df['Three clusters'] = pd.Series(k_means.predict(iris_df.values), index = iris_df.index)
iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Three clusters
0 0.222222 0.625000 0.067797 0.041667 1
1 0.166667 0.416667 0.067797 0.041667 1
2 0.111111 0.500000 0.050847 0.041667 1
3 0.083333 0.458333 0.084746 0.041667 1
4 0.194444 0.666667 0.067797 0.041667 1
... ... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667 2
146 0.555556 0.208333 0.677966 0.750000 0
147 0.611111 0.416667 0.711864 0.791667 2
148 0.527778 0.583333 0.745763 0.916667 2
149 0.444444 0.416667 0.694915 0.708333 0

150 rows × 5 columns

sns.pairplot(iris_df, hue = 'Three clusters')

That seems quite nice. We can also do individual plots if preferred.

sns.scatterplot(data = iris_df, x = 'sepal length (cm)', y = 'petal width (cm)', hue = 'Three clusters')

K-means works by clustering the data around central points (often called centroids, means or cluster centers). We can extract the cluster centres from the kmeans object.

k_means.cluster_centers_
array([[0.44125683, 0.30737705, 0.57571548, 0.54918033],
       [0.19611111, 0.595     , 0.07830508, 0.06083333],
       [0.70726496, 0.4508547 , 0.79704476, 0.82478632]])

It is tricky to plot these using seaborn but we can use a normal maplotlib scatter plot.

Let us grab the groups.

group1 = iris_df[iris_df['Three clusters'] == 0]
group2 = iris_df[iris_df['Three clusters'] == 1]
group3 = iris_df[iris_df['Three clusters'] == 2]

Grab the centroids

import pandas as pd

centres = k_means.cluster_centers_

data = {'x': [centres[0][0], centres[1][0], centres[2][0]],
        'y': [centres[0][3], centres[1][3], centres[2][3]]}

df = pd.DataFrame (data, columns = ['x', 'y'])

Create the plot

import matplotlib.pyplot as plt

# Plot each group individually
plt.scatter(
    x = group1['sepal length (cm)'], 
    y = group1['petal width (cm)'], 
    alpha = 0.1, color = 'blue'
)

plt.scatter(
    x = group2['sepal length (cm)'], 
    y = group2['petal width (cm)'], 
    alpha = 0.1, color = 'orange'
)

plt.scatter(
    x = group3['sepal length (cm)'], 
    y = group3['petal width (cm)'], 
    alpha = 0.1, color = 'red'
)

# Plot cluster centres
plt.scatter(
    x = df['x'], 
    y = df['y'], 
    alpha = 1, color = 'black'
)

16.2.2 Changing the number of clusters

What happens if we change the number of clusters?

Two groups

k_means_2 = KMeans(n_clusters = 2, init = 'random', n_init = 10)
k_means_2.fit(iris)
iris_df['Two clusters'] = pd.Series(k_means_2.predict(iris_df.iloc[:,0:4].values), index = iris_df.index)

Note that I have added a new column to the iris dataframe called ‘cluster 2 means’ and pass only our origonal 4 columns to the predict function (hence me using .iloc[:,0:4]).

How do our groupings look now (without plotting the cluster column)?

sns.pairplot(iris_df.loc[:, iris_df.columns != 'Three clusters'], hue = 'Two clusters')

Hmm, does the data have more than two groups in it?

Perhaps we should try 5 clusters instead.

k_means_5 = KMeans(n_clusters = 5, init = 'random', n_init = 10)
k_means_5.fit(iris)
iris_df['Five clusters'] = pd.Series(k_means_5.predict(iris_df.iloc[:,0:4].values), index = iris_df.index)

Plot without the columns called ‘cluster’ and ‘Two cluster’

sns.pairplot(iris_df.loc[:, (iris_df.columns != 'Three clusters') & (iris_df.columns != 'Two clusters')], hue = 'Five clusters')

iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Three clusters Two clusters Five clusters
0 0.222222 0.625000 0.067797 0.041667 1 1 4
1 0.166667 0.416667 0.067797 0.041667 1 1 1
2 0.111111 0.500000 0.050847 0.041667 1 1 1
3 0.083333 0.458333 0.084746 0.041667 1 1 1
4 0.194444 0.666667 0.067797 0.041667 1 1 4
... ... ... ... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667 2 0 0
146 0.555556 0.208333 0.677966 0.750000 0 0 3
147 0.611111 0.416667 0.711864 0.791667 2 0 3
148 0.527778 0.583333 0.745763 0.916667 2 0 0
149 0.444444 0.416667 0.694915 0.708333 0 0 3

150 rows × 7 columns

Which did best?

k_means.inertia_
6.982216473785234
k_means_2.inertia_
12.127790750538193
k_means_5.inertia_
4.58977540011789

It looks like our k = 5 model captures the data well. Intertia, looking at the sklearn documentation as the Sum of squared distances of samples to their closest cluster center..

If you want to dive further into this then Real Python’s practical guide to K-Means Clustering is quite good.

16.3 Principal Component Analysis (PCA)

PCA reduces the dimension of our data. The method derives point in an n dimentional space from our data which are uncorrelated.

To carry out a PCA on our Iris dataset where there are only two dimensions.

from sklearn.decomposition import PCA

n_components = 2

pca = PCA(n_components=n_components)
iris_pca = pca.fit(iris_df.iloc[:,0:4])

We can look at the components.

iris_pca.components_
array([[ 0.42494212, -0.15074824,  0.61626702,  0.64568888],
       [ 0.42320271,  0.90396711, -0.06038308, -0.00983925]])

These components are intersting. You may want to look at a PennState article on interpreting PCA components.

Our second column, ‘sepal width (cm)’ is positively correlated with our second principle component whereas the first column ‘sepal length (cm)’ is postively correlated with both.

You may want to consider:

  • Do we need more than two components?
  • Is it useful to keep sepal length (cm) in the dataset?

We can also examine the explained variance of the each principle component.

iris_pca.explained_variance_
array([0.23245325, 0.0324682 ])

A nice worked example showing the link between the explained variance and the component is here.

Our first principle component explains a lot more of the variance of data then the second.

Another way to explore these indicators is to look at the explained_variance_ratio_ values. These present a similar information but provide them as percentage values so they are easier to interpret. You can also create a plot and see how these percentages add up. In this case, the first two components add up to 0.96. Which means the first two features are able to represent around 96% of the variation in the data, not bad. These values are not always this high.

A high value that is close to 100% means that the PCA is able to represent much of the variance and they will be good representations of the data without losing a lot of that variance in the underlying features. This of course is based on an assumption that variance is a good proxy about how informative a feature is.

iris_pca.explained_variance_ratio_
array([0.84136038, 0.11751808])
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

16.3.1 Dimension reduction

For our purposes, we are interested in using PCA for reducing the number of dimension in our data whilst preseving the maximal data variance.

We can extract the projected components from the model.

iris_pca_vals = pca.fit_transform(iris_df.iloc[:,0:4])

The numpy arrays contains the projected values.

type(iris_pca_vals)
numpy.ndarray
iris_pca_vals
array([[-6.30702931e-01,  1.07577910e-01],
       [-6.22904943e-01, -1.04259833e-01],
       [-6.69520395e-01, -5.14170597e-02],
       [-6.54152759e-01, -1.02884871e-01],
       [-6.48788056e-01,  1.33487576e-01],
       [-5.35272778e-01,  2.89615724e-01],
       [-6.56537790e-01,  1.07244911e-02],
       [-6.25780499e-01,  5.71335411e-02],
       [-6.75643504e-01, -2.00703283e-01],
       [-6.45644619e-01, -6.72080097e-02],
       [-5.97408238e-01,  2.17151953e-01],
       [-6.38943190e-01,  3.25988375e-02],
       [-6.61612593e-01, -1.15605495e-01],
       [-7.51967943e-01, -1.71313322e-01],
       [-6.00371589e-01,  3.80240692e-01],
       [-5.52157227e-01,  5.15255982e-01],
       [-5.77053593e-01,  2.93709492e-01],
       [-6.03799228e-01,  1.07167941e-01],
       [-5.20483461e-01,  2.87627289e-01],
       [-6.12197555e-01,  2.19140388e-01],
       [-5.57674300e-01,  1.02109180e-01],
       [-5.79012675e-01,  1.81065123e-01],
       [-7.37784662e-01,  9.05588211e-02],
       [-5.06093857e-01,  2.79470846e-02],
       [-6.07607579e-01,  2.95285112e-02],
       [-5.90210587e-01, -9.45510863e-02],
       [-5.61527888e-01,  5.52901611e-02],
       [-6.08453780e-01,  1.18310099e-01],
       [-6.12617807e-01,  8.16682448e-02],
       [-6.38184784e-01, -5.44873860e-02],
       [-6.20099660e-01, -8.03970516e-02],
       [-5.24757301e-01,  1.03336126e-01],
       [-6.73044544e-01,  3.44711846e-01],
       [-6.27455379e-01,  4.18257508e-01],
       [-6.18740916e-01, -6.76179787e-02],
       [-6.44553756e-01, -1.51267253e-02],
       [-5.93932344e-01,  1.55623876e-01],
       [-6.87495707e-01,  1.22141914e-01],
       [-6.92369885e-01, -1.62014545e-01],
       [-6.13976551e-01,  6.88891719e-02],
       [-6.26048380e-01,  9.64357527e-02],
       [-6.09693996e-01, -4.14325957e-01],
       [-7.04932239e-01, -8.66839521e-02],
       [-5.14001659e-01,  9.21355196e-02],
       [-5.43513037e-01,  2.14636651e-01],
       [-6.07805187e-01, -1.16425433e-01],
       [-6.28656055e-01,  2.18526915e-01],
       [-6.70879139e-01, -6.41961326e-02],
       [-6.09212186e-01,  2.05396323e-01],
       [-6.29944525e-01,  2.04916869e-02],
       [ 2.79951766e-01,  1.79245790e-01],
       [ 2.15141376e-01,  1.10348921e-01],
       [ 3.22223106e-01,  1.27368010e-01],
       [ 5.94030131e-02, -3.28502275e-01],
       [ 2.62515235e-01, -2.95800761e-02],
       [ 1.03831043e-01, -1.21781742e-01],
       [ 2.44850362e-01,  1.33801733e-01],
       [-1.71529386e-01, -3.52976762e-01],
       [ 2.14230599e-01,  2.06607890e-02],
       [ 1.53249619e-02, -2.12494509e-01],
       [-1.13710323e-01, -4.93929201e-01],
       [ 1.37348380e-01, -2.06894998e-02],
       [ 4.39928190e-02, -3.06159511e-01],
       [ 1.92559767e-01, -3.95507760e-02],
       [-8.26091518e-03, -8.66610981e-02],
       [ 2.19485489e-01,  1.09383928e-01],
       [ 1.33272148e-01, -5.90267184e-02],
       [-5.75757060e-04, -1.42367733e-01],
       [ 2.54345249e-01, -2.89815304e-01],
       [-5.60800300e-03, -2.39572672e-01],
       [ 2.68168358e-01,  4.72705335e-02],
       [ 9.88208151e-02, -6.96420088e-02],
       [ 2.89086481e-01, -1.69157553e-01],
       [ 1.45033538e-01, -7.63961345e-02],
       [ 1.59287093e-01,  2.19853643e-04],
       [ 2.13962718e-01,  5.99630005e-02],
       [ 2.91913782e-01,  4.04990109e-03],
       [ 3.69148997e-01,  6.43480720e-02],
       [ 1.86769115e-01, -4.96694916e-02],
       [-6.87697501e-02, -1.85648007e-01],
       [-2.15759776e-02, -2.87970157e-01],
       [-5.89248844e-02, -2.86536746e-01],
       [ 3.23412419e-02, -1.41140786e-01],
       [ 2.88906394e-01, -1.31550706e-01],
       [ 1.09664252e-01, -8.25379800e-02],
       [ 1.82266934e-01,  1.38247021e-01],
       [ 2.77724803e-01,  1.05903632e-01],
       [ 1.95615410e-01, -2.38550997e-01],
       [ 3.76839264e-02, -5.41130122e-02],
       [ 4.68406593e-02, -2.53171683e-01],
       [ 5.54365941e-02, -2.19190186e-01],
       [ 1.75833387e-01, -8.62037590e-04],
       [ 4.90676225e-02, -1.79829525e-01],
       [-1.53444261e-01, -3.78886428e-01],
       [ 6.69726607e-02, -1.68132343e-01],
       [ 3.30293747e-02, -4.29708545e-02],
       [ 6.62142547e-02, -8.10461198e-02],
       [ 1.35679197e-01, -2.32914079e-02],
       [-1.58634575e-01, -2.89139847e-01],
       [ 6.20502279e-02, -1.17687974e-01],
       [ 6.22771338e-01,  1.16807265e-01],
       [ 3.46009609e-01, -1.56291874e-01],
       [ 6.17986434e-01,  1.00519741e-01],
       [ 4.17789309e-01, -2.68903690e-02],
       [ 5.63621248e-01,  3.05994289e-02],
       [ 7.50122599e-01,  1.52133800e-01],
       [ 1.35857804e-01, -3.30462554e-01],
       [ 6.08945212e-01,  8.35018443e-02],
       [ 5.11020215e-01, -1.32575915e-01],
       [ 7.20608541e-01,  3.34580389e-01],
       [ 4.24135062e-01,  1.13914054e-01],
       [ 4.37723702e-01, -8.78049736e-02],
       [ 5.40793776e-01,  6.93466165e-02],
       [ 3.63226514e-01, -2.42764625e-01],
       [ 4.74246948e-01, -1.20676423e-01],
       [ 5.13932631e-01,  9.88816323e-02],
       [ 4.24670824e-01,  3.53096310e-02],
       [ 7.49026039e-01,  4.63778390e-01],
       [ 8.72194272e-01,  9.33798117e-03],
       [ 2.82963372e-01, -3.18443776e-01],
       [ 6.14733184e-01,  1.53566018e-01],
       [ 3.22133832e-01, -1.40500924e-01],
       [ 7.58030401e-01,  8.79453649e-02],
       [ 3.57235237e-01, -9.50568671e-02],
       [ 5.31036706e-01,  1.68539991e-01],
       [ 5.46962123e-01,  1.87812429e-01],
       [ 3.28704908e-01, -6.81237595e-02],
       [ 3.14783811e-01, -5.57223965e-03],
       [ 5.16585543e-01, -5.40299414e-02],
       [ 4.84826663e-01,  1.15348658e-01],
       [ 6.33043632e-01,  5.92290940e-02],
       [ 6.87490917e-01,  4.91179916e-01],
       [ 5.43489246e-01, -5.44399104e-02],
       [ 2.91133358e-01, -5.82085481e-02],
       [ 3.05410131e-01, -1.61757644e-01],
       [ 7.63507935e-01,  1.68186703e-01],
       [ 5.47805644e-01,  1.58976299e-01],
       [ 4.06585699e-01,  6.12192966e-02],
       [ 2.92534659e-01, -1.63044284e-02],
       [ 5.35871344e-01,  1.19790986e-01],
       [ 6.13864965e-01,  9.30029331e-02],
       [ 5.58343139e-01,  1.22041374e-01],
       [ 3.46009609e-01, -1.56291874e-01],
       [ 6.23819644e-01,  1.39763503e-01],
       [ 6.38651518e-01,  1.66900115e-01],
       [ 5.51461624e-01,  5.98413741e-02],
       [ 4.07146497e-01, -1.71820871e-01],
       [ 4.47142619e-01,  3.75600193e-02],
       [ 4.88207585e-01,  1.49677521e-01],
       [ 3.12066323e-01, -3.11303854e-02]])

Each row corresponds to a row in our data.

iris_pca_vals.shape
(150, 2)
iris_df.shape
(150, 7)

We can add the component to our dataset. I prefer to keep everything in one table and it is not at all required. You can just assign the values whichever variables you prefer.

iris_df['c1'] = [item[0] for item in iris_pca_vals]
iris_df['c2'] = [item[1] for item in iris_pca_vals]
iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Three clusters Two clusters Five clusters c1 c2
0 0.222222 0.625000 0.067797 0.041667 1 1 4 -0.630703 0.107578
1 0.166667 0.416667 0.067797 0.041667 1 1 1 -0.622905 -0.104260
2 0.111111 0.500000 0.050847 0.041667 1 1 1 -0.669520 -0.051417
3 0.083333 0.458333 0.084746 0.041667 1 1 1 -0.654153 -0.102885
4 0.194444 0.666667 0.067797 0.041667 1 1 4 -0.648788 0.133488
... ... ... ... ... ... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667 2 0 0 0.551462 0.059841
146 0.555556 0.208333 0.677966 0.750000 0 0 3 0.407146 -0.171821
147 0.611111 0.416667 0.711864 0.791667 2 0 3 0.447143 0.037560
148 0.527778 0.583333 0.745763 0.916667 2 0 0 0.488208 0.149678
149 0.444444 0.416667 0.694915 0.708333 0 0 3 0.312066 -0.031130

150 rows × 9 columns

Plotting out our data on our new two component space.

sns.scatterplot(data = iris_df, x = 'c1', y = 'c2')

We have reduced our three dimensions to two.

We can also colour by our clusters. What does this show us and is it useful?

sns.scatterplot(data = iris_df, x = 'c1', y = 'c2', hue = 'Three clusters')

iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Three clusters Two clusters Five clusters c1 c2
0 0.222222 0.625000 0.067797 0.041667 1 1 4 -0.630703 0.107578
1 0.166667 0.416667 0.067797 0.041667 1 1 1 -0.622905 -0.104260
2 0.111111 0.500000 0.050847 0.041667 1 1 1 -0.669520 -0.051417
3 0.083333 0.458333 0.084746 0.041667 1 1 1 -0.654153 -0.102885
4 0.194444 0.666667 0.067797 0.041667 1 1 4 -0.648788 0.133488
... ... ... ... ... ... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667 2 0 0 0.551462 0.059841
146 0.555556 0.208333 0.677966 0.750000 0 0 3 0.407146 -0.171821
147 0.611111 0.416667 0.711864 0.791667 2 0 3 0.447143 0.037560
148 0.527778 0.583333 0.745763 0.916667 2 0 0 0.488208 0.149678
149 0.444444 0.416667 0.694915 0.708333 0 0 3 0.312066 -0.031130

150 rows × 9 columns

16.3.2 PCA to Clusters

We have reduced our 4D dataset to 2D whilst keeping the data variance. Reducing the data to fewer dimensions can help with the ‘curse of dimensionality’, reduce the change of overfitting a machine learning model (see here) and reduce the computational complexity of a model fit.

Putting our new dimensions into a kMeans model

k_means_pca = KMeans(n_clusters = 3, init = 'random', n_init = 10)
iris_pca_kmeans = k_means_pca.fit(iris_df.iloc[:,-2:])
type(iris_df.iloc[:,-2:].values)
numpy.ndarray
iris_df['PCA 3 clusters'] = pd.Series(k_means_pca.predict(iris_df.iloc[:,-2:].values), index = iris_df.index)
iris_df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Three clusters Two clusters Five clusters c1 c2 PCA 3 clusters
0 0.222222 0.625000 0.067797 0.041667 1 1 4 -0.630703 0.107578 2
1 0.166667 0.416667 0.067797 0.041667 1 1 1 -0.622905 -0.104260 2
2 0.111111 0.500000 0.050847 0.041667 1 1 1 -0.669520 -0.051417 2
3 0.083333 0.458333 0.084746 0.041667 1 1 1 -0.654153 -0.102885 2
4 0.194444 0.666667 0.067797 0.041667 1 1 4 -0.648788 0.133488 2
... ... ... ... ... ... ... ... ... ... ...
145 0.666667 0.416667 0.711864 0.916667 2 0 0 0.551462 0.059841 0
146 0.555556 0.208333 0.677966 0.750000 0 0 3 0.407146 -0.171821 1
147 0.611111 0.416667 0.711864 0.791667 2 0 3 0.447143 0.037560 0
148 0.527778 0.583333 0.745763 0.916667 2 0 0 0.488208 0.149678 0
149 0.444444 0.416667 0.694915 0.708333 0 0 3 0.312066 -0.031130 1

150 rows × 10 columns

As we only have two dimensions we can easily plot this on a single scatterplot.

# a different seaborn theme
# see https://python-graph-gallery.com/104-seaborn-themes/
sns.set_style("darkgrid")
sns.scatterplot(data = iris_df, x = 'c1', y = 'c2', hue = 'PCA 3 clusters')

I suspect having two clusters would work better. We should try a few different models.

Copying the code from here we can fit multiple numbers of clusters.

ks = range(1, 10)
inertias = [] # Create an empty list (will be populated later)
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k, n_init = 10)
    
    # Fit model to samples
    model.fit(iris_df.iloc[:,-2:])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Three seems ok. We clearly want no more than three.

These types of plots show an point about model complexity. More free parameters in the model (here the number of clusters) will improve how well the model captures the data, often with reducing returns. However, a model which overfits the data will not be able to fit new data well - referred to overfitting. Randomish internet blogs introduce the topic pretty well, see here, and also wikipedia, see here.

16.3.3 Missing values

Finally, how we deal with missing values can impact the results of PCA and kMeans clustering.

Lets us load in the iris dataset again and randomly remove 10% of the data (see code from here).

import numpy as np

x = load_iris()
iris_df = pd.DataFrame(x.data, columns = x.feature_names)

mask = np.random.choice([True, False], size = iris_df.shape, p = [0.2, 0.8])
mask[mask.all(1),-1] = 0

df = iris_df.mask(mask)

df.isna().sum()
sepal length (cm)    27
sepal width (cm)     33
petal length (cm)    35
petal width (cm)     30
dtype: int64
df
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 NaN 1.4 0.2
2 4.7 NaN NaN 0.2
3 4.6 3.1 NaN 0.2
4 5.0 3.6 1.4 NaN
... ... ... ... ...
145 NaN 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

About 20% of the data is randomly an NaN.

16.3.3.1 Zeroing

We can 0 them and fit our models.

df_1 = df.copy()
df_1 = df_1.fillna(0)
df_1
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 0.0 1.4 0.2
2 4.7 0.0 0.0 0.2
3 4.6 3.1 0.0 0.2
4 5.0 3.6 1.4 0.0
... ... ... ... ...
145 0.0 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

k_means_zero = KMeans(n_clusters = 4, init = 'random', n_init = 10)
k_means_zero.fit(df_1)
df_1['Four clusters'] = pd.Series(k_means_zero.predict(df_1.iloc[:,0:4].values), index = df_1.index)
sns.pairplot(df_1, hue = 'Four clusters')

What impact has zeroing the values had on our results?

Now, onto PCA.

# PCA analysis
n_components = 2

pca = PCA(n_components=n_components)
df_1_pca = pca.fit(df_1.iloc[:,0:4])

# Extract projected values
df_1_pca_vals = df_1_pca.transform(df_1.iloc[:,0:4])
df_1['c1'] = [item[0] for item in df_1_pca_vals]
df_1['c2'] = [item[1] for item in df_1_pca_vals]

sns.scatterplot(data = df_1, x = 'c1', y = 'c2')

df_1_pca.explained_variance_
array([6.53211311, 4.47901067])
df_1_pca.components_
array([[ 0.78683196, -0.00372672,  0.60451846,  0.12425384],
       [-0.61327545, -0.1129114 ,  0.77113291,  0.1284456 ]])

16.3.3.2 Replacing with the average

df_2 = df.copy()
for i in range(4):
    df_2.iloc[:,i] = df_2.iloc[:,i].fillna(df_2.iloc[:,i].mean())
df_2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.100000 3.500000 1.400000 0.2000
1 4.900000 3.086325 1.400000 0.2000
2 4.700000 3.086325 3.858261 0.2000
3 4.600000 3.100000 3.858261 0.2000
4 5.000000 3.600000 1.400000 1.1725
... ... ... ... ...
145 5.869106 3.000000 5.200000 2.3000
146 6.300000 2.500000 5.000000 1.9000
147 6.500000 3.000000 5.200000 2.0000
148 6.200000 3.400000 5.400000 2.3000
149 5.900000 3.000000 5.100000 1.8000

150 rows × 4 columns

k_means_zero = KMeans(n_clusters = 4, init = 'random', n_init = 10)
k_means_zero.fit(df_2)
df_2['Four clusters'] = pd.Series(k_means_zero.predict(df_2.iloc[:,0:4].values), index = df_2.index)
sns.pairplot(df_2, hue = 'Four clusters')

# PCA analysis
n_components = 2

pca = PCA(n_components=n_components)
df_2_pca = pca.fit(df_2.iloc[:,0:4])

# Extract projected values
df_2_pca_vals = df_2_pca.transform(df_2.iloc[:,0:4])
df_2['c1'] = [item[0] for item in df_2_pca_vals]
df_2['c2'] = [item[1] for item in df_2_pca_vals]

sns.scatterplot(data = df_2, x = 'c1', y = 'c2')

df_2_pca.explained_variance_
array([3.07318001, 0.30043442])
df_2_pca.components_
array([[ 0.35776982, -0.08204292,  0.87268026,  0.32202311],
       [ 0.80763506,  0.30285429, -0.41134634,  0.29461683]])

16.4 Useful resources

The scikit learn UserGuide is very good. Both approaches here are often referred to as unsupervised learning methods and you can find the scikit learn section on these here.

If you have issues with the documentation then also look at the scikit-learn examples.

Also, in no particular order:

In case you are bored:

  • Stack abuse - Some fun blog entries to look at
  • Towards data science - a blog that contains a mix of intro, intermediate and advanced topics. Nice to skim through to try and undrestand something new.

Please do try out some of the techniques detailed in the lecture material The simple examples found in the scikit learn documentation are rather good. Generally, I find it much easier to try to understand a method using a simple dataset.