18 Exercise: Wine dataset

As with previous exercises, fill in the question marks with the correct code.

Last week you were introduced to the wine dataset. We have 10 input variables and 1 output variables.

Input variables (based on physicochemical tests):

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Output variable (based on sensory data):

quality (score between 0 and 10)

I suggest we look at two broad questions with this dataset:

Will dimension reduction reveal variable groupings? Think back to how we interpreted the loadings in the crime dataset.
What does clustering the wines well us?

18.1 Load data and import libraries

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import seaborn as sn?
from sklearn.cluster import KMeans
from sklearn.decomposition import PC?
from sklearn.decomposition import S????ePCA
from sklearn.manifold import TSNE

df = pd.read_excel('data/winequality-red_v2.xlsx')

  Cell In[1], line 9
    from sklearn.decomposition import S????ePCA
                                       ^
SyntaxError: invalid syntax

df.h??d()

  Cell In[2], line 1
    df.h??d()
        ^
SyntaxError: invalid syntax

# May take a while depending on your computer
# feel free not to run this
sns.pair????(df)

  Cell In[3], line 3
    sns.pair????(df)
            ^
SyntaxError: invalid syntax

19 Normalisation

Before you carry out any operation, you might want to perform some normalisation. This will ensure that some of the assumptions that the algorithms are making are met and also the results are not biased/determined by the different value ranges and variation ranges inherent in the data.

Do try out the following steps without normalisation first and then come back to this, normalise the data and see the differences it makes using a normalised copy of the data.

from sklearn.preprocessing import MinMaxScaler

# first save the column names, we will create a new dataset with the scaled data
col_names = df.columns

# This is the normalization function. 
# We are using the MinMaxScaler which brings all the data between 0 and 1.
# Make use of other transformations offered by scikitlearn, experiment, note changes. 

# The last column of the data contains the "quality" labels/scores, we don't want to normalize them 
# as they is sort of the "dependent (or "target") variable and there is meaning in these scores. 
# Let's normalize the first 11 columns which are our "independent" columns.
scaled_df =  pd.DataFrame(MinMaxScaler().fit_transform(df.iloc[:, 0:11]))

# now we want to add the "quality" values back in. We'll need them.
scaled_df = scaled_df.join(df.iloc[:, 11:12])

# now we name the columns with the original column names. We do this because MinMaxScaler 
# produces a data frame with no column names (don't ask me why..)
scaled_df.columns = col_names

# let's have a look at what the data is looking like:
scaled_df.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 4
      1 from sklearn.preprocessing import MinMaxScaler
      3 # first save the column names, we will create a new dataset with the scaled data
----> 4 col_names = df.columns
      6 # This is the normalization function. 
      7 # We are using the MinMaxScaler which brings all the data between 0 and 1.
      8 # Make use of other transformations offered by scikitlearn, experiment, note changes. 
   (...)
     11 # as they is sort of the "dependent (or "target") variable and there is meaning in these scores. 
     12 # Let's normalize the first 11 columns which are our "independent" columns.
     13 scaled_df =  pd.DataFrame(MinMaxScaler().fit_transform(df.iloc[:, 0:11]))

NameError: name 'df' is not defined

Important note: The rest of the code will continue to use the non-normalised version of the data. For now, just carry on with that and try running the operations with the non-normalised version. Once you are through and/or somewhere in the middle, try them out with the normalised data. See what this will change.

20 Dimension reduction

from sklearn.decomposition import PCA

n_components = 2
 
pca = PCA(n_??????????=n_components)
df_pca = pca.fit(df?iloc[:, 0:11])

  Cell In[5], line 5
    pca = PCA(n_??????????=n_components)
                ^
SyntaxError: invalid syntax

df_pca_vals = df_pca.???_transform(df.iloc[:, 0:11])
df['c1'] = [item[0] for item in df_pca_????]
df['c2'] = [item[1] for item in df_pca_vals]

  Cell In[6], line 1
    df_pca_vals = df_pca.???_transform(df.iloc[:, 0:11])
                         ^
SyntaxError: invalid syntax

sns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')

  Cell In[7], line 1
    sns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')
                                   ^
SyntaxError: invalid syntax

print(df.columns)
df_pca.components_

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 print(df.columns)
      2 df_pca.components_

NameError: name 'df' is not defined

What about other dimension reduction methods?

20.1 SparcePCA

s_pca = SparsePCA(n_components=n_components)
df_s_pca = s_pca.fit(df.????[:, 0:11])

  Cell In[9], line 2
    df_s_pca = s_pca.fit(df.????[:, 0:11])
                            ^
SyntaxError: invalid syntax

df_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])
df['c1 spca'] = [item[0] for item in df_s_pca_vals]
df['c2 spca'] = [item[1] for item in df_s_pca_vals]

  Cell In[10], line 1
    df_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])
                              ^
SyntaxError: invalid syntax

sns.scatterplot(data = df, x = 'c1 spca', y = 'c2 spca', hue = 'quality')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 sns.scatterplot(data = df, x = 'c1 spca', y = 'c2 spca', hue = 'quality')

NameError: name 'sns' is not defined

20.2 tSNE

tsne_model = TSNE(n_components=n_components)
df_tsne = tsne_model.fit(df.iloc[:, 0:11])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 tsne_model = TSNE(n_components=n_components)
      2 df_tsne = tsne_model.fit(df.iloc[:, 0:11])

NameError: name 'TSNE' is not defined

df_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])
df['c1 tsne'] = [item[0] for item in ??_tsne_vals]
df['c2 tsne'] = [item[1] for item in df_tsne_vals]

  Cell In[13], line 2
    df['c1 tsne'] = [item[0] for item in ??_tsne_vals]
                                         ^
SyntaxError: invalid syntax

# This plot does not look right
# I am not sure why.
sns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')

  Cell In[14], line 3
    sns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')
                           ^
SyntaxError: invalid syntax

That looks concerning - there is a straight line. It looks like something in the above code might not be correct.

Can you find out what that might be?

Hint: think about when you would get a straight line in a scatterplot?

Once you fixed the error above, you will notice a different structure to the ones you observed in the PCA runs. There isn’t really a clear next step which of these projections one should adopt.

For now, we will use PCA components. PCA would be a good choice if the interpretability of the components is important to us. Since PCA is a linear projection method, the components carry the weights of each raw feature which enable us to make inferences about the axes. However, if we are more interested in finding structures and identify groups of similar items, t-SNE might be a better projection to use since it emphasises proximity but the axes don’t mean much since the layout is formed stochastically (fancy speak for saying that there is randomness in the algorithm and the layout will be different each time your run it).

data = {'columns' : df.iloc[:, 0:11].columns,
        'component 1' : df_pca.components_[0],
        'component 2' : df_pca.components_[1]}


loadings = pd.?????????(data)
loadings_sorted = loadings.sort_values(by=['component 1'], ascending=False)
loadings_sorted.iloc[1:10,:]

  Cell In[15], line 6
    loadings = pd.?????????(data)
                  ^
SyntaxError: invalid syntax

loadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)
loadings_sorted.iloc[1:10,:]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 loadings_sorted = loadings.sort_values(by=['component 2'], ascending=False)
      2 loadings_sorted.iloc[1:10,:]

NameError: name 'loadings' is not defined

20.3 Clustering

from sklearn.cluster import KMeans

ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    ????? = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(df[['c1', 'c2']])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

import matplotlib.pyplot as plt

plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

Object `??? = KMeans(n_clusters=k)` not found.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 10
      7 get_ipython().run_line_magic('pinfo2', '??? = KMeans(n_clusters=k)')
      9 # Fit model to samples
---> 10 model.fit(df[['c1', 'c2']])
     12 # Append the inertia to the list of inertias
     13 inertias.append(model.inertia_)

NameError: name 'model' is not defined

k_means_3 = KMeans(n_clusters = 3, init = 'random')
k_means_3.fit(df[['c1', 'c2']])
df['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)

  Cell In[18], line 3
    df['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)
                                               ^
SyntaxError: invalid syntax

sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 sns.scatterplot(data = df, x = 'c1', y = 'c2', hue = 'Three clusters')

NameError: name 'sns' is not defined

Consider:

Is that useful?
What might it mean?

Outside of this session go back to normalising the data and try out different methods for normalisation as well (e.g., centering around the mean), clustering the raw data (and not the projections from PCA), trying to get tSNE working or using different numbers of components.