Cell In[1], line 9 from sklearn.decomposition import S????ePCA
^
SyntaxError: invalid syntax
df.h??d()
Cell In[2], line 1 df.h??d()
^
SyntaxError: invalid syntax
# May take a while depending on your computer# feel free not to run thissns.pair????(df)
Cell In[3], line 3 sns.pair????(df)
^
SyntaxError: invalid syntax
19 Normalisation
Before you carry out any operation, you might want to perform some normalisation. This will ensure that some of the assumptions that the algorithms are making are met and also the results are not biased/determined by the different value ranges and variation ranges inherent in the data.
Do try out the following steps without normalisation first and then come back to this, normalise the data and see the differences it makes using a normalised copy of the data.
from sklearn.preprocessing import MinMaxScaler# first save the column names, we will create a new dataset with the scaled datacol_names = df.columns# This is the normalization function. # We are using the MinMaxScaler which brings all the data between 0 and 1.# Make use of other transformations offered by scikitlearn, experiment, note changes. # The last column of the data contains the "quality" labels/scores, we don't want to normalize them # as they is sort of the "dependent (or "target") variable and there is meaning in these scores. # Let's normalize the first 11 columns which are our "independent" columns.scaled_df = pd.DataFrame(MinMaxScaler().fit_transform(df.iloc[:, 0:11]))# now we want to add the "quality" values back in. We'll need them.scaled_df = scaled_df.join(df.iloc[:, 11:12])# now we name the columns with the original column names. We do this because MinMaxScaler # produces a data frame with no column names (don't ask me why..)scaled_df.columns = col_names# let's have a look at what the data is looking like:scaled_df.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[4], line 4 1fromsklearn.preprocessingimport MinMaxScaler
3# first save the column names, we will create a new dataset with the scaled data----> 4 col_names =df.columns
6# This is the normalization function. 7# We are using the MinMaxScaler which brings all the data between 0 and 1. 8# Make use of other transformations offered by scikitlearn, experiment, note changes. (...) 11# as they is sort of the "dependent (or "target") variable and there is meaning in these scores. 12# Let's normalize the first 11 columns which are our "independent" columns. 13 scaled_df = pd.DataFrame(MinMaxScaler().fit_transform(df.iloc[:, 0:11]))
NameError: name 'df' is not defined
Important note: The rest of the code will continue to use the non-normalised version of the data. For now, just carry on with that and try running the operations with the non-normalised version. Once you are through and/or somewhere in the middle, try them out with the normalised data. See what this will change.
20 Dimension reduction
from sklearn.decomposition import PCAn_components =2pca = PCA(n_??????????=n_components)df_pca = pca.fit(df?iloc[:, 0:11])
sns.scatterplot(data = df, x = ?, y = ?, hue ='quality')
Cell In[7], line 1 sns.scatterplot(data = df, x = ?, y = ?, hue = 'quality')
^
SyntaxError: invalid syntax
print(df.columns)df_pca.components_
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[8], line 1----> 1print(df.columns)
2 df_pca.components_
NameError: name 'df' is not defined
df_s_pca_vals = s_pca.fit_?????????(df.iloc[:, 0:11])df['c1 spca'] = [item[0] for item in df_s_pca_vals]df['c2 spca'] = [item[1] for item in df_s_pca_vals]
sns.scatterplot(data = df, x ='c1 spca', y ='c2 spca', hue ='quality')
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[11], line 1----> 1sns.scatterplot(data = df, x ='c1 spca', y ='c2 spca', hue ='quality')
NameError: name 'sns' is not defined
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[12], line 1----> 1 tsne_model =TSNE(n_components=n_components)
2 df_tsne = tsne_model.fit(df.iloc[:, 0:11])
NameError: name 'TSNE' is not defined
df_tsne_vals = tsne_model.fit_transform(df.iloc[:, 0:11])df['c1 tsne'] = [item[0] for item in ??_tsne_vals]df['c2 tsne'] = [item[1] for item in df_tsne_vals]
Cell In[13], line 2 df['c1 tsne'] = [item[0] for item in ??_tsne_vals]
^
SyntaxError: invalid syntax
# This plot does not look right# I am not sure why.sns.scatterplot(data = ??, x ='c1 tsne', y ='c1 tsne', hue ='quality')
Cell In[14], line 3 sns.scatterplot(data = ??, x = 'c1 tsne', y = 'c1 tsne', hue = 'quality')
^
SyntaxError: invalid syntax
That looks concerning - there is a straight line. It looks like something in the above code might not be correct.
Can you find out what that might be?
Hint: think about when you would get a straight line in a scatterplot?
Once you fixed the error above, you will notice a different structure to the ones you observed in the PCA runs. There isn’t really a clear next step which of these projections one should adopt.
For now, we will use PCA components. PCA would be a good choice if the interpretability of the components is important to us. Since PCA is a linear projection method, the components carry the weights of each raw feature which enable us to make inferences about the axes. However, if we are more interested in finding structures and identify groups of similar items, t-SNE might be a better projection to use since it emphasises proximity but the axes don’t mean much since the layout is formed stochastically (fancy speak for saying that there is randomness in the algorithm and the layout will be different each time your run it).
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[16], line 1----> 1 loadings_sorted =loadings.sort_values(by=['component 2'], ascending=False)
2 loadings_sorted.iloc[1:10,:]
NameError: name 'loadings' is not defined
20.3 Clustering
from sklearn.cluster import KMeansks =range(1, 10)inertias = []for k in ks:# Create a KMeans instance with k clusters: model ????? = KMeans(n_clusters=k)# Fit model to samples model.fit(df[['c1', 'c2']])# Append the inertia to the list of inertias inertias.append(model.inertia_)import matplotlib.pyplot as pltplt.plot(ks, inertias, '-o', color='black')plt.xlabel('number of clusters, k')plt.ylabel('inertia')plt.xticks(ks)plt.show()
Object `??? = KMeans(n_clusters=k)` not found.
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[17], line 10 7 get_ipython().run_line_magic('pinfo2', '??? = KMeans(n_clusters=k)')
9# Fit model to samples---> 10model.fit(df[['c1', 'c2']])
12# Append the inertia to the list of inertias 13 inertias.append(model.inertia_)
NameError: name 'model' is not defined
Cell In[18], line 3 df['Three clusters'] = pd.Series(k_means_3.???????(df[['c1', 'c2']].values), index = df.index)
^
SyntaxError: invalid syntax
sns.scatterplot(data = df, x ='c1', y ='c2', hue ='Three clusters')
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[19], line 1----> 1sns.scatterplot(data = df, x ='c1', y ='c2', hue ='Three clusters')
NameError: name 'sns' is not defined
Consider:
Is that useful?
What might it mean?
Outside of this session go back to normalising the data and try out different methods for normalisation as well (e.g., centering around the mean), clustering the raw data (and not the projections from PCA), trying to get tSNE working or using different numbers of components.