13  Exercise: Regression

import warnings
warnings.filterwarnings('ignore')

Now it’s your turn to prepare a linear regression model.

13.1 Scikit Learn

You can use here as well Scikit Learn library. More information you can find here: https://www.tutorialspoint.com/scikit_learn/scikit_learn_linear_regression.htm

13.2 Wine Dataset

For this exercise we will be using Wine Quality Dataset from Cortez et al. (2009). You can find more information about it here: https://doi.org/10.24432/C56S3T

What would be your research question? What do you like to learn, given the data you have?

13.3 Reading Data

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
wine = pd.read_excel('data/raw/winequality-red_v2.xlsx', engine = 'openpyxl')

#You might need to use encoding, then the code will look like:
# wine = pd.read_excel('data/raw/winequality-red_v2.xlsx', engine = 'openpyxl', encoding='UTF-8')

13.4 Data exploration

Let’s check the data, their distribution and central tendencies

print('shape:', wine.shape)
wine.head()
shape: (1599, 12)
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

13.4.1 Check your variables

Use lmplot() function from Seaborn to explore linear relationship Input data must be in a Pandas DataFrame. To plot them, we provide the predictor and response variable names along with the dataset

Did you find outliers or missing data? You can use function np.unique and find the unique elements of an array.

?np.unique

Do you need to remove any cases?

 

Did you need to standarize data?

If you standarized data, try to plot them again

 

13.5 Form ideas about the data

Before you move on to exploring correlations and maybe other kinds of models, try and build some sense of understanding of the relations between the variables. What are some relations that stand out. Do you know a bit more about the wines in this dataset or wines more generally?

13.6 Move on to building some simple models

You can calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

We will be using the scikit-learn package here. This is a package we will be making use of very frequently.

import scipy.stats
scipy.stats.pearsonr(wine.???.values, wine.???.values)
  Cell In[6], line 2
    scipy.stats.pearsonr(wine.???.values, wine.???.values)
                              ^
SyntaxError: invalid syntax

using Scikit-learn, build a simple linear regression (OLS)

from sklearn.linear_model import LinearRegression

est = LinearRegression(fit_intercept = True)

x = wine[['???']]
y = wine[['???']]

est.fit(x, y)

print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 5
      1 from sklearn.linear_model import LinearRegression
      3 est = LinearRegression(fit_intercept = True)
----> 5 x = wine[['???']]
      6 y = wine[['???']]
      8 est.fit(x, y)

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:4108, in DataFrame.__getitem__(self, key)
   4106     if is_iterator(key):
   4107         key = list(key)
-> 4108     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4110 # take() does not accept boolean indexers
   4111 if getattr(indexer, "dtype", None) == bool:

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:6249, in Index._raise_if_missing(self, key, indexer, axis_name)
   6247 if nmissing:
   6248     if nmissing == len(indexer):
-> 6249         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6252     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['???'], dtype='object')] are in the [columns]"

What is the model’s mean squared error (\(MSE\)) and the coefficient of determination (\(R^2\)) ?

from sklearn import metrics

# Analysis for all months together.
x = wdi[['???']]
y = wdi[['???']]
model = LinearRegression()
model.fit(x, y)
y_hat = model.predict(x)
plt.plot(x, y,'o', alpha = 0.5)
plt.plot(x, y_hat, 'r', alpha = 0.5)
plt.xlabel('?')
plt.ylabel('?')
print ("MSE:", metrics.mean_squared_error(y_hat, y))
print ("R^2:", metrics.r2_score(y_hat, y))
print ("var:", y.var())
plt.savefig("?.png", dpi = 300, bbox_inches = 'tight')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 4
      1 from sklearn import metrics
      3 # Analysis for all months together.
----> 4 x = wdi[['???']]
      5 y = wdi[['???']]
      6 model = LinearRegression()

NameError: name 'wdi' is not defined

What’s the conclusion?