import warnings
'ignore') warnings.filterwarnings(
13 Exercise: Regression
Now it’s your turn to prepare a linear regression model.
13.1 Scikit Learn
You can use here as well Scikit Learn library. More information you can find here: https://www.tutorialspoint.com/scikit_learn/scikit_learn_linear_regression.htm
13.2 Wine Dataset
For this exercise we will be using Wine Quality Dataset from Cortez et al. (2009). You can find more information about it here: https://doi.org/10.24432/C56S3T
What would be your research question? What do you like to learn, given the data you have?
13.3 Reading Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
= pd.read_excel('data/raw/winequality-red_v2.xlsx', engine = 'openpyxl')
wine
#You might need to use encoding, then the code will look like:
# wine = pd.read_excel('data/raw/winequality-red_v2.xlsx', engine = 'openpyxl', encoding='UTF-8')
13.4 Data exploration
Let’s check the data, their distribution and central tendencies
print('shape:', wine.shape)
wine.head()
shape: (1599, 12)
fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
13.4.1 Check your variables
Use lmplot() function from Seaborn to explore linear relationship Input data must be in a Pandas DataFrame. To plot them, we provide the predictor and response variable names along with the dataset
Did you find outliers or missing data? You can use function np.unique and find the unique elements of an array.
?np.unique
Do you need to remove any cases?
Did you need to standarize data?
If you standarized data, try to plot them again
13.5 Form ideas about the data
Before you move on to exploring correlations and maybe other kinds of models, try and build some sense of understanding of the relations between the variables. What are some relations that stand out. Do you know a bit more about the wines in this dataset or wines more generally?
13.6 Move on to building some simple models
You can calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
We will be using the scikit-learn package here. This is a package we will be making use of very frequently.
import scipy.stats
scipy.stats.pearsonr(wine.???.values, wine.???.values)
Cell In[6], line 2 scipy.stats.pearsonr(wine.???.values, wine.???.values) ^ SyntaxError: invalid syntax
using Scikit-learn, build a simple linear regression (OLS)
from sklearn.linear_model import LinearRegression
= LinearRegression(fit_intercept = True)
est
= wine[['???']]
x = wine[['???']]
y
est.fit(x, y)
print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[7], line 5 1 from sklearn.linear_model import LinearRegression 3 est = LinearRegression(fit_intercept = True) ----> 5 x = wine[['???']] 6 y = wine[['???']] 8 est.fit(x, y) File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py:4108, in DataFrame.__getitem__(self, key) 4106 if is_iterator(key): 4107 key = list(key) -> 4108 indexer = self.columns._get_indexer_strict(key, "columns")[1] 4110 # take() does not accept boolean indexers 4111 if getattr(indexer, "dtype", None) == bool: File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name) 6197 else: 6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6200 self._raise_if_missing(keyarr, indexer, axis_name) 6202 keyarr = self.take(indexer) 6203 if isinstance(key, Index): 6204 # GH 42790 - Preserve name from an Index File ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/indexes/base.py:6249, in Index._raise_if_missing(self, key, indexer, axis_name) 6247 if nmissing: 6248 if nmissing == len(indexer): -> 6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) 6252 raise KeyError(f"{not_found} not in index") KeyError: "None of [Index(['???'], dtype='object')] are in the [columns]"
What is the model’s mean squared error (\(MSE\)) and the coefficient of determination (\(R^2\)) ?
from sklearn import metrics
# Analysis for all months together.
= wdi[['???']]
x = wdi[['???']]
y = LinearRegression()
model
model.fit(x, y)= model.predict(x)
y_hat 'o', alpha = 0.5)
plt.plot(x, y,'r', alpha = 0.5)
plt.plot(x, y_hat, '?')
plt.xlabel('?')
plt.ylabel(print ("MSE:", metrics.mean_squared_error(y_hat, y))
print ("R^2:", metrics.r2_score(y_hat, y))
print ("var:", y.var())
"?.png", dpi = 300, bbox_inches = 'tight') plt.savefig(
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[8], line 4 1 from sklearn import metrics 3 # Analysis for all months together. ----> 4 x = wdi[['???']] 5 y = wdi[['???']] 6 model = LinearRegression() NameError: name 'wdi' is not defined
What’s the conclusion?