Linear Regression using scikit learn in ML

Manish Patel

2023-05-18

Python code on sklearn linear regression example

Importing required libraries

import numpy as np  
import matplotlib.pyplot as plt  
from sklearn.datasets import load_diabetes  
from sklearn.linear_model import LinearRegression  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error  
from sklearn import linear_model
import math

Loading the sklearn diabetes dataset

X, Y = load_diabetes(return_X_y=True)  

Taking only one feature to perform simple linear regression

X = X[:,8].reshape(-1,1)  

Splitting the dependent and independent features of the dataset into training and testing dataset

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 0.3, random_state = 10 ) 

Creating an instance for the linear regression model of sklearn

reg = linear_model.LinearRegression()  

Training the model by passing the dependent and independent features of the training dataset

reg.fit( X_train, Y_train )   
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Creating an array of predictions made by the model for the unseen or test dataset

Y_pred = reg.predict( X_test ) 

Regression model equation

print("Value of the Coefficients: \n", reg.coef_)  
Value of the Coefficients: 
 [875.72295475]
# equation is y = ax + b
# find the value of 'a' (your coefficient) rounded up to 2 dec places
print('a is: ',np.round(reg.coef_,2))

# find the value of b (the constant/intercept) rounded up to 2 dec places
print('b is: ',np.round(reg.intercept_,2))
a is:  [875.72]
b is:  150.35

Importance of Model Evaluation

Being able to correctly measure the performance of a machine learning model is a critical skill for every machine learning practitioner

Regression Metrics

Regression metrics are different from classification metrics because we are predicting a continuous quantity.

  • Mean Absolute Error
  • Mean Squared Error
  • R2 score (the coefficient of determination)

image.png

image.png

Mean Squared Error

  • The Mean squared error (MSE) represents the error of the estimator or predictive model created based on the given set of observations in the sample.
  • It measures the average squared difference between the predicted values and the actual values, quantifying the discrepancy between the model’s predictions and the true observations.
  • The lower the MSE, the better the model predictive accuracy, and, the better the regression model is.

Why MSE?

  1. Ease of interpretation
  2. Squared terms emphasizes larger errors
  3. Differentiable

Drawbacks

  • MSE has some limitations, such as its sensitivity to outliers
  • the absence of an upper bound on its values.

What is R-Squared? (coefficient of determination)

  • It measures the proportion of the total variation in the dependent variable (output) that can be explained by the independent variables (inputs) in the model.
  • Mathematically, that can be represented as the ratio of the sum of squares regression (SSR) and the sum of squares total (SST).
  • R-squared value is used to measure the goodness of fit or best-fit line.
  • The greater the value of R-Squared, the better is the regression model as most of the variation of actual values from the mean value get explained by the regression model.

DIFFERENCES - MSE VS COEFFICIENT OF DETERMINATION

Metric Mean Squared Error (MSE) Coefficient of Determination (R-squared)
Definition The average of the squared differences between the predicted values and the true values. The proportion of the variance in the dependent variable explained by the independent variables.
Range of Values Can take any non-negative value. Ranges between 0 and 1, inclusive.
Interpretation Lower MSE indicates better model performance, as it reflects the average prediction error. Higher R-squared indicates better model performance, as it explains a larger proportion of the variance in the data.
Sensitivity to Outliers Highly sensitive to outliers, as the squared differences amplify the impact of outliers on the overall error. Less sensitive to outliers, as it considers the proportion of variance explained rather than the absolute differences.
Relationship to Data Directly measures the prediction accuracy by quantifying the average deviation between predicted and actual values. Indirectly measures the goodness of fit by assessing how well the model explains the variability in the data.
Combination of Errors MSE combines both systematic and random errors into a single metric. R-squared separates systematic errors (explained variance) from random errors (unexplained variance).
Usage Useful for comparing different models or tuning hyperparameters, as it provides a numerical measure of prediction accuracy. Widely used for model evaluation and selection, as it provides insights into the model’s explanatory power.

The value of the coefficients for the independent feature through the multiple regression model

print("Value of the Coefficients: \n", reg.coef_)  
Value of the Coefficients: 
 [875.72295475]

The value of the mean squared error

print(f"Mean Absolute error: {mean_absolute_error( Y_test, Y_pred)}")  

print(f"Mean square error: {mean_squared_error( Y_test, Y_pred)}")  
Mean Absolute error: 54.02374262410346
Mean square error: 4254.615583911326

The value of the coefficient of determination, i.e., R-square score of the model

print(f"Coefficient of determination: {r2_score( Y_test, Y_pred )}")  
Coefficient of determination: 0.3276174567207636

Plotting the output

plt.scatter(X_test, Y_test, color = "black", label = "original data")  
plt.plot(X_test, Y_pred, color = "blue", linewidth=3, label = "regression line")  
plt.xlabel("Independent Feature")  
plt.ylabel("Target Values")  
titlestr = f'Predicted Function: y = {reg.coef_[0]:.2f} + {reg.intercept_:.2f}'
plt.title(titlestr)
plt.show()