Linear Regression

Akashdeep Mahata
5 min readFeb 25, 2024

--

Introduction

Linear regression is a foundational concept in statistics and machine learning, serving as a fundamental technique for modelling the relationship between two or more variables. In this comprehensive article, we’ll delve into the intricacies of linear regression, covering its definition, mathematical representation, assumptions, types, and practical applications. Additionally, we’ll explore the process of fitting a linear regression line, including cost function optimization using gradient descent and evaluating model performance using the R-squared method.

Definition

Linear regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors) by fitting a linear equation to observed data.

Description

Linear regression aims to establish a linear relationship between the independent variables (X) and the dependent variable (Y) by estimating the coefficients (slope and intercept) of the linear equation. The model assumes that the relationship between the variables is linear, meaning that a change in one variable is associated with a proportional change in the other variable.

Assumptions

Linear regression relies on several assumptions, including:

  • Linearity: The independent and dependent variables have a linear relationship with one another.
  • Independence: The observations in the dataset are independent of each other.
  • Homoscedasticity: The variance of the errors is constant across all levels of the independent variable.
  • Normality: The errors are normally distributed around the regression line.

For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression apply. In addition to this, below are few more:

  • No Multicollinearity: This indicates that there is little or no correlation between the independent variables.
  • Overfitting: Overfitting occurs when the model fits the training data too closely, capturing noise or random fluctuations that do not represent the true underlying relationship between variables. This can lead to poor generalization performance on new, unseen data.

Application

Linear regression is widely used in various fields, including:

  • Economics: Predicting GDP growth based on factors like investment and consumption.
  • Finance: Forecasting stock prices using historical data.
  • Healthcare: Estimating patient outcomes based on clinical variables.
  • Marketing: Predicting sales revenue based on advertising expenditure.

Types

Linear regression can be classified into two main types:

  1. Simple Linear Regression: Involves one independent variable.
  2. Multiple Linear Regression: Involves two or more independent variables.

Graphical and Mathematical Representation

Graphically, linear regression is represented by a straight line that best fits the observed data points.

Image Source

Mathematically, the linear regression equation for -

1. Simple Linear Regression

In simple linear regression, we have one independent variable X and one dependent variable Y. The relationship between X and Y is modelled using a linear equation with two parameters: the slope (β1​) and the intercept (β0​).

The equation for simple linear regression is given by:

where:

  • Y is the dependent variable (target),
  • X is the independent variable (predictor),
  • β0​ is the intercept (y-intercept),
  • β1​ is the slope (regression coefficient),
  • ϵ is the error term.

2. Multiple Linear Regression

In multiple linear regression, we have multiple independent variables (X1,X2,…,Xn​) and one dependent variable Y. The relationship between X and Y is modelled using a linear equation with n+1 parameters: one intercept (β0​) and n regression coefficients (β1,β2,…,βn​).

The equation for multiple linear regression is given by:

where:

  • Y is the dependent variable (target),
  • X1,X2,…,Xn​ are the independent variables (predictors),
  • β0​ is the intercept,
  • β1,β2,…,βn are the regression coefficients,
  • ϵ is the error term.

Linear Regression Line

The linear regression line is the line that best fits the observed data points, minimizing the sum of squared differences between the observed and predicted values. It is determined by estimating the coefficients θ0​ and θ1​ that minimize the residual sum of squares (RSS).

Finding Best Fit Line

  • Cost Function: The cost function measures the difference between the predicted and actual values of the dependent variable. In linear regression, the most commonly used cost function is the Mean Squared Error (MSE), which calculates the average squared difference between the observed and predicted values.

Where hθ(x(i)) is the predicted value, y(i) is the actual value, and m is the number of observations.

  • Gradient Descent: Gradient descent is an optimization algorithm used to minimize the cost function by iteratively updating the coefficients θ0​ and θ1​ in the direction of the steepest descent.

where α is the learning rate .

Image Source
  • Model Performance — R-squared Method: R-squared (R2) is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

where:

  • SSres​ is the sum of squared residuals (errors),
  • SStot​ is the total sum of squares.

Conclusion

Linear regression is a powerful and versatile tool for modelling the relationship between variables in various fields. By understanding its principles, mathematical representation, assumptions, and techniques for model fitting and evaluation, practitioners can effectively apply linear regression to analyse data, make predictions, and derive valuable insights in real-world scenarios.

--

--

No responses yet