Linear Regression

  • Linear regression refers to a statistical method used to understand the relationship between two continuous variables. One variable is considered to be an explanatory variable (often denoted by x), and the other is considered to be a dependent variable (often denoted by y).

  • The basic model of linear regression can be represented by the equation y = mx + c, where ‘y’ is the dependent variable, ‘x’ is the independent or explanatory variable, ‘m’ is the slope of the line (also known as the regression coefficient), and ‘c’ is the y-intercept.

  • The objective of regression is to find the ‘line of best fit’ that minimally deviates from the observed data points. This deviation is often quantified on the basis of the sum of the squares of the residuals (the difference between the actual and the predicted y values).

  • Understanding the slope is critical in linear regression as it represents the expected change in the dependent variable (y) for a one-unit change in the explanatory variable (x).

  • The y-intercept is the value of y when x = 0. It acts as a sort of ‘baseline’ value for the dependent variable when the independent variable is absent or zero.

  • Several assumptions underpin linear regression: linearity (relationship between x and y is linear), independence (observations are independent of each other), homoscedasticity (the variance of residual is the same for any value of x), normality (for any fixed value of x, y is normally distributed), and absence of multicollinearity (the independent variables are not too highly correlated with each other).

  • The coefficient of determination, denoted by R², is an important metric in linear regression. R² explains how much of the variability in the outcome can be explained by the independent variables in the model.

  • One may utilise the method of ‘least squares’ to calculate the line of best fit. This process minimises the sum of the squares of the residuals (vertical deviations from each data point to the line).

  • Linear regression can form the basis for predictive modelling, allowing for forecasting where a future data point may lie on the regression line based on the value of the independent variable.

  • Limitations of linear regression include its sensitivity to outlier values (a single outlier can significantly alter the line of best fit) and its inability to model complex, non-linear relationships between variables.

  • It is necessary to check the reliability of a calculated linear regression model using residual plots, normality checks, and checking for homoscedasticity. If these assumptions are violated, you may need to use transformations or non-linear regression models instead.

  • Understanding linear regression conceptually, as well as being able to calculate and interpret it, will be crucial in this component of the syllabus.