Regression with Linear Algebra

Joseph Jojoe
4 min readJun 20, 2022

--

Linear algebra is a powerful tool in any mathematician’s arsenal; today, we explore how matrices can be used to plot lines and curves of best fit for a given set of data points.

Motivation

Regression is intimately tied to the idea of projection matrices ― a projection matrix P maps a vector b onto the space spanned by the columns of A.

Formula for the projection matrix P
Pre-multiplying a vector b by P projects it onto the column space of a given matrix A.

For a given set of data points, there may not be a straight line that passes through each point ― it is far more likely that we can only produce a ‘line of best fit’, which minimises the distance between the line and each point of our dataset.

A tentative line of best fit for the points (0,0), (1,1), and (2,1).
A tentative line of best fit for the points (0,0), (1,1), and (2,1). Notice how the line does not pass through any of the data points.

In matrix form, this means that Ax = b is not solvable, as there is no straight line that can pass through every point and satisfy each equation. Instead, we will project the vector b onto the column space C(A) of A, forming a vector p, and solve for coefficients of our line of best fit which will be able to produce the vector p.

The projection is orthogonal, meaning that b is projected to the point of C(A) which is closest to it. This allows us to assert that the line produced by the coefficients found through this method is indeed the ‘best fit’; the orthogonal projection of b onto C(A), which we name p, is the closest that Ax can get to the original vector b.

Proving that a given line is the ‘best’ line for a dataset, or alternatively finding the coefficients of the line of best fit for a set of points, is the task of regression. We find these coefficients by minimising the error, defined as the sum of the squares of the vertical distances between each point and the line ― producing the lowest value of the sum of the squared distances leads to the name for this method, least squares.

We minimise the sum of the squared errors (shown in red) to produce a line of best fit; this is the foundational principle of least squares regression.
We minimise the sum of the squared errors (shown in red) to produce a line of best fit; this is the foundational principle of least squares regression.

Example: Linear Regression with Least Squares

We will now produce a line of best fit of the form y = Ax + B for the given points (1, 1), (2, 1), and (3, 2).

If the line went through all three points, we would have three equations which we could solve simultaneously to find M and C.

The system of linear equations formed by our generalised line of best fit equation and supplied data points.
The system of linear equations formed by our generalised line of best fit equation and supplied data points.

This is a system of linear equations, which we can now write in the language of linear algebra using matrices and vectors.

Our system of linear equations, now in matrix form.
Our system of equations, now in matrix form.

We describe this system as Ax = b; a matrix of coefficients A is applied to a vector of unknowns x, producing a solution vector b. Unfortunately, the matrix A is rectangular and thus singular ― we cannot solve this system directly.

Instead, we solve the related equation:

The first equation is not solvable, so we instead pre-multiply by the transpose of A on both sides and solve the second equation for the best possible values of the solutions vector x.
The first equation is not solvable, so we instead pre-multiply by the transpose of A on both sides and solve the second equation for the best possible values of the solutions vector x.

AA is a square matrix and invertible as long as A has independent columns; this is true for our matrix A, so we can now multiply both sides of the second equation by the inverse of AA to solve for the coefficients. In fact, this second equation relates to projecting b onto the column space of A; this is what ensures that the coefficients we find will indeed produce the line of best fit.

Note the caret (hat) over x in the image ― this is to emphasise that while this is our best possible solution, it is still not a perfect match for our original matrix equation. This corresponds to our geometric intuition of a straight line not always being able to pass through every point contained within a dataset.

We finally solve for our best possible coefficients A and B.
We finally solve for our best possible coefficients A and B.

This gives us A = 1/2 and B = 1/3, which we now substitute into our original line of best fit equation, y = Ax + B.

The line of best fit with coefficients A = 1/2 and B = 1/3.

The error for each point is extremely small, and it is reasonable to conclude from our work and prior assertions that the sum of our squared errors has indeed been minimised. Our least squares linear regression is complete.

Is non-linear regression possible?

Yes! In our example we chose to fit points to the function y = Ax + B, but we equally could have chose y = Ax² + Bx + C, or nearly any other function for that matter. Substituting (x, y) pairs into our desired function almost always still gives systems of linear equations, and thus we can utilise the same least squares method to solve for coefficients.

An example of least squares regression with a quadratic curve of best fit.
An example of least squares regression with a quadratic curve of best fit.

Further Exploration

Lecture 16 of 18.06Sc Linear Algebra ― Projection Matrices and Least Squares

US Particle Accelerator School ― Least Squares Fitting

--

--

Joseph Jojoe

mathematics, machine learning, and everything beyond.