- Linear regression
- Cost functions
- Learning / Estimation / Fitting
- Grid search
- Optimization landscapes
- Smooth (differentiable) optimization
- Non-smooth (non-differentiable) optimization
- Constrained optimization
- Implementation issues in gradient methods
- Least squares
- Maximum likelihood
- Overfitting and underfitting
- Model selection
- Logistic regression
- Generalized Linear Models
- Nearest neighbor classifiers and the curse of dimensionality
Support Vector Machines
- Alternative formulation: Duality
- Kernel trick
- Properties of kernels
- Unsupervised learning
- Matrix Factorization
- SVD and PCA
⚠ Work in progress
We’ll always use subscript for data point, and for feature. is the data size and is the dimensionality.
Recommended website: http://www.matrixcalculus.org/
A linear regression is a model that assumes a linear relationship between inputs and the output. We will study three types of methods:
- Grid search
- Iterative optimization algorithms
- Least squares
Simple linear regression
For a single input dimension (), we can use a simple linear regression, which is given by:
are the parameters of the model.
Multiple linear regression
If our data has multiple input dimensions, we obtain multivariate linear regression:
👉🏼 If we wanted to be a little more strict, we should write , as the model of course also depends on the weights.
The tilde notation means that we have included the offset term , also known as the bias:
If the number of parameters exceeds the number of data examples, we say that the task is under-determined. This can be solved by regularization, which we’ll get to more precisely later
is the data, which we can easily understand where comes from. But how does one find a good from the data?
A cost function (also called loss function) is used to learn parameters that explain the data well. It quantifies how well our model does by giving errors a score, quantifying penalties for errors. Our goal is to find weights that minimize the loss functions.
Desirable properties of cost functions are:
- Symmetry around 0: that is, being off by a positive or negative amount is equivalent; what matters is the amplitude of the error, not the sign.
- Robustness: penalizes large errors at about the same rate as very large errors. This is a way to make sure that outliers don’t completely dominate our regression.
Good cost functions
Probably the most commonly used cost function is Mean Square Error (MSE):
MSE is symmetrical around 0, but also tends to penalize outliers quite harshly (because it squares error): MSE is not robust. In practice, this is problematic, because outliers occur more often than we’d like to.
Note that we often use MSE with a factor instead of . This is because it makes for a cleaner derivative, but we’ll get into that later. Just know that for all intents and purposes, it doesn’t change really change anything about the behavior of the models we’ll study.
When outliers are present, Mean Absolute Error (MAE) tends to fare better:
Instead of squaring, we take the absolute value. This is more robust. Note that MAE isn’t differentiable at 0, but we’ll talk about that later.
There are other cost functions that are even more robust; these are available as additional reading, but is not exam material.
A function is convex iff a line joining two points never intersects with the function anywhere else. More strictly defined, a function with is convex if, for any , and for any , we have:
A function is strictly convex if the above inequality is strict ().
A stritly convex function has a unique global minimum . For convex functions, every local minimum is a global minimum. This makes it a desirable property for loss functions, since it means that cost function optimization is guaranteed to find the global minimum.
Sums of convex functions are also convex. Therefore, MSE and MAE are convex.
Learning / Estimation / Fitting
Given a cost function (or loss function) , we wish to find which minimizes the cost:
This is what we call learning: learning is simply an optimization problem, and as such, we’ll use an optimization algorithm to solve it – that is, find a good .
This is one of the simplest optimization algorithms, although far from being the most efficient one. It can be described as “try all the values”, a kind of brute-force algorithm; you can think of it as nested for-loops over the individual weights.
For instance, if our weights are , then we can try, say 4 values for , 4 values for , for a total of 16 values of .
But obviously, complexity is exponential (where is the number of values to try), which is really bad, especially when we can have millions of parameters. Additionally, grid search has no guarantees that it’ll find an optimum; it’ll just find the best value we tried.
If grid search sounds bad for optimization, that’s because it is. In practice, it is not used for optimization of parameters, but it is used to tune hyperparameters.
A vector is a local minimum of a function (we’re interested in the minimum of cost functions , which we denote with , as opposed to any other value , but this obviously holds for any function) if such that
In other words, the local minimum is better than all the neighbors in some non-zero radius.
The global minimum is defined by getting rid of the radius and comparing to all other values:
A minimum is said to be strict if the corresponding equality is strict for , that is, there is only one minimum value.
Smooth (differentiable) optimization
A gradient at a given point is the slope of the tangent to the function at that point. It points to the direction of largest increase of the function. By following the gradient (in the opposite direction, because we’re searching for a minimum and not a maximum), we can find the minimum.
Gradient is defined by:
This is a vector, i.e. . Each dimension of the vector indicates how fast the cost changes depending on the weight .
Gradient descent is an iterative algorithm. We start from a candidate , and iterate.
As stated previously, we’re adding the negative gradient to find the minimum, hence the subtraction.
is known as the step-size, which is a small value (maybe 0.1). You don’t want to be too aggressive with it, or you might risk overshooting in your descent. In practice, the step-size that makes the learning as fast as possible is often found by trial and error 🤷🏼♂️.
As an example, we will take an analytical look at a gradient descent, in order to understand its behavior and components. We will do gradient descent on a 1-parameter model, in which we minimize the MSE, which is defined as follows:
Note that we’re dividing by 2 on top of the regular MSE; it has no impact on finding the minimum, but when we will compute the gradient below, it will conveniently cancel out the .
The gradient of is:
And thus, our gradient descent is given by:
This sequence is guaranteed to converge for (so the solution to this exact problem can be extracted analytically from gradient descent). This would set the cost function to 0, which is the minimum.
The choice of has an influence on the algorithm’s outcome:
- If we pick , we would get to the optimum in one step
- If we pick , we would get a little closer in every step, eventually converging to
- If we pick , we are going to overshoot . Slightly bigger than 1 (say, 1.5) would still converge; would loop infinitely between two points; diverges.
Gradient descent for linear MSE
Our linear regression is given by a line that is a regression for some data :
Our model is:
We define the error vector by:
The MSE can then be restated as follows:
And the gradient is, component-wise:
We’re using column notation to signify column of the matrix .
And thus, all in all, our gradient is:
To compute this expression, we must compute:
- The error , which takes floating point operations (flops) for the matrix-vector multiplication, and for the subtraction, for a total of , which is
- The gradient , which costs , which is .
In total, this process is at every step. This is not too bad, it’s equivalent to reading the data once.
Stochastic gradient descent (SGD)
In ML, most cost functions are formulated as a sum of:
In practice, this can be expensive to compute, so the solution is to pick a random uniformly at random in to be able to make the sum go away.
The stochastic gradient descent is thus:
Why is it allowed to pick just one instead of the full thing? We won’t give a full proof, but the intuition is that:
The gradient of a single n is:
Note that , and . Computational complexity for this is .
But perhaps just picking a single value is too extreme; there is an intermediate version in which we choose a subset instead of points, instead of a single point.
Note that if then we’re performing a full gradient descent.
The computation of can be parallelized easily over GPU threads, which is quite common in practice; is thus often dictated by the number of available threads.
Computational complexity is .
Non-smooth (non-differentiable) optimization
We’ve defined convexity previously, but we can also use the following alternative characterization of convexity:
Meaning that the function must always lie above its linearization (which is the first-order Taylor expansion) to be convex.
A vector such that:
is called a subgradient to the function at . The subgradient forms a line that is always below the curve, somewhat like the gradient of a convex function.
This definition is valid even for an arbitrary that may not be differentiable, and not even necessarily convex.
If the function is differentiable at , then the only subgradient at is .
This is exactly like gradient descent, except for the fact that we use the subgradient at the current iterate instead of the gradient:
For instance, MAE is not differentiable at 0, so we must use the subgradient.
Here, is somewhat confusing notation for the set of all possible subgradients at our position.
For linear regressions, the (sub)gradient is easy to compute using the chain rule.
Let be non-differentiable, differentiable, and . The chain rule tells us that, at , our subgradient is:
Stochastic subgradient descent
This is still commonly abbreviated SGD.
It’s exactly the same, except that is a subgradient to the randomly selected at the current iterate .
|Full gradient descent||Gradient of
|Stochastic gradient descent||Gradient of||Subgradient of|
Sometimes, optimization problems come posed with an additional constraint.
We’ve seen convexity for functions, but we can also define it for sets. A set is convex iff the line segment between any two points of lies in . That is, , we have:
This means that the line between any two points in the set must also be fully contained within the set.
A couple of properties of convex sets:
- Intersection of convex sets is also convex.
- Projections onto convex sets are unique (and often efficient to compute).
Projected gradient descent
When dealing with constrained problems, we have two options. The first one is to add a projection onto in every step:
The rule for gradient descent can thus be updated to become:
This means that at every step, we compute the new normally, but apply a projection on top of that. In other words, if the regular gradient descent sets our weights outside of the constrained space, we project them back.
This is the same for stochastic gradient descent, and we have the same convergence properties.
Note that the computational cost of the projection is very important here, since it is performed at every step.
Turning constrained problems into unconstrained problems
If projection as described above is approach A, this is approach B.
We use a penalty function, such as the “brick wall” indicator function below:
We could also perhaps use something with a less drastic error value than , if we don’t care about the constraint quite as extreme.
Note that this is similar to regularization, which we’ll talk about later.
Now, instead of directly solving , we solve for:
Implementation issues in gradient methods
When is zero (or close to zero), we are often close to the optimum.
If the second order derivative is positive (or positive semi-definite for the general case ), then it is a (possibly local) minimum. If the function is also convex, then this is necessarily a global minimum. That is:
If is too big, we might diverge (as seen previously). But if it is too small, we might be very slow! Convergence is only guaranteed for , which is a value that depends on the problem.
In some rare cases, we can take an analytical approach to computing the optimum of the cost function, rather than a computational one; for instance, for linear regression with MSE, as we’ve done previously. These types of equations are sometimes called normal equations. This is one of the most popular methods for data fitting, called least squares.
How do we get these normal equations?
First, we show that the problem is convex. If that is the case, then according to the optimality conditions for convex functions, the point at which the derivative is zero is the optimum:
This gives us a system of equations known as the normal equations.
Single parameter linear regression
Let’s try this for a single parameter linear regression, with MSE as the cost function.
First, we will just accept that the cost function is convex in the parameter.
As proven previously, we know that for the single parameter model, the derivative is:
This means that the derivative is 0 for . This allows us to define our optimum parameter as .
Multiple parameter linear regression
As we know by now, the cost function for linear regression with MSE is:
Where the matrices are defined as:
We denote the row of by . Each represents a different data point.
We claim that this cost function is convex in . We can prove that in any of the following ways:
The cost function is the sum of many convex functions, and is thus also convex.
Directly verify the definition
The left-hand side of the inequality reduces to:
which is indeed non-positive.
Compute the Hessian
The Hessian is the matrix of second derivatives, defined as follows:
If the Hessian is positive semidefinite (i.e. all its eigenvalues are non-negative), then the function is convex.
For our case, the Hessian is given by:
This is indeed positive semi-definite, as its eigenvalues are the squares of the eigenvalues of , and must therefore be positive.
Knowing that the function is convex, we can find the minimum. If we take the gradient of this expression, we get:
We can set this to 0 to get the normal equations for linear regression, which are:
This proves that the normal equations for linear regression are given by .
The above definition of normal equations are given by . How can visualize that?
The error is given by:
By definition, this error vector is orthogonal to all columns of . Indeed, it tells us how far above or below the span our prediction is.
The span of is the space spanned by the columns of . Every element of the span can be written as for some choice of .
For the normal equations, we must pick an optimal for which the gradient is 0. Picking an is equivalent to picking an optimal from the span of .
But which element of shall we take, which one is the optimal one? The normal equations tell us that the optimum choice for , called is the element such that is orthogonal to .
In other words, we should pick to be the projection of onto .
All we’ve done so far is to solve the same old problem of a matrix equation:
But we’ve always done so with a bit of a twist; there may not be an exact value of satisfying exact equality, but we could find one that gets us as close as possible:
This is also what least squares does. It attempts to minimize the MSE to get as close as possible to .
In this course, we often denote the data matrix as , the weights as , and as ; in other words, we’re trying to solve:
In least squares, we multiply this whole equation by on the left. We attempt to find , the minimal weight that gets us as minimally wrong as possible. In other we’re trying to solve:
One way to solve this problem would simply be to invert the matrix, which in our case is :
As such, we can use this model to predict values for unseen data points:
Invertibility and uniqueness
Note that the Gram matrix, defined as , is invertible if and only if has full column rank, or in other words, .
Unfortunately, in practice, our data matrix is often rank-deficient.
- If , we always have (since column and row rank are the same).
If , but some of the columns are collinear (or in practice, nearly collinear), then the matrix is ill-conditioned. This leads to numerical issues when solving the linear system.
To know how bad things are, we can compute the condition number, which is the maximum eigenvalue of the Gram matrix, divided by the minimum See course contents of Numerical Methods.
If our data matrix is rank-deficient or ill-conditioned (which is practically always the case), we certainly shouldn’t be inverting it directly! We’ll introduce high numerical errors that falsify our output.
That doesn’t mean we can’t do least squares in practice. We can still use a linear solver. In Python, that means you should use
np.linalg.solve, which uses a LU decomposition internally and thus avoids the worst numerical errors. In any case, do not directly invert the matrix as we have done above!
Maximum likelihood offers a second interpretation of least squares, but starting with a probabilistic approach.
A Gaussian random variable in has mean and variance .
For a Gaussian random vector (instead of a single random variable), with mean and covariance (which is positive semi-definite) is:
Remember that .
As another reminder, two variables and are said to be independent when .
A probabilistic model for least squares
We assume that our data is generated by a linear model , with added Gaussian noise :
This is often a realistic assumption in practice.
The noise is for each dimension . In other words, it is centered at 0, has a certain variance, and the error in each dimension is independent of that in other dimensions. The model is, as always, unknown.
Given samples, the likelihood of the data vector given the model and the input (where each row is one input) is:
Intuitively, we’d like to maximize this likelihood over the choice of the best model . The best model is the one that maximizes this likelihood.
Defining cost with log-likelihood
The log-likelihood (LL) is given by:
Taking the log allows us to get away from the nasty product, and get a nice sum instead.
Notice that this definition looks pretty similar to MSE:
Note that we would like to minimize MSE, but we want LL to be as high as possible (intuitively, we can look at the sign to understand that).
Maximum likelihood estimator (MLE)
Maximizing the log-likelihood (and thus the likelihood) will be equivalent to minimizing the MSE; this gives us another way to design cost functions. We can describe the whole process as:
The maximum likelihood estimator (MLE) can be understood as finding the model under which the observed data is most likely to have been generated from (probabilistically). This interpretation has some advantages that we discuss below.
Properties of MLE
MLE is a sample approximation to the expected log-likelihood. In other words, if we had an infinite amount of data, MLE would perfectly be equal to the expected value of the log-likelihood.
This means that MLE is consistent, i.e. it gives us the correct model assuming we have enough data. In probability, we can write this as:
This sounds amazing, but the catch is that this all is under the assumption that the noise indeed was generated under a Gaussian model.
Overfitting and underfitting
Underfitting with linear models
Linear models can very easily underfit; as soon as the data itself is given by anything more complex than a line, fitting a linear model will underfit: the model is too simple for the data, and we’ll have huge errors.
But we can also easily overfit, where our model learns the specificities of the data too intimately. And this happens quite easily with linear combination of high-degree polynomials.
Extended feature vectors
We can actually get high-degree linear combinations of polynomials, but still keep our linear model. Instead of making the model more complex, we simply “augment” the input to become degree . If the input is one-dimensional, we can add a polynomial basis to the input:
Note that this is basically a Vandermonde matrix.
We then fit a linear model to this extended feature vector :
Here, . In other words, there are parameters in a degree extended feature vector. One should be careful with this degree; too high may overfit, too low may underfit.
If it is important to distinguish the original input from the augmented input then we will use the notation. But often, we can just consider this as a part of the pre-processing, and simply write as the input, which will save us a lot of notation.
To reduce overfitting, we can chose a less complex model (in the above, we can pick a lower degree ), but we could also just add more data:
To prevent overfitting, we can introduce regularization to penalize complex models. This can be applied to any model.
The idea is to not only minimize cost, but also minimize a regularizer:
The function is the regularizer, measuring the complexity of the model. We’ll see some good candidates for the regularizer below.
-Regularization: Ridge Regression
The most frequently used regularizer is the standard Euclidean norm (-norm):
Where . The value of will affect the fit; can have overfitting, while can have underfitting.
The norm is given by:
The main effect of this is that large model weights will be penalized, while small ones won’t affect our minimization too much.
Depending on the values we choose for and , we get into some special cases. For instance, choosing MSE for is called ridge regression, in which we optimize the following:
Least squares is also a special case of ridge regression, where
We can find an explicit solution for in ridge regression by differentiating the cost and regularizer, and setting them to zero:
We can now set the full cost to zero, which gives us the result:
Where . Note that for , we indeed have the solution least squares.
Ridge regression to fight ill-conditioning
This formulation of is quite nice, because adding the identity matrix helps us get something that always is invertible; in cases where we have ill-conditioned matrices, it also means that we can invert with more stability.
We’ll prove that the matrix indeed is invertible. The gist is that the eigenvalues of are all at least .
To prove it, we’ll write the singular value decomposition (SVD) of as . We then have:
The singular value is “lifted” by an amount . There’s an alternative proof in the class notes, but we won’t go into that.
-Regularization: The Lasso
We can use a different norm as an alternative measure of complexity. The combination of -norm and MSE is known as The Lasso:
Where the -norm is defined as
If we draw out a constant value of the norm, we get a sort of “ball”. Below, we’ve graphed .
To keep things in the following, we’ll just claim that is invertible. We’ll also claim that the following set is an ellipsoid which scales around the origin as we change :
The slides have a formal proof for this, but we won’t get into it.
Note that the above definition of the set corresponds to the set of points with equal loss (which we can assume is MSE, for instance):
Under these assumptions, we claim that for regularization, the optimum solution will likely be sparse (many zero components) compared to regularization.
To prove this, suppose we know the norm of the optimum solution. Visualizing that ball, we know that our optimum solution will be somewhere on the surface of that ball. We also know that there are ellipsoids, all with the same mean and rotation, describing the equal error surfaces. The optimum solution is where the “smallest” of these ellipsoids just touches the ball.
Due to the geometry of this ball this point is more likely to be on one of the “corner” points. In turn, sparsity is desirable, since it leads to a “simple” model.
As we’ve seen in ridge regression, we have a regularization parameter that can be tuned to reduce overfitting by reducing model complexity. We say that the parameter is a hyperparameter.
We’ve also seen ways to enrich model complexity, like polynomial feature expansion, in which the degree is also a hyperparameter.
We’ll now see how best to choose these hyperparameters; this is called the model selection problem.
We assume that there is an (unknown) underlying distribution producing the dataset, with range . The dataset we see is produced from samples from :
Based on this, the learning algorithm choses the “best” model, under the parameters of the algorithm.
We write , where denotes the learning algorithm. It depends on the data subset we’re given, and is the resulting prediction of our model.
To indicate that sometimes depend on hyperparameters, we’ll write .
Training Error vs. Generalization Error
Given a model , how can we asses if is any good? We already have the loss function, but its result is highly dependent on the error in the data, not to how good the model is. Instead, we can compute the expected error over all samples chosen according to .
Where is our loss function; e.g. for ridge regression, .
The quantity has many names, including generalization error (or true/expected error/risk/loss). This is the quantity that we are fundamentally interested in, but we cannot compute it since is unknown.
What we do know is the data subset . It’s therefore natural to compute the equivalent empirical quantity, which is the average loss:
But again, we run into trouble. The function is itself a function of the data , so what we really do is to compute the quantity:
is the trained model. This is called the training error. Usually, the training error is smaller than the generalization error, because overfitting can happen (even with regularization, because the hyperparameter may still be too low).
Splitting the data
To avoid validating the model on the same data subset we trained it on (which is conducive to overfitting), we can split the data into a training set and a test set (aka validation set), which we call and .
We apply the learning algorithm on the training set , and compute the function . We then compute the error on the test set:
If we have duplicates in our data, then this could be a bit dangerous. Still, in general, this really helps us with the problem of overfitting since is a “fresh” sample, which means that we can hope that defined above is close to the quantity . Indeed, in expectation both are the same:
This is a quite nice property, but there are a few limits. First, we paid a price by splitting the data and thus reducing the size of our training data, though this can be mediated using cross-validation, which we’ll see later.
Generalization error vs test error
Assume that we have a model and that our loss function is bounded in . We are given a test set chosen i.i.d. from the underlying distribution .
How far apart is the test error (empirical) from the true generalization error? As we’ve seen above, they are the same in expectation. But we need to worry about the variation, about how far off from the true error we typically are:
We claim that:
Where is a quality parameter. This gives us an upper bound on how far away our empirical loss is from the true loss.
This bound gives us some nice insights. Error decreases in the size of the test set as , so the more data points we have, the more confident we can be in the empirical loss being close to the true loss.
We’ll prove . We assumed that each sample in the test set is chosen independently. Therefore, given a model , the associated losses are also i.i.d. random variables, taking values in by assumption. We can call each such loss :
This is just a naming alias; since the underlying value is that of the loss function, the expected value of is simply that of the loss function, which is the true loss:
The empirical loss on the other hand is equal to the average of such i.i.d. values.
The formula of gives us the probability that empirical loss diverges from the true loss by more than a given constant, which is a classical problem addressed in the following lemma (which we’ll just assert, not prove).
Chernoff Bound: Let be a sequence of i.i.d random variables with mean and range . Then, for any :
Using we can show . By setting , we find that as claimed.
Our main goal was to look for a way to select the hyperparameters of our model. Given a finite set of values for of a hyperparameter , we can run the learning algorithm times on the same training set , and compute the prediction functions . For each such prediction function we compute the test error, and choose the which minimizes the test error.
This is essentially a grid search on using the test error function.
Model selection based on test error
How do we know that, for a fixed function , is a good approximation to ?
The answer to this follows the same idea as when we talked about generalization vs test error, but we now assume that we have models for . We assume again that the loss function is bounded in , and that we’re given a test set whose samples are chosen i.i.d. in .
How far is each of the (empirical) test errors from the true ? As before, we can bound the deviation for all candidates, by:
A bit of intuition of where this comes from: for a general , if we check the deviations for independent samples and ask for the probability that for at least one such sample we get a deviation of at least (this is what the Chernoff bound answers). Then by the union bound this probability is at most times as large as in the case where we are only concerned with a single instance. Thus the upper bound in Chernoff becomes , which gives us as above.
As before, this tells us that error decreases in . Now that we test hyperparameters, our error only goes up by a tiny amount of . This follows from , which we proved for the special case of .
Splitting the data once into two parts (one for training and one for testing) is not the most efficient way to use the data. Cross-validation is a better way.
K-fold cross-validation is a popular variant. We randomly partition the data into groups, and train times. Each time, we use one of the groups as our test set, and the remaining groups for training.
To get a common result, we average out the results. This means we’ll use the average weights to get the average test error over the folds.
Cross-validation returns an unbiased estimate of the generalization error and its variance.
When we perform model selection, there is an inherent bias–variance trade-off.
For now, we’ll just look at “high-bias & low-variance” models, and “high-variance & low-bias” models.
- High-bias & low-variance: the model is too simple. It’s underfit, has a large bias, and and the variance of is small.
- High-variance & low-bias: the model is too complex. It’s overfit, has a small bias and large variance of (as a single addition of a data point is likely to change the prediction function considerably)
Consider a linear regression with one-dimensional input and polynomial feature expansion of degree . The former can be achieved by picking a too low value for , while the latter by picking too high. The same principle applies for other parameters, such as ridge regression with hyperparameter .
Data generation model
Let’s assume that our data is generated by some arbitrary, unknown function , and a noise source with distribution (i.i.d. from sample to sample, and independent from the data). We can think of representing the precise, hypothetical function that perfectly produced the data. We assume that the noise has mean zero (without loss of generality, as a non-zero mean could be encoded into ).
We assume that is generated according to some fixed but unknown distribution . We’ll be working with square loss as our loss function . We will denote the joint distribution on pairs as .
As always, we have a training set , which consists of i.i.d. samples from . Given our learning algorithm , we compute the prediction function . The square loss of a single prediction for a fixed element is given by the computation of:
Our experiment was to create , learn