10b Machine Learning: LASSO Regression

Hey howdy everyone. I'm michael perch i'm an associate professor at the university of texas at austin and continuing on with my course on subsurface machine learning we have previously discussed all of the or many of the concepts that are required as prerequisites before we do machine. Learning things like feature engineering probability theory and so forth. Now what we'll do and we've talked about some inferential methods we just started into predictive methods covered linear regression talked about Ridge regression. I'll do a little bit of recap of that and we'll dive in today to the last su or is it the last Sol. Well let's talk about that okay so let's go ahead and dive into the concept of something la. SS Oh regression. So what's the motivation. Why cover this topic anyway. Well first of all we're still want to cover methodologies that are simple enough high degree of interpretability it builds on linear and ridge regression has a linkage once again just like Ridge regression to variance and biased trade-off. Which is super helpful. And we want to enforce the concept of hyper parameter tuning that's essential we want to also have a chance to talk about feature selection while performing prediction and that's super super interesting not too useful thing. So let's cover la. SS Oh regression. So let's go back and just recap linear regression remember. Our linear regression model was quite simple. It is um we will take a linear combination of our predictor features. X alphas going from 1 through m predictor features. We have the parameters we're working with are the B alphas 1 through m plus an intercept term. And we're going to predict our response feature now. So let's we're going to do this under the constraint of an l2 normalization in other words. We're going to look at the sum of the squares of the error at all the training data locations training data. I 1 through n training data and this is the model right here just shown right there so this is going to be our prediction.

And so we're gonna get the sum of these square root errors okay so residual sum of squared errors is what we're going to minimize. That's what we do with the linear regression model now. People may suggest and this is kind of funny because linear regression model is quite simple. We don't usually think of it being very sensitive but in fact for some class of problems it may be considered too sensitive and this is usually when we have sparse data sets that have high variance for a single training set. There's only one possible linear regression representation so we want to have a methodology that allows us some opportunity to to fit or I should say Tuna hyper parameters have some choice about model complexity or control on model variance. And so that motivates us to move towards the ridge regression and rich regression adds a new term. It's a regularization term. We know it as the shrinkage penalty which basically says go ahead and we want to fit or minimize the squared error at the training data locations. But we'll put a penalty on the size of the parameters in other words who want to encourage the model or machine to have parameter magnitudes that are closer to zero that are smaller. And so we'll have a lambda hyper parameter. That's going to determine how important that is. Relative to the problem of minimizing the residual sum of squares of the error. So this is known as the shrinkage method. And what's the impact of it. Well we can look at it. Graphically it's going to create increased model bias. We can see that graphically linear regression in regular linear regression in trying to maximize the fit or minimize the error at the training data locations could be shown as the blue line right here the Ridge regression approach is going to result in a model that is not OP. In other words it does increase the bias. We could see this difference between the models would be increased bias but what do we gain while we gain something really important.

We're gonna find out as we go along this course that in fact model variance is huge and many things that we do to try to improve accuracy are all about decreasing model variants. And so what this does is you can imagine if I was to do a bootstrap of multiple realizations of the training data that I would get multiple models that they would have different slopes the parameters that are trained for each of the data sets would vary. And because of the fact that we're working with l2 normalization we might see quite a bit of fluctuation even with this very simple model but by putting in this regularization term what will happen is we will in fact dampen that sensitivity to the actual training data and therefore and decrease the model variance now what we hope with the model variance and bias. Trade-off is in fact what we're going to do is we're going to while we may in fact lose in the from the perspective of increasing model bias in fact we'll be moving this direction as we increase our lambda parameter will increase model bias. But we hope the decrease in model variance is sufficient to offset that loss and the result is that we hope that we end up to decrease the overall expected test mean squared error in fact what. I should say is improve the performance in making predictions at unseen locations in in other words in the real world use of the model. Now so give them we've done a little bit of a recap on the regression linear regression on the ridge. Regression approach using a shrinkage term to try to decrease model variance improve accuracy the model bias and variance trade-off now. Let's go ahead and talk about la. SS oh okay so now why. I'm a little bit hesitant about pronouncing it because I'm Canadian and Canadians sometimes struggle with how to pronounce things because we're kind of a little bit British and we're kind of a little bit American and so I wondered when I start to use this term in my class well I was struggling with and I realized it's because of my background the American pronunciation is last sooo last soo the British pronunciation is last so last so so you decide how to pronounce it.

I'm gonna call it last so from now on because my canadian heritage no anyway my mom was British too so you know I guess that makes me British so last so is actually an acronym. And that's why most of the time in the notes. I'm using capital letters it means least absolute shrinkage and selection operator. I think they kind of reach for that one. But that's okay I get where it's a regression analysis method that performs. This is the super cool thing. And this is the first time we're going to show a prediction method to does this know not only regularization to try to dampen model variants but also a variable selection at the same time and. I'm going to show you how it does that. That's super cool. Now it's all about getting enhanced model accuracy. We know that we need to prove that model accuracy but the feature selection actually provides us with a really cool opportunity for improved interpretability. So what are we gonna do. How do we get a different method. We're gonna build off of ritual. Gresham and with last solo regression. What we're gonna do is we're going to just simply change the shrinkage penalty not to use an l2 norm this square but to use an l1. So remember with shrinkage what we had was this squaring. So we've lambda multiplied by the weighting the sum of the squares of the coefficients instead. What we'll do is we'll do the lambda times the sum of the absolute values of the coefficients. So this is the only difference here. Is that squared term going to absolute. And that's how we get the last so approach okay so now once again just recall just the difference between the l1 and l2 norm. The l2 norm was known as a Euclidean distance or Euclidean norm. And it's all based on these square and if you were to replace the deltas in each one of the features or the coefficients with a X my Delta this would be a measure of regular Euclidean distance and we have a generous generalized.

P norm that we could be working with. Now what we're gonna do is we're gonna go ahead and just work with P equal to 1 which is the l1. Now what's the difference just to recap on this. The difference between the l1 and the l2 norm. What the l1 norm. We get a solution that is more robust in other words more resistant outliers but it is more unstable it will tend to have jumps as we change the lambda parameter or the training data. Now there were the others concepts that. I mentioned before I won't recap all those right now. But that's sufficient right now for our discussion. Go back to lecture 9 if you want to hear more about norms. I have on recorded lecture just about norms. Check that out okay. So robustness stability. The robustness is defined as the insensitivity to outliers in the data l1 is better and we can safely be less concerned about outliers. L2 is better if the outliers are important in other words if we if we actually feel that the outliers are meaningful. We don't necessarily want to dampen them just like in the case of working with the pearson product-moment correlation coefficient versus the Spearman rank correlation coefficient. If the outliers are meaningful. We don't want to dampen that information right. We're just might be sweeping something under the rug. Stability and solutions l1 for a small horizontal change in the data can in fact have a massive jump in the model solution this slope may just switch and so there's like these tipping points and the model will just change. Okay so why what causes that. Like how would we explain that and this is one of the best figures. I found looking around that too in order in order to explain this idea. I just noticed that. I'd forgotten to put the citation to the website so appreciation to the blogpost that's available at this address were had adapted this figure from. I took their figure and just redrew something. A little less complicated that would fit on my slide here.

Ok so basically what's happening you can imagine that if I was working in a system where I'm trying to find the shortest path between this point and this point the l2 solution has only one solution. The straight line would be the solution but if you look at an l1 solution the l1 solution is behaving like a city block solution in other words as long as it is simply taking the sum of the differences. And if you look at every one of these paths this is solution number one solution number two solution number three they would all have the same l1 norm their distance would all be the same ok so now imagine if I take this training data and I just move it slightly. What could happen is I could have a big jump where suddenly I go from solution 3 if I move just a little bit this way maybe solution 1 becomes a better solution or if I move just a little bit up like this you can imagine that there will be instabilities. They can jump very quickly between dramatically different solutions. Now this a very simplified way of visualizing the problem now imagine working a very high dimensionality under this constraint of city block type of distances and then you understand how we could end up with instabilities. I'll make a couple of comments right now but analytical solutions but then I'll speak a little bit later about the idea of training the parameters for this model. The l1 norm does not have an analytical solution because it is non differentiable it's a piecewise function. It's including that absolute value. So we'll require some type of numerical solution scheme and approximation. L2 does have a direct closed-form analytical solution. So we benefit from that sparsity is also really interesting we compare the two methodologies l1 l2 the sparsity is the property of having coefficients that are either 0 or are not near zero in other words. If we don't have sparsity we have a bunch of values that go close to zero but you never really get to zero or and so l1 does remove features in other words that as we increase the lambda it's going to in fact shrink feature weights or I should say the slope parameters right to 0 they'll go right to 0 and then they 0 and that feature has no impact on the model.

That's where we get into model feature selection while building the model and now l2 will shrink the coefficients near 0 as lambda increases. And but you will still have lower sparsity. Excuse the spelling mistake there so let's visualize that. Let's see the difference between these norms. And this is a common visualization and found online. I believe James and all in their book they have this type of display too and many authors have shown these types of displays. So let's compare the ridge regression l2 regularization against the lasso so for the same regularization cost will have different shapes in the parameter space. So let's go on b1 b2 parameter space. We're putting constraints in the last so that the we're setting basically that the absolute value of B 1 plus B 2 must be less than or equal to s for a specific lambda value. That's effectively what we're doing now. Consider that for the l2 regularization. That's the b1 squared. Plus the b2 squared less than or equal to s a constraint and the result will be a circle solution space in the case of the last so it's going to be this type of square oriented with the corners on each one of the particular features. Now imagine that shape in a hyper dimensional space. Now if this was the least square solution combination of the parameters that we would have found with regular linear if we had a large enough s in both cases in other words of lambda parameter was low enough and had you know enough as too little influence to change us from getting that solution. Then we would just we would get that we would get back to having linear regression but now what would happen is as we increase the lambda parameter well as we increase the lambda parameter.

We can't get to that. Optimum solution the least squares fit and so what will happen is we can imagine will start to deviate from that and we could actually draw these shapes right here and they would represent ISO squared error contours in other words. We're accepting more error because of the constraint on the parameters the regularization constraint and so what will happen is if this is ultimately our constraint that this is only as far as we can go under the last so constraint you could grow these contours out. Keep moving away from the optimal solution and guess what's going to happen. The last so the very first solution we find is going to hit a corner with a very high likelihood it's going to be one of these corners and if you look at the corner what it is it's a solution for which the B one parameter is zero Doh and the B two parameter would have this negative value down here and along this axis now we look at a circle type of solution set to our regularization. What will happen is. We're more likely to intersect along. One of these arcs here. And we're not going to be zeroing out our individual parameters for a model so this is interesting. The last so is going to zero our parameters. This is really really cool now. Imagine that's happening in a hyper dimensional space and I've drawn it in a very simplified space now once again what's the impact of our regularization just like Richburg regression is going to increase the bias. The very best model the least biased linear model. We can work with is the linear regression model. Because it's going to minimize the error at the training data locations. We're going to introduce new bias. Because we're in fact in a flat node or model decrease the slope terms and so this is going to be the bias in our model now but what we gain is going to be reduced model variance and this is super cool. We'll see later on when we get into ensemble methods that all we're doing putting multiple estimators together is all about trying to attack model variance.

That's usually the biggest impact on as far as decreasing our prediction accuracy. And so what we'll gain is you can imagine if we bootstrap the training data and we created multiple models. It's basically a form of bagging with our linear regression. Model that we would have and we'll talk about bagging later. We would have multiple models. They'd all have different slope terms. But if we use regularization what we're going to do is we're going to decrease the sensitivity of our model to the training data therefore we decrease model variance variance in the model now once again. We're going to do that with a shrinkage penalty term. We've just already discussed that. We showed that it's going to be a lambda hyper parameter applied to the sum of the absolute values and just like with Ridge regression if we were to have the lambda term basically decrease until zero. We're at a point where this term just disappears and we effectively have we're just can have regular residual sum squares and our loss function and that'll be just regular linear regression if we if lambda approaches a very large number the solution will approach a global mean because as lambda goes to a very large number. We are shrinking all the model parameters so these parameters right here go to zero these parameters go to zero right here and the result is that we are left with just minimizing the squared difference between the training data and a constant value and the very best constant value would come up with to minimize that squared error in fact is going to be the mean. And we know an expectation to minimize the squared residual. The best value. We use is the mean okay so now. I mentioned the fact that this approach has a really cool aspect to it and that is from the standpoint of feature selection so let's go back to let's produce a plot and what this plot is is we're solving a multi regression with rich regression. Right here with lambda hyper parameter and with the last so down here with our lambda hyper parameter now.

Notice what you put in here different scales for the lambda parameters. I've used the log scale here a log scale here but I use different scales to kind of highlight specific behaviors and forms. And so what we can see in. General is that if we use a lambda parameter. That's very very small. We're going to basically have linear regression. There's no real impact at lambda parameter at some point. We're gonna see that we start to impact the parameters of the model and they start to all shrink towards zero now notice they cut they all shrink basically together for the most part. They're all kind of shrinking together towards zero. Now what's really interesting is if we go ahead and we look at the behavior of the last so what actually happens. Is that if we look in detail here they don't shrink together. What actually is happening here and if you look here the lambda parameter we're talking about is 10 to negative 2. That would be right here on this scale. We can already see that. Some of the individual parameters are already zeroing note first of all acoustic impedance then total organic carbon brittleness log permeability and finally porosity at the very end a large enough lambda parameter once again. We've reached the point where you just can estimate the global mean all the parameters are just a 0 a note so this is really cool. We can cease the sequential order over which our individual predictor features are being zeroed out and we could use that as a scheme given the assumption of a linear model and so forth all the other assumptions of homoscedasticity and so forth. We could use that as a feature selection method. What's our most important feature the last one to be 0 doe the first feature to be 0 doe acoustic impedance is the least important feature and so we can use that. We have the order now porosity. It's most important than log permanent brittleness toc and finally acoustic impedance.

Now let me just make a couple of comments around the last so and training the model parameters I want to reinforce this point this important when we switch to l1 regularization once again. We don't have a analytical close form solution any longer we showed in the ridge regression lecture that we can actually derive using matrix math but we could derive the actual methodology to calculate the and now there was a method. There was a matrix inversion but to calculate the parameters and we said that we had confidence that in general that matrix would be invertible. And so we'd be able to calculate those parameters using matrix math. Now we don't have that closed form solution for the last. So and we're now going to rely on a numerical solution scheme like gradient descent. We'll talk much more about gradient descent when we get into gradient boosting later on in this course but there has been a recent paper that demonstrate the last so solution is unique for any number of features and for any number of features that actually have weights. P in this case would be features that remain for a lambda value in other words haven't been zeroed out because remember this is equivalent to feature selection so just like we did with dimensionality reduction in feature selection. Emma was the total number of features and P were the remaining selected features in the case of projection feature projection. There were the number of say principal components. We've worked with now. I should comment. This has been shown that unique for the case in which all features are continuous. If some of your predictor features are categorical. You will you may lose this attribute all right so we will have some hands-on. I have an example in Python for working with the last so but this is the end of the lecture right now so if you want to check that out that's on github on my github repository under a Python numerical demos under juice. That's guy is the name of the account.

I hope that this lecture has been useful to you. I really enjoyed covering this topic. I think it's essential that we cover this topic. As far as understanding the overall idea of regularization feature selection and working with simple highly interpretive methods and then building up alright so. I am Michael perch. I'm an associate professor at the University of Texas at Austin. I work in data analytics geo statistics and machine learning with my group of PhD students we are the Texas Center for data analytics and geostatistics. I share all of my lectures online on YouTube and all my worked out examples are on github and I share a lot of good information on Twitter. Alright take care everyone bye.