How do I build a Bayesian Logistic Regression model?
Another episode of probabilistic programming primer. I'm going to talk about how to build a Bayesian logic logistic regression model thus based on the work of J Benjamin cook and the question is how likely am I to make more than 50,000 US dollars. So we're going to use the adult data set which is from UCI or a vine repository and we're going to do a little bit of like kind of feature engineering. We're going to exclude anybody who we don't have any income data or on which is which is fine we're going to restrict this to the US we're going to create a variable a binary variable for income. We're going to only look at age education hours. An age squared as a a squared is defined here as as our features or a covariance. We can see that we have a bit of class imbalance here so so we will use a simple model which assumes that the probability of making more than 50k a 50 thousand. US dollars is a function of age years of education hours. Worked per week we will use. PI mc3 to do inference and Bayesian statistics for those that you do not know we treat everything as a random variable we want to know the posterior probability distribution of the parameters in this case the regression coefficients in this case our parameters or the regression coefficients um the posterior is equal to the likelihood and there's a small error there device so the posterior is equal to likelihood times prior divided by some factor because of the denominator. PD is you know it's described as this is an integral we've referred to skip computing it but fortunately if we draw a samples from the parameter space the probability proportional to the height of the posterior at any point at any given point we end up with an empirical distribution that converges to the austere as a number of samples approaches infinity. What this means and and practice is that we only need to worry about the numerator so getting back to logistic regression. We need to specify a prior and a likelihood in order to draw samples from the posterior we could use sociological knowledge about the effects of age and education on income.
But we don't know this because I'm an outsider and sociologists but a stay. Let's just use the default prior specification for GLM coefficient such general linear model coefficients. The primary 3 gives us which is sample from this. And from this normal distribution. This is a very vague prior that will let the data speak for itself. The likelihood is a product of an Bernoulli trials where P I equals 1 over 1 minus X exponential function to the power of minus Zi and Zi is these coefficients and why why I cuz wanted equal FM income is greater than 50k and why I equals 0 other otherwise with the math out of the way we can get back to the data here I use PI m63 to draw samples from the posterior the sampling algorithm uses nuts which is a for form of Hamiltonian Monte Carlo. In which the parameters are tuned automatically notice and this is probably one of the nicest things where pine mc3 we get to borrow the syntax of specifying. GLM's from our so. It's very convenient. I'm the last line in the cell tosses out the first 2,000 samples which are taken before the Markov chain has converged and therefore did not come from a target distribution and these are not important for us so as you can see you get a nice syntax I am. I've done a little trick here I've started initialization with a DVI because otherwise I couldn't get this to work um you can see that I returned the trace which is one of our most important objects here and I also trained our posterior predictive. This is what we can use to do criticism. If you want to off this model so we have a little look at the trace we can see the intercepts. We have a little look at race age. You just want to see if there's actually any numbers here we can see that how. Beta education and beta age are distributed and they've got like a kind of normal distribution we can see that there's a bit of a there's not much correlation between the two of them so that's fine so how so here's our question.
How do you age and education affect the probability of making more than 50 K to answer this question. We can show how the probability of making more than 50 K changes with age for a few different education levels here. We assume that the number of hours worked per week is fixed to 50. Pi mc3 gives us a convenient way to plot the posterior predictive distribution so we need to give a function a linear model on a set of points to evaluate. So we'll pass in three different models. One with education equals 12 so it's finishing high school. One with education equals 16. Let's finish undergrad. And one with education equals 219 which is three years of grad school. And you can see here so as as you go up with with so as your so high school education. So you see that grad school increases your probability of getting to earn more than 50k until you headed by fifty years and then it kind of levels off and then it kind of goes down and this probably indicates that there's you know some advantage to education and it gets the same kind of form with all three of these. Um this indicates that education matters up to a certain age but you know beyond that age. You probably won't earn that much and that's probably because above the age of say 50 years old. You're probably not likely to dramatically change your earnings at that point or even like be in the labor market in some cases so you can see like so we can like scroll over here and we can like we can say things like approximately at age 40. The probability of getting an income greater than 50k is 60% for say for sake argument and but if you had grad school. Your probability is 68 percent. So there's an 8% uplift at the same age all things being equal by having grad school versus by having grad school versus high school education and we can look at the the odds ratio and we can find our confidence interval even though this is wrong it doesn't matter but whenever I ran this before and we are 95% confident are all the terrestrial lies between our interval which means that we can trust these results in some sense so I'll put so these are just a bunch of functions to run multiple models.
I'm not gonna run these but basically one of the questions we have is. Why should it be age to versus age. So we run. The model with age cubed 8 4 and all those variables and you can see and here's our evaluation metric this is the W AIC which is a standard evaluation metric in pi MC 3 and you can see that K 2 is better than K 1 so H squared is better than a k1. There's not much difference between. H cubed and age two and a teenage age the power of four is slightly better but we can kind of say that age squirt was our best case in this case so that pretty much wraps up what we have so far.
But we don't know this because I'm an outsider and sociologists but a stay. Let's just use the default prior specification for GLM coefficient such general linear model coefficients. The primary 3 gives us which is sample from this. And from this normal distribution. This is a very vague prior that will let the data speak for itself. The likelihood is a product of an Bernoulli trials where P I equals 1 over 1 minus X exponential function to the power of minus Zi and Zi is these coefficients and why why I cuz wanted equal FM income is greater than 50k and why I equals 0 other otherwise with the math out of the way we can get back to the data here I use PI m63 to draw samples from the posterior the sampling algorithm uses nuts which is a for form of Hamiltonian Monte Carlo. In which the parameters are tuned automatically notice and this is probably one of the nicest things where pine mc3 we get to borrow the syntax of specifying. GLM's from our so. It's very convenient. I'm the last line in the cell tosses out the first 2,000 samples which are taken before the Markov chain has converged and therefore did not come from a target distribution and these are not important for us so as you can see you get a nice syntax I am. I've done a little trick here I've started initialization with a DVI because otherwise I couldn't get this to work um you can see that I returned the trace which is one of our most important objects here and I also trained our posterior predictive. This is what we can use to do criticism. If you want to off this model so we have a little look at the trace we can see the intercepts. We have a little look at race age. You just want to see if there's actually any numbers here we can see that how. Beta education and beta age are distributed and they've got like a kind of normal distribution we can see that there's a bit of a there's not much correlation between the two of them so that's fine so how so here's our question.
How do you age and education affect the probability of making more than 50 K to answer this question. We can show how the probability of making more than 50 K changes with age for a few different education levels here. We assume that the number of hours worked per week is fixed to 50. Pi mc3 gives us a convenient way to plot the posterior predictive distribution so we need to give a function a linear model on a set of points to evaluate. So we'll pass in three different models. One with education equals 12 so it's finishing high school. One with education equals 16. Let's finish undergrad. And one with education equals 219 which is three years of grad school. And you can see here so as as you go up with with so as your so high school education. So you see that grad school increases your probability of getting to earn more than 50k until you headed by fifty years and then it kind of levels off and then it kind of goes down and this probably indicates that there's you know some advantage to education and it gets the same kind of form with all three of these. Um this indicates that education matters up to a certain age but you know beyond that age. You probably won't earn that much and that's probably because above the age of say 50 years old. You're probably not likely to dramatically change your earnings at that point or even like be in the labor market in some cases so you can see like so we can like scroll over here and we can like we can say things like approximately at age 40. The probability of getting an income greater than 50k is 60% for say for sake argument and but if you had grad school. Your probability is 68 percent. So there's an 8% uplift at the same age all things being equal by having grad school versus by having grad school versus high school education and we can look at the the odds ratio and we can find our confidence interval even though this is wrong it doesn't matter but whenever I ran this before and we are 95% confident are all the terrestrial lies between our interval which means that we can trust these results in some sense so I'll put so these are just a bunch of functions to run multiple models.
I'm not gonna run these but basically one of the questions we have is. Why should it be age to versus age. So we run. The model with age cubed 8 4 and all those variables and you can see and here's our evaluation metric this is the W AIC which is a standard evaluation metric in pi MC 3 and you can see that K 2 is better than K 1 so H squared is better than a k1. There's not much difference between. H cubed and age two and a teenage age the power of four is slightly better but we can kind of say that age squirt was our best case in this case so that pretty much wraps up what we have so far.