Neural Network Regression Model with Keras | Keras #3
In this video, I will be using a really cool dataset to build both a linear and non-linear regression model with keras. I’ll explain exactly what this means in a second but first I would like to touch on the dataset. This dataset contains some statistics on youtube videos including the number of views, likes, dislikes, and also the number of subscribers on that video’s channel. How I got this dataset is actually super cool. I along with my dad developed a web crawler which was used to scour the videos on the homepage of youtube, and also the results of videos that appeared when a particular search query was entered into the youtube search engine. Some examples of these queries include “sports”, and “politics”, and “gardening”. A webcrawler is simply an automated script that browses the web in a methodical manner, oftentimes retrieving data. An important thing to mention is that our dataset consists mainly of very popular videos, so it isn’t necessarily representative of all youtube videos. For our purposes though, it will suffice. In the video description there will be the link to a github repository that I made which contains the supplementary code and dataset for this video. If interested, you can download the files by clicking the green clone or download button in the top right hand corner, and then clicking download zip. You will then need to unzip this. Alternatively, it is much easier for you to simply click on the code and look at it directly from github The goal of our model will be to predict the number of views based on the other 3 parameters, which are, likes, dislikes, and subscribers. This is a prime example of a regression model. But, what does regression mean? Well, regression just points to the fact that our model is continuous, and that for any inputs, we can get an output. For any number of likes, dislikes, and subscribers we will get the predicted number of views. This is different from classification problems, which I will be talking about in the next video in this series, again with a cool dataset.
Linear regression means that our model will essentially be a straight line, and it can be defined in the equation y=mx+b, with y being the output, m being the weights, x being the input, and b being the bias, even if these must be matrices. In the case of a Neural Network, this will be a neural net with no hidden layers. It will have an input layer and an output layer connected by weights, and we will also have a bias for each output. If you are unfamiliar with the structure of Neural Networks, you may want to watch my video where I explain just this. It will be linked in the video description. Our linear regression model will look like this. It’ll have 3 inputs, one output, and one bias. Let’s go ahead and implement this in python, and see how it performs. Before we do that, we will need to install a couple more packages in addition to what we installed in the last video. So, in your command line, make sure your environment is activated, and type the following commands. Conda install pip, which will install pip in your virtual environment. Pip is just a package management system that we will use to install some other packages. From here we will type pip install pandas, then pip install sklearn, then pip install matplotlib. We will see exactly what these packages do in the implementation. Next, we will need to upload our dataset into jupyter notebook. I would highly recommend that you make a folder within your home directory in jupyter notebook, so that your interface remains clean and uncluttered. You can do this by clicking new in the top right and folder. From here you should rename your folder by selecting it and clicking rename in the top left corner. Next, you should open the newly created folder. Here, you can click the upload button in the top right corner, and select both the inputs or StatsVideosXALL.csv, and the outputs or StatsVideosYALL.csv. After doing this, you can click upload next to the files, and you are done.
From here in jupyter notebook, we will import all of these modules. For some, we just import the module, and for others, like for pyplot, we must type from matplotlib import pyplot. This is because pyplot is a sub-module of matplotlib that doesn’t get imported with a simple import matplotlib. Additionally, for some modules, like tensorflow, we type import tensorflow as tf. This is simply so that when we use tensorflow we don’t have to write out tensorflow every time. Instead, we can just type tf. Next, we need to assign the dataset to variables. I do this with a function in pandas called read_csv, where we pass the path to our file in quotes. Because we have the csv in the same folder as our python script, we can simply pass the name of our files. In this case, they are StatsVideosXALL.csv and StatsVideosYALL.csv. So now we have two pandas dataframes containing our input data and our output data. BTW this is why we call the variables df1, and df2. After this, we will split up our data into training and testing data, with a module from sklearn called train_test_split. If you remember from my neural network videos, training data is the data that we use to minimize our cost, and the testing data is a separate chunk of data reserved for testing our model's performance. Remember, overfitting occurs when our model learns our training data too well, and as a result it suffers in generalizing, which is the main goal of a machine learning model in the first place. Therefore, it is important that we can test whether or not our model is overfitting. One way to do this is to reserve some data for testing, that we don’t use in training our model. When we do this with the train_test_split function, it randomly samples our data, which is ideal. By defining the test_size as 0.2 we are saying that we want 20 percent of the data reserved for testing, and the remaining 80 percent of the data reserved for training. When we pass df1, it will take a random 80 percent of it, and assign it to X train, and it will take the remaining 20 percent of it and assign it to X_test.
The same occurs for df2, but now into y_train and y_test. Once we do this, we scale our data. Scaling data is usually a good practice when you have inputs for different features that vary greatly, or outputs for different features that vary greatly. I’ll explain why this is in just a moment, but first, let’s look at our data. You can see that we only have one output parameter, views, and so scaling it won’t offer any performance benefits, besides making the optimum hyperparameters, like the learning rate, more consistent between models for different datasets, and in turn easier to find. In my implementation, I choose not to scale the outputs, but you very well could. Note, that if you were to do so, when using your data’s predictions, you would need to “undo” the normalization, which you can do by performing the inverse operations of the normalization that you initially performed on your data. However, our inputs are a different story. We have likes, dislikes, and subscribers, and as you can see, the values in dislikes, and even likes for that matter, are orders of magnitude less than subscribers. In a Neural Network, this would result in an underrepresentation of these parameters in the output when the weights are first initialized, and, most likely, when training has been completed. This is because training doesn’t always find the perfect combination of weights for different inputs, which can be called the “global minimum”, but instead it adjusts different weights from where it begins and gets what is often a “local minimum”, or a minimum of the cost in a certain range of weights. This would be an example of a global minimum and a local minimum . In these graphs we can see the correlation between dislikes and views, likes and views, and subscribers and views. You can see that dislikes actually has a stronger correlation to views than subscribers does.
From this we can tell that by standardizing we would maximize our chances of reaching the global minimum, or at least something near it, because we are making sure that the data is more or less equally represented. In this implementation, I will be using the .scale function from sklearn.preprocessing. This scales the data so that it centers it around 0 and has a standard distribution of 1. Note that the transformation is not linear. A graph of our data should roughly resemble this standard deviation graph. The next step is to actually define the structure of our Neural Network. First, we define our model in the command model = sequential. After this, we can begin adding layers. For this first linear regression model, I will only add one layer with 3 inputs likes, dislikes and subscribers, and one output, views, and it doesn’t have an activation function. Note that it does have a bias. The predictions of this model would be a straight line, but in 4 dimensions, whatever that would look like. In the next line, we state our optimizer, Adam, and our loss function mean squared error. All that you need to know about optimizers is that they state how we update or “optimize” our weights, and that adam is an alternative to stochastic gradient descent that simply performs better. Stochastic gradient descent is simply where, using derivatives, we update all the weights to a subset of the data. Again, adam is simply an alternative to this that modifies the process slightly, and brings many performance benefits. In this instance, because our dataset has such large outputs I set the initial learning rate to 100, and even this, as we will see, is not nearly enough. Additionally, mean squared error is the cost function which we defined in the Neural Network series. It is the same thing as ^2. From this point, we train our model. We do this by saying model.fit, and passing certain parameters. First, is the training inputs, then the training outputs. In jupyter notebook we can see all the different parameters by typing shift-tab.
In my implementation, I state the number of epochs, or the number of times that all the data has been cycled through, validation split, which I will explain in just a moment, and verbose which controls whether or not our model prints certain things like accuracy, at each epoch. I usually like to have this off. Validation split is a random sample of our training data, in our case a random 10 percent, that is used to check the accuracy of our Neural Network on something besides the training data at each epoch. It is different from the testing data which we only use once our model is done training. Although I use the default batch size of 32 and don’t explicitly define it in the .fit funciton, it is important that I mention batch size and give you an overview of what it actually is. During each iteration, the model calculates the predicted output for 32 data points, calculates the cost for each of them, averages this value, and uses derivatives to update all the weights. The model then repeats this process until it has cycled through all the data points. This would be one epoch. Note that in our case, we will have many iterations in every epoch. If we use something significantly less than 32, then we won’t be using a representative chunk of data during each weight update, significantly more and we are being inefficient, and performing way too many calculations before we update the weights. 32 is a good sweet spot that many programmers use. I have all this set to history, so that I can graph the accuracy of the model on the training data and on the validation split over the epochs. This is exactly what I do in the next lines. Don’t worry about the specifics all that you need to know is that the red is the validation loss or cost, and the blue is the training loss or cost. Once I run this, you can see that the model hasn’t learned much yet, but it is on the right track, with both the blue training data loss, and the red validation split loss going down.
When I make the learning rate 1000, the model continues to improve. When I update the learning rate once more to 10000 you can see that the model stops improving, and therefore we can only further improve its performance by optimizing other paramters. Next I add an activation function, specifically “relu”, which is one of the better performing activation functions. As you can see, this has no significant effect on the performance of our model. Now that we have our trained model, let’s compare its performance on training data and testing data. So, let’s run our model with its current weights on our training data and testing data by typing model.predict and passing x_train and x_test, and then assigning this to y_train_pred and y_test_pred respectively. Now that we have done this, we must find a good way to measure the performance of our model. After all, we can’t measure the accuracy as a fraction of correct and incorrect predictions as we can in a classification problem. A good way of measuring our model's accuracy is the r squared score. R squared is a statistical measure of how close the data is to the regression model. It will range from 0 percent to 100 percent which, in our code, will be displayed from 0 to 1, with zero meaning that the model explains none of the correlation between input and output and 1 meaning that it explains all of the correlation. I implement it in these lines. First I import the r2 score from sklearn.metrics. Don’t worry about the formatting in the following 2 lines, but just know that I print out the r2 score, which in order to be calculated takes 2 arguments The actual data’s outputs, and our models predicted outupts, for both the training data and the testing data . If you remember, we often call these y and yhat respectively. In this instance, the r2 score is slightly better than on the training data when compared to the testing data, so we are likely overfitting a little bit.
Overall, our model is performing quite well, despite the fact that it is very simple. Let’s see what happens when we make our model more complex. There is no guarantee that the performance will improve, as our linear regression model is already performing quite well, and the actual relationship in our dataset between input and output is simple, even linear, meaning that a more complex model may not be necessary. I will begin by adding by adding 13 outputs to the first layer. There is no real reason why I chose 13 as opposed to any integers near it. Perhaps because it is how old I am. I also add 4 hidden layers, each with 13 inputs and 13 outputs. I then add an output layer with one output, which will be the predicted views. This is what our model now looks like. Next, I define our optimizer, adam, the cost function, mean squared error, and the learning rate, which will be 0.003. After training this model for 6000 iterations, you can see that both the training data loss and the validation split loss converged quickly, and then the model began to overfit, with the training data cost continuing to improve, and the validation split cost beginning to go up again. Note that when you run this code, it is most likely that you will need to adjust these values in order to achieve similar results. Neural Networks perform differently every time you run them, especially when data is randomly sampled. We could rerun the training and limit the number of epochs to around a hundred, when our model was doing the best. Or we could simply use the EarlyStopping function. We implement it in the following way. You pass several parameters, the most important of which are the loss or cost that you want to monitor, the min delta which sets a limit for how much the model must improve in an epoch before it terminates, the patience which allows the cost to not decrease for a certain number of epochs, and verbose, which controls whether or not certain things are printed out. Again if you would like to see all the different arguments that can be passed, click inside these parentheses and click shift tab on your keyboard.
I assign all this to a variable which I name early stopper, and pass it in as a callback which is an argument that the .fit function takes, as shown on screen. Note that I very well could have implemented the earlystopper in the linear regression model, however I essentially nailed it with 500 epochs and a learning rate of 10000. Immediately, from the graphs, we can see that the linear and non-linear regression models perform quite similarly. Next, let’s compare the peformance of the two models in more detail with the r2 score. First, I need to calculate the r2 score of the deep neural network. That is exactly what I do in these lines. As you can see, they are quite similar. Let’s display our model’s performance visually In the next lines, I do as best I can to visually represent the model and its performance. I make a scatterplot with the actual number of views on the x axis, and the models predictions on the y axis. Note that red is the training data and green is the testing data. I also plot a straight line with a slope of one on a separate graph to show you what a perfect model would look like. Remember, a perfect model would have the same actual output and predicted output. From this graph we can conclude that our model is decent. There are several other important ways many programmers improve their model’s performance. These include batch normalization, weight regularization, dropout, data augmentation, and several others. I will talk about these with a dataset for which the performance of the model improves substantially in future videos. In this regression dataset, there is no such improvement. Anyhow, that is it for this video. Stay tuned for similar videos on things like classification problems, convolutional neural networks, recursive neural networks, and more, which will all feature cool datasets.
Linear regression means that our model will essentially be a straight line, and it can be defined in the equation y=mx+b, with y being the output, m being the weights, x being the input, and b being the bias, even if these must be matrices. In the case of a Neural Network, this will be a neural net with no hidden layers. It will have an input layer and an output layer connected by weights, and we will also have a bias for each output. If you are unfamiliar with the structure of Neural Networks, you may want to watch my video where I explain just this. It will be linked in the video description. Our linear regression model will look like this. It’ll have 3 inputs, one output, and one bias. Let’s go ahead and implement this in python, and see how it performs. Before we do that, we will need to install a couple more packages in addition to what we installed in the last video. So, in your command line, make sure your environment is activated, and type the following commands. Conda install pip, which will install pip in your virtual environment. Pip is just a package management system that we will use to install some other packages. From here we will type pip install pandas, then pip install sklearn, then pip install matplotlib. We will see exactly what these packages do in the implementation. Next, we will need to upload our dataset into jupyter notebook. I would highly recommend that you make a folder within your home directory in jupyter notebook, so that your interface remains clean and uncluttered. You can do this by clicking new in the top right and folder. From here you should rename your folder by selecting it and clicking rename in the top left corner. Next, you should open the newly created folder. Here, you can click the upload button in the top right corner, and select both the inputs or StatsVideosXALL.csv, and the outputs or StatsVideosYALL.csv. After doing this, you can click upload next to the files, and you are done.
From here in jupyter notebook, we will import all of these modules. For some, we just import the module, and for others, like for pyplot, we must type from matplotlib import pyplot. This is because pyplot is a sub-module of matplotlib that doesn’t get imported with a simple import matplotlib. Additionally, for some modules, like tensorflow, we type import tensorflow as tf. This is simply so that when we use tensorflow we don’t have to write out tensorflow every time. Instead, we can just type tf. Next, we need to assign the dataset to variables. I do this with a function in pandas called read_csv, where we pass the path to our file in quotes. Because we have the csv in the same folder as our python script, we can simply pass the name of our files. In this case, they are StatsVideosXALL.csv and StatsVideosYALL.csv. So now we have two pandas dataframes containing our input data and our output data. BTW this is why we call the variables df1, and df2. After this, we will split up our data into training and testing data, with a module from sklearn called train_test_split. If you remember from my neural network videos, training data is the data that we use to minimize our cost, and the testing data is a separate chunk of data reserved for testing our model's performance. Remember, overfitting occurs when our model learns our training data too well, and as a result it suffers in generalizing, which is the main goal of a machine learning model in the first place. Therefore, it is important that we can test whether or not our model is overfitting. One way to do this is to reserve some data for testing, that we don’t use in training our model. When we do this with the train_test_split function, it randomly samples our data, which is ideal. By defining the test_size as 0.2 we are saying that we want 20 percent of the data reserved for testing, and the remaining 80 percent of the data reserved for training. When we pass df1, it will take a random 80 percent of it, and assign it to X train, and it will take the remaining 20 percent of it and assign it to X_test.
The same occurs for df2, but now into y_train and y_test. Once we do this, we scale our data. Scaling data is usually a good practice when you have inputs for different features that vary greatly, or outputs for different features that vary greatly. I’ll explain why this is in just a moment, but first, let’s look at our data. You can see that we only have one output parameter, views, and so scaling it won’t offer any performance benefits, besides making the optimum hyperparameters, like the learning rate, more consistent between models for different datasets, and in turn easier to find. In my implementation, I choose not to scale the outputs, but you very well could. Note, that if you were to do so, when using your data’s predictions, you would need to “undo” the normalization, which you can do by performing the inverse operations of the normalization that you initially performed on your data. However, our inputs are a different story. We have likes, dislikes, and subscribers, and as you can see, the values in dislikes, and even likes for that matter, are orders of magnitude less than subscribers. In a Neural Network, this would result in an underrepresentation of these parameters in the output when the weights are first initialized, and, most likely, when training has been completed. This is because training doesn’t always find the perfect combination of weights for different inputs, which can be called the “global minimum”, but instead it adjusts different weights from where it begins and gets what is often a “local minimum”, or a minimum of the cost in a certain range of weights. This would be an example of a global minimum and a local minimum . In these graphs we can see the correlation between dislikes and views, likes and views, and subscribers and views. You can see that dislikes actually has a stronger correlation to views than subscribers does.
From this we can tell that by standardizing we would maximize our chances of reaching the global minimum, or at least something near it, because we are making sure that the data is more or less equally represented. In this implementation, I will be using the .scale function from sklearn.preprocessing. This scales the data so that it centers it around 0 and has a standard distribution of 1. Note that the transformation is not linear. A graph of our data should roughly resemble this standard deviation graph. The next step is to actually define the structure of our Neural Network. First, we define our model in the command model = sequential. After this, we can begin adding layers. For this first linear regression model, I will only add one layer with 3 inputs likes, dislikes and subscribers, and one output, views, and it doesn’t have an activation function. Note that it does have a bias. The predictions of this model would be a straight line, but in 4 dimensions, whatever that would look like. In the next line, we state our optimizer, Adam, and our loss function mean squared error. All that you need to know about optimizers is that they state how we update or “optimize” our weights, and that adam is an alternative to stochastic gradient descent that simply performs better. Stochastic gradient descent is simply where, using derivatives, we update all the weights to a subset of the data. Again, adam is simply an alternative to this that modifies the process slightly, and brings many performance benefits. In this instance, because our dataset has such large outputs I set the initial learning rate to 100, and even this, as we will see, is not nearly enough. Additionally, mean squared error is the cost function which we defined in the Neural Network series. It is the same thing as ^2. From this point, we train our model. We do this by saying model.fit, and passing certain parameters. First, is the training inputs, then the training outputs. In jupyter notebook we can see all the different parameters by typing shift-tab.
In my implementation, I state the number of epochs, or the number of times that all the data has been cycled through, validation split, which I will explain in just a moment, and verbose which controls whether or not our model prints certain things like accuracy, at each epoch. I usually like to have this off. Validation split is a random sample of our training data, in our case a random 10 percent, that is used to check the accuracy of our Neural Network on something besides the training data at each epoch. It is different from the testing data which we only use once our model is done training. Although I use the default batch size of 32 and don’t explicitly define it in the .fit funciton, it is important that I mention batch size and give you an overview of what it actually is. During each iteration, the model calculates the predicted output for 32 data points, calculates the cost for each of them, averages this value, and uses derivatives to update all the weights. The model then repeats this process until it has cycled through all the data points. This would be one epoch. Note that in our case, we will have many iterations in every epoch. If we use something significantly less than 32, then we won’t be using a representative chunk of data during each weight update, significantly more and we are being inefficient, and performing way too many calculations before we update the weights. 32 is a good sweet spot that many programmers use. I have all this set to history, so that I can graph the accuracy of the model on the training data and on the validation split over the epochs. This is exactly what I do in the next lines. Don’t worry about the specifics all that you need to know is that the red is the validation loss or cost, and the blue is the training loss or cost. Once I run this, you can see that the model hasn’t learned much yet, but it is on the right track, with both the blue training data loss, and the red validation split loss going down.
When I make the learning rate 1000, the model continues to improve. When I update the learning rate once more to 10000 you can see that the model stops improving, and therefore we can only further improve its performance by optimizing other paramters. Next I add an activation function, specifically “relu”, which is one of the better performing activation functions. As you can see, this has no significant effect on the performance of our model. Now that we have our trained model, let’s compare its performance on training data and testing data. So, let’s run our model with its current weights on our training data and testing data by typing model.predict and passing x_train and x_test, and then assigning this to y_train_pred and y_test_pred respectively. Now that we have done this, we must find a good way to measure the performance of our model. After all, we can’t measure the accuracy as a fraction of correct and incorrect predictions as we can in a classification problem. A good way of measuring our model's accuracy is the r squared score. R squared is a statistical measure of how close the data is to the regression model. It will range from 0 percent to 100 percent which, in our code, will be displayed from 0 to 1, with zero meaning that the model explains none of the correlation between input and output and 1 meaning that it explains all of the correlation. I implement it in these lines. First I import the r2 score from sklearn.metrics. Don’t worry about the formatting in the following 2 lines, but just know that I print out the r2 score, which in order to be calculated takes 2 arguments The actual data’s outputs, and our models predicted outupts, for both the training data and the testing data . If you remember, we often call these y and yhat respectively. In this instance, the r2 score is slightly better than on the training data when compared to the testing data, so we are likely overfitting a little bit.
Overall, our model is performing quite well, despite the fact that it is very simple. Let’s see what happens when we make our model more complex. There is no guarantee that the performance will improve, as our linear regression model is already performing quite well, and the actual relationship in our dataset between input and output is simple, even linear, meaning that a more complex model may not be necessary. I will begin by adding by adding 13 outputs to the first layer. There is no real reason why I chose 13 as opposed to any integers near it. Perhaps because it is how old I am. I also add 4 hidden layers, each with 13 inputs and 13 outputs. I then add an output layer with one output, which will be the predicted views. This is what our model now looks like. Next, I define our optimizer, adam, the cost function, mean squared error, and the learning rate, which will be 0.003. After training this model for 6000 iterations, you can see that both the training data loss and the validation split loss converged quickly, and then the model began to overfit, with the training data cost continuing to improve, and the validation split cost beginning to go up again. Note that when you run this code, it is most likely that you will need to adjust these values in order to achieve similar results. Neural Networks perform differently every time you run them, especially when data is randomly sampled. We could rerun the training and limit the number of epochs to around a hundred, when our model was doing the best. Or we could simply use the EarlyStopping function. We implement it in the following way. You pass several parameters, the most important of which are the loss or cost that you want to monitor, the min delta which sets a limit for how much the model must improve in an epoch before it terminates, the patience which allows the cost to not decrease for a certain number of epochs, and verbose, which controls whether or not certain things are printed out. Again if you would like to see all the different arguments that can be passed, click inside these parentheses and click shift tab on your keyboard.
I assign all this to a variable which I name early stopper, and pass it in as a callback which is an argument that the .fit function takes, as shown on screen. Note that I very well could have implemented the earlystopper in the linear regression model, however I essentially nailed it with 500 epochs and a learning rate of 10000. Immediately, from the graphs, we can see that the linear and non-linear regression models perform quite similarly. Next, let’s compare the peformance of the two models in more detail with the r2 score. First, I need to calculate the r2 score of the deep neural network. That is exactly what I do in these lines. As you can see, they are quite similar. Let’s display our model’s performance visually In the next lines, I do as best I can to visually represent the model and its performance. I make a scatterplot with the actual number of views on the x axis, and the models predictions on the y axis. Note that red is the training data and green is the testing data. I also plot a straight line with a slope of one on a separate graph to show you what a perfect model would look like. Remember, a perfect model would have the same actual output and predicted output. From this graph we can conclude that our model is decent. There are several other important ways many programmers improve their model’s performance. These include batch normalization, weight regularization, dropout, data augmentation, and several others. I will talk about these with a dataset for which the performance of the model improves substantially in future videos. In this regression dataset, there is no such improvement. Anyhow, that is it for this video. Stay tuned for similar videos on things like classification problems, convolutional neural networks, recursive neural networks, and more, which will all feature cool datasets.