Decision Tree Machine Learning | Decision Tree Python | Intellipaat


Hey guys, welcome to Intellipaat. In today's session, we are going to learn about decision trees. First of all, let me tell you why you should be interested in learning about decision trees. Decision trees are schematical and geographical representation to all the possible solutions of a decision-based problem. It is used to make smart decisions. Not only is it very crucial in the field of Data Science but it is also very crucial in the field of Machine Learning where it is used for predictive analysis. So, if you are interested in any of these fields, you must stay tuned till the end of this video. Now before we go forward, do subscribe to Intellipaat's YouTube channel so that you never miss out on any of our upcoming videos. So, first of all, we are going to learn what is a decision tree with the help of a real-world example, and then we are going to learn how to build a decision tree. Also guys, if you are interested in becoming a certified Data Science Professional, then do check out the Data Science Course offered by Intellipaat. You can find the course link below in the description box. Now without any further delay, let's get started. So, decision tree is basically a technique or a data structure which we build that helps us in making decisions. So here, all the internal nodes represent the test condition on an attribute, and all the leaf nodes are the categories into which the data is divided. So, let's take an example to understand this better. So, let's say you're the manager of a telecom company and you want to understand what are the factors which make customer churn out. So, you decide to build a decision tree. Now, this decision tree will give you a series of test conditions. So here, the root node is gender, that is, the first condition would be decided by the gender column. So, if the customer is male, then we'll further determine what is the duration of his tenure. On the other hand, if the customer is female, then the next test condition would be on the basis of monthly charges.

So, let's say the customer is male. Then we'll check his tenure, and if his tenure is less than 30 months, then he will churn out, and again if his tenure is greater than 30 months then he'll stick to the same company, and this is the final prediction given by the decision tree. Similarly, if you get a female customer, then we will have to check her monthly charges. So, if her monthly charges are greater than $80, then she will churn out, and if her monthly charges are less than $80, then she'll stick with the same company. So, we are exploring a series of alternatives to reach a particular decision point. Now that we've understood what is a decision tree, let's look at the types of decision trees. So, decision tree can either be a classification tree or a regression tree. So, classification tree is used when the response or the target variable is categorical in nature, and the regression tree is used when the response variable is numerically low or continuous. So let's say we have a dataset A and there are n records in it. Now what I'm going to do is draw samples from this dataset. So, this actually will be sampling with replacement. That is, I'll take one record from dataset A, take note of it, enter the same sample in dataset A1, and then put the record back to where it came from. I will repeat this process n times, so that there are n records in dataset A1 as well. So, what you need to keep in mind is that, out of these n records in A1, some of them might have come twice, thrice, or even several times while some records from A might not have made it at all to A1. So, I've created A1 like this. Then I'll go ahead and create multiple datasets the same way. So, I have A1, A2, A3, till Ax, and each of these has the same number of records as A. The X over here could be anything. Let's say 100, 500, or even 1000. So, from just one dataset A, we are able to create multiple datasets for our advantage. So, let's say dataset A has 1000 rows and the value of x is also 1000.

So this would be 1000 multiplied by 1000 which would give us 1 million rows, that is, from just 1000 rows of data, we were able to get 1 million rows. Now, what we'll do is for each of this X datasets, we will fit one decision tree each, so we have X decision trees coming from X datasets. So now we have a group of trees or in other words, what we have here is ensemble of trees. Now let's say a new record Ri comes away. Then we're going to pass this record to each of this X trees and we are going to get each tree's prediction on what class this new record is going to represent. Since we have X trees, we will get X predictions, that is, let's say X was 500, you'll get 500 predictions. Similarly, if X was 1000, we would get 1000 predictions. Now to get the final prediction, all we have to do is select that class which would have the majority of the woods across all of the predictions from individual trees. So, what we are really doing here is aggregating the predictions across all these trees. So guys, this is the concept of bagging. Just a quick info guys: If you are interested in becoming a certified Data Science Professional, then check out the Data Science Course offered by Intellipaat. You can find the course link in the description box below. Now, let's continue with the session. So, we will use the same example for bagging and understand where the difference comes. Again, we have the dataset A and there are n records in it. Now what I'm going to do is draw samples from this dataset. So this actually will be sampling with replacement, that is, I will take one record from dataset A take note of it, enter the same sample in dataset A1, and then put the record back to where it came from, and I will repeat this process n times so that there are n records in dataset A1 as well. So, what you need to keep in mind is that out of these n records in A1, some of them might have come twice, thrice, or even several times, while some records from A might not have made it at all to A1.

So, I've created A1 like this, and then I will go ahead and create multiple datasets the same way, and each of these have the same number of record as A and the x over here could be anything, let's say 100, 500, or even 1000. So from just one dataset A, we are able to create multiple datsets for our advantage. So just for our sake, let's say dataset A has 1000 rows and the value of x is also 1000. So, this would be 1000 multiplied by 1000 which would give us 1 million rows. That is, from just 1000 rows of data, we were able to get 1 million rows. So, till now, the process is same as bagging. So this is where the difference comes. Now what we'll do is for each of these X datasets, we will fit 1 decision tree, but the process of building the decision tree changes over here. So, let's say this A1 dataset has 10 independent variables. Now when it came to bagging, we considered all of these 10 independent variables to be a choice for the split candidate, but what happens in a Random Forest is each time a node is being split in a decision tree, not all 10 columns will be provided to the algorithm. This is important, so I am retreating this guys. So each time a node is being split in a decision tree, not all 10 columns will be provided to the Random Forest algorithm. So now the question arises, what will be made available to the algorithm? So, only a random subset of these 10 columns will be available to the algorithm. So, let's say I want to split this root node. Now instead of providing it all the 10 columns, only a subset of these columns will be provided. So, let's say 3 columns, and it could be any of the 10. With those 3, the algorithm goes on to split the node. Similarly for the left node over here, it is again going to be provided with a random set of 3 variables. It is not necessary that the left node should get the same 3 variables. It can be a different set of 3 columns altogether. So, whenever we're splitting a node, it is given a random set of 'm' predictors from the entire predictors' space, and the reason this is done is to make each of these X trees very different.

So, let's compare bagging and Random Forest. So when bagging, all of the trees had the entire predictor space available to them. The eventual trees which you would end up building are going to be very similar to each other, and in the case of Random Forest, we bring in this randomness with respect to the columns provided, that is, only a random set of columns are provided from the entire predictor space, and that is why the set of decision trees which you would get would be quite different from each other. Now after this the steps are pretty much the same when compared to bagging. So, let's say a new record Ri comes away. Then we're going to pass this record to each of these x trees, and we are going to get each tree's prediction on what class this new record is going to represent. Since we have x trees, we will get x predictions. To get the final prediction, all we have to do is select that class which would have the majority of the votes all the predictions of individual trees. So guys, this is the concept of Random Forests. Today, we will actually be using three different packages to build decision trees. We will be using trees, we will be using car parts, and we will also be using ctree function from the party package. So, we'll learn three different functions to implement the decision tree model. So first, we'll start off by loading this package library of ISLR. Now, this package basically has this car seats data. So, you would have to say data of car seats. You'd have to load this data. Now, let me have a glance at this. So, this is our first dataset which basically has sales of different cars. Now let me go to the help and show the description of this. So, this is our simulated dataset containing sales of child car seats at 400 different stores, and these are the different columns.

So, sales is unit sales at each location that comprises price charged by competitor at each location, the service community income level versus local advertising budget, population and size, price company charges for car seats, and these are the rest of the columns. So what we will do is, first, we'll start off with classification. So, as I had told you guys in yesterday's session, a decision tree can be used for both classification and regression purpose. So, on this dataset we will be basically trying to classify something. So our first task would be to take this sales column and divide this and actually convert this into a categorical column. So, wherever the value of sales is equal to or less than 8, we will tag that as low. Similarly, wherever the value is greater than 8, we will tag that as high. That will be our first task, and since it's a classification task, we would need a categorical column. So, let's do that. So, we will use this ifelse function. So what we are doing is we are taking this sales column from the car seats dataset and wherever the value is equal to or less than 8, we'll say it is no. So basically, it is not a high sales car and wherever it has greater than 8, we'll say yes. So let's go ahead and create this new variable. So I'd have to actually convert this into small C first. So, I'm taking this dataset and storing this into a new object. So this is capital C, this is small c; that's pretty much the only difference over here. Now I will take the sales column and wherever the value is less than 8, I will tag it as No and wherever the value is greater than 8, I will tag it as Yes. and I will put that result in the object high. Now, I will create a new data frame which consists of all the columns from this car seats dataset and I will add this new object to it and I will store it back to car seats. So view of car seats, over here, we see that we have added this new column that is Yes and No. So 'Yes' basically indicates that the sales value is greater than 8, and 'No' value indicates that the sales value is either equal to or less than 8.

So now, we have a dataset with us, so it's time to build the model. We'll start off with the tree function actually. To use a tree function, you would require a tree package. So let's load this. No, I am creating that new column. column. So, let me actually show you the original dataset. View of car seats: So this is our original data frame which does not consist of the high column. So what I am doing is, I am taking this sales column and from this, wherever the value is equal to or less than 8, I'm tagging it as 'No', and wherever the value is greater than 8, I am tagging it as 'Yes,' and that result I will store it in a new object and name that object as high. So this will basically be our new column. So I am adding that new column into the car seats data frame in our original data frame. So, I'll store the result back to car seats. Able to follow this? Yeah. So, we have a dataset with us over here. Now, it's time to build the model. Just a quick info guys, if you are interested in becoming a certified Data Science Professional, check out the Data Science Course offered by Intellipaat. You can find the course link in the description box below. Now, let's continue with the session. To build the model may we'll be using the tree package so library of tree now we will use this tree function and we want to understand whether the result is high or not or in other words whether the sales value is high or not on the basis of all of the other columns so over here if I want to know with respect to all of the other columns I'll just put a dot over here so the dependent variable is high and the independent variables are rest of the columns - sales this is because this high column has been created from the sales column right so I'll take all of the columns except the sales column double to follow this via my removing the seals from the independent variables and then again so I am NOT splitting the data set as of now so I've just I'm building this model directly on top of this entire dataset so data will be equal to car seats now let us have a glance at the summary of this so summary of tree dot car seats and this tells us about the different independent variables and the number of terminal nodes we have and the misclassification error rate now let's actually go ahead and plot this it's a plot of this tree dot car seats now we'll also go ahead and add the text to this so text and we'll pass in the same model which you've built I'll say pretty is equal to zero right so now if I remove this pretty equal to zero so basically this categorical column which you see that Dixon okay let me remove this first right now let me plot this so when I'm plotting this you get the categorical levels as just alphabets now if I want the categorical columns with respect to their names I would have to Kevin pretty equal to zero so I will delete this from over here right so we will get the actual categorical columns over here so now let's actually have a look at this and understand what is happening so we want to see if the sales are high or not so the first split point is based on the Shelf location column so this is the column and this determines the first split so over here if the value is either equal to bad or medium then we go on to the left side on the other hand if the value is equal to good then we go on to the right side alright so let's go to the right side again we'll check if the price is less than 135 if price is less than 135 then we'll come to the left side again again we'll check if price is less than hundred and nine so if price is less than hundred and nine then the value of sales will be high similarly this does the same thing over here so this is how we can plot the decision tree which you've just built right now so this is the model which you build on top of the entire data and now what we'll do is we'll divide this data into train and assets and build the model on top of the train set and predict the values on top of the test set right so again I'll be using those same package CA tools which will give me the sample dot split function I'll load this up so let me set a seed value first so that even you guys get the same value so I'm setting a seed value of 101 I'll use the sample dot split function and the split criteria over here as the high column and split ratio is 0.

65 which basically means that 65% records go into the training set and 35% records go into the testings and I stole this n split tag now let me use the subset function and with the help of the subset function from the entire car seats data set wherever the split tag value is true I'll store that in trainset similarly from this entire car seats data set wherever split tag value is false I will store that in the test set so we have our training and testing sets ready now we'll go ahead and build the model on top of the training set so again we'll be using the tree function and again the formula would be the same over here hi is our dependent variable and all other columns except the sales column are the independent variables and we are building this model on top of the train set this time Bonnie I just lost you split tag is equal to two and forth what where that come from here okay so I will start from this so the sample dot split function takes in two parameters first is basically in the column which would want to split into two parts and since we want a dependent variable to be high we take this as a split criteria and the split ratio is basically of the percentage of split so it will basically give us true or false values so 65% of the observations have the true label and the rest of the 35% observations have the false label and I've stored that in split tag now let me just print this out split tag so I just have a bunch of true and false values over here right so from this bunch of true and false values what I'll do is that from the entire car seats data set wherever the split tag value is true I will take all of those records and store them in the train set similarly from this entire car seats data set wherever the split tag value is false I will store that in the test set so here for the first record and I understand that part so when we do this split ratio and it stores and split tag does it assign it as true or false with with 65% ratio yes so basically the division is 6535 Saqqara 5% gods have true 35% of the records have faults and we'll use this true here all right so is everyone able to follow this right so he made the split tag and now I'll use the subset function and from the entire car seats dataset were split tag is true I'll store in train similarly from the entire car seats dataset the split tags as false I will store that in test so we have training and testing ready now it's time to build the model on the training set so again the function esteem will use the tree function and the dependent variable is high and the independent variables will take everything except in the seals column because this high column is basically we've made this high column from the sales column so - seals and we are building this model on top of the train set and we'll store this model in tree dot car seats I'll hit enter all right now let me have a glance at this plot so I will build a plot again right so this time we see that the split criteria is determined the first split criteria as determined by the price so if price is less than 90 then we go to the left side and if price is greater than 90 we go to the right side so this is basically the entire decision tree which we have over here so we have built the model now let's also go ahead and predict the values and to predict the values we will be using the predict function and this takes in three parameter so the first parameter is the model which we built second parameter is the test set because you want to predict the values on top of the test set and third parameter is the type of so type of prediction the type of prediction is class so we'll basically get the direct result over here so the class is basically yes or no and again we're storing this resultant tree dot bread right so we have also the predicted values now I will build the confusion matrix so this confusion matrix and this the actual values come from this test dollar height so these are the actual values and these are the predicted values which are there and three dot bread so this gives us a confusion matrix so this value basically indicates that all of the actual values which are known out of them we classified 68 of them correctly and this indicates that or the actual values which were known we incorrectly classified 15 of them as yes and this 18 basically indicates all of those values which were actually yes we incorrectly classified them as no and these 39 are those observations which are actually yes and we correctly classified them as yes so basically this left diagonal indicates all of the correctly classified observations and this right diagonal indicates all of the misclassified observations and to get the accuracy we will divide this of left diagonal with respect to all of the observations just a quick info guys if you are interested in becoming a certified data sense professional then check out the data science course offered by intellibid you can find the course link in the description box below now let's continue with the session so that'll be 68 plus 39 divided by 68 plus 39 plus 15 plus 18 and this gives us an accuracy of 76% all right and now as we learned in yesterday's class that we have a fully grown tree over here now what we'll do is we will go ahead and prune this tree and we'll see what sort of difference it makes to the accuracy of the model after pruning all right so for that table we'd have to do a bit of cross-validation first so yesterday also we saw or the key fold cross validation so the CV dot tree is used exactly for that so the help of the CV dot tree function we can do the cross validation and the stakes in two parameters first as the model which we build and next is the function so prune dot miss class is basically saying that we are doing this cross-validation for the purpose of pruning a tree and this is basically an inbuilt function which does the entire thing in the background and will store this in CV dot car seats now let me just print this out CV dot car seats right so this is basically the size of the tree so we started off with one root node and this kept on increasing so one two three and then finally we have a tree which has twenty-four terminal nodes in total and this is the MIS classification rate over here so basically when we have just one node then the resub situation error such is the maximum and when we have all of these terminal nodes so this is the fully grown tree so this fully grown tree has the minimum daeviation or in other words it has the minimum as classification and this is the cost associated with each of these different levels of nodes right so this was cross-validation now let me also plot this out so plot of CV dot car seats let me zoom this right so over here what we basically see us as the size of the tree increases the misclassification rate till here you know it decreases and then it increases again so initially where they were like two or three nodes there was a very high miss classification so when the number of nodes at reads around fifteen or sixteen then we had the minimum misclassification rate and after that when we kept on splitting there was actually an increase in miss classification so what we can find out as the ideal number of nodes would be somewhere around fifteen all right so this is how we basically prune our tree so what we are doing over here this post pruning or cause complexity pruning so we found out that this is a fully grown tree as not the right idea so let me actually print this out so this value would be let's say around sixteen so at sixteen this miss classification has increased again so what I'll do is I will build a model with number of nodes to be 16 over here so let me set this to be 16 and I'll write so this time the function which we are using is prune dot miss clause so this prune dot mess class helps us to prune a tree with the best value or Rob the number of at max the number of nodes would be 16 and I will store that result in to prune dot car seats right now let me plot this out flood of prune dot car seats now I'll also add the text for us right so this has 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 so initially we had a fully grown tree but after that we did a bit of cross validation and then we found out that fully grown tree is not a good idea because that fully grown tree does not give us so you know after a certain point the misclassification rate basically increases so there is no point of a fully grown tree and then we understood that there is a level which is 16 so that is where we have to stop splitting the nodes so right so we have pruned our tree now what we'll do is we will will predict the values again with this prune the tree so let's go ahead and do that so again I'll be using the operatic function again this takes in three parameters first as this pruned tree over here next as the test set and then we have the type which is class I'll hit enter now let me build the confusion matrix and let's see what is the accuracy which we get this time so 68 plus 40 divided by 68 plus 40 plus 70 in plus 15 and this gives us an accuracy of 77 percent so as we saw earlier the initial accuracy was 76 percent so let me print out the initial value over here right so this comes out to 68 plus 39 divided by 68 plus 39 plus 18 plus 15 and this gives us an accuracy of 76 percent but after that we went ahead and pruned our tree and then we predicted the values again so after the pruning of the tree we get an accuracy of 77 percent so 68 plus 40 divided by 68 plus 40 plus 17 plus 15 right so this time we have an accuracy of 77 percent so is everyone able to understand how did we do pruning and how did we get a better accuracy after pruning so any questions cell here so varlet the seat value is basically so now when I'm basically building this model all of you might get different values so if you want the same values as I'm getting then you'll be skis at a seed value which will give or the same result right so others bunny bunny actually I was clear about a Moodle actually the guy in a seed every time I see it will be like you will give the same data like hopefully so like how it is going see see I mean I didn't understand the seal concept there's nothing to it it's just that when I see set seed and then I give a random value to it so it could be anything right after this whatever so let's see I generate a sample and right so now if you see three three three and you use the sample function even you will get the same values or if I do sample again even I'll if I do said dot seed and use three three three three three and set a sample 10 comma 2 I'll get the same value so basically if I want the same result then I need to use said dot seed that's pretty much it even now I am doing splitting every time gingerreadslainey to use the no no no no no no if it's like I want to if I want to show this result to someone else right so now whenever I'm using the sample function this will give me a different result every time but let's say I want to show this as an example to someone else so I want the sample 10 comma 2 to give me the same result every time then what I will do is I will just set a seed value and that seed value will give me the same result every time so set that seed of maybe I give a random value 1 and sample 10 comma 2 I get 3 comma 4 again I set a seed value of 1 and sample 10 comma 2 I get the same result so not seed of one sample of 10 comma 2 I get the same result right so basically when we want the same result we will use this set dot seed now what that is nothing else to it then seed a replacement lamp replacement sample I'm not following you when we do the sampling replace the values so it can be repeat itself right no we're not replacing anything it says that if you want the same result that is when will you said dot seed yeah so there is nothing which we are replacing over here a quick question when we are doing this kind of classification right and when we're doing the test I mean accuracy and all of that is good right can I get on the basis of my test data can I get the probability of I mean I want to just do a scoring and now can I do that with this classification scoring on what basis see what we did was the first column that you selected right and then you said ok greater than 8 less than 8 you classified that as yes no kind of thing I want to get a scoring something on sales kind of predicting right how my sales would be predicted and possibly on that I may take another decision can I get those sales values predicted yes so what you're saying is basically can we do regression or with the help of this decision tree that is what you're asking that we can and so far is an example love so there is I see an example which we'll be doing with the are part package so there will be actually predicting you know continuous values with the help of the decision tree here so against the decision tree is used for both classification purpose as well as regression purpose right you can do both with them so this example on this data set will be doing classification right so we have we did a bit of pruning and we got this result over here now what we'll do is so we will print this out again and we will actually prune it with a different number of nodes so we have this cross-validation and let me paste it over here right so over here we took the number of nodes to be 16 now instead of 16 we'll say the number of nodes to be 9 and we will see what is the accuracy when the number of nodes is 9 right so we'll prune this tree at 9 nodes so again all we have to do is set this best value to be equal to 9 so it's the same thing again we'll use the prune dot miss class function first as the model which you build next is basically the number of nodes which you want and then well plot this out so over you we have 1 2 3 4 5 6 7 8 and 9 and this time again the for split is on the basis of the price so if the price is less than 90 point 5 then it's basically hi is equal to yes so this is the model now let's go ahead and predict the values again so this is what we have to do over here all right so again and this will use the predict function well use the model which we've just built over you this is the pruned model and then we are trying to predict the values on test set and type equals class and then we'll also go ahead and build the confusion matrix so let's see what is accuracy this time 68 plus 37 68 plus 37 plus 20 plus 15 so this time we see that accuracy 75 so the ideal split or the ideal level where we'll have to cut out our three as when we have 16 nodes so again that is why this cross-validation is very important for us right so this result over here we see that 9 and 16 right so 16 is ideal again we saw the same thing when we made a plot of this lot of CV dot car seats and over here we saw that this is our ideal value so this is where there is a jump right so till here there is a decrease in the misclassification rate and when so after we progress from after the number of nodes progressed from 16 and we keep on splitting the misclassification rate increases so 16 is our ideal value so yes with this Swiss so basically how we can build decision trees with the help of the tree package so after this we'll be building decision trees with the party package so this is just again an alternative to build the decision trees so let me go ahead and load this package so has everyone install this package a quick confirmation please yeah this one alright so now for this we will be using the iris dataset so let me open up the iris dataset view of virus so this time we have a three-way classification so we'll be trying to classify whether the species is setosa virginica or versicolor so levels of iris the dollar species so we see that there are three classes over here and this time we'll be building a decision tree on top of this dataset to understand whether the belongs to setosa or c color or virginica right so yeah let's do that so we have loaded the party package now this time so till now I've been using the CA tools package and CA tools package K was the sample dot split function so similar to sample dot split we also have the create data partition package so it's it's basically and allow us to sample dot split so create data partition is a part of carrot package and the sample dot split is a part of CA Toles package so I'll load this carrot package right now I will use this create data partition so it's pretty much the same things instead of sample dot split it's create data partition and the split column a species and split ratio is 65 so this list is equal to false so normally sampled or split gives us you know it gives us a vector but this actually gives us a list instead of a vector so we'll say list equals false and we'll store it in split tag and again it's pretty much the same thing so now of the sample dot split gave us true or false values but this create data partition gives us the record numbers so let me print this out split tag right so over here we have the record numbers over here so now if you have to make train and test from the split tag over here we'd have to basically pass this in as a parameter inside our dataset so what I'm doing us from this iris dataset I will be selecting all these row numbers right so these row numbers comprise of 65 percent of the iris dataset and I will store them in the Train Set now similarly so the split tag contains 65 percent of the row numbers so apart from those 65 percent so I put a minus symbol right so it's basically all of the records - the 65% of the row numbers which are present in split tack right so which are basically 60 which are basically 35% so those 35 percent records go into the test set so this is how we create that drain and test set this time so again I'm repeating this so this is basically analogous to sampled or split just a different way to split the data set right so this time we have the see tree function which as a part of the PA RT y package and we'll be building our model on top of the train set and we will want to understand what type of species the flower is with respect to these four columns sepal length sepal width petal length and petal width so that is why why you've put a dot over here and I am storing the result and my tree now let me also go ahead and plot this over here so this is the difference between 3 and C 3 over here so this time the first split criteria is on the basis of the petal length column so if the petal length is less than or equal to 1.

7 then we classify the flower as setosa so you see this so there is almost a probability of close to 1 right so this probability is close to 1 so if petal length is less than or equal to 1.7 then we classify it as setosa on the other hand if petal length is greater than 1.7 then the next split criteria is petal width and if petal width is less than 0.001 so if it's a petal width is less than or equal to 1.7 then again we'll check the petal length and this time if petal length is less than or equal to 4.8 then it will be for sickle err on the other hand if petal length is greater than 4.8 the there's around fifty percent or around sixty percent probability that it's Aussie color and there on forty percent probability that it's virginica on the other hand if petal length is greater than 1.7 and petal width is greater than 1.7 if this is the case then it is virginica right so over here we are doing a multi V classification and we are trying to really confuse here one and three right that splits our same values of split four also same below one box one and box three write the scripts are having less than point zero zero one and then you know it's splitting at greater than one point seven and less than equal to one point seven the same thing is happening on the box three as well right so for it about this p-value over here right you take these values so this is the value of petal length this is the value of petal length this is the value of this is the let me let me ask you another question now on the right hand side when you move right it's one point seven anything greater than one point seven will go out towards the right hand side right yeah then how can we after box three split again okay right right so again forget these P values if these are confusing you so just forget these P values so first we are splitting it on the basis of petal length so if petal length is less than or equal to one point seven then we can be almost 100% sure that at us setosa on the other hand if petal length is greater than one point seven then we will check petal width right it's not petal length its petal width so if petal width this time it's less than one point seven then we'll check petal length again and if petal length is less than or equal to four point eight then we can be almost 100 ensure that its 4c color on the other hand if it's greater than 4.

8 then there's around 60% probability that it's were C color and there's around 40 percent probability that it's virginica on the other hand if petal length is greater than 1.7 and petal width is greater than 1.7 then again we can be almost 100% sure that this flower belongs to or Jenica so what we did over here as a multi-class classification where we are trying to understand whether the flower belongs to setosa Wasi color or virginica right and this model we've built with the help of this C tree function over here so we have wealthy model now let's go ahead and predict the values this time so again I will be using the predict function but now so in the previous case we had said type equals class so this time this c3 function the predict value of the type value it will be response rate so these are like the my nude differences between these functions over here so when we used the tree function and we were predicting the values exactly is that right so this when we use the tree function and we were predicting the values the type of prediction we set it to be class and when we are using the C tree function and we are predicting the values the type over here will be response but then again the parameters are pretty much the same first is the model which we've just built next is the test set on which we are building the model next is the type of prediction so over here type of prediction is a response which is pretty much same as class it says that nomenclature changes when it comes to the C tree function and we will store this in my bread over here so we have also predicted the values so now let's go ahead and build a confusion matrix so these are the values where the actual was setosa and has been correctly classified as setosa so these are the two cases actually it was sedusa but it has been classified as was he color these are the cases were actual was rosy color and it has been correctly classified as 4c color this is the case where actual as was he color and it has been incorrectly classified as for Jenica this is the case were actual as virginica and it has been incorrectly classified as 4c color this is the case when actual as virginica and it has been correctly classified as for Jenica so again this left diagonal which you see these are the correctly classified values and the rest so thus 2 1 & 2 these are the incorrectly classified values so let me go ahead and find out the accuracy so this time they'll be 15 plus 16 plus 15 divided by 15 plus 16 plus 15 plus 2 plus 1 plus 2 right so you get an accuracy of 90% which is very very good now what we saw in the plot was of the split was determined by only petal width and petal length right so petal length and petal width so these are the only columns which are determining the split and these are the only columns which determine whether the flower belongs to setosa 4c color or virginica so that is why what will do us we will build another model where we will take only petal width and petal length as the independent variables because in what is the point of including other variables when they are you know not even a power of the split criteria right so let me go ahead and build another model so again C 3 the formula this time dependent variable species and the independent variables are petal width and petal length only these two so we do not take sepal length and sepal width this time and again we are building this model on top of the train set now let me go ahead so I'll delete this first now let me make a plot of this right so this is pretty much the sizzle yeah yeah so from as what you said there no from the decision tree you got to know that only two things are getting used and you did not use the other columns so is this something how you reduce the number of dimensions which should be there in your model is that something which we should use it if I have ten variables and if I get a clue from this particular model in saying that you don't require the other variables I need not use it in any of my modeling is it a right way - yes supernat comes to decision trees this is something you can do but then again this is trial and error right so okay and so whatever yeah so again you need to see what works with respect to the model so over here we got a clue that sepal width and sepal length were not part of the split process so we thought why not build the models without using these - right so it will basically bring in redundancy so we don't want redundancy so let's go ahead and go to the model and see what the output is so that is what you do so basically you find for Cruz basically you find four clues to limit the data or to limit the number of independent variables right right so we can use this to actually limit the number of independent variables yes you can yeah right so I have plotted the second one and we see that see that we sort of have a similar plot over here not similar actually it's pretty much the same rate so again petal length less than 1.

7 we get setosa again we come over here greater than 1.7 greater than 1.7 we get this to be virginica right so we get though so we get sort of the same result over here now just to be sure we'll again predict the values with the model which we've just built over here so we will use the predict function the model which you built will be the first parameter my tree - and we are building this model on top of the test set and the type as response and I will store this in my bread now let me again go ahead and build my confusion matrix so first as the actual values which come from the test set next is the predicted values which are stored than this my bread object I'll hit enter right so again we see that we sort of get a similar result over here again I'll check the accuracy 15 plus 15 plus 16 divided by 15 plus 15 plus 16 plus 2 plus 1 plus 2 right so we get the same accuracy now with the help of this plot we could find out that sepal length and sepal width did not provide any information to the model and when we found out we removed those two columns and we build the model again soft we build the model again we saw that this gives the same accuracy as given by including all of the independent variables right so this is how you are basically to your trial and error this is how you try to find out your best fit model right so we are done with tree we are also done with C tree now it's time for the final decision tree function which is our part right so have you also installed this Arpad package before going to have to or to the goodies yeah sure you know normally when you do any classification we which is not depend only on operation yeah I mean in the last session as we discussed I mean in the situation I agree with when we are doing you need any classification problem it's not only dependent on particular left to receive dependent on this area under curve on arrows yes and that is true pursuant to a part to redo so do we need to do one nation needed yes you have so for every classification problem you would have to take care of all of these factors for whatever classification problem it does you'd have to need the right trade-off between classification specificity sensitive you know sensitivity all of these factors well now suppose we put ROC some particular point five point seven we should use that value in conclusion that is like by putting the reader so could you repeat that again so point seven point seven assert suppose suppose ROC value that threshold value is 0.

5 point or like this we should use like in conclusion matrix like this one table no greater than my print greater than point by or Zenga so we've liked it over there for this no because when you use the GLM function that gives you a probability on the other hand when you're using this tree function and see tree function you are directly getting the class you get this so when we predicted they directly gave us a class they did not give us a probability right so this function by itself takes in a threshold value and on the basis of that threshold value it divides the data into classes or it predicts the data in two classes so over here you would not require to set a manual threshold by yourself so in GLM we had to do that because the logistic regression function gave us a probability some would predict it was type is equal to prop can you show us how to use our idea of one of the river so again so there is no point of using ROC for this right so we would not require the threshold for this so okay so what I will do is I will show you the predicted results of these two so we are okay I will pill this model again let me copy this let me paste it over here wait okay so I'd have to load this entire thing again right all right so let me print out these values tree dot read so you see that these predicted values are actually the final results which you get you get yes/no right so there is no probability over here so you'd don't get a probability like zero there is 80% probability that this is yes or the rest 20% probability that this is no so this function by itself takes in a proper mean or it takes in a threshold value and it gives you the final classified result so all you have to do is take in this final classified result compare it with the original result and find out how accurate the modulus so by itself this model will give you the optimal threshold so this tree C tree and even the Arpad functions by themselves they will give the optimal threshold values so you don't have to manually set those threshold values there is no point of at doing ROC in the year you need to understand that we wanted the trade-off between you know each sub ROC again gives you the threshold value for the accuracy right so if we cut it off at this threshold then it says that you will get the maximum accuracy but then again this function by itself does that so we don't have to manually do it we don't have to manually set a threshold value for this probability if you don't set the threshold manually then do we need to use the AUC functionality you know all this patiently right no not really and the second thing is second question like can we use like the same way because it's a multi-class classification when you are using party package can we use the same thing in the repackage azamati class yes you can do that so with all of these three functions you can do that but for launching off you can use the LM model no that is not possible I guess so even I am Not sure so I have never done multi-class classification with logistic regression so I'd have to check up on that once I get like we have used some other GA some other function so G's thing you can use directly yes so three are part and C 3 you can directly use this for multi-class classification so again so LM what you need to understand this it is a binomial model and it will give you a probability so these diffusion tree models they are used for both purposes they are used for regression as well as classification right so we've used yeah so we are done with tree we are done with C tree now it's time for the our part so with our part will be doing regression right so this time I have the Boston dataset so you'd have to load the mass package first so just stipend library of Mars and then you'd have to load the Boston dataset yeah you of Boston alright so this is our data set over here now let me actually show you what these columns mean all right so this Boston data frame has filed in six rows and 14 columns and these are all of the different Collins which you have so CR I am at basically stands for per capita crime rate by town Zeldin is the proportion of residential land soon for plots over 25,000 square feet and this is the proportion of non retail business eco spat on so these are different columns you can go through this list so our main focus would be on this ma DV column so this ME DV is the median value of owner-occupied homes in thousand dollars so basically we are trying to predict the value of this house so this the value of this house is $24,000 is $21,000 and is like $24,000 so on the basis of the other columns we would want to build a regression model which helps us to predict the median value of this house so let's use the Arpad function and do that so first task again we will have to load the Arpad package now again I am using this create data partition function which is a part of the carrot package so this time the split column is this MATV and then also the ratio is 0.

65 so 65% would be in training and 35% would be in dust less sequence falls so you have the split tag with us again what I'll do us from this Boston data all of those values which are there in split tag I will take them into training set and all of those values which are not there in split tag that is the rest of the 35% values I will store them in the test set so we have our training and testing sets ready and as we've been doing we will build the model on top of the training set so this time the function which we are using is our part and again it's the same so we will put in the formula and then given the data so formula s m e DV tilde dot so m e TV as our dependent variable and all other columns are independent so your MeV is dependent and all of the columns are independent variables and we are building this model on top of the train set right so now to plot the Spree we would require in the our part dot plot package so you guys have to load this package and then you would have to visualize the tree which you've just built so our poor dot plot and I will pass in this object which is my tree alright so this is the sort of visualization which you get from this so over here we of the first split criteria is on the basis of L start so if L start is equal to or greater than nine point seven then we go on the left side on the other hand of ll start as less than nine point seven then we go on to the right side so let's go to the other right side first so I'll start as let's say less than nine point seven we go to the right side here we check if RMS less than seven point five so if it's left in seven point five again left side so if it's greater than seven point five we come to the right side so this is how split is happening and these values which you see these are the average values of the price of the house so yesterday we saw that during the first example itself when we were trying to predict the salary of the player that gave us the average salaries of the player so when you use a decision tree model to predict a continuous value it will give an average value right so if the split goes something like this then your average house of the price is nine thousand dollars and if it goes something like this the average price of the house is fifteen thousand and if it follows this process then the average price of the house is forty five thousand dollars right so this is your this is how your split works over here now we have built the model now it's time to predict so this time we'll use the predict function and we will not give the third parameter so this time we get a continuous value so we don't have to set the type so this automatically gives a continuous value so first we will given the name of the model as you build which is Maya tree and then you'll given the dataset onto which you want to predict the values and you want to predict the values on top of the test set so also the resultant predict tree now I will find the actual values and the predicted values I use the C vine function so the actual values are in the test set and the predicted values are there in this object I will combine these two and I will store them in the final data now this is actually a matrix so I will convert this into a DITA frame first a store data frame of final data and I will store it back to final data so view of final data so these are the actual values and these are the predicted values which we have right now what we'll do is we will go ahead and find out the error in prediction so error in prediction is again we subtract these predicted values from the actual values and then we get the error in prediction so let's do that so it's pretty simple all you have to do is final leader dollar actual - final date at all a predicted and we'll store that in error and again we'll bind the error back to the final data now let me have a glance of this view of final data so actual values predicted values and this is the error in prediction so again if we want to find out the average error what we have to do is there is some thickness root mean square error so let's go ahead and find out the root mean square error so first we will take this error we will square this up and then we'll take the mean and then we'll take the square root so for the first model we get a root mean square error of three point nine three now I will store this in our mac1ana we'll take this image and over here we see that only L start NO x CR I am and RM have been used right so out of all of the independent variables which we have only we have only limited number of columns which have been used for the split so we will build a model which will take in only these independent variables so we will not use all of the independent variables so let's again do that we have our M L starts here I am I know X so we also have P is over let's also add that right so let me add D is so for the second model again it's the same thing our but the formula as Emily we which is the dependent variable and the independent variables this time become our M L star C R I am and of X and D is and we are building this model on top of the train set so we build the model again let me have a glance at this right so I hit on enter so you did not notice any change right because they've got the same result right I'll delete this I'll hit enter again so what we see us even though with included just these one two three four five independent variables we've got the same split over here so this again basically reiterates our belief that no other column was used for the split purpose right so we have built the model now let's go ahead and predict the values again and let's calculate the RMS C for this model so we'll use the credit function so we'll take in the model as the first parameter and then we'll break the values on desert we'll store it and predict tree again I'll find the actual values and the test values and I'll store it and final data I'll convert this to a data frame and calculate the error in prediction now I would have to bind this error back to the final data so again I will use the C bind function so final data I am binding the error to this all right now let me have a glance of this view of final data right so this time when we used only these Phi independent variables these are the actual values these are the predicted values and this is the error introduction now again let me go ahead and find out the root mean square error so the root mean square error is three point nine three now let me put in our MSE one right so it's the same so for the first model and for the second model the root mean square error is same so this means that there is absolutely no need to include any other variable after these five independent variables because they do not add anything to the model right so this is how we can choose our ideal independent variables so this is how we can do regression with the help of a decision tree right so any doubts in these three functions which of you still know I have when a we are good I am good but how can we cut that our fight this I'll share this with you after the session I'll share this with the operations team and they'll give this our file with you guys yeah great thank you all right and I will son will that um yeah my last name and I was told that I had not recorded the first half of yesterday's session so it was basically my fault so I believe I'd have to record the earlier part again and then you'll have the video so what I'll do is after 2d session I will record the theory part again and you will have this uploaded by tomorrow night right yesterday's Theory session go ahead so any other doubts nobody know here also we should do cooling rate in our part and see it we also yes so even here you can do pruning like is the same cooling method no so there is something honest dream control parameters so you will use those spring control parameters for our cut and see tree read up on that what was C 3 so C 3 there are some train control parameters so inside those train control parameters you will love you know set the ideal number of maybe of nodes when to split or the ideal threshold value when to split and that is how you do it we'll all the pruning techniques will be the same result so in post pruning there is just one type cost complexity right so what we saw earlier that was cost complexity pruning so the again the idea behind that as the root node it has the highest miss classification rate so as you keep coming down the misclassification rate or the re substitution error that decreases so you need to find those that particular you know that particular level of split where the misclassification rate is minimum so as we saw in the earlier draft so and it reads sixteen notes it no it had the least misclassification rate but after sixteen when we started splitting again it was actually not a degrees but there was an increase in the misclassification rate right so in cost complexity pruning your idea is to have that minimum value of misclassification rate so that minimum value where you have the that minimum value of misclassification rate that will be your ideal number of terminal nodes yeah these were using Gini index or of information gain of also by default the impurity function which these three packages use our imperator our Gini index how do we check that now it again even I'm not sure so I'd have to read up on that so by default these decision tree functions which they have so the impurity which they use is Jini so for a random forest we'll be working with this CD GT does it so this has so this is basically your medical data set which I took from you say machine learning repository so I will be sending this data set to you guys after the session so you have to follow me through this video so that's basically consists of all of these columns and so this is basically a data set which measures the fetal heart rate of a patient and these are the different parameters and this is basically the final categorical column which we are trying to predict so this NSP basically stands for normal suspect or pathological so that fetal heart rate it's either normal or it's suspected to be pathological or at as pathological rate so again this will be your multi-class classification and we'll be doing that with the help of random forest and decision tree we just have a single tree which we build on top of our data set so now what we have is something called as ensemble learning so the perfect example for this could be let's say you want to watch a movie and you take your friends advice so that one particular friend hates all action movies right so you want to watch Avengers and that one particular friend hates all action movies and he is very oh he has a very biased opinion towards action movies and even though he has not watched Avengers he'll tell you that it's a bad movie now what happens in ensemble learning us you basically take opinion of multiple people so an ensemble learning instead of just building one decision tree you have multiple decision trees or in other words you take opinion of ten people and out of those ten people eight people would tell you to watch Avengers because it is a good movie other two have their own biased opinions and that is why they tell you it is a bad movie so on the whole you will get a collective opinion that Avengers is good so basically in ensemble learning you get results from multiple decision trees so again the first extension come from decision trees is packing and from bagging we have something on us random forests so what we do in bagging this we have this initial data set so from this initial data set we create multiple data sets and those data sets we create by sampling with replacement so we have this data set with n records what I do is I create another data set l1 which has the same number of records and these records have been taken from L but that is done by sampling with replacement similarly I will create l2 which has n records taken from L but these records are sampling with replacement similarly I'll create another data set which is l3 which has n records taken from L and that again is sampling with replacement so I will create X such data sets and I will be building one decision tree on top of each of the data set so instead of getting just one result I get multiple results from multiple decision trees and I will take the aggregate result of all of the decision trees so this is what happens in bagging so how are you able to follow me tell you the concept of bagging yes yes right so once we know what bagging is random forest is just an extension of bagging so in random forest this part is the same so first we create multiple data sets but the part where random forest differs from bagging us the split criteria of the nodes over here so for this split criteria it will not get all of the independent variables it will only get a subsection of the independent variables and that will be a random subsection so let's say I have ten independent variables in my data set now from those ten independent variables I'll be using only three random independent variables for the split similarly for this split again I'll be using three random again for the split there are only three random variables available so all the ten random variables are not available for the split only em random variables are available and typically for classification or priests this M value is under root of P and this P is the total number of independent variables so let's say if you have ten independent variables then M will be under root of 10 which will be around three or four right so this random four random forest is the word random in random forest is basically comes from this part over here right the you know the split of the nodes as it is dependent on a random independent variable so that is why this ensemble learning is known as random forest so this is the only part which differs from bagging and again we have these X such decision trees and we get a result from all of these decision trees and then we take an aggregate from these decision trees and that will be our final result so this is the basic idea behind random forest right so does that help you yes thank you great so yeah so now let me know what random forest us let's go ahead and work with random forest so let me start off by having a look at the structure of this data so I'll use the STR function and these are all of the columns which I have in this now my dependent column is this which is n SP so we see that this is of integer type but then again since we're doing classification we warned this to be as a categorical variable so a first step would be to convert this integer variable into a categorical variable and to do that I will use the AZ dot factor function and I will convert this into a factor so as dot fact of data dollar NSP and I will store this back to data dollar NSP now let me have a look at the structure of the second structure of data and we see that this integer type has been converted to factor again let me have a glance at the levels of this NSP so one two and three so this one denotes that the patient is normal to denotes that the patient is suspected to have that fetal heart disease and three denotes that this patient has that pathological heart disease so there are sixteen hundred and fifty-five normal patients two hundred and ninety five patients who are suspected to have the fetal heart disease and there are a 176 patients who