Exploratory data analysis

Welcome back to the data professor. YouTube channel if you new here my name is Tim in non-sena mod and I'm an associate professor of bioinformatics and this YouTube channel. We cover about data science concepts and practical tutorials. So if you're into this type of content please consider subscribing data pre-processing and expert or a data analysis. It's very crucial to the success of every data science projects so in this video. I'm going to show you how to do. Basic data pre-processing and exploratory data analysis in Python using the pandas library. So that you can tackle your data science projects so without further ado. Let's get started okay. So the first thing that you want to do is head over to the github of the data professor and click on the code. Repository scroll down. Click on python down and find pandas. Expo Tori data analysis. Click on that all right. And then you want to right-click on the download button and save link as and save it to your computer. And then you can open up the jupiter notebook in your computer and follow along or you can go to the google collab and we can download it directly from github. So click on the github tab search for data professor enter and then find pandas export already data analysis. Okay but since. I already have that. I'll just open it up okay. And then before we begin let me clear all of the outputs so that we can start from scratch together. Okay so the data that we're gonna use in. This tutorial will be based on the one from the previous videos where we have used the NBA player stats data directly scraped from the basketball reference. Comm website and so you're gonna see that the code is right here in the first block of code so let's run that and so what it essentially does is it will import tandas and then it will use pandas to read the contents into a data variable whereby the content will be a data table. And then we're gonna do some basic data cleaning by removing the redundant header rolls. That are pressing more than one time in the content of the table and so that was shown in the previous tutorial.

So if you haven't yet watched that please click on the link above okay and so you will see that the header contains several acronyms. And if you're not familiar with that it is summarized in the following block of text here in this table where the acronym is followed by the description right so our K is rank. POS is the position so H is the players age on February 1 of the season and the team name. The number of games played games started minutes played per game. Okay and so today. We're gonna take a deeper look into that in terms of the expo. Tori data analysis. And so we're gonna use a lot of pandas in order to retrieve the data that we want to have a look at and also world make some basic plus and graphs here. Okay and so. Let's have a look here so as I mentioned already. This block of code will scrape the data directly from the basketball reference website and the table. Data will be put into the data frame called. DF 2019 right. And then we're gonna drop all of the headers that are redundant and then it will be contained within the raw variable. Okay and let's have a look at the shape of the data so it has 708 rows 30 columns. Let's take a look at the first few rolls so it's essentially here. Let's check for any missing values using the ethanol function. And then we're gonna do a summation of how many missing values are there and we see that there are a couple of missing values here and let's just say that we're gonna replace all of the missing value with a number of zero and then we're gonna check the music value again and we're gonna see that we have already solved the missing value issue here and then rank is not telling us anything at the moment so we're just gonna take it out by dropping the column and then we see that the rank column is now dropped and we have this table data here called. DF so let's write this to a CSV file and so we're gonna type in DF which is a name containing the table and dot - csv function and then we're going to call it and BA 2019.

Dot csv. So we're gonna write out a csv file and then we're gonna have index equals to false because we don't want the index to be written out and so let's write the file and so LS to check if the file has been created and it is right here NBA 2019. And then let's briefly. Take a look at the contents of the file in bash and so the content is here as a csv file. Okay and so. We're going to read the data back in okay and then we assign it to the DF data frame again instead. The data is right here. Okay looks very clean and now let's just say that you know this that it displays only the ten rolls of your data frame so it will show the first five and the bottom five of the data. Let's say that you want to see the entire content. All of the rolls is that possible. Yes you could do that using the set option so go ahead and run that and then run the data frame again and now it will list all of the data of the entire data frame. Okay so it's just in case that you want to see all of the players and it's not possible to do it within the to burn notebook so now you can just set this option here and let's say that you don't want to have this lengthy data frame you could just revert it back to the default by set option and set the master role to be back to ten okay. Let's have a look at the data frame again and so it looks just as before you see the first five and the bottom five of the data frame all right and so. Let's have a look at the data type of each column of your data frame and so we're gonna see that players. Our objects position is an object team is an object and the rest are either integers or float numbers. Okay and so. Let's say that we want to show specific data type in our data frame so we could use the. Select D type function and include equal to the data type that we want to be shown so if we want to show all of the numbers it will include integer and floating numbers.

Then we want to use the include equals two number argument here and so we're gonna see only the numbers or if we want to show only the objects then in the argument we're gonna use include object okay and it's gonna show only the object. Which is the player. Name the position and the team okay. And so when we're doing expiratory data analysis before we dive into all of the functions commands. That could allow you to do expert ory data analysis which might be a bit boring or quite lengthy. So why don't we focus on the front part. Let's ask some question and let's see which commands can help us to answer our questions okay and so. I'm going to group this into the various headings that I will show you here after so the first concept here is conditional selection. So let's say that we want to show specific rows or columns in the data set which matches our particular condition. Okay so let's demonstrate that with our first example okay so which player scored the most points per game. Let's see how we can do that. So points is the column pts and points here. Let's have a look at the meaning so pts means points per game so in basketball is the number of points that a player scores in a given game. Okay so it's the average points that are scored per game so this is calculated by taking all of the points that the player has accumulated over the season divided by the number of games that the player plays and so that will be the points per game. Okay and that will be in the column pts so let's have a look at the question again so the question asks which player scored the most points per game and we see that there are about 700 players in the data frame and so in order to find which player scored the most point we have to use the function ma. X which will tell us the maximum value for the given variable so in order to select the column pts from the DF data frame we will use DF pts. That is one way another way would be to use. DF bracket quotation pts.

Okay so this is the selection of the particular column and to get the maximum value. We will use the max function again. That will tell you 36.1. But it won't tell you the name of the player. Okay so either way it will be the same answer so it's thirty six point one so the next question is which player scores thirty six point one. Okay and that player it will be the answer for our question right here and so we have to use this thing called conditional selection so we already know the answer to be thirty six point one okay and so notice that this block of code here is essentially right here okay so when we type in the data frame. DF it will display all of the players okay and if I say DF a particular column it will show the values of that column so in order to show the player which scores the most point. We're going to type in. DF open bracket DF pts and then double equal sign followed by DF pts Max and in the opening and closing parentheses and in the closing bracket and that will show us that James Harden from the Houston Rockets with the position. A point guard scores the most point at 36 point one okay and so this will display odd of the roles the entire role. Okay so the name of the player along with all of the data that are associated with this player. But let's say that we want to return specific values about this. Let's say that we want to know the team. So how are we going to return the name of the team. So we're gonna do this by copying this block of code here and then we're going to assign this to the data variable called player maps points in order to simplify the look of the code a bit and then we're gonna call the player map's points again and then we're gonna use dot TM because we're gonna select the column called TM. Which is the team. Okay let's run this and we see that the team is. Hou so it's the Houston Rockets. And so which position is the player playing ass so we're gonna select only the POS column and so the position is Petey point guard.

How many games did the player played in a season so dot. G so 78 is to answer okay so that is the first answer for the first question. Okay so now. Let's move on to the second question. So let's say that we want to know which player scored more than 20 points per game. Okay so the first question was which single player had the most points scored in the game and this one will be which player scores more than 20 points per game. And so we're going to retrieve many players here and so the condition here will be. DF pts greater than 20 which is the condition. And we're putting this as the argument inside the column selection okay so inside the bracket is the condition and DF bracket means we want to select the rolls containing the given condition which is DF pts greater than 20. And so here. We're gonna retrieve the name of all players along with the Associated data which scores more than 20 points. Okay so the rolls are where the player scores more than 20 points per game. And so you're gonna see that the middle information is missing here so if you want to show it you can go up and find the code about set options and then you can run that okay. But we're gonna move on to the next question now. So the next question is which player had the highest three-point field goals per game and this will be the 3p column name and so as always we're going to use DF and then the bracket and inside the argument of the bracket we're gonna use the DF 3 P equal equal 3 f 3 P. Max so that will give you the maximum value of the 3p column and then it will return the rolls matching that condition and so. Stephen Curry is the answer whereby he scores on average five point one three-pointers per game. Okay and the next question is which player had the highest assists per game. Okay and the column name is ast and so the same concept here we're going to use. DF and in the bracket and inside as argument we're going to use DF ast equal equal DF ast max okay and that will return Russell Westbrook where he has ast of ten point seven KS sis of ten point seven all right and so the next several questions will be using the group by function concept.

So the question here is which player scored the highest points from the Los Angeles Lakers. Okay so let's have a look sequentially. What does this block of code does so first we're going to assign a variable called. Lal and the content will use the DF data frame. And it will group the data and then after grouping the data by the team it will select the specific team that we want which is the Los Angeles Lakers. And so let's run this code and have a look al al and so as you can see in the team column all of the players here are from Los Angeles Lakers and there are 22 rolls. Okay and the food dimension of the columns are shown so if we change out al to something else the team will change right. Okay see and the team will change to okay see okay let's change it back to Lal okay and that's the answer and so let's go to the next question off the five position which position scores the most points. Okay so in order to answer this question. The first thing that we want to do is to group by the position so we're gonna use the F dot goop bye and then goodbye the position POS and then dot pts we sister points and then dot describe which will give us the descriptive statistic and so here. We see that there are more than five position here because there are some position that are hybrid meaning that some players played to position. The player could play both as a center and a power forward both as a power forward and a small forward both as small forward and shooting guard both as a power forward and a shooting guard. Okay and so you see that the number of count here is very low so only one or two player or playing both position and so let's say that we want to remove these low occurrence data. How are we going to do that. So first we're going to define a variable called position and inside. We're going to make it a list of the five position that we want to be shown.

Which is the traditional position containing centre power forward small forward point guard and shooting guard. Okay and we're going to define a variable called POS with the capital letters and then we're gonna define DF open bracket and then we're gonna have DF open and closing bracket inside. It's the position and then we're gonna use. TSN function is in and then as argument position. So what this essentially does is it will remove all of the irrelevant position out. And it will display only a subset of data which contain the positions that we wanted in the list because we have only 5 position here it will display only the 5 positions that are listed here in the list which is 5. Okay just run that and so we see that there are only the five positions shown so there are 700 roles right because the hybrid position will contain 2 2. Which is 4 to 6 now 1 1. So it's 8 and so before it was 708 rows and now we have 700 roles so 8 are missing and that is the correct answer. Ok so now. Let's take a look at the descriptive again. And so we see a beautiful answer here the five positions and we get the count and then we get the mean and the standard deviation and also the quote aisles as well and also the maximal value. So here we're going to see that. The average points are relatively 8 plus - ok plus minus 5 and 6 points. So we're gonna see that the most points are scored by the position of center. Eight point seven eight right but still. They're roughly similar okay and interestingly we see that the point guard also had the highest standard deviation smell so probably means that there are several point guard scoring quite high here okay and so. Let's take a look now at the visuals. Let's make some histograms but before doing that. Let's create the subset of the data frame so here we're going to select the columns position and the points and we're going to define it into a new data frame called pts and then we're gonna select only the five position.

Okay and let's run that. And so we have the five positions here and the points column. And so let's show the histogram so this is the built-in function of pandas as you can see that we're defining the pts data frame and we're defining the pts column and then we're going to use the building function of hist which is going to display the histogram and then we're gonna make several histogram subplot and that will be according to the POS which is according to the position and because we have five position it will create five separate histogram plot shown here. Let's say that we don't like the layout layout is a bit off. We have specified the option to be layout and then ask the tuples we will use 1 comma 5 and so it will show you one row 5 columns and let's say that the width dimension is suboptimal. We could do that. Further by customizing the fixed size option to be 16 and 2 and ok and now it looks quite good. Ok and you could go on and further customize the number of Bin's that are shown in the graph here ok so this is the built in of pandas. And let's say that if we want to use. Seabourn to do the same thing again. This is the Seabourn code so we're gonna import Seaborn and matplotlib and here we're gonna use the SNS dot facet grid which will create the multiple subplot that you see here and the input data frame and then the multiple subplot will be created. According to the position column as you can see that the position column is broken down into the five unique values containing the center. Power forward small forward shooting guard and point guard ok and it showed this as the facet grid. Alright so let's move on to the box plot so here we're gonna use the pts data frame dots box plot and so this is the built in box plot function our pandas and as argument. We're gonna define column equal to pts and by position again so we're gonna have the five boxes inside the plot here if we don't define this. Let's see what happens.

We see only the consolidated points per game but if we say by position this one box will be separated into five individual positions and each position will be given its own box. Okay and let's say that we don't like this box plot and we want to do it in. Seabourn and so we could do this using the SNS. Box plot function and the argument x equal to the position y equal to the points because X is the position and y here is the points and the data is the pts data frame okay and it looks quite good here very simple code and looks really nice with the colors as well now. Let's say that we want to show the box plot and we want to see the individual data points for each of the box we will use the strip plot function and jitter to be true otherwise points will be superimposed or it will stack up into the same point if there are multiple points in that same position so jitter will randomize the number so that it will not overlap so much instead of being overlap. It will move out a bit okay. And then here we're going to use alpha transparency of 0.8. Okay and so it looks like this so we see that actual data points on top of the box plot. Okay and so. Let's now have a look at the heat map. And so we're gonna compute the correlation matrix. Which will be the data that we're gonna use to make the heat map and so we're gonna assign the COR our variable and the DF data frame will be used to compute the correlation. And so here. We obtain the correlation matrix by using only the DF 0rr function here and so it is a 26 row by 26 column because we have 26 variables and so it will be a pairwise correlation matrix. So we have 26 column. And so we're gonna have each of the 26 we're gonna compute the pairwise. Pearson's correlation coefficient meaning that variable X 1 and X 1. It has a correlation of 1 right X 2 and X 1 has the correlation coefficient X 1 X 1 row here X 1 and X 3 X 1 and X 4 X 1 and X 5 etc and. Then we will move on to the next row right. X 2 and X 1 X 2 and X 2 X 2 and X 3 right and then etc and then we move on to the third row to the fourth row to the fifth row until we move on to all of the 26 role and so essentially it will be 26 by 26 but as you will see here they are diagonal meaning the one below and above the diagonal of one will be mirror image of one another okay because they are the same pairwise correlation coefficient meaning that age and G is H and G right here right GS and age it's H and GS okay and the same.

It's the same value just mirror image. Okay so I will show you how to make the diagonal version or the full box version of the correlation matrix in the format of heat map. Okay so let's make the heat map. And it's as simple as using the heat map function and as argument the cor our data frame which contains the correlation coefficient matrix value. And so this will give you a heat map of the inter correlation matrix of each variable with one another. And you see that white color here are the diagonal having correlation coefficient of 1 and so when the color is lighter color it will have correlation coefficient of 1 and if it has darker color. It means that the correlation coefficient will be low okay and if the color is red it means that the correlation coefficient will be about 0.5. Okay so this is a gradient scale. Alright and let's say that we want to adjust the figure size. Then we'll have to use the matplotlib functions here as well by defining fig and ax equal to pl t sub plots and then the argument here to be fixed size and at the tuple of 7 comma 5 and then SNS stock heat map. We're gonna make the heat map as a square so we're gonna say square equal to true and so notice that we have a square heat map now and the size is adjusted to what we wanted. Okay and so. Let's create another version. I got this from the link here from Seabourn and so we will mask or hide half of the heat map because they have a value that are just essentially mirror image of the bottom part. And so we're gonna show only the bottom half and so the mask variable here will allow us to do that.

Okay all right and so let's move on to the scatter plots. Have a look at the data frame again. Okay and we're gonna select columns that have numerical data type and so as I have mentioned above. We're gonna use the. Select d-types function include to be number and so here we see only the numbers. And then we're gonna select the first 5 columns and so we're gonna assign this content into the number variable. Okay and then here we're gonna select only the first 5 columns so here we're gonna define number dot. I lock and I lock means index location and then bracket and then the first value that we see is the colon. Colon here means that we're gonna select all of the rolls. Okay and then we're gonna have comma and then the colon five means that we're gonna select columns 0 1 2 3 4. Okay so it's gonna select the the first five columns all right so here. We select the first five columns and this is selection based on the index number. Okay and let's say that we want to select the columns based on the column names and it's like the same concept that we selected the positions the five positions that I have created above here the Center powerful word smart forward point guard shooting guard rank. So we're going to use the same concept we're gonna define a variable cost elections and then the content will be the list. Okay it will be a list of the column names and then we're gonna define DF and an opening and closing bracket and then the argument will be selections okay and then it will display data containing only these columns. Let's have a look alright and it does exactly as we expected. It will display age game steel block assistant points okay so now. Let's make the scatterplot grid because above here we just created or data and so let's click on the five column and I'll call them at the same time because all calling will take a long time to compute. And so we're gonna create scatterplot grid for the first five columns and so here we see that this is the scatterplot grid showing the scatterplot between the various columns age game steel block assists points with the same set of columns points assists blocked steel game age and notice that the perfect line here means that the columns are compared with its own column so it is a self comparison age and age games and game steel and steel right block and block assist and assist points and points.

A and half of the data will be mirror image of one another so same concept like the correlation matrix heatmap. Alright and so now we see that the all columns here are computed for the scatterplot. So we see that there are 26 by 26 plot grid here and so that diagonals are shown here and the remaining 26 are shown here and so the upper half and the lower half are essentially the same information and so we see a lot of positive correlation or no correlation at all right or positive correlation right. So we see that some variables have positive correlations of variables don't have positive correlation field-goal and points. Okay so there is a positive correlation field-goal attempts and points two points and okay so so there are related to scoring right so if you go attempt or the number times they shoot in order to score and so that is directly related to points scored in a game. Okay so congratulations you have now done some basic data pre-processing and exploratory data analysis so as always in order to learn data science. You have to do data science and feel free to apply this block of code for your data set of interests in order to expand your data science portfolio. And so let me know which data set you are working on and let me know your success in applying this set of code to your own data set so if you find value in this video please give it a thumbs up and please enjoy the journey. Thank you for watching. Please like subscribe and share and I'll see you in the next one. But in the meantime please check out these videos.