How To Speed Up Pandas By 4X Times- Modin Pandas Library
Hello all my name is krishna and welcome to my youtube channel so guys today in this particular video we are going to see how we can speed up the pandas library by at least three to four times now one of the major disadvantages of pandas is that it is actually slow i'll tell you what what is the reason of being slow and apart from that you'll be able to see that if i compare pandas and numpy obviously we know that numpy is actually fast because it has the references of c python that is a it is basically the implementation of combination of both c and python itself now there is an amazing library which is called as mode in and this mode in actually if you try to install it you know this can make the pandas library pretty much faster now what is this all library about let me just explain you something now suppose in my current system i have 16 cores 16 cpu cores and similarly if you're having a laptop some people may have 4 cpu cores some may have 8 cpu codes like that now usually when you are using pandas pandas only utilizes or uses only one core whereas once we install this mode in library then what it does is that it makes sure that whenever we are trying to read some data set whenever we are trying to do some pre-processing on the data set at that time it will divide the tasks throughout all the codes that is present over here that is basically the cpu codes itself so mode in can definitely make the pandas library really fast so let's go ahead and try to see an example one important thing that you should be knowing guys to actually practice this particular problem or check modern implementation once what you have to do is that you need to have a huge data set so for that particular purpose i'll be giving you this link of this particular data set which is available in kaggle you can download this kind of data sets and this data set you can see over here at least it should be more than 800 mb if you have somewhere around 500 mb or 600 mb uh you won't be able to see the difference that much with respect to the time but if you have a huge data set definitely you'll be able to see the difference right and i have a powerful system so definitely if i'm trying to just execute with pandas even though the time will be less so let's go ahead and try to see and first of all we should know how we should install model so in order to install mode in you just have to write pip install mode in and apart from this there are two modern dependencies one is the ray and one is dusk task is basically used for parallel uh processing you know we can either use dusk or ray before going ahead with respect to this particular video this video has been sponsored by an academy so unacademy has come up with a spinacle comprehensive and concise track to become an expert in c plus plus by the mentor pulkit chobra you can check this entire details about this particular course and here you will be having a one-year long structured and goal-oriented batch that begins with data structures the most important topic for all the levels for programmers they're going to cover very important topics like greedy algorithms number theory recursion and dp and discrete mathematics computational geometry graph algorithms and many more things apart from this guys if you really want to check out more details of this what all syllabus is going to get covered you can check it out here is the schedule which is basically starting from jan 18th uh which is basically beginning with sorting algorithms then you will be going on and you can also check out the complete schedule with respect to all the dates you know and apart from this guys if you also want to check out the other educators with respect to organ academy here you have all the educator details and you can also take the free classes and tests that have been provided with respect to competitive programming apart from this guys if you are really interested uh to go ahead with the subscription just click on get subscription make sure that you give a referral code of kn06 and you just apply it once you apply it you'll be basically getting a 10 percent off with the help of this particular referral code so all the information and the link with respect to this particular program is given in the description of this particular video please go ahead and check it out now in this particular example the first step what i'll do is that i'll just install pip install mode in and if i really want to go with dusk i'll just give this or gray suppose if i want to go with ray in this example so i'm just going to paste it over here i'm just going to press shift enter now here you can see that requirement is already satisfied because i have already done that specific installation okay so once you do this installation the next step what you have to do is that let's see i will just try to import pandas as pd okay now when i import pandas as pd that basically means i'm just using normal pandas okay normal pandas i'm actually using now remember guys i have downloaded this particular data set part two dot csv and i have named it as test you'll be able to see that probably i'll be able to see over here test dot csv okay so please make sure that you rename it so that you'll be able to call it very easily so here what i'll do is that i will first of all use time simple so that it will calculate that how much time it is probably taking to load that particular data set and here i'll say pd dot pd dot read underscore csv okay and i'm just going to read this particular csv that is my test dot csv that's it i'm just going to do this much um now let's see how much time it will probably take so i'm just going to execute and this percentile percentage time is basically a model which will actually help us to find out how much time it may have taken to reach this test.
Csv and remember guys first of all execute this separately because to execute this it may take some time to load that library it may take some time based on your system requirements now once i execute it here you are seeing that guys it is an 800 mb file more than 800 mv file now you can see that how much time it is probably taking so let's see so here you can see the total amount of time is somewhere around 7.
3 seconds and this is your entire data set let me do one thing quickly let me just restart the kernel and let me just print it df over here okay again i'll just restart the kernel just to start from starting so that you don't have any confusion i'll import the pandas and then finally i'll read it and probably i'll display the df okay now let's see how much time it will probably take okay so it is executing finally it takes somewhere around 7.37 seconds so by using pandas library when i'm reading this dot csv file it is taking somewhere around 7.37 seconds now what i'm going to do is that over here there is a column let's let's do some kind of data pre-processing so i'm going to take this particular column at underscore team okay so i'll say df of df dot i'll just try to do group by i'm just going to perform a group by operation and for the group by operation i'm going to take this particular attribute that is at underscore team and then after doing a group by after doing a group by if i really want to check the type of this just see this guys if i really want to check the type of this it is basically saying that it is a pandas dot code or group by generic.data frame group by at this specific memory location now once i do this particular group by probably i want to use an aggregate function which is called as count now let's go and probably measure this particular time so if i measure this particular time and if i just execute it you'll be able to see that it is taking somewhere around 1.76 second so initially we saw that with the help of pandas we we are taking somewhere around 7.37 just to read this huge data set and remember when pandas is reading it it is basically reading and it is basically sending all this information and it is using only one core of the cpu to read all this particular inform to read this csv file and do all this particular method similarly when we are performing this pre-processing uh a simple group by operation here also it is just using one code okay now let me do one thing let me just restart the kernel and after restarting the kernel what i'm going to do is that i'm going to import mode in dot pandas as pd so i am now planning to use moden if you want to import mode in we just have to write import mode in dot pandas as pd where you can actually check it out here is basically given import mode in dos pandas as pd so probably it will take some time now let's see i have got executed it is got executed perfectly now what i'll do i will just do the same operation just see the difference guys i'm just going to do the same operation over here i'll paste it over here and once i do it here let's see how much time it will take because see i've just used the same pd right right now it is modin.
Pandas now not let me execute it now once i execute it probably you can see that it is just reading the entire data in 2.9 seconds so what is happening in the back end is that now when we are reading this dot csv it is distributed between multiple cores right and then it is being able to read quickly probably a chunk is distributed to core one another chunk is distributed to co2 like this and you are basically reading it right uh a parallelism is basically coming up multi-thread parallelism is basically coming up to read the specific data set now the next step what i'll do is that i will just try to execute this same statement a group by statement over here right and let me just execute it and let's see what what is the time and here you can see 1.52 seconds i think previous one was somewhere around 1.76 seconds now this is just this many number of records guys so you are getting this much but here again it has performed well right this is with respect to mode in this is with respect to pandas now guys here modern is definitely having a lot of advantages and you should try to use modding now why you should try to use modern remember mode in like how i did a group buy right a group buy is just a kind of operation in pandas right so in modern also you have that particular operation now it is said that in the documentation you know more than 73 percentages of the operations that are present in pandas is available in modern so now people can use those kind of operations to do the data pre-processing because it will try to utilize the core of the cpus that is present and it will be able to give you a very good answers quickly it will be able to execute it quickly right so let us revise initially what we have done in order to install mode in we just have to write some pip install mode in of ray you can also use das if you want not a problem or ray or dusk are the engines you can see over here uh modern will use ray if you are using this particular code or it will it can use tasks so what are these these are modern dependencies okay these are modern dependencies after that here you'll be able to see that in the next statement you are trying to read this you are importing pandas and trying to read this dot csv and you're trying to compute the time over here you're getting somewhere around 7.
37 seconds if you do a group by operation and you probably do a count on that basically group by and then apply an aggregate function it is taking somewhere around 1.76 seconds and just imagine guys if you have millions of records this time will definitely go up if you are using pandas okay but in the case of mode in when i try to import mode in and when i try to read that same data set and try to see the time it is very very less and similarly the group by operation also took very very less time now just try different different operation try to handle the missing values try to apply different different kind of aggregate.
Functions try to do different different pre-processing guys you will definitely see the difference of time and this is how we can we can improve our we can run our pandas by at least three to four times you know by just using modded right so i hope you like this particular video please do subscribe the channel if you're not already subscribed i'll see y'all in the next video have a great day ahead thank you one doll bye bye.
Csv and remember guys first of all execute this separately because to execute this it may take some time to load that library it may take some time based on your system requirements now once i execute it here you are seeing that guys it is an 800 mb file more than 800 mv file now you can see that how much time it is probably taking so let's see so here you can see the total amount of time is somewhere around 7.
3 seconds and this is your entire data set let me do one thing quickly let me just restart the kernel and let me just print it df over here okay again i'll just restart the kernel just to start from starting so that you don't have any confusion i'll import the pandas and then finally i'll read it and probably i'll display the df okay now let's see how much time it will probably take okay so it is executing finally it takes somewhere around 7.37 seconds so by using pandas library when i'm reading this dot csv file it is taking somewhere around 7.37 seconds now what i'm going to do is that over here there is a column let's let's do some kind of data pre-processing so i'm going to take this particular column at underscore team okay so i'll say df of df dot i'll just try to do group by i'm just going to perform a group by operation and for the group by operation i'm going to take this particular attribute that is at underscore team and then after doing a group by after doing a group by if i really want to check the type of this just see this guys if i really want to check the type of this it is basically saying that it is a pandas dot code or group by generic.data frame group by at this specific memory location now once i do this particular group by probably i want to use an aggregate function which is called as count now let's go and probably measure this particular time so if i measure this particular time and if i just execute it you'll be able to see that it is taking somewhere around 1.76 second so initially we saw that with the help of pandas we we are taking somewhere around 7.37 just to read this huge data set and remember when pandas is reading it it is basically reading and it is basically sending all this information and it is using only one core of the cpu to read all this particular inform to read this csv file and do all this particular method similarly when we are performing this pre-processing uh a simple group by operation here also it is just using one code okay now let me do one thing let me just restart the kernel and after restarting the kernel what i'm going to do is that i'm going to import mode in dot pandas as pd so i am now planning to use moden if you want to import mode in we just have to write import mode in dot pandas as pd where you can actually check it out here is basically given import mode in dos pandas as pd so probably it will take some time now let's see i have got executed it is got executed perfectly now what i'll do i will just do the same operation just see the difference guys i'm just going to do the same operation over here i'll paste it over here and once i do it here let's see how much time it will take because see i've just used the same pd right right now it is modin.
Pandas now not let me execute it now once i execute it probably you can see that it is just reading the entire data in 2.9 seconds so what is happening in the back end is that now when we are reading this dot csv it is distributed between multiple cores right and then it is being able to read quickly probably a chunk is distributed to core one another chunk is distributed to co2 like this and you are basically reading it right uh a parallelism is basically coming up multi-thread parallelism is basically coming up to read the specific data set now the next step what i'll do is that i will just try to execute this same statement a group by statement over here right and let me just execute it and let's see what what is the time and here you can see 1.52 seconds i think previous one was somewhere around 1.76 seconds now this is just this many number of records guys so you are getting this much but here again it has performed well right this is with respect to mode in this is with respect to pandas now guys here modern is definitely having a lot of advantages and you should try to use modding now why you should try to use modern remember mode in like how i did a group buy right a group buy is just a kind of operation in pandas right so in modern also you have that particular operation now it is said that in the documentation you know more than 73 percentages of the operations that are present in pandas is available in modern so now people can use those kind of operations to do the data pre-processing because it will try to utilize the core of the cpus that is present and it will be able to give you a very good answers quickly it will be able to execute it quickly right so let us revise initially what we have done in order to install mode in we just have to write some pip install mode in of ray you can also use das if you want not a problem or ray or dusk are the engines you can see over here uh modern will use ray if you are using this particular code or it will it can use tasks so what are these these are modern dependencies okay these are modern dependencies after that here you'll be able to see that in the next statement you are trying to read this you are importing pandas and trying to read this dot csv and you're trying to compute the time over here you're getting somewhere around 7.
37 seconds if you do a group by operation and you probably do a count on that basically group by and then apply an aggregate function it is taking somewhere around 1.76 seconds and just imagine guys if you have millions of records this time will definitely go up if you are using pandas okay but in the case of mode in when i try to import mode in and when i try to read that same data set and try to see the time it is very very less and similarly the group by operation also took very very less time now just try different different operation try to handle the missing values try to apply different different kind of aggregate.
Functions try to do different different pre-processing guys you will definitely see the difference of time and this is how we can we can improve our we can run our pandas by at least three to four times you know by just using modded right so i hope you like this particular video please do subscribe the channel if you're not already subscribed i'll see y'all in the next video have a great day ahead thank you one doll bye bye.