Data Science Tools: Working with Large Datasets(CSV Files) in Python[2019]


Hello everyone welcome back again. My name is Jessie and today's one of the three on data science tools. We tried to see how to work with large and huge data set. So what do I mean by like I said so you give you the test. It is more than hundred megabytes opposed to one gigabyte to two days or terabytes and you're working with it. It can be very very difficult sometimes when you are trying to work with it on your local state in the cloud it may be faster but it is a system can be difficult. So how do you wake up out with that issue. So there are several ways of recommended so we have pandas itself. You have model. We have task. We have advice and an ammeter. Is that some of the popular packages that can use to help make your work easier on your creates large data set so to install. Any event is going to be stopped and us to be not installed pass or click. Install that's complete try to extort a package that's very powerful package can also go to be stopped by is another package then people install modern. Which is pandas working on real right. This one is it. Is it liability hard work with it now. Let's see the first message to the first method is not just using of these packages but to split them. So you have splits you data set and immediately checks so. Let's see how to do that. So we give you on working on lettuce all units be system like liners or. I can't oh she can just go it. Split right spent already comes with your system. This goes plate - L then the number of lines will just click it big data set right then atlatl and just renamed it or if I split them by rights like maybe 250 keep kilobytes or 15 megabytes or gonna split them by VB digits but I'm split them and then get in the customer all right or just play them into chunks by dividing it access video chat of panic and it's a basic. We wonder we came on. A nice piece is dead. So we gives you awaken on ok lioness or in non unit visited like windows can also go with this past club at each call see is CSV successful.

Let's see how it is so just open it here so this they. CSV splitter so I've instructed it here then less utility see so can I open it this a little bit right through. It has a simple tool then. One could be this particular tool to give me the funny so. I'm going to supply my phone in for my teacher said we are using just and Omega box. Let's call it us and rabbits like this right. It is just 100. Megabyte is not that huge. Then you just supply the location that you want to put you out for that. So let's put it here us resort then a good toilet first row contains column numbers. Right that is that is true then include head header in each new party so I want to spread them. It too may be different so this is s hundred thousand. Let's make it ten tower right or little letter all day and I don't stab you like this since I go it. Executes it's going to start splitting then give us ten a more time so let's check our food. Let's you're running inside so this is our food. Ah there's a puppy that is running and this is our food addict. We are running inside so you see that is good to split them into individual chunks that is placing them eat 200 moves and in very very very nice format right which is quite interesting so we different format you can leave it you read or do in a regular - entity right it's going to open each and every that would have the header which is very very interesting so to take some time you want it that's it so that is a phase meter you keep your waking on reduce hold my nose so let's watch it then this is this now let's see the next method - then there is it gives you a cannot reduce this theme today we did coming down. DT awaken online us right today. Now let's see the next actors who can also use no TGS Java so to do that so npm install csv format so let's do the nest with us with NS better just to view your file without any packages even a normal default Python stuff so it just won't go into it. Okay then the name of the farm or the data set has found the you is going to preview the entire stuff or just read it.

It's not lying to preview it as it so that is very very interesting we to know the normal rules this particular widow movie the number of who that is. I can change them. You can spread in income make on the individual. Let's see how to do so. Who do you run it like this. This depends on how fast the system is right so this took six just about six seconds. Bite you to read this and give us this 9200 for nine nine nine 38,000 rules we're wearing Kristin's which is just giving us a preview of the first backpack stop. I thought I'd wanna do is calculate now said you have a preteen commit to know what to do with it. How are you going to divide a dataset how. I will to cut into chunks unless it in expletive a mess. This matter with that. Since we know that this is number of rows we can read it a root of maybe thousands of euros or hundreds or in different different rules then cannot agree diem has columns right so let's see how to do this. Go to do no more pandas to Tippie stop under switch was restored and inputs and SPD if it's not let's attempt to read our data so that we had faces see how is going to be so obviously to the normal time then. DF PD dots read and that's called CSV. Then our data set rights will dig it assets right. Tell us what they're trying to read. We have big just one now. Let's try it and read it and see how long it will take. These are about hundred megabyte rights. We can use one terabyte but for this tutorial is let's use only hundred megabyte so it takes about what eight point five seconds right so just one to six point. 17 seconds is to eight point five seconds so that is one way of. We didn't but tempting to read it as a big file so let's try to see let's see what if I didn't laughs okay. The best videos are data set. It's happened all of these teen series and very interesting this. Is it very nice now. Let's see how to read the face rules right so we know that it has about this kind of rules we can just read the first our state our rules.

Let's see happen to be very simple so time then. DF left ODF one then PD Daughtry and that's called CSV and I would subtract if at a club decided how they eat. Ah that's one then. I can supply the number of Jews. I want to be certain this with n rules then I can supply that button. So let's say I want to read the first out of the rules if I'm really struggling with this going to it can keep us 160 Alyssa. Get very very interested so with a this particular format can really. It's a simple way that very fast that we didn't work on it do whatever. I do Indian applied on a really dataset which is very simple. This one takes less than mass for going to be really neat asta so this was eight point five seconds this is autistic. Melissa can fruit is very fast right diameter that you can also check. Let's check how many meters cheap plastic dishes. Let's check the shape or forty up. We're super so we sell seven hundred moves right. He's from the entire nine thousand not only. Santa thought that supposed to be this is this was supposed to come in handy. Okay done. Let's move on. This was supposed to be here. An hour. Less didn't estimate a zoo so now. Let's see the amount of memory that is being used by each energy of these rules at a gap so to do that's quite simple is going to be. F 1 plus e to the liquid good memory usage. Because it's that's true so you work on this so to give us a memory. Let this be used by general ago so that almost all of them is the same and this amount. SP you can also use a give a matter of Prescott at one. Copy this one of the deep formats right. So it's going to be. This is at least one too deep. Then it's going to use it before us. You see this is beauty. 39:17 is using one of the roots. But in case you want to check for each and every other column right is just use it daily but we'll check out your column can just move on to the second method it's going to be you apply that one not on to the YouTube but on Tuesday follow me so we're going to be memory usage.

It's a call to true. I chose a be true so that this is 80 so the index itself can it happen for all of these was keep you country d2. I can just converted to megabyte talking about this I want to keep. Let's try to better offer a deep so let's say to be to be quickly true. You said because true you will to see it very click and so that in this is a tea set it is 808 oven and then. Lima Bank is higher right to this. Basically to know which column is you did more can use those columns to read write or to make your selection or platinum olivewood. That's it is it idea about it. Let's compare this one to buy to give your to commit to bite. It's good to be this interval by c10 one. He then what. I think this is done. Didn't see that compared them to but so you need to see the difference all see that it's too big let's make it kilobyte drink so eat chlorides eg 65 dinner. So that's how to do the column but a d-pad take imagine a you go see it is to select the kind of Colorado's so let's see how to do that so it's going to be could use a time magic then. DF 2 is going to be PD. George read Alice for CS e and a big data set right then I multiplied a column such enough so you would need to be used cause go to our columns until now so let's see how it's going to tick the instance 5.17 two seconds right. This was this was a 160 milliseconds. This was five point 17 seconds. Tell me that it takes longer time on your reading collapse then when i really good this is really at a little about just to keep all it is five columns very interested so let's check the usage of this one to to give value to the tender that's weird to 5 so just just below the five rows right to cannot switch equities each of these rules how much is being used to dfq info then memory usage go to true if excellent now it is given a totally different go. This is reading all of seen all these rules variable right so these are the rules at this week.

Convenient our last class that day so that is a pity idea about it. Okay so now let's move on to the nest objects with the nest metal that you can read enter stuff into chunks right read the entire is a second chance to holly leaders who. I have already fitted a simple project. So it's going to be just like this. So this is the number of chunks. You want to read right then you can just quit the NIT data frame then you look through the entire days as I than you have an ass give you the option of supplying a chunk size. Then you do whatever process you want to do so just check it fully to down. Its to the typical city antiquity to that people seem to caught me before they pander down so how to check for 230. Good data greater than 10,000 or hundred thousand right so I'll do my calculation to my processor then copilot or a route dataset into one place and then work on us so let's run it and see see. How long is what it take professor so now it took about 16 minutes to read this entire staff write. That need to really person took a very long time to read it right because they are still doing their computation. Now let's see the nest metal. That colors would you two unless just good yep dot largest Vance and then with this dot head and just check for cheat for the first five episodes at maybe to wake and give us based on this competition artillery. That's one of the letters of our community right so this method is it works about the sixth time. Because it's just chunk ended. There's been the same to understand. Hindi very very very interested. Now let's see the nest native so the next method is to use modern right so what is a very powerful nice packing that waste on really right to expand ass running in steroids called real smoothness of or less blue. They package just blue is important more than but under us. MPD then is going to run all this process in white that is the basic stuff is what you do here going to run it perfectly.

Then let's try and we did add attested to one of the wheels that sometime when you ever. Kate Morgan he keeps you interrupted. It may have problems we just have to risk that. Mike and also have to restart the kind of a so obvious article one then again. I'm just going to leave it as is to if I noticed it as it they with modern and so it's just good run to some time and some Altos s as compared to the previous one. That is very interesting. Twelve point four seven very very fast know that battery faster than the shower. Now let's see the nesting you can do with model to it. My name is a wholly different packages or different functions. You can do with this particular subjects white panda right. You can do almost all of these days with model control of distance with model so we understand that you can just check for the head right. Missile include it would give us the head of this dataset. Wipe this busy stuff maybe it should just ask you have a do it so if I check for it maybe you see such as giving us this stuff right very interesting. See what you load all of these columns and all of these rules. Let's check the nest metal so an expert on is to use gas rise to that is a very powerful package that has several other features we have tested that field as I read in dust and dust distributed. So you can just work with it. Let's see how to ability just going to go get imported us that we must DD. Then let's see how to read out some time then is going to be given a small task then dd-does read nicely. CSP this is okay as I said to help this disaster to sweeten see and see the time it takes to eke very very fast right. So that's it very very fast. Since able to read the entire. Star Wars unsticking about 260 is really say yes so let's determine if you're going to be this to this idea of DF that's a lose. Give us all of distance we can just check on these different functions right. Let's come back quickly very interesting. All of these parties can be debates but class option.

If you want to get information about just want to go ask those influenced by good at evil. I'm going to run it. Then give me some little flower boxes because it so that is how to work with it with. That's going to tell people to give you the information kids all together described and also do this thing to know here fuckos. Yeah - don't ask this one just described going. This is mutual to describe the entire. It's not that efreeti but we get it but compute you to do a lot of things which is which goes with this right and it's a better method of eternity unless conversion irrespective soon as I saw this matter six which is ill-advised to pursue to another power to two turns over quit by sometimes I may be issues with the installation but I still just go with this. C format if to ability now for the matter. Seven is for - Tecna box the limit of seven just go this test report pandas eyes. You open a file no matter how long it is then you pick le choice committed from one format of CSV to a paper format so it when it comes easier to eat. So let's try that one so. I'm just going to read our farm to let's play tests dear. Turing can three dealers read other sports is T then a big distance. That's without so let's read it first one but if they do. Doesn't that help to take some time to read to quite some time and after that. I'm just going to pick all that's right see with a speckle perfected to do to be dear. Three dots should pick up the now surprises. Let's go knew it. Don't tickle alright. So this is how to toss with a pickle is to come easier it comes faster for me to work with it next time. I'm trying to it so this data no more Road it does it but typical format is which change the format into a format that it's easy to read so let's reload it and check the time to take to read it soon time then sort of new here. P denotes read last course list not suppose pickle then I'll subtract then you pick out so let's check the time because it serves about three seconds two point seven five seconds which is very very fast as compared to the people's want me do it my to this format olicity.

Yes we didn't just do anything let's move so they can improve this piece. I found just check for this head and then it's going to work perfectly just like enormous on everything that's in issues if I do whatever I smoke you do need it so this is just the first five rows that has new to read for us very nicely so there are several measures can also work with like the sets one of the methods come with it using the clock right context. Rudy this is a synthetic lock GCP it obviously more these craft systems so thank you for watching. Its long tutorial case that any question or contribution can you squeeze at a comment section so that everybody can benefit gives you mean help clearly these assets consultative leave below. Thank you nesting.