Data Science Course | Intellipaat
Take a look at why do we need data science and then we'll move on to the process of data science then we'll understand data gathering data processing data analysis data cleaning data visualization then we'll understand how to create a model how to test if the model is performing up to a mark then we'll create a hands-on in which we'll use logistic regression algorithm to create a model that could make certain predictions and finally we'll end with a quiz to test whether you have grasped everything that you need to grasp so what is data science well let's take a look at that data science is the process of finding hidden patterns from raw and unstructured data so what is raw unstructured data basically raw unstructured data is some data that has no visible structure to it so when you are visiting a website such an e-commerce platform such as Amazon flipkart the store information about what are the items that you've recently viewed what are the items that you clicked on what are the items that you searched for what were your search results what are the ads that you searched for what are the ads that got you interested in two particular items now this kind of data is not really interrelated so this kind of data is unstructured data and what these companies do is they store this data about you to use it to make better predictions for you so that they can recommend good products that match your taste and also help you to or and also use this to help you to direct you to some products that you might be interested in so when we talk about data science one thing that really gets into the way is the terminologies so this machine learning this and data analysis the statistics there many other things and people don't know just where does data science fit into all of this so if you take a look at this Venn diagram it explains it quite beautifully we use statistics and Scientific methods in statistics we use algorithms from machine learning and we use the data analysis part to create models to use the models to make predictions and data science is the part of this.
Venn diagram where all of these three things overlap it signifies that all of these three things are used in data science statistics machine learning and data analysis now just because it uses these three things doesn't mean that you have to be a master in all of these things to get started you can get started quite easily in data science you need to just know the basics so why do we need data science 80% of the data gathered by companies is unstructured data as we've already seen unstructured data basically means data that has no visible structure visible patterns that you could use to provide better services to your users so 80% of the data gathered by other companies is unstructured and data science is used to analyze this unstructured data and extract simple observations that could help the company's structure the business model in order to help the customers better and get a better reputation so unstructured data as visible to the naked eye is quite difficult to an understand and grasp which is why we perform these data analysis tasks to data science and we perform data science operations on it to extract some meaningful observations out of the unstructured data and gather information about the users and how to best move forward in future so the incoming data can be of from various sources it could be from the web it could be from a database this database could be normalized could be denormalized it could be an SQL database it could be in no SQL database the data sources are quite varied and since the data sources are quite varied the data model will be quite varied as well this is the problem that many data scientists face when they get a lot of data but they have no idea how to join the data together this is one of the most difficult tasks in data science once you gather the data then you need to understand how to join the data of which data to leave out from certain sources how to get the data clean so on and so forth now you can use these data this data acquired from different data sources and put them directly into a bi tool stands for business intelligence but sometimes business intelligence tools are not capable of handling such large quantities of data because you gather a lot of data based on the your user base the size of your user base companies like Amazon have data in terabytes of volume so you can't really just throw all the data in a bi tool and expect it to work this is the data science comes in where you extract meaningful information out of the data and create a model now to handle large amounts of unstructured data we draw meaningful trends so for that we need data science from that unstructured data from that raw data that we have collected from our users we want to generate some trends we want to extract some information so that we could use that to suggest our business to move in a better way so let's take a look at a real time use case so data science is quite useful when we are trying to create a model that could predict credit-card fraud detection so suppose you have a lot of information about all the credit card transactions and you have information about which of these transactions for fraudulent transactions transactions that should not have gone through the credit card system but they did so maybe someone tried to trick the system got more money or someone stole someone's credit card and made a transaction that was not supposed to be made so these kinds of things are quite common and to handle these kinds of things what we do is we gather a lot of data about fraudulent credit card transactions and then we perform the data science operations on that we gather data from multiple sources we clean it we process it will visualize it we generate a model that when fed some information about a credit card transaction could give us an indication on whether it's fraudulent transaction or not if it is a fraudulent transaction or if it's classified as such then our banking officials can take a look into it can call the customers if either it was them who produce the transactions so on and so forth so this is one use case now data science is useful not just for this but for many other tasks even in the banking industry but this is one of the most popular use cases of data science in credit card companies there are other use cases as well such as social media analytics so have you wondered how Facebook recommends use in fields like this there's also targeted AdWords so it's just Google Ads recommends you advertisements based on the things that you've interacted with the YouTube videos that you've watched the searches that you performed the tasks that you've performed on Google the things that you searched for and So many other things this is where google ads come in to play, they use data science and all the techniques in data science to create and generate models to make targeted advertisements specifically to a specific user this augmented reality augmented reality are used to put images into the frame where they were not really images so Apple uses augmented reality many other AR gears are also available that uses data science then there's recommendation engines so when Netflix recommends you the next movie or the next TV show that might of interest to you it uses data sense in the background and then there's healthcare imaging there are systems that could take a look at your brain scans and figure out if there is a tumor in your brain scans look at the heart scans look at the ECGs look at various other imagings that you could perform in medicals and you can get some sort of a reading after those systems whether or not the patient is at risk of heart disease whether the person has a cancerous tumor whether it is malignant or benign and so on and so forth now there are many algorithms that you can use in data science some of them are linear regression linear regression is most commonly used when you want to predict something and most commonly when you have only two features and you want to plot them and predict something out of it linear regression is quite useful for that it's quite easy to understand when there is only two variables when there are multiple variables it could get a little tricky similarly we have logistic regression now logistic regression is a classification algorithm also although it's a bit confusing that it has the word regression in its name it is a classification algorithm so the major difference is in linear regression we try to predict a continuous value likely price of a house depending on its square footage the area that it is in and how old the house is similarly logistic regression issues for classification we are we're trying to take some data and predict which class it goes into so we get an image and we wish to predict whether or not it contains a dog we take in an image again we try to predict what species of flower is it so these are different classes that our data could be put into and this is why we use Classification Algorithm this entry is also a classification algorithm Naivs base uses probabilistic model using the Naive based theorem and allows you to make assertions based on the probabilistic model and KNN stands for K nearest neighbors now it's mostly used for clustering so KNN k-means clustering is used for clustering and KN n allows us to create clusters and then put new data points inside the graphs and make accurate predictions based on the features that they have k-means clustering is an unsupervised learning so linear regression so regression and classification are supervised learning in which we give them the information about the data that we already have we have a labeled data set where we want our model to learn from it and we know the answers already we know the house prices that our data set has and we wanted to learn from that and make accurate prediction from there in clustering we send in a lot of data and then we tell it to perform clustering and try to figure out some sort of a grouping try to group a certain data point or find out a certain groups and then based on the similarities that they have and then tell us what those groupings are and this is very useful when they're trying to predict which customers are similar to each other so this is where k-means clustering can come into play and random forest is just an advanced version of decision tree it uses multiple decision trees to create classification or regression model okay so let's take a look at the data science process so the data science process consists of several steps the first step is with any problem that we're trying to solve the first understand the business problem problem that the business is trying to solve so using the data that we have so the first thing we try to understand is what exactly does the business want from us do they want us to make a product recommendation engine ok then what kind of product recommendation system do they want us to recommend products to user based on their previous position purchases or should we recommend people the items that people who are simliar who have a similar taste to the user have bought and this kind of brainstorming and problem understanding and analysis allows us to reduce our work and understand what exactly is it that we're trying to achieve if in case we had not performed this problem we would be solving a problem that need not be solved that the company has no useful in case the company wants us to make a product recommendation system that recommends the use of things based on their previous purchase and we built the product recommend system on some other kind of data then it becomes a real problem also understanding the business problem allows us to take a look at the data add more head time or new ones to weigh and understand whether this data would be useful for the kind of problem you trying to solve then comes data gathering data gathering is one of the most important aspects of data science process because if the data that is gathered is incorrect if the data is incorrect if it's wrong if it contains too many null values if most of the values in the data is out of range or it's just too varied or is just biased towards one result then the problem becomes really really difficult to solve because data science works entirely on data so if you don't give it good amounts of good data a large amount of good data then you might not be able to solve the problem that you're trying to solve then comes data processing we process the data we get the data we try to understand we load it into the data frame we process it we join it we try to gather data from multiple sources and then which process it can convert into a single source so that that could be analyzed later and that could be used to create important distinctions and important predictions then comes data analysis the data that we have gathered from multiple sources and converted into a single format now comes the time for us to understand whether or not the data is correct how many null values are there is the data biased towards one kind of prediction if I'm making a classification whether a person is whether the person is going to pass or fail his next class his or her next class and if all we have is the data of people who have passed then we can't build a model because our data is completely biased so these are the kinds of information that we need to understand we need to analyze the data understand the outliers how much of the data is usable how much of the data how much of the features in the data how much more the columns in the data are just unneeded for the problem and so on and so forth then comes visualizations after analysis we visualize the data to take a look at whether or not something is correct or incorrect we wish to take a look at the trends of the data so for instance if we have some data set let's say that we have stock prices of some company and we have a long list of prices now we could just take a look at the prices and see okay the price is increasing and then decreasing and increasing and decreasing but that could be a little tedious data visualization allows us to create graphs and create visualizations that make us intuitively understand the trend that is inside the data so we take a look at the data take a look at the stock prices and we can take a look at the visualization and we could quickly understand that the data is showing an upward trend or a downward trend whether the profits are increasing or decreasing since when did the processor profits start decreasing what could be the issue around those times this is where data visualization comes in then comes data cleaning so depending on the kind of data that you have one thing that you have you might have to do and it's quite possible is that you have to convert categorical data to numerical data maybe you have dates in incorrect format maybe you have date in the US format and you need to convert it to the Indian format or maybe vice versa or maybe you just have data that's just irrelevant or maybe some data that is outliers or maybe some null values so how do you go about doing this this completely forms into data cleaning and it's something that you need to take care of you need to understand the domain before cleaning the data because maybe some data is not an outlier it's an actual value that we just didn't understand and we removed it and that could cause a lot of problem so data cleaning is the next step and finally after all the steps after we've got our clean processed analyzed visualized data set we use machine learning and data science algorithms to create molds these models could either perform regressions or could perform predictions or classifications and allow us to make accurate predictions and provide value to our business this is the problem that we were trying to solve and when we create the model we test the model to check whether or not the model that was created is performing well if it's not performing well then maybe we need more data maybe we need another algorithm maybe we need to clean the data correctly rescaling reshaping tells a lot of scope in that department so the first step in a data science solution is to understand the problem as we've discussed that understanding is the problem is the key if we start off wrong in that step or data science flow is going to be disrupted and we might end up solving the wrong problem or it's also possible that we might hit a dead end because the data that we have is just not useful to solve the kind of problem that we miss heard or misunderstood and we try to solve so always try to understand what kind of data you have and what kind of problem you're trying to solve this allows us to analyze the data and understand whether this data is useful for us to solve this kind of problem because when you get a problem the first thing you need to ask is why do we need to solve this problem and how can we solve this problem the why is the most important question in this entire flow and we need to understand the end product what exactly are we trying to build what what is what is the what are the features that are required in this model what should be the input what should be the output how fast does it need to be why do we need to take the input in one particular format and not in other these are all the questions that need to understand and have a clear understanding on and your entire team needs to have a clear understanding on that as well then we determine the data sources for this problem as I've already discussed data collected from from several other sources it could come from a CSV file from a text file from a legacy database from the internet you think web scrapping we could get lot of data this way and that could be a problem for us to understand as to how to merge all this data and how to get this data from suppose we are getting the data from a legacy database and it's just too much effort then we might and the data is just too little then we might regard that source to be not useful and we just don't use data from there now these are the kind of things that you need to understand when you are trying to do the data sense trying to solve a data science problem and after all of these steps have been performed you should have the information and the context which you are required to solve this problem these context can allow you to understand what exactly was the problem and do we have correct data to solve this problem if we don't perform these steps in these in this way we might end up spending a lot of time lot of effort a lot of money in solving the problem that either does it didn't need to be solved or the problem that we misunderstood or maybe the problem that we could not have solved with the data that we had we just spent a lot of time in a sense wasted a lot of time so now let's take a look at some of the steps in data science let's take a look at data gathering data gathering is also known as data extraction which is the process of retrieving data from various sources to be used in your data science process so it's not really necessary for you to extract data from multiple sources if you have a data source that contains large amounts of data than well and good however most of the times when you are working with a data sense project you need large amounts of data that requires you to gather data from extremely varied sources you need you might need to get data from a CSV file you might need to get the data from a website or a web API if it's if the website is not providing an Web API that you need to scrap that website and get the data that you need from that website and sometimes you need to carry from a legacy database maybe you your application that is interacting with the customers storing information about their likes and dislikes in a MySQL database that was designed a long time ago now you need to extract the information and only the kind of only the information that you need from the database and then you need to gather the data from all of these sources into one common place this is for data gathering comes into play the more data that you have the better it is for you to portray in your model and the better chances you have of solving the problem with greater accuracy now just because you have more data doesn't necessarily mean that your model is going to be perfect more data that is really well cleaned that's really well prepared that's been preserved in a good condition and that does not contain a lot of null values that does not contain a lot of unrealistic values then data gathering is considered to be successful because when you reach a gathering process is successful you end up with a large amounts of data so data extraction is performed in order to gather data from various diverse sources or data repositories so we extract data from multiple sources and then we gathered that and store it in one single data repository this data repository is then used to train our machine learning models and make predictions now you could gather data from a lot of places two of the most common one our databases and Internet when you're using databases it could be in no SQL database it could be an SQL database in no SQL databases also there are multiple kinds of databases document stores graph databases column stores and so on and so forth similarly on the internet.
You have a lot of sources you could get data from a web api web api might be a restful web api it could be a soap driven web api there's a lot of diversity from the sources that you can get the data from and if the the internet source that you're looking for does not provide a web api then you might need to create scripts that could reach into the web pages extract the information that you need from it and then you store it somewhere that you can use in your data Science process so now let's get our hands dirty into the code and we'll try to perform some SQL queries to know how to extract data from an SQL database that contains multiple tables and we might need to perform several operations like sub queries and joins and all that okay so let us take a look at SQL sub queries a sub query is basically a nested query that allows us to extract some data from another table and use that data inside the current query so to do that we have a database that has two tables make and model this database is used inside a car dealership so let us take a look at the data that we'll be working off so we select * from make and as you can see we have three make names or the name of the companies of cars now let's take a look at the model and we have six rows inside it and all of them belong to a specific make and this is referenced using a make ID so make ID is the foreign key that is referencing the ID column of the make table now what we wish to do is we wish to write a query that will look at all the data in the model column for the model table it will take a look at all the data and it will give us the ID what do you make ID of all the cars that have been built after the year of 2000 so anything with the year less than 2010 will not get included in our result and then we look into the make table take a look at all the IDS that were returned by the model table and we'll take a look at if that ID exists here and if it does we'll show the name so to do that what we do is we start by writing the query so what we want is we want the name from the make table where the ID is then and this is where we start a sub query we want the make ID from the model table where here it's greater than or equal to 2010 let's take a look and sure enough we get the Honda and Hyundai so let's take a look if it got the correct answer so what we wanted is we wanted the name of the cars that have the name of the make of the car that have models that have been released in during the year 2010 or later 2010 so let's take a look at that so essentially we have one two three models or three three models of cars that have been released either on or after 2010 so we have city I 10 and verna so we have the ID of two and three and the ID of two refers to the make of Honda and this third ID refers to the make of Hyundai and this is what we get as an answer so essentially a sub query allows to extract some data from another table and use that data and use the outer query to work on that data so our inner query results the make IDs that we need and then our outer query extracts the data from the sub query and uses that data to display the result so our sub queries are done so let's take a look at how to join data from two tables we'll take a look at the queries and look at the join keyword in how to use the join keyword to get data from two tables that are linked using a foreign key and displayed in a single command so what we want to do in this problem is we were we wish to extract the make name and the model name and the model year and displayed in a single result set now the issue here is that we have the data in two separate tables we have a make table and we have a model table and they are linked using the make ID in the model table so to make ID is the foreign key and we can use it to extract data from both the columns and see where the columns match and we show the data so let's take a look at how to do that so the first thing we need to do is we need to use a select query and in that we will name the things that we wish to extract so from the make table we want the name column and I will name it as makes underscore name the reason why I do this is because in the model column there is another model table there's another column named name and I don't want there to be any confusion as to which column named name refers to which column so the chord column will be named as make name and the other column that we want is the model dot name and finally we want model dot here so we have will extract this data from make and model so we want this from the make table we want inner join the model table on make dot ID and here we have the data that we wish to extract we have the make name the name and the year so instead of showing the IDS we can now show the name of the make sub instance wherever it says the ID is one we have the Maruti name as the make name similarly for the Honda and Hyundai so we have successfully extracted data from two separate columns reason why sometimes data gets stored in two separate columns and because we want to keep the data in normalized form and we want to keep the data separate from each other so that there's no duplication redundancy and that makes it a bit difficult for us to use these columns and this is why we use the join keyword to join the two tables and extract data from them now let's move on to web scrapping let's perform a hands-on where we'll get some useful insight on how to perform web scrapping on a simple HTML page just to give you a brief overview of how to perform the scrapping or how to create a script that would grab just the part of the website but just the content from a web page that you need for your data sense process so in this hands-on we'll take a look at web scrapping so what we'll do is we'll create a simple HTML page and extract some content out of it to see how web scrapping works in general so to do that firstly we'll have to create an HTML page so let's do that in each okay now let's create some HTML content so that you can scrape it later I will create a div tag with AP tag inside it and I'd like to scrap hello world out of this HTML suppose this is A website that's on the web and you wish to extract the content inside the P tag that which is inside the content last div tag so to do that I'll have to firstly create a Python file scrape.py you could name it anything and now I need to install some packages so the packages that I'd have to install I'll use the PIP command so PIP install beautiful soup four and L XML so beautiful soup four it is the library that is used for scraping websites and L XML is the package that beautiful soup four will use underneath so what it will do is it will use L XML to get the HTML content from the file extracted parse it and then allow us to use the HTML content to get to the data that we wish to scrape let's install them and I have already installed these packages you might get a different output since you have not installed it so now let's take a look at how to do this so the first thing I'll have to do I'll have to import beautifulsoup from the package ps4 so from the s4 import beautiful soup now that that is done let me open the file which we created called index dot HTML so now that we have created index dot HTML we have created the content we wish to open it as F F will be the variable that I will be using to refer to this file and I will create a beautiful soup object which will take this file and I will instruct it to use the L XML parser underneath now just to check it everything is working correctly let's print the soup variable and let's see whjat happens and as you can see it's reading the file correctly now what I wish to do is I wish to extract the div from soup so I'll try and find it a div from the soup object with the class of content as you can look into our HTML page we have a div tag with the class of content and what we wish to do is we wish to extract the text inside the P tag of that div tag so to do that I'll just after getting div let's print the div and see if we are getting the correct output and yes this is the output that were looking for but we need to go a step further and we need to just extract this content now beautifulsoup makes it very easy for us to do all we have to do is inside the div tag we want the P tags text so I've done that let me run this and we have the text that we wished to extract now I'll save the text in some variable and now I can do with it whatever I wish to so print the data for you to see if it's working properly and yes it is and now this data can be used to be stored inside a database in a CSV file in a text file or maybe just for some pre-processing and then used into a machine learning or data science algorithm or perhaps just to be stored inside or perhaps just to be perform some word tokenization and using NLP so this text is now extracted from HTML using web scrapping and now you could use it perform several tasks or save it wherever you wish to so let's take a look at data processing so data processing is the process of converting data into easily readable information which is more organized so when you get data from very different platforms such as databases and on using data gathering now it's our job to use these libraries such as numpy pandas and all to make convert those sort of the separate data forms into a single data format so that they could be utilized all in one so there are many libraries for deta manupulation there's numpy there's pandas the SyFy they all work in conjunctions and now pandas is actually using numpy under the hood so there is a lot of interdependency between these libraries as well so let's take a look at these libraries so first in numpy numpy is a library that is very popular in python it's it's an abbreviation for numerical Python it is an extremely fast library performs really well with even with large amounts of data sets and it's specifically used to perform mathematical and complicated mathematical operations that you cannot perform on normal lists which is why we used numpy and numpy arrays and all that and it's very popular in scientific computation because it allows us to apply several functions and methods to it to convert data from one form to another and to organize data and do a lot of things with it similarly pandas is a very simple and powerful solution for manipulating data that's in tabular format mostly the pandas is used for because of its data frames but it has series and other features as well it's built on top of the Python language and it uses Numpy underneath the hood for some of his features and pandas gives you a lot of libraries and a lot of features that you could use out of the box to make your life simple so as you can see pandas performs better than umpire for 500 K rows or more so if you have a large data set then pandas performs better than numpy because of its internal algorithms and they are extremely highly optimized which could be used very easily it has a very simple API as well so it allows us to use these functions very easily whereas for numpy performs better for 50 K rows or less so if you have a smaller data set and you have just a single row loading into an Numpy array and then vectorizing it and then converting into already shaping it and then using it would be a better option.
Pandas series is a very flexible data structure and it is something that allows you to define your own label indexes so in for instance you want to create create key value pairs that perform really well and even under high computational high computational stress then you could use pandas series on the other and numpy is the elements in the numpy array are accessed using the default indexed positions so similarly to a list you can access the elements in Nampy using the indexes that are present in you as you can see in lists also you can also access numpy array elements similarly well in pandas you have your own labelled indexes so you can use them the first step to create a numpy array is to import the numpy package now you can import numpy package most probably people include the numpy package at the top of their file because it's so common to use it so to import numpy you can just type import numpy but since numpy is so commonly used and it has five characters in its name table abbreviate it to NP so they whenever they import the import numpy as NP it's a convention you don't really need to do it but people do do it and it's easy to create a numpy array out of simple lists just pass in the list in the NP dot array function similarly you can create a 2-d or 1 d 1 d vector or 2 da matrix of zeroes using the zeros method and also you can create a vector or matrix of a specified shape with random numbers using the NP dot random dot random method there's also series in pandas so pandas series is just a one-dimensional label array and it can have any data type element in it so you could have a string a literal an object you could have many different kind of data in and in a series so to create an empty series just use the series constructor function from in the pandas package and if you want to give the indexes you can just when you're creating the series pass in the list and also pass in another keyword argument named index it should be a list of all the indexes make sure that the element number of elements in the list are the same as the number of elements in the data list that is sent as the first argument then there's data frame so data frame is the 2-dimensional tabular structure this is one of the most important and impressive features of pandas pandas dataframe can be built from several data sources such as SQL SQL tables CSV files Excel files so on and so forth and you could construct these data frames yourself these data frames are very useful they allow you to perform several tasks like grouping aggregating you can apply methods on it and took manipulate data in some way so to create an empty data frame just use PD dot data frame function and it will create a data frame for you similarly you could create a data frame out of a series as well just pass in the series in the constructor and if it will create empty or a data frame based on the series that you passed in now let's take a look at some of the hands-on on data processing so now let's take a look at some data processing using numpy and pandas so to do that first you will have to import numpy pandas and so now let's first take a look at how to create a numpy array using a list so firstly let me create a list so a numeric list that will contain one two three four and five so let's take a look at them and we have a numeric list list list just to be shown and yes so to create an array what we have to do is we have to click NP dot array and pass in the list and here we have it now let's take a look at the type and we have a numpy n dimensional array now it's not necessary to pass in a one dimensional array only you could for instance pass in a two-dimensional array so let me just remove all that one two three then we have five and six and here we have a two-dimensional array now that is done let's let's take a look at that let's take a look at the type it the types to still be the same and it's an NP ND array okay let me show you another trick so you could for instance instead of creating a two-dimensional array maybe you have a one dimensional array just remove the additional brackets let's take a look so we have that but let's say that instead of having an array of size 1 by 6 that is one row and 6 columns we want to sort of reshape it to 2 by 3 so 1 2 3 would be in the first row and 4 5 6 would be in the second row so let's take a look at that that's quite simple we will just use reshape pass in the shape which is 2 by 3 and we have that now it is important to notice that this array is not being reshaped so if a.type again it's at the same shape the thing that changed is that it has returned a new array that is of that shape so if I change the original in to that array then I print it then I get the same so we have reshaped the array we have got in the array but what if you want to create a numpy array of zeros you could do that so you could create NP dot zeros and you could take a look at the documentation as well by pressing shift tab and it will tell you what is it so you could tell the shape of there now the shape that I want it to have would be a 2 by 3 and the shape needs to be passed in as a tuple and here we have a 2 by 3 array this a 1 by 3 and you get this so that's how that works we can create a numpy array of that as well now there is also a range method so let's say that I want to create an array of range 1 to 10 now whenever you create a range make sure that you understand that the ending number would be excluded so always start from 0 to 10 and you'll get all the numbers from 0 up to 10 but excluding them so we'll get 0 to 9 so you could create an numpy array that way as well so with that you could perform a lot of tasks let's say that I wanted to reshape it and it is it came from 3 by 3 and now you have created one to nine numbers matrix which contains which is of the shape 3 cross 3 so you could do that as well now let's take a look at pandas so pandas comes with a something called siries object and again you can take a look at the documentation by stepping into the brackets and pressing shift tab and show you what how can you do that so press the plus button to take a look at the additional documentation it's got a very good descriptive document string as well so now that that's done let me just pass in some data passing 1 2 & 3 and you get a pandas series instantly now let's say that I wanted to create a pandas series out of zeros we'll be shaping it I'll just passing this and you get a pandas series of 1 to 9 automatically now this is why numpy and pandas work together because pandas underneath the hood you system numpy so it allows you to use numpy objects from PI ND arrays with it so now that we created this let's take a look at how to create a data frame as well so data frame is nothing more than just is a multi-column series and it's a it allows us to store data in tabular format using rows and columns all that so the way I like to create data frames is using something called a series but it's not really necessary you could just do it this way and here you have the series here you have the data frame using a list you could also type in the index as well so we can create an index in I'll call it a but the index needs to be of the same shape as the data now instead of 1 2 3 we have A B and C suppose if it was not of the same shape that I'll get an error actually I got a big error saying value error so always make sure that the indexes of the thing like now creating data frames like this could be tedious so a good way of doing it a good way of doing this would be using a dictionary where the keys would represent the column headers and the values would represent their values so let's say that I have ID which is equal to 1 2 3 and let's say that I have a name so the names be A B and C and yes you can see that it has returned us a data frame that contains the ID and the name now another common thing that you can do instead of typing the index there if you wish to make a column the index you can use the set index method on the data frame object so I'll just paas an ID and now instead of having 0 1 2 3 we have 1 2 3 so if you were extracting data from a database and you needed to store the ID you can do it this way so we have created a data frame and we have created series and we have created the numpy numpy and it is as well and we are allowed to reshape it as well you can combine these tasks together to create or and manipulate data as you wish and create dummy data for yourself as well is it really quite helpful so let's take a look at some data manipulation to do that I'll import pandas and I will read a CSV file CSV file is named housing dot CSV it's some housing data that is connected inside a comma separated values file let's take a look at what the data looks like so let's take a look at the head of the data when we look at the head of the data we see there we have four columns all of them are numeric so that's good but let's take a look at the shape of data and we have 489 rows and 4 columns so now let's try and getting the subset of the data so instead of getting the entire dataset let's say that we want the first five rows but instead of using head we'll do it ourselves so for that I use a a strategy called I lock I'll just index into the rows and columns the way I lock works is that you provide a comma separated list of size two in which you give it a deed rules that you want in the columns that you wanna so since we want the first five rows and all the columns I'll do it like this now the reason why this works is because these colon operators allow us to define the starting and the ending range thing on the Left would be starting and thing on the right would be the ending if I omit the values it will just go to the extreme ends so on the left since I have not provided any value it will assume it to be zero so to start from the first room and will get me the first five rows so since it's five it will exclude one because ranges are exclusive in Python and similarly here I have not provided either the starting or the ending of the slice since I've not provided anything it will just assume that is from zero till the end and it gives me all the columns so it's exactly the same result that we got when we use data to head and this is what we get here but suppose you don't want to do that you want to get the data from this first instead of getting it from the zeroth row you want to get it from the second row so you will get 2/3 and four but instead of getting all the columns what you want is you want to get all the columns except the last one this is called negative indexing so when I say minus 1 what it will do it will go to the end and exclude one column so since the last one is the med we column it will store it it will not use it in our result and as you can see we got RM else at PT issue and it has excluded the mid V column as well as it has given us two three and four rows so now that we have done that it's not really necessary for you to use just the I log method you could use for instance this another better call block which will allow you to specify the columns so from RM I want to PT ratio and as you can see I have defined the name of the columns RM and PT ratios and I've gotten all the rows we've got 489 but because 489 is a large number pandas just concatenate the result and shows us the first five in the last five I can again slice it anywhere I want and I get the first five so that's how that works here it would instead of excluding it would be including five so when you are using I log to make sure that you understand that the ending range would be excluded and here it's included so it works like that now that we have done that let's see how we can perform some additional tasks so we have here created the entire data frames we have loaded it into a data frame using a CSV file and we have gotten a subset of the data set but there are other things that you could do as well for instance let's say that you wish to sort the data frame based on the RM value so that's easier to do you have to first dot sort values so Sort values method will take in a parameter called by this will be a the name of the the column by which you go into sort and in our case it's RM and another thing you can do is you could pass in ascending to be true or false but if what it's true so don't do anything as you can see the values are sorted according to the RM content but if I said ascending to be false it's now sorted in descending order so all the largest value start at the beginning and then we decrease them one by one so so we have also sorted these values what you can also do is we can set the values of the columns so since we have these four columns let's say that instead of having PT ratio I want all the values in the PT ratio to be zero so to do that I'll just have to data that data and I want PT ratio I have two names like this yeah and we have that and I can assign it a value of zero now if I look at the data as you can see I have changed the value similarly you could just change the value of some rows by just selecting those rows and then changing their values so let me just reload the data just take a look at the data now and it's correct so itself using PT ratio let's say I select the first five rows and so that Ilog I'll use and I select PT instead of PT ratio yeah from PT ratio to an adv and all of that is to zero as you can see the first five rows or and the columns of PT ratio.
Med we are set to zero again this is not something that you have to do but in case you have to do these kind of things let's say that all the non values or all the incorrect values are aggregated in some range you can change it using this another thing that you could do is after you're done that you could apply functions on it so let me show you so suppose you wanted to change the values in L stat to the values that are double of it so instead of it being four point zero three it would be eight point zero six and all that and you want to do it for all the rows in the LSAT column so that's this thing called apply you can define a lambda here and it will do it for you X multiplied by X now if you don't want to do it for everything we'll just do it for LSAT and I think here LSAT oh it's L stat okay so now we have gotten the column and now important to note is that it has not changed the column so x multiplied by two and it's still there so to change it what I can do is as we have discussed earlier I could just sign it like this and now as you can see we have doubled the values of double all the values inside the LSAT column similarly you can perform other tasks as well this is a very simple example but this is just to show you that you could apply functions using that and uh finally the another thing that you could do is you could do boolean indexing so let's say you want every value where data RM is greater than and now you have 474 rows so this is what is called as boolean indexing so you when you type in just this command you get this boolean mask where wherever the conditions are met it will return as true so now here what we do is when we type in this it will check okay so the zeroth row is true I'll show that first row is true and if some row is false then it will skip that row so now we have that we have all the rows that contain the similar data and that's that's what you can do another thing you could do is let's say greater than 6 now you have 318 rows so this works it works like that and you are not really limited to just one condition suppose you wanted to interpolate multiple conditions so I could just say and the Lstat value of data is greater than 10 and we don't have any I change it and here's the data now we have 268 rows and you could just keep interpolating like this you could add more conditions and you'll get to filter the data like that so the filter in the data is important when you wish to just extract some data based on some condition or on some rules that allows you to just get the data that fits your requirements instead of using the entire data set supposing your data set was like 200,000 data points and you only wished to extract the data let's say that you had some census data and you wished to extract the data or the subset of the data that contain the state as Rajasthan or some other place so you could do it this way and this is how you can extract some data manipulate it get some subset of the data apply some sort of methods on it to change the values and that would be really handy for some of these tasks so let's take a look at merging joining and concatenating the data frames created using pandas so pandas has merged join and concat features so merge and join combines the given two data frames into a new data frame that has a common column so similarly how you perform joint queries in SQL tables you can perform merge and join similarly the only difference is join works on indexes while merge can take any two columns and can combine the two data frames based on that now concatenation on the other hand takes two data frames and stacks them one on top of the other based on the order that these lists were passed in so if I create three data frames Df 1 df2 and df3 and I create a list out of those in that order and I pass it into the concatenate function it will return the data frame that is coming that has the combined data of data frame one followed by data frame two followed by data frame three even that all of these data frames have the similar shape have same columns and you could use the use they use them to be combined together using the pandas concatenate function now when you merge or join two data frames together the first data frame data is shown in one column alongside the second data frame column shown in the same row but different columns so you could do it that way so if you are similarly when you merge or join two tables in SQL query you get a similar output now there are four kinds of joins for merges inner join write join or outer join and full outer join and left join so in inner join only the data that is common between the two are joined together in left join the data frame one which is on the left is the one that is considered to be of higher priority so data in that would be shown and if the data does not match in the second data frame all the columns of that row be considered as null similarly outer join would combine both the data frames with similar columns and those that differ in both the columns would be named null and the right join in the right merge would do the same thing that the left one did but just the rules will be reversed so the data frame too will have one column in data frame one will have another column and both of these columns if they have the common values they will be considered they will be filled with the values and if they don't have the common values then they would be filled as null so let's take a look at a hands-on that shows us merge join and concatenate okay so let's take a look at merging two data frames so merging would basically mean that we'll specify a common column and the column on which the data data rows are similar will be merged together let's try and do that first we will create two data frames let's just use pandas before that we should import pandas as pd and it's imported now lets created a different DF one and I'll create a data frame using a dictionary and ours would be a user data frame and user IDs will become one two three and four similarly we'll have a user name we have created a you thirty-degree not create qualifications it would have similar ID or change the name or you can have different user IDs and complications to merge 2 data frames we use pandas large functionality and us in the left and the right together so we'd have to user data frame qualification data frame and now we pass in the columns that we wish to join them so from the left data frame we want the ID column so really on the right data thing you want user ID column and you can see the user ID and the ID of both these tables are both data frames are same so it will be merged together and we'll return a data frame that is merged based on the ID specific and here we have it so we have the username ABCD you.
Venn diagram where all of these three things overlap it signifies that all of these three things are used in data science statistics machine learning and data analysis now just because it uses these three things doesn't mean that you have to be a master in all of these things to get started you can get started quite easily in data science you need to just know the basics so why do we need data science 80% of the data gathered by companies is unstructured data as we've already seen unstructured data basically means data that has no visible structure visible patterns that you could use to provide better services to your users so 80% of the data gathered by other companies is unstructured and data science is used to analyze this unstructured data and extract simple observations that could help the company's structure the business model in order to help the customers better and get a better reputation so unstructured data as visible to the naked eye is quite difficult to an understand and grasp which is why we perform these data analysis tasks to data science and we perform data science operations on it to extract some meaningful observations out of the unstructured data and gather information about the users and how to best move forward in future so the incoming data can be of from various sources it could be from the web it could be from a database this database could be normalized could be denormalized it could be an SQL database it could be in no SQL database the data sources are quite varied and since the data sources are quite varied the data model will be quite varied as well this is the problem that many data scientists face when they get a lot of data but they have no idea how to join the data together this is one of the most difficult tasks in data science once you gather the data then you need to understand how to join the data of which data to leave out from certain sources how to get the data clean so on and so forth now you can use these data this data acquired from different data sources and put them directly into a bi tool stands for business intelligence but sometimes business intelligence tools are not capable of handling such large quantities of data because you gather a lot of data based on the your user base the size of your user base companies like Amazon have data in terabytes of volume so you can't really just throw all the data in a bi tool and expect it to work this is the data science comes in where you extract meaningful information out of the data and create a model now to handle large amounts of unstructured data we draw meaningful trends so for that we need data science from that unstructured data from that raw data that we have collected from our users we want to generate some trends we want to extract some information so that we could use that to suggest our business to move in a better way so let's take a look at a real time use case so data science is quite useful when we are trying to create a model that could predict credit-card fraud detection so suppose you have a lot of information about all the credit card transactions and you have information about which of these transactions for fraudulent transactions transactions that should not have gone through the credit card system but they did so maybe someone tried to trick the system got more money or someone stole someone's credit card and made a transaction that was not supposed to be made so these kinds of things are quite common and to handle these kinds of things what we do is we gather a lot of data about fraudulent credit card transactions and then we perform the data science operations on that we gather data from multiple sources we clean it we process it will visualize it we generate a model that when fed some information about a credit card transaction could give us an indication on whether it's fraudulent transaction or not if it is a fraudulent transaction or if it's classified as such then our banking officials can take a look into it can call the customers if either it was them who produce the transactions so on and so forth so this is one use case now data science is useful not just for this but for many other tasks even in the banking industry but this is one of the most popular use cases of data science in credit card companies there are other use cases as well such as social media analytics so have you wondered how Facebook recommends use in fields like this there's also targeted AdWords so it's just Google Ads recommends you advertisements based on the things that you've interacted with the YouTube videos that you've watched the searches that you performed the tasks that you've performed on Google the things that you searched for and So many other things this is where google ads come in to play, they use data science and all the techniques in data science to create and generate models to make targeted advertisements specifically to a specific user this augmented reality augmented reality are used to put images into the frame where they were not really images so Apple uses augmented reality many other AR gears are also available that uses data science then there's recommendation engines so when Netflix recommends you the next movie or the next TV show that might of interest to you it uses data sense in the background and then there's healthcare imaging there are systems that could take a look at your brain scans and figure out if there is a tumor in your brain scans look at the heart scans look at the ECGs look at various other imagings that you could perform in medicals and you can get some sort of a reading after those systems whether or not the patient is at risk of heart disease whether the person has a cancerous tumor whether it is malignant or benign and so on and so forth now there are many algorithms that you can use in data science some of them are linear regression linear regression is most commonly used when you want to predict something and most commonly when you have only two features and you want to plot them and predict something out of it linear regression is quite useful for that it's quite easy to understand when there is only two variables when there are multiple variables it could get a little tricky similarly we have logistic regression now logistic regression is a classification algorithm also although it's a bit confusing that it has the word regression in its name it is a classification algorithm so the major difference is in linear regression we try to predict a continuous value likely price of a house depending on its square footage the area that it is in and how old the house is similarly logistic regression issues for classification we are we're trying to take some data and predict which class it goes into so we get an image and we wish to predict whether or not it contains a dog we take in an image again we try to predict what species of flower is it so these are different classes that our data could be put into and this is why we use Classification Algorithm this entry is also a classification algorithm Naivs base uses probabilistic model using the Naive based theorem and allows you to make assertions based on the probabilistic model and KNN stands for K nearest neighbors now it's mostly used for clustering so KNN k-means clustering is used for clustering and KN n allows us to create clusters and then put new data points inside the graphs and make accurate predictions based on the features that they have k-means clustering is an unsupervised learning so linear regression so regression and classification are supervised learning in which we give them the information about the data that we already have we have a labeled data set where we want our model to learn from it and we know the answers already we know the house prices that our data set has and we wanted to learn from that and make accurate prediction from there in clustering we send in a lot of data and then we tell it to perform clustering and try to figure out some sort of a grouping try to group a certain data point or find out a certain groups and then based on the similarities that they have and then tell us what those groupings are and this is very useful when they're trying to predict which customers are similar to each other so this is where k-means clustering can come into play and random forest is just an advanced version of decision tree it uses multiple decision trees to create classification or regression model okay so let's take a look at the data science process so the data science process consists of several steps the first step is with any problem that we're trying to solve the first understand the business problem problem that the business is trying to solve so using the data that we have so the first thing we try to understand is what exactly does the business want from us do they want us to make a product recommendation engine ok then what kind of product recommendation system do they want us to recommend products to user based on their previous position purchases or should we recommend people the items that people who are simliar who have a similar taste to the user have bought and this kind of brainstorming and problem understanding and analysis allows us to reduce our work and understand what exactly is it that we're trying to achieve if in case we had not performed this problem we would be solving a problem that need not be solved that the company has no useful in case the company wants us to make a product recommendation system that recommends the use of things based on their previous purchase and we built the product recommend system on some other kind of data then it becomes a real problem also understanding the business problem allows us to take a look at the data add more head time or new ones to weigh and understand whether this data would be useful for the kind of problem you trying to solve then comes data gathering data gathering is one of the most important aspects of data science process because if the data that is gathered is incorrect if the data is incorrect if it's wrong if it contains too many null values if most of the values in the data is out of range or it's just too varied or is just biased towards one result then the problem becomes really really difficult to solve because data science works entirely on data so if you don't give it good amounts of good data a large amount of good data then you might not be able to solve the problem that you're trying to solve then comes data processing we process the data we get the data we try to understand we load it into the data frame we process it we join it we try to gather data from multiple sources and then which process it can convert into a single source so that that could be analyzed later and that could be used to create important distinctions and important predictions then comes data analysis the data that we have gathered from multiple sources and converted into a single format now comes the time for us to understand whether or not the data is correct how many null values are there is the data biased towards one kind of prediction if I'm making a classification whether a person is whether the person is going to pass or fail his next class his or her next class and if all we have is the data of people who have passed then we can't build a model because our data is completely biased so these are the kinds of information that we need to understand we need to analyze the data understand the outliers how much of the data is usable how much of the data how much of the features in the data how much more the columns in the data are just unneeded for the problem and so on and so forth then comes visualizations after analysis we visualize the data to take a look at whether or not something is correct or incorrect we wish to take a look at the trends of the data so for instance if we have some data set let's say that we have stock prices of some company and we have a long list of prices now we could just take a look at the prices and see okay the price is increasing and then decreasing and increasing and decreasing but that could be a little tedious data visualization allows us to create graphs and create visualizations that make us intuitively understand the trend that is inside the data so we take a look at the data take a look at the stock prices and we can take a look at the visualization and we could quickly understand that the data is showing an upward trend or a downward trend whether the profits are increasing or decreasing since when did the processor profits start decreasing what could be the issue around those times this is where data visualization comes in then comes data cleaning so depending on the kind of data that you have one thing that you have you might have to do and it's quite possible is that you have to convert categorical data to numerical data maybe you have dates in incorrect format maybe you have date in the US format and you need to convert it to the Indian format or maybe vice versa or maybe you just have data that's just irrelevant or maybe some data that is outliers or maybe some null values so how do you go about doing this this completely forms into data cleaning and it's something that you need to take care of you need to understand the domain before cleaning the data because maybe some data is not an outlier it's an actual value that we just didn't understand and we removed it and that could cause a lot of problem so data cleaning is the next step and finally after all the steps after we've got our clean processed analyzed visualized data set we use machine learning and data science algorithms to create molds these models could either perform regressions or could perform predictions or classifications and allow us to make accurate predictions and provide value to our business this is the problem that we were trying to solve and when we create the model we test the model to check whether or not the model that was created is performing well if it's not performing well then maybe we need more data maybe we need another algorithm maybe we need to clean the data correctly rescaling reshaping tells a lot of scope in that department so the first step in a data science solution is to understand the problem as we've discussed that understanding is the problem is the key if we start off wrong in that step or data science flow is going to be disrupted and we might end up solving the wrong problem or it's also possible that we might hit a dead end because the data that we have is just not useful to solve the kind of problem that we miss heard or misunderstood and we try to solve so always try to understand what kind of data you have and what kind of problem you're trying to solve this allows us to analyze the data and understand whether this data is useful for us to solve this kind of problem because when you get a problem the first thing you need to ask is why do we need to solve this problem and how can we solve this problem the why is the most important question in this entire flow and we need to understand the end product what exactly are we trying to build what what is what is the what are the features that are required in this model what should be the input what should be the output how fast does it need to be why do we need to take the input in one particular format and not in other these are all the questions that need to understand and have a clear understanding on and your entire team needs to have a clear understanding on that as well then we determine the data sources for this problem as I've already discussed data collected from from several other sources it could come from a CSV file from a text file from a legacy database from the internet you think web scrapping we could get lot of data this way and that could be a problem for us to understand as to how to merge all this data and how to get this data from suppose we are getting the data from a legacy database and it's just too much effort then we might and the data is just too little then we might regard that source to be not useful and we just don't use data from there now these are the kind of things that you need to understand when you are trying to do the data sense trying to solve a data science problem and after all of these steps have been performed you should have the information and the context which you are required to solve this problem these context can allow you to understand what exactly was the problem and do we have correct data to solve this problem if we don't perform these steps in these in this way we might end up spending a lot of time lot of effort a lot of money in solving the problem that either does it didn't need to be solved or the problem that we misunderstood or maybe the problem that we could not have solved with the data that we had we just spent a lot of time in a sense wasted a lot of time so now let's take a look at some of the steps in data science let's take a look at data gathering data gathering is also known as data extraction which is the process of retrieving data from various sources to be used in your data science process so it's not really necessary for you to extract data from multiple sources if you have a data source that contains large amounts of data than well and good however most of the times when you are working with a data sense project you need large amounts of data that requires you to gather data from extremely varied sources you need you might need to get data from a CSV file you might need to get the data from a website or a web API if it's if the website is not providing an Web API that you need to scrap that website and get the data that you need from that website and sometimes you need to carry from a legacy database maybe you your application that is interacting with the customers storing information about their likes and dislikes in a MySQL database that was designed a long time ago now you need to extract the information and only the kind of only the information that you need from the database and then you need to gather the data from all of these sources into one common place this is for data gathering comes into play the more data that you have the better it is for you to portray in your model and the better chances you have of solving the problem with greater accuracy now just because you have more data doesn't necessarily mean that your model is going to be perfect more data that is really well cleaned that's really well prepared that's been preserved in a good condition and that does not contain a lot of null values that does not contain a lot of unrealistic values then data gathering is considered to be successful because when you reach a gathering process is successful you end up with a large amounts of data so data extraction is performed in order to gather data from various diverse sources or data repositories so we extract data from multiple sources and then we gathered that and store it in one single data repository this data repository is then used to train our machine learning models and make predictions now you could gather data from a lot of places two of the most common one our databases and Internet when you're using databases it could be in no SQL database it could be an SQL database in no SQL databases also there are multiple kinds of databases document stores graph databases column stores and so on and so forth similarly on the internet.
You have a lot of sources you could get data from a web api web api might be a restful web api it could be a soap driven web api there's a lot of diversity from the sources that you can get the data from and if the the internet source that you're looking for does not provide a web api then you might need to create scripts that could reach into the web pages extract the information that you need from it and then you store it somewhere that you can use in your data Science process so now let's get our hands dirty into the code and we'll try to perform some SQL queries to know how to extract data from an SQL database that contains multiple tables and we might need to perform several operations like sub queries and joins and all that okay so let us take a look at SQL sub queries a sub query is basically a nested query that allows us to extract some data from another table and use that data inside the current query so to do that we have a database that has two tables make and model this database is used inside a car dealership so let us take a look at the data that we'll be working off so we select * from make and as you can see we have three make names or the name of the companies of cars now let's take a look at the model and we have six rows inside it and all of them belong to a specific make and this is referenced using a make ID so make ID is the foreign key that is referencing the ID column of the make table now what we wish to do is we wish to write a query that will look at all the data in the model column for the model table it will take a look at all the data and it will give us the ID what do you make ID of all the cars that have been built after the year of 2000 so anything with the year less than 2010 will not get included in our result and then we look into the make table take a look at all the IDS that were returned by the model table and we'll take a look at if that ID exists here and if it does we'll show the name so to do that what we do is we start by writing the query so what we want is we want the name from the make table where the ID is then and this is where we start a sub query we want the make ID from the model table where here it's greater than or equal to 2010 let's take a look and sure enough we get the Honda and Hyundai so let's take a look if it got the correct answer so what we wanted is we wanted the name of the cars that have the name of the make of the car that have models that have been released in during the year 2010 or later 2010 so let's take a look at that so essentially we have one two three models or three three models of cars that have been released either on or after 2010 so we have city I 10 and verna so we have the ID of two and three and the ID of two refers to the make of Honda and this third ID refers to the make of Hyundai and this is what we get as an answer so essentially a sub query allows to extract some data from another table and use that data and use the outer query to work on that data so our inner query results the make IDs that we need and then our outer query extracts the data from the sub query and uses that data to display the result so our sub queries are done so let's take a look at how to join data from two tables we'll take a look at the queries and look at the join keyword in how to use the join keyword to get data from two tables that are linked using a foreign key and displayed in a single command so what we want to do in this problem is we were we wish to extract the make name and the model name and the model year and displayed in a single result set now the issue here is that we have the data in two separate tables we have a make table and we have a model table and they are linked using the make ID in the model table so to make ID is the foreign key and we can use it to extract data from both the columns and see where the columns match and we show the data so let's take a look at how to do that so the first thing we need to do is we need to use a select query and in that we will name the things that we wish to extract so from the make table we want the name column and I will name it as makes underscore name the reason why I do this is because in the model column there is another model table there's another column named name and I don't want there to be any confusion as to which column named name refers to which column so the chord column will be named as make name and the other column that we want is the model dot name and finally we want model dot here so we have will extract this data from make and model so we want this from the make table we want inner join the model table on make dot ID and here we have the data that we wish to extract we have the make name the name and the year so instead of showing the IDS we can now show the name of the make sub instance wherever it says the ID is one we have the Maruti name as the make name similarly for the Honda and Hyundai so we have successfully extracted data from two separate columns reason why sometimes data gets stored in two separate columns and because we want to keep the data in normalized form and we want to keep the data separate from each other so that there's no duplication redundancy and that makes it a bit difficult for us to use these columns and this is why we use the join keyword to join the two tables and extract data from them now let's move on to web scrapping let's perform a hands-on where we'll get some useful insight on how to perform web scrapping on a simple HTML page just to give you a brief overview of how to perform the scrapping or how to create a script that would grab just the part of the website but just the content from a web page that you need for your data sense process so in this hands-on we'll take a look at web scrapping so what we'll do is we'll create a simple HTML page and extract some content out of it to see how web scrapping works in general so to do that firstly we'll have to create an HTML page so let's do that in each okay now let's create some HTML content so that you can scrape it later I will create a div tag with AP tag inside it and I'd like to scrap hello world out of this HTML suppose this is A website that's on the web and you wish to extract the content inside the P tag that which is inside the content last div tag so to do that I'll have to firstly create a Python file scrape.py you could name it anything and now I need to install some packages so the packages that I'd have to install I'll use the PIP command so PIP install beautiful soup four and L XML so beautiful soup four it is the library that is used for scraping websites and L XML is the package that beautiful soup four will use underneath so what it will do is it will use L XML to get the HTML content from the file extracted parse it and then allow us to use the HTML content to get to the data that we wish to scrape let's install them and I have already installed these packages you might get a different output since you have not installed it so now let's take a look at how to do this so the first thing I'll have to do I'll have to import beautifulsoup from the package ps4 so from the s4 import beautiful soup now that that is done let me open the file which we created called index dot HTML so now that we have created index dot HTML we have created the content we wish to open it as F F will be the variable that I will be using to refer to this file and I will create a beautiful soup object which will take this file and I will instruct it to use the L XML parser underneath now just to check it everything is working correctly let's print the soup variable and let's see whjat happens and as you can see it's reading the file correctly now what I wish to do is I wish to extract the div from soup so I'll try and find it a div from the soup object with the class of content as you can look into our HTML page we have a div tag with the class of content and what we wish to do is we wish to extract the text inside the P tag of that div tag so to do that I'll just after getting div let's print the div and see if we are getting the correct output and yes this is the output that were looking for but we need to go a step further and we need to just extract this content now beautifulsoup makes it very easy for us to do all we have to do is inside the div tag we want the P tags text so I've done that let me run this and we have the text that we wished to extract now I'll save the text in some variable and now I can do with it whatever I wish to so print the data for you to see if it's working properly and yes it is and now this data can be used to be stored inside a database in a CSV file in a text file or maybe just for some pre-processing and then used into a machine learning or data science algorithm or perhaps just to be stored inside or perhaps just to be perform some word tokenization and using NLP so this text is now extracted from HTML using web scrapping and now you could use it perform several tasks or save it wherever you wish to so let's take a look at data processing so data processing is the process of converting data into easily readable information which is more organized so when you get data from very different platforms such as databases and on using data gathering now it's our job to use these libraries such as numpy pandas and all to make convert those sort of the separate data forms into a single data format so that they could be utilized all in one so there are many libraries for deta manupulation there's numpy there's pandas the SyFy they all work in conjunctions and now pandas is actually using numpy under the hood so there is a lot of interdependency between these libraries as well so let's take a look at these libraries so first in numpy numpy is a library that is very popular in python it's it's an abbreviation for numerical Python it is an extremely fast library performs really well with even with large amounts of data sets and it's specifically used to perform mathematical and complicated mathematical operations that you cannot perform on normal lists which is why we used numpy and numpy arrays and all that and it's very popular in scientific computation because it allows us to apply several functions and methods to it to convert data from one form to another and to organize data and do a lot of things with it similarly pandas is a very simple and powerful solution for manipulating data that's in tabular format mostly the pandas is used for because of its data frames but it has series and other features as well it's built on top of the Python language and it uses Numpy underneath the hood for some of his features and pandas gives you a lot of libraries and a lot of features that you could use out of the box to make your life simple so as you can see pandas performs better than umpire for 500 K rows or more so if you have a large data set then pandas performs better than numpy because of its internal algorithms and they are extremely highly optimized which could be used very easily it has a very simple API as well so it allows us to use these functions very easily whereas for numpy performs better for 50 K rows or less so if you have a smaller data set and you have just a single row loading into an Numpy array and then vectorizing it and then converting into already shaping it and then using it would be a better option.
Pandas series is a very flexible data structure and it is something that allows you to define your own label indexes so in for instance you want to create create key value pairs that perform really well and even under high computational high computational stress then you could use pandas series on the other and numpy is the elements in the numpy array are accessed using the default indexed positions so similarly to a list you can access the elements in Nampy using the indexes that are present in you as you can see in lists also you can also access numpy array elements similarly well in pandas you have your own labelled indexes so you can use them the first step to create a numpy array is to import the numpy package now you can import numpy package most probably people include the numpy package at the top of their file because it's so common to use it so to import numpy you can just type import numpy but since numpy is so commonly used and it has five characters in its name table abbreviate it to NP so they whenever they import the import numpy as NP it's a convention you don't really need to do it but people do do it and it's easy to create a numpy array out of simple lists just pass in the list in the NP dot array function similarly you can create a 2-d or 1 d 1 d vector or 2 da matrix of zeroes using the zeros method and also you can create a vector or matrix of a specified shape with random numbers using the NP dot random dot random method there's also series in pandas so pandas series is just a one-dimensional label array and it can have any data type element in it so you could have a string a literal an object you could have many different kind of data in and in a series so to create an empty series just use the series constructor function from in the pandas package and if you want to give the indexes you can just when you're creating the series pass in the list and also pass in another keyword argument named index it should be a list of all the indexes make sure that the element number of elements in the list are the same as the number of elements in the data list that is sent as the first argument then there's data frame so data frame is the 2-dimensional tabular structure this is one of the most important and impressive features of pandas pandas dataframe can be built from several data sources such as SQL SQL tables CSV files Excel files so on and so forth and you could construct these data frames yourself these data frames are very useful they allow you to perform several tasks like grouping aggregating you can apply methods on it and took manipulate data in some way so to create an empty data frame just use PD dot data frame function and it will create a data frame for you similarly you could create a data frame out of a series as well just pass in the series in the constructor and if it will create empty or a data frame based on the series that you passed in now let's take a look at some of the hands-on on data processing so now let's take a look at some data processing using numpy and pandas so to do that first you will have to import numpy pandas and so now let's first take a look at how to create a numpy array using a list so firstly let me create a list so a numeric list that will contain one two three four and five so let's take a look at them and we have a numeric list list list just to be shown and yes so to create an array what we have to do is we have to click NP dot array and pass in the list and here we have it now let's take a look at the type and we have a numpy n dimensional array now it's not necessary to pass in a one dimensional array only you could for instance pass in a two-dimensional array so let me just remove all that one two three then we have five and six and here we have a two-dimensional array now that is done let's let's take a look at that let's take a look at the type it the types to still be the same and it's an NP ND array okay let me show you another trick so you could for instance instead of creating a two-dimensional array maybe you have a one dimensional array just remove the additional brackets let's take a look so we have that but let's say that instead of having an array of size 1 by 6 that is one row and 6 columns we want to sort of reshape it to 2 by 3 so 1 2 3 would be in the first row and 4 5 6 would be in the second row so let's take a look at that that's quite simple we will just use reshape pass in the shape which is 2 by 3 and we have that now it is important to notice that this array is not being reshaped so if a.type again it's at the same shape the thing that changed is that it has returned a new array that is of that shape so if I change the original in to that array then I print it then I get the same so we have reshaped the array we have got in the array but what if you want to create a numpy array of zeros you could do that so you could create NP dot zeros and you could take a look at the documentation as well by pressing shift tab and it will tell you what is it so you could tell the shape of there now the shape that I want it to have would be a 2 by 3 and the shape needs to be passed in as a tuple and here we have a 2 by 3 array this a 1 by 3 and you get this so that's how that works we can create a numpy array of that as well now there is also a range method so let's say that I want to create an array of range 1 to 10 now whenever you create a range make sure that you understand that the ending number would be excluded so always start from 0 to 10 and you'll get all the numbers from 0 up to 10 but excluding them so we'll get 0 to 9 so you could create an numpy array that way as well so with that you could perform a lot of tasks let's say that I wanted to reshape it and it is it came from 3 by 3 and now you have created one to nine numbers matrix which contains which is of the shape 3 cross 3 so you could do that as well now let's take a look at pandas so pandas comes with a something called siries object and again you can take a look at the documentation by stepping into the brackets and pressing shift tab and show you what how can you do that so press the plus button to take a look at the additional documentation it's got a very good descriptive document string as well so now that that's done let me just pass in some data passing 1 2 & 3 and you get a pandas series instantly now let's say that I wanted to create a pandas series out of zeros we'll be shaping it I'll just passing this and you get a pandas series of 1 to 9 automatically now this is why numpy and pandas work together because pandas underneath the hood you system numpy so it allows you to use numpy objects from PI ND arrays with it so now that we created this let's take a look at how to create a data frame as well so data frame is nothing more than just is a multi-column series and it's a it allows us to store data in tabular format using rows and columns all that so the way I like to create data frames is using something called a series but it's not really necessary you could just do it this way and here you have the series here you have the data frame using a list you could also type in the index as well so we can create an index in I'll call it a but the index needs to be of the same shape as the data now instead of 1 2 3 we have A B and C suppose if it was not of the same shape that I'll get an error actually I got a big error saying value error so always make sure that the indexes of the thing like now creating data frames like this could be tedious so a good way of doing it a good way of doing this would be using a dictionary where the keys would represent the column headers and the values would represent their values so let's say that I have ID which is equal to 1 2 3 and let's say that I have a name so the names be A B and C and yes you can see that it has returned us a data frame that contains the ID and the name now another common thing that you can do instead of typing the index there if you wish to make a column the index you can use the set index method on the data frame object so I'll just paas an ID and now instead of having 0 1 2 3 we have 1 2 3 so if you were extracting data from a database and you needed to store the ID you can do it this way so we have created a data frame and we have created series and we have created the numpy numpy and it is as well and we are allowed to reshape it as well you can combine these tasks together to create or and manipulate data as you wish and create dummy data for yourself as well is it really quite helpful so let's take a look at some data manipulation to do that I'll import pandas and I will read a CSV file CSV file is named housing dot CSV it's some housing data that is connected inside a comma separated values file let's take a look at what the data looks like so let's take a look at the head of the data when we look at the head of the data we see there we have four columns all of them are numeric so that's good but let's take a look at the shape of data and we have 489 rows and 4 columns so now let's try and getting the subset of the data so instead of getting the entire dataset let's say that we want the first five rows but instead of using head we'll do it ourselves so for that I use a a strategy called I lock I'll just index into the rows and columns the way I lock works is that you provide a comma separated list of size two in which you give it a deed rules that you want in the columns that you wanna so since we want the first five rows and all the columns I'll do it like this now the reason why this works is because these colon operators allow us to define the starting and the ending range thing on the Left would be starting and thing on the right would be the ending if I omit the values it will just go to the extreme ends so on the left since I have not provided any value it will assume it to be zero so to start from the first room and will get me the first five rows so since it's five it will exclude one because ranges are exclusive in Python and similarly here I have not provided either the starting or the ending of the slice since I've not provided anything it will just assume that is from zero till the end and it gives me all the columns so it's exactly the same result that we got when we use data to head and this is what we get here but suppose you don't want to do that you want to get the data from this first instead of getting it from the zeroth row you want to get it from the second row so you will get 2/3 and four but instead of getting all the columns what you want is you want to get all the columns except the last one this is called negative indexing so when I say minus 1 what it will do it will go to the end and exclude one column so since the last one is the med we column it will store it it will not use it in our result and as you can see we got RM else at PT issue and it has excluded the mid V column as well as it has given us two three and four rows so now that we have done that it's not really necessary for you to use just the I log method you could use for instance this another better call block which will allow you to specify the columns so from RM I want to PT ratio and as you can see I have defined the name of the columns RM and PT ratios and I've gotten all the rows we've got 489 but because 489 is a large number pandas just concatenate the result and shows us the first five in the last five I can again slice it anywhere I want and I get the first five so that's how that works here it would instead of excluding it would be including five so when you are using I log to make sure that you understand that the ending range would be excluded and here it's included so it works like that now that we have done that let's see how we can perform some additional tasks so we have here created the entire data frames we have loaded it into a data frame using a CSV file and we have gotten a subset of the data set but there are other things that you could do as well for instance let's say that you wish to sort the data frame based on the RM value so that's easier to do you have to first dot sort values so Sort values method will take in a parameter called by this will be a the name of the the column by which you go into sort and in our case it's RM and another thing you can do is you could pass in ascending to be true or false but if what it's true so don't do anything as you can see the values are sorted according to the RM content but if I said ascending to be false it's now sorted in descending order so all the largest value start at the beginning and then we decrease them one by one so so we have also sorted these values what you can also do is we can set the values of the columns so since we have these four columns let's say that instead of having PT ratio I want all the values in the PT ratio to be zero so to do that I'll just have to data that data and I want PT ratio I have two names like this yeah and we have that and I can assign it a value of zero now if I look at the data as you can see I have changed the value similarly you could just change the value of some rows by just selecting those rows and then changing their values so let me just reload the data just take a look at the data now and it's correct so itself using PT ratio let's say I select the first five rows and so that Ilog I'll use and I select PT instead of PT ratio yeah from PT ratio to an adv and all of that is to zero as you can see the first five rows or and the columns of PT ratio.
Med we are set to zero again this is not something that you have to do but in case you have to do these kind of things let's say that all the non values or all the incorrect values are aggregated in some range you can change it using this another thing that you could do is after you're done that you could apply functions on it so let me show you so suppose you wanted to change the values in L stat to the values that are double of it so instead of it being four point zero three it would be eight point zero six and all that and you want to do it for all the rows in the LSAT column so that's this thing called apply you can define a lambda here and it will do it for you X multiplied by X now if you don't want to do it for everything we'll just do it for LSAT and I think here LSAT oh it's L stat okay so now we have gotten the column and now important to note is that it has not changed the column so x multiplied by two and it's still there so to change it what I can do is as we have discussed earlier I could just sign it like this and now as you can see we have doubled the values of double all the values inside the LSAT column similarly you can perform other tasks as well this is a very simple example but this is just to show you that you could apply functions using that and uh finally the another thing that you could do is you could do boolean indexing so let's say you want every value where data RM is greater than and now you have 474 rows so this is what is called as boolean indexing so you when you type in just this command you get this boolean mask where wherever the conditions are met it will return as true so now here what we do is when we type in this it will check okay so the zeroth row is true I'll show that first row is true and if some row is false then it will skip that row so now we have that we have all the rows that contain the similar data and that's that's what you can do another thing you could do is let's say greater than 6 now you have 318 rows so this works it works like that and you are not really limited to just one condition suppose you wanted to interpolate multiple conditions so I could just say and the Lstat value of data is greater than 10 and we don't have any I change it and here's the data now we have 268 rows and you could just keep interpolating like this you could add more conditions and you'll get to filter the data like that so the filter in the data is important when you wish to just extract some data based on some condition or on some rules that allows you to just get the data that fits your requirements instead of using the entire data set supposing your data set was like 200,000 data points and you only wished to extract the data let's say that you had some census data and you wished to extract the data or the subset of the data that contain the state as Rajasthan or some other place so you could do it this way and this is how you can extract some data manipulate it get some subset of the data apply some sort of methods on it to change the values and that would be really handy for some of these tasks so let's take a look at merging joining and concatenating the data frames created using pandas so pandas has merged join and concat features so merge and join combines the given two data frames into a new data frame that has a common column so similarly how you perform joint queries in SQL tables you can perform merge and join similarly the only difference is join works on indexes while merge can take any two columns and can combine the two data frames based on that now concatenation on the other hand takes two data frames and stacks them one on top of the other based on the order that these lists were passed in so if I create three data frames Df 1 df2 and df3 and I create a list out of those in that order and I pass it into the concatenate function it will return the data frame that is coming that has the combined data of data frame one followed by data frame two followed by data frame three even that all of these data frames have the similar shape have same columns and you could use the use they use them to be combined together using the pandas concatenate function now when you merge or join two data frames together the first data frame data is shown in one column alongside the second data frame column shown in the same row but different columns so you could do it that way so if you are similarly when you merge or join two tables in SQL query you get a similar output now there are four kinds of joins for merges inner join write join or outer join and full outer join and left join so in inner join only the data that is common between the two are joined together in left join the data frame one which is on the left is the one that is considered to be of higher priority so data in that would be shown and if the data does not match in the second data frame all the columns of that row be considered as null similarly outer join would combine both the data frames with similar columns and those that differ in both the columns would be named null and the right join in the right merge would do the same thing that the left one did but just the rules will be reversed so the data frame too will have one column in data frame one will have another column and both of these columns if they have the common values they will be considered they will be filled with the values and if they don't have the common values then they would be filled as null so let's take a look at a hands-on that shows us merge join and concatenate okay so let's take a look at merging two data frames so merging would basically mean that we'll specify a common column and the column on which the data data rows are similar will be merged together let's try and do that first we will create two data frames let's just use pandas before that we should import pandas as pd and it's imported now lets created a different DF one and I'll create a data frame using a dictionary and ours would be a user data frame and user IDs will become one two three and four similarly we'll have a user name we have created a you thirty-degree not create qualifications it would have similar ID or change the name or you can have different user IDs and complications to merge 2 data frames we use pandas large functionality and us in the left and the right together so we'd have to user data frame qualification data frame and now we pass in the columns that we wish to join them so from the left data frame we want the ID column so really on the right data thing you want user ID column and you can see the user ID and the ID of both these tables are both data frames are same so it will be merged together and we'll return a data frame that is merged based on the ID specific and here we have it so we have the username ABCD you.