Time Series Analysis in R | Time Series Forecasting | Intellipaat
Hello everyone welcome to this session on introduction to time series today in this session we'll be taking you through various concepts that are involved in creating and time series model so let's go ahead and see what all we'll be covering in this session so first off we'll start off with understanding what is time series and then we want to discuss the different types of time series out there after that we'll be discussing the various components of time series and then move on to discuss how we can fit the model of time series using three methods which are differencing and then dickey-fuller and finally the ACF plot method once you are done with that we'll move on and understand what is the ARIMA model and then move on to discuss how we can create a time series model using the ARIMA method and then we'll be done with the session all right so before we begin the session guys if you haven't already please subscribe to the intellipaat channel and also comment down any queries that you have in the comment section and we'll be happy to answer all your queries so let's start with time series forecasting well to put it simply time series.
Forecasting is the use of statistical models to predict future values based on past results so what kind of variables can be focused so any value that can be tracked and collected over time think of annual population data or a company's daily stock price or quarterly sales figures for each of these examples data is collected over time and the time series model simply uses that data to forecast future values now we'll concentrate on some of the building blocks of time series model the first defining characteristic of a time series is a list of observations while ordering matters so ordering is very important because there is a dependency on time and changing the order could change the meaning of the data now to accurately focus future values we'll need each measurement of data to be taken across sequential and equal intervals and with each time unit having at most one data point so once we have collected our data we have two objectives in mind so first would be identifying the patterns represented by the sequence of observations and second would be forecasting or predicting future values of time series so the patterns we observe will tell us a story of how a business interacts with time in time series analysis it is assumed that the data consists of a systematic pattern that is usually a set of identifiable components and random noise that is error which usually makes the pattern difficult to identify so most time series analysis techniques involve some form of filtering of the noise in order to make the pattern more notable so now we'll learn about the different types of time series models so the time series data could either be stationary or non stationary so given a series of data points if the mean and variance of all the data points remains constant with time then we'll call that series are stationary series so here on the screen we have a simple example of how stationary series will look like now it is important to know about stationary model because of the fact that without a stationary series we cannot move forward with time series analysis and we would have to need to calculate the mean of a time series in order to estimate the expected value but if the time series is not stationary then our calculation for expected value will give false results and interpretation so now if a series is not stationary then we'll need to convert the non stationary series into a stationary series in order to do this we'll have to differentiate the series so apart from a stationary series a white noise series is a series where the mean and variance of the data points is constant but there is no auto correlation between values of this points so the correlation between values of data points at different time intervals is known as autocorrelation it is also sometimes termed as lag correlation now when the mean and variance of time series data is not constant that is fading with time then we can say that the data is just taking a random walk over time so random walk is a term used when data points in the series are not dependent on their past values this makes the series non stationary series because the mean and variance will vary with time so there are three basic criteria for a series to be classified as a stationary series the mean of the series should not be a function of time rather it should be a constant so here the green colored graph satisfies the condition whereas the graph in red has a time dependent mean then the variance of the series should also not be a function of time this property is known as homoscedasticity so over here you can notice the varying spread of distribution and the right hand graph and the third criteria is the covariance of the I th term and the I + MH term should not be a function of time in the graph here you will notice the spread becomes closer as the time increases hence the covariance is not constant with time for the red series so next we will study about different components on a time series data now as we know that a time series data varies with time there are many factors which results in this variation the effects of these factors are studied by the following four major components trends seasonal variation cyclic variation and irregular variation to better understand and apply these models will examine a business problem surrounding how management at Hotel EBC a mountain resort us able to prepare for their year by forecasting the number of room bookings they expect each month so this information can help management make informed decisions on staffing hospitality arrangements and pricing for rooms we have tasks their business analyst to forecast bookings so they can make the necessary preparations so the historical data contains monthly information from the past 10 years and the forecasts should contain monthly bookings for the next six months so your job will be to use the historical bookings data from the past years to investigate and clean the data and then determine the trend and seasonal components afterwards should have to apply the findings to an arima model and finally forecast the bookings for the next six months so that's quite a lot to think about but don't worry we'll be taking a closer look at time series forecasting methods and what they all mean so before we crunch any numbers and make a predictive model we need to plot our data in order to get a feel for what a time series looks like and time series plot shows graphical presentation of the relationship between time and the time series target variable so time is on the horizontal axis and the target variables values are shown on the vertical axis the first plot shows the complete ABC hotel time series so you can see that the series shows a general trend up and we call that an upward trend lastly our bookings plot shows two main patterns first we see the upward trend secondly we see some regularly occurring fluctuations up and down within the same calendar year so we call that a seasonal pattern so plotting or the time series allows us to visualize these patterns and behaviors we can the news of findings in the time series plot to create and fine-tune sophisticated forecasting models in our time series analysis or trend is a gradual shift or movement to relatively higher or lower values over a long period of time so in a trend pattern exhibits a general direction which is upward where there are higher highs and lower lows we call this an uptrend and when a trend pattern exhibits a general direction which is down where there are lower highs and lower lows we call this a downtrend so time series can experience changing direction where it can go from an uptrend or downtrend and if there were no trend we call it a horizontal or stationary trend so each trend we see in a time series is called the trend cycle when looking at the plot of the bookings data from ABC Hotel we see that ABC Hotel continues to increase their bookings year on year creating an upward trend in the time series so if ABC Hotel reaches capacity then booking will turn sideways and should a competing Hotel open up down the street and steal their bookings then ABC would experience a Down trend so time series data exhibits a repeating pattern at fixed intervals of time within a 1-year period is said to have a seasonal pattern or seasonality seasonality is a common pattern seen across many different kinds of time series for example if you live in a climate with cold winters and warm summers your homes air conditioning cause probably rise in the summer and fall in the winter and you would reasonably expect the seasonality of your air conditioning cost to recur every year likewise a company that sells heavy coats would see sales jump in the winter but drop in the summer so companies that understand the seasonal patterns of the business can time inventory staffing and other decisions that coincide with the expected change in business so what we see now was a seasonality plot again an all time series plots the vertical axis is our target variable and the horizontal axis shows our seasonality for example with monthly data all the monthly values are plotted in chronological order with each month on the horizontal axis and the numbers 1 through 10 represent the 10 years of data that we have seen in the time series and when looking at this plot for the ABC Hotel time series we can tell that our results during winter and summer months are significantly higher so these findings are not surprising considering most people go to the hotel during these months to take advantage of skiing and hiking since this pattern repeats each year the same time intervals you can safely say that our time series contains seasonality another noticeable characteristic of the ABC hotel booking seasonality plot is that the magnitude of bookings increase year after year so guys keep this in mind us will be very important for a predective modeling so one question which arises here is what if we see a pattern in our data that doesn't occur in the same calendar year.
Is that still a seasonal pattern no we call it as a cyclical pattern so what cyclical pattern exists when data exhibits rise and falls but not over a fixed period so think of business cycles which usually last several years but where the length of the current cycle is unknown beforehand in finance times of expansion and recession the stock market reveal cyclical patterns a cyclic apprentice referred to as a bull market while a cyclically downtrend is referred to as a bear market and these patterns in the general market occupy multiple years and don't have repeating pattern within each year so many people confuse cyclical behavior with seasonal behavior but they're quite different so the fluctuations are not of a fixed period than they are cyclically if the period is unchanging and associated with some aspect of the calendar then the pattern is seasonal in general with cyclically patterns the average length of cycles is longer than the length of the seasonal pattern and the magnitude of the cycles tends to change more than the magnitude of seasonal patterns also cyclical patterns are much harder to predict as well for example the decline in stock markets is often too sudden and violent all coming as a surprise to most and the variation of observation in our time series which is unusual or unexpected is known as irregular variation it is also termed as a random variation and is usually unpredictable the example of irregular variation can be the time of strikes or natural disasters which are unusual or unexpected so now that we've covered some of the basic building blocks of time series analysis next we can start discussing the first model type which is exponential smoothing model so exponential smoothing forecasts use weighted averages of past observations giving more weight to the most recent observation with weights gradually getting smaller as the observation gets older so the e T and s terms represent how the error trend and seasonality are applied in the smoothing method calculation so each term can be applied either additively multiplicativly or in some cases we left out of the model altogether this framework allows for a wide spectrum of time series analysis due to simplicity of the calculation so how do we determine how to apply the error trend and seasonality terms of an ets model a good way to start is to visualize the data by using a time series decomposition plot so what this plot does is separate the time series into its seasonal trend and error component so let's start by looking at the data from a business problem the first plot shows the actual time series the seasonal portion shows us that there is a seasonal pattern our trend line indicates the general course or tendency of the time series so it has a centered moving average of the time series and fits between the seasonal peaks and valleys this line is considered DC's analyzed and lastly the remainder is the error in the model that calculates the difference between the observed value and the trendline estimate here's the piece that is not accounted for by combining the seasonal Peice in the trend piece all time series will have this residual error to help explain what trend and seasonality cannot and making use of the trend seasonal and error plots shown together in a decomposition plot allows us to identify this main components of the time series so later we can extract this components so that we can figure our exponential smoothing model to best represent the underlying data for time series so why do we need ARIMA model panorama model is a class of statistical models for analyzing and forecasting time series data it is an acronym which stands for auto regressive integrated moving average if it is our generalization of the simpler autoregressive moving average and that's the notion of integration these ARIMA models are applied in some cases where data shows evidence of non stationarity and an initial differencing step can be applied one or more times to eliminate this non stationarity so a random variable which is time series is said to be stationary if it's statistical properties are all constant over time so were stationary series has no trend but as its variations around its mean have a constant amplitude and it.
Wiggles in a consistent fashion that is a short-term random time patterns always took the same in a statistical sense the latter condition means that it's Auto correlations which is nothing but correlations with its own prior deviations from the mean remain constant over time or equivalently that its power spectrum remains constant over time random variable of this form can be viewed as a combination of signal and noise and the signal could be a pattern of fast or slow mean reversion or sinusoidal oscillation and it could also have a seasonal component so ARIMA model can be viewed as a filter that tries to separate the signal from the noise and the signal is then extrapolated into the future to obtain forecasts so what exactly is ARIMA model the ARIMA forecasting equation for a stationary time series as linear that as regression type equation in which the predictors consist of large of the dependent variable or large to the forecast errors that as predicted value of y is equal to a constant or a weighted sum of one or more recent values of Pi or a weighted sum of one over recent values of the errors to the acronym ARIMA as descriptive capturing the key aspects of the model itself so a R means order regression so model that uses the dependent relationship between an observation and some number of lagged observations and I stands for integrated so this is used for differencing of raw observations that as subtracting one observation from another observation of the previous time step in order to make the time series stationary and that means stands for moving average so it is a model that uses the dependency between an observation and residual errors from moving average model apply to lagged observations and each of these components are explicitly specified in the model as a parameter and this is a standard notation used for these three aspects P T and Q but the parameters are substituted with integer values to quickly indicate the specific ARMA model being used so P denotes the number of lag observations included in the model and it is also called the lag order D stands for the number of times so the raw observations are different and it is also called the degree of differencing and Q denotes the size of the moving average window it is also called the order of moving average so now we look at the assumptions of ARIMA model so here the first assumption is that the series is a stationary essentially this means that the series is normally distributed and the mean and variance are constant over a long time period next is uncorrelated random error so we assume that the error term is randomly distributed and the mean and variance are constant over a time period so the Durbin Watson test is a standard test for correlated errors we also assume that there are no outliers in the series as outliers may affect conclusion strongly and can be misleading the last assumption is the random shocks or the random error component so if any shocks are present they are assumed to be randomly distributed with a mean of 0 and a constant variance now we look at the steps to build ARIMA model sometimes I am a model is also known as box Jenkins method so the Box elkins method as a stochastic model building process and it is an iterative approach that consists of the following three steps the first step is identification here we use the data and all related information to help select a subclass of the model that members summarize the data and next step is estimating so here we use the data to train the parameters of the model and the third step is diagnostic checking so here we evaluate the fitted model in the context of the available data and check for areas where the model may be improved in this entire process is iterative but as new information is K in during Diagnostics we can circle back to step 1 and incorporate that new information back into new model classes so let's take a look at these steps in more detail the identification step can be further broken down first we assess where there was a time series is stationary and if it is not we determine how many differences are required to make it stationary and after that we identify the parameters of an ARIMA model for the data now let's have a look at some of the tips during identification so it is advised to use unit route statistical tests on the time series to determine whether or not the stationery and also we need to avoid over differencing as much as possible so differencing the time series more than what is required can result in the addition of extra serial correlation and additional complexity now we look at the steps for configuring a are endemic so two diagnostic plots can be used to choose the P and Q parameters for the ARIMA model they are auto correlation function and partial auto correlation function so the ACF plot summarizes the correlation of an observation with lag values the x-axis shows the lag and the y-axis shows the correlation coefficient between minus 1 and 1 for negative and positive correlation while the PCF plot summarizes the correlations for an observation with lag values which are not accounted for by prior lagged observations so these plots are drawn as bar charts showing the 95 percentile 99 percent confidence intervals as horizontal lines for the bars that cross these confidence intervals are therefore more significant and worth noting now you may observe some useful patterns when you make this two plots such as the model is ar if the ECF trails off after a lag and has a hard cutoff in the pse of after lag so this lag is taken as a value for P and the model is MA if the ACF trails off after a lag and has a hard cutoff in the DCF after the light and this lag value is taken as the value for Q and the model is a mix of AR and ma if both the ECF and p ACF trail off so that was the process involved in identification and the next step is estimation so estimation involves using numerical methods to minimize a loss or error term the method of least squares can be used for this however for models involving anime component there is no simple form rather can be applied to obtain the estimates and the third step is diagnostic checking for the idea of diagnostic checking just to look for evidence that the model is not a good fit for the data the two useful areas to investigate Diagnostics are overfitting and residual errors so what do we do in or fitting we start off by checking of the model or fits the data generally this means that the model is more complex than it needs to be and captures random noise in the training data so this is a problem for time series forecasting because it negatively impacts the ability of the model to generalize resulting in poor forecast performance on data which is out of the sample data so careful attention must be paid to both in-sample and out-of-sample performance and this requires the careful design of a robust test harness for evaluating models and what do we do in case of residual errors so these forecast residuals provide a great opportunity for Diagnostics a review of the distribution of errors helps in removing out the bias in the model so the errors from an ideal model would resemble white noise which is a Gaussian distribution with a mean of zero and a symmetrical variance and for this purpose you may use density plots histograms and QQ plots and compare the distribution of erros to the expected distribution so a non Gaussian distribution may suggest an opportunity for data pre-processing and the skew in the distribution or a nonzero mean may suggest a bias and forecasts that may be correct additionally an ideal model would leave no temporal structure in the time series of forecasts residuals so this can be checked by creating a CF and P ACF plots of the residual error time series and the presence of serial correlation in the residual errors such as further opportunity for using this information in the model so we'll be implementing all of these on top of the eight passengers data set so let's quickly go to R studio one start working right guys so this is R studio and our first stars could be to have a glance to the eight passengers data set so I'll type in a passengers and this is our time series data so thus time series data starts from the year 1949 and goes on till the year 1960 and all of these are just entries which tell us about the number of passengers for each month so if we have a look at this entry then there were three hundred and two passengers during July 1954 similarly if we take this entry Over here there were 178 passengers during March of 1951 and if we take this entry there were four hundred and five passengers during December of 1959 and just to be sure let's also have a glance at the class of this data set so it'll be class of eight passengers so you get TS so TS is nothing but time series now let's also explore some time series functions so the start function would give us the first entry point of the time series dataset so start of a passenger and this is the first entry point so 1949 one so this basically is telling us the first entry point is the first month of the year 1949 and similarly we can use the end function to get the last entry point which would be end of a passengers the last entry point is the 12th month of the year 1960 right so the first entry point is January 1949 and the last entry point is December 1960 okay now let's also have a glance the summary of this data set so it'll be summary of air passengers so we see that the minimum number of passengers for any month is 104 the maximum number of passengers for any month was 622 and the mean number of passengers around 280.
Okay now let's also plot this so it'll be lot of eight passengers and this is what we get so here time is my abdomen to the x-axis and that the number of passengers are mapped onto the y-axis so what we see is the number of passengers increase with respect to time now we can also add some labels and color to this plot so the color which I'll be giving this aspie green for right so I've added color to this plot let me also add a title to this so it'll be mean equals passengers versus dying right so this is the plot after giving the color and the label so this is the color which you get the speed ring for and the label which have given is passengers versus time okay now what I'll do is I'll also add a linear line on top of this plot so a beeline of I'll use the LM function and the dependent variable would be eight passengers and the independent variable would be time of a passengers so we have also nabbed a linear line on top of this so this linear line is nothing but the mean guy's right so the inference which you can draw from this as as the time increases the mean also increases or in other words the mean is a function of the time variable but what happens as of the mean is a function of time then this time series data is not stationary so this is one problem another problem with this is the variance is also not equal so if you take these two peaks over here so the distance from this p to the main line and the distance from this p to the main line is different again if we take let's say these two peaks the distance from this p to the main line and the distance from the speed to the main line is again different so the two major problems are variance is not equal and the mean is also not constant and that is why this air passengers time series data is not stationary okay so now we'll go ahead and have a glance of the decomposition plot so I'll type plot off decompose off eight passengers and we get the decomposition plot so this is the original draft and this is the trend line which we get so we see that there is a general upward trend and this is the seasonal pattern and this is the random pattern now we'll also have a glance at the cyclic and pattern and to get a psych little pattern we'd have to build a box plot so it'll be box plot and I'll be giving the eight passengers on to the y-axis and I'll be mapping cycle of air passengers onto the x-axis right so this is the cyclically pattern guys so if we have a look at this graph closely what we see is most of the traffic comes during the seventh and eighth months that is during July and August and the minimum traffic comes during second and eleventh month that is February and November right so again I'm restating it guys most of the traffic which we get is during seventh and eighth month or in other words July and August and the minimum traffic which we get is during February in number so this is the cyclic Alberto in which we opting okay guys so now as I've already told you this air passengers data is not stationary so to make it stationary we need to do two things first we need to make the variance to be equal and second we need to make the mean to be constant so we'll start over the first task so we'll go ahead and make the variance to be equal and do that we would have to use the logarithmic function so let me actually plot the original graph first so we'll be plot of a passengers and we see here that the variance is not equal now to make the variance to be equal all I need to do is use the log function so plot of log of eight passengers and this is what we get after using the logarithmic function so this is the original trend line and after applying the logarithmic function this is what we get now let's also add the linear line on top of this a beeline let me use the LM function and this time the independent feasible would be log of eight passengers and the dependent variable would be time off log of eight passengers right swift map this linear line on top of this so this time what we see is the variance is equal the distance from this p to the mean line and the distance from this P to the mean line is equal again and if you have a look at these two peaks the distance from the speed to the main line and the distance from the speed to the mean line is equal so let's we have used a logarithmic function to make the variance to be equal okay so now we'll also go ahead and make the mean to be constant and to make the mean to be constant we have to use the differentiation function so let's do that I'll be lot of death of log of air passengers right so log of air passengers gives us a trend line where variance is equal and applying the differentiation function on top of this gives us a trend line where the mean is also constant right is the graph where the mean is constant and the variance is equal now again.
I will plot the DB line on top of this so it'll be a beeline LM and this time will be death of log of 8 passengers and the independent variable would be time of death of log of 8 passengers right so you have also mapped the mean line on top of this to see that the mean is horizontal or in other words the mean is constant and also the variance is equal so you have successfully converted the non-stationary data into stationary data and that is why we can go ahead and build the ARIMA model on top of this so before we go ahead and build I remember let's understand it properly so it's er I MA so over here a R stands for auto regression i stands for integration and ma stands for moving averages and a RS denoted with p is denoted with d and ma is denoted with q now to find out the values of P and Q we can use the ACF and the P AC of functions so I will type ACF and let me give in the original data set first so will be a CF of air passengers s actually keep this graph in mind now what I'll do is inside ECF i'll given the modified plot which would be def off log of 8 passengers right so this is the modified ECF graph so here we see how there are no inverted lines and over here we see that there are some inverted lines now to get the Q value it will be that line which comes just before the first inverted line I am repeating it guys the Q would be that line which comes just before the first inverted line so here the numbering starts from 0 so this be the zeroeth line this would be the first line and this would be a second line and since this is the first inverted line the value of Q would be 1 so 0 1 2 and this is the value of Q okay so let me give him the value of Q which would be 1 okay now similarly to get the value of P we need to use the PSU function so P AC f stands for partial autocorrelation function alright so again it's the same case to get the value of P it will be that first line which comes just before the first inverted line so this is the zero at line this is the first line and since it is a 0 at line value of P would be 0 okay so we have found out the value of P and the value of Q the value of T is just the number of times we differentiate the dataset to make the mean to be constant and since we differentiate the data set only once the value of D would be 1 so we have successfully got the value of P D and Q so the value of P is 0 the value of D is 1 and the value of Q is also 1 okay since we've got the value of P D and Q we can go ahead and build the ARIMA model so I'll type in a Rima here and inside this I will Kevin log of e passengers after this I'll given the values of P T and Q so it will be 0 1 1 you guys must be wondering why haven't I differentiated this that is because I'm already giving the value of D to be 1 over here right and that is why I'm not differentiating this ok so I've given the data set I have also given the values of P D and Q now I'll also given the seasonal parameters so seasonal equals list order equals C of 0 1 0 and after this I will also given the period and since they're 12 months and one year period would be equal to 12 and I will store this in let's see mod time right guys we have successfully built the ARIMA model now it's time to predict the values so I'll use the predict function and the first parameter would be the model which you just built so predict of mod time and after this the next parameter would be the number of miles the number of years for which we'd want the prediction and I'd want the prediction for the next ten years so it'll be 10 into 12 and I will store this in let's see red time okay so let me print this so I'll be pred time so guys now what you need to keep in mind is all of these values are logarithmic values right so that is why we get values like one point three seven one point six one one point seven seven so we need to raise these values to the power of e so the e value is two point seven one eight and I'll raise this predicted values of the power of each will be two point seven one eight to the power of bread time dollar Fred okay right so I will store this back in bread time and this time let me print it again Fred time so guys these are the predicted values so we get predicted values from 1961 to 1970 so let's have a look at this entry so the number of passengers for March 1966 would be 653 similarly the number of passengers for August 1970 would be twelve hundred and seventy-one and similarly there are number of passengers for November 1967 would be 655 again if we have a look at this so the number of passengers for me 1970 would be 990 so guys we have successfully forecasted values for the next 10 years now also we will go ahead and make a plot of the actual values and the predicted values so lb TS dot plot and let me given the actual values which are stored in 8 passengers let me given the predicted values which are stored in Fred time and since adduce a logarithmic function shall be log equals y now I use the lty parameter to represent all of the actual values with solid lines and to represent all of the predicted values with dotted lines so lb L T y equals C of 1 3 right guys so as these are the actual values and these are the predicted values so we have actual values from 1950 to 1960 and then we forecasted values for the next ten years all right okay so now I'll also give a color to this so we see oh well equals feel green for yep so I've changed the color now also I'll give a title to this so the title which I'll be giving us forecasted values right guys so this is the final plot right so I've changed the color and I've also given a title to this so guys this is how we can work with the ARIMA model and focus new values now if this session was useful to you guys please like this video and also comment down any queries that you have in regards to this particular video so thanks a lot for listening to me guys have a great day ahead and good bye.
Forecasting is the use of statistical models to predict future values based on past results so what kind of variables can be focused so any value that can be tracked and collected over time think of annual population data or a company's daily stock price or quarterly sales figures for each of these examples data is collected over time and the time series model simply uses that data to forecast future values now we'll concentrate on some of the building blocks of time series model the first defining characteristic of a time series is a list of observations while ordering matters so ordering is very important because there is a dependency on time and changing the order could change the meaning of the data now to accurately focus future values we'll need each measurement of data to be taken across sequential and equal intervals and with each time unit having at most one data point so once we have collected our data we have two objectives in mind so first would be identifying the patterns represented by the sequence of observations and second would be forecasting or predicting future values of time series so the patterns we observe will tell us a story of how a business interacts with time in time series analysis it is assumed that the data consists of a systematic pattern that is usually a set of identifiable components and random noise that is error which usually makes the pattern difficult to identify so most time series analysis techniques involve some form of filtering of the noise in order to make the pattern more notable so now we'll learn about the different types of time series models so the time series data could either be stationary or non stationary so given a series of data points if the mean and variance of all the data points remains constant with time then we'll call that series are stationary series so here on the screen we have a simple example of how stationary series will look like now it is important to know about stationary model because of the fact that without a stationary series we cannot move forward with time series analysis and we would have to need to calculate the mean of a time series in order to estimate the expected value but if the time series is not stationary then our calculation for expected value will give false results and interpretation so now if a series is not stationary then we'll need to convert the non stationary series into a stationary series in order to do this we'll have to differentiate the series so apart from a stationary series a white noise series is a series where the mean and variance of the data points is constant but there is no auto correlation between values of this points so the correlation between values of data points at different time intervals is known as autocorrelation it is also sometimes termed as lag correlation now when the mean and variance of time series data is not constant that is fading with time then we can say that the data is just taking a random walk over time so random walk is a term used when data points in the series are not dependent on their past values this makes the series non stationary series because the mean and variance will vary with time so there are three basic criteria for a series to be classified as a stationary series the mean of the series should not be a function of time rather it should be a constant so here the green colored graph satisfies the condition whereas the graph in red has a time dependent mean then the variance of the series should also not be a function of time this property is known as homoscedasticity so over here you can notice the varying spread of distribution and the right hand graph and the third criteria is the covariance of the I th term and the I + MH term should not be a function of time in the graph here you will notice the spread becomes closer as the time increases hence the covariance is not constant with time for the red series so next we will study about different components on a time series data now as we know that a time series data varies with time there are many factors which results in this variation the effects of these factors are studied by the following four major components trends seasonal variation cyclic variation and irregular variation to better understand and apply these models will examine a business problem surrounding how management at Hotel EBC a mountain resort us able to prepare for their year by forecasting the number of room bookings they expect each month so this information can help management make informed decisions on staffing hospitality arrangements and pricing for rooms we have tasks their business analyst to forecast bookings so they can make the necessary preparations so the historical data contains monthly information from the past 10 years and the forecasts should contain monthly bookings for the next six months so your job will be to use the historical bookings data from the past years to investigate and clean the data and then determine the trend and seasonal components afterwards should have to apply the findings to an arima model and finally forecast the bookings for the next six months so that's quite a lot to think about but don't worry we'll be taking a closer look at time series forecasting methods and what they all mean so before we crunch any numbers and make a predictive model we need to plot our data in order to get a feel for what a time series looks like and time series plot shows graphical presentation of the relationship between time and the time series target variable so time is on the horizontal axis and the target variables values are shown on the vertical axis the first plot shows the complete ABC hotel time series so you can see that the series shows a general trend up and we call that an upward trend lastly our bookings plot shows two main patterns first we see the upward trend secondly we see some regularly occurring fluctuations up and down within the same calendar year so we call that a seasonal pattern so plotting or the time series allows us to visualize these patterns and behaviors we can the news of findings in the time series plot to create and fine-tune sophisticated forecasting models in our time series analysis or trend is a gradual shift or movement to relatively higher or lower values over a long period of time so in a trend pattern exhibits a general direction which is upward where there are higher highs and lower lows we call this an uptrend and when a trend pattern exhibits a general direction which is down where there are lower highs and lower lows we call this a downtrend so time series can experience changing direction where it can go from an uptrend or downtrend and if there were no trend we call it a horizontal or stationary trend so each trend we see in a time series is called the trend cycle when looking at the plot of the bookings data from ABC Hotel we see that ABC Hotel continues to increase their bookings year on year creating an upward trend in the time series so if ABC Hotel reaches capacity then booking will turn sideways and should a competing Hotel open up down the street and steal their bookings then ABC would experience a Down trend so time series data exhibits a repeating pattern at fixed intervals of time within a 1-year period is said to have a seasonal pattern or seasonality seasonality is a common pattern seen across many different kinds of time series for example if you live in a climate with cold winters and warm summers your homes air conditioning cause probably rise in the summer and fall in the winter and you would reasonably expect the seasonality of your air conditioning cost to recur every year likewise a company that sells heavy coats would see sales jump in the winter but drop in the summer so companies that understand the seasonal patterns of the business can time inventory staffing and other decisions that coincide with the expected change in business so what we see now was a seasonality plot again an all time series plots the vertical axis is our target variable and the horizontal axis shows our seasonality for example with monthly data all the monthly values are plotted in chronological order with each month on the horizontal axis and the numbers 1 through 10 represent the 10 years of data that we have seen in the time series and when looking at this plot for the ABC Hotel time series we can tell that our results during winter and summer months are significantly higher so these findings are not surprising considering most people go to the hotel during these months to take advantage of skiing and hiking since this pattern repeats each year the same time intervals you can safely say that our time series contains seasonality another noticeable characteristic of the ABC hotel booking seasonality plot is that the magnitude of bookings increase year after year so guys keep this in mind us will be very important for a predective modeling so one question which arises here is what if we see a pattern in our data that doesn't occur in the same calendar year.
Is that still a seasonal pattern no we call it as a cyclical pattern so what cyclical pattern exists when data exhibits rise and falls but not over a fixed period so think of business cycles which usually last several years but where the length of the current cycle is unknown beforehand in finance times of expansion and recession the stock market reveal cyclical patterns a cyclic apprentice referred to as a bull market while a cyclically downtrend is referred to as a bear market and these patterns in the general market occupy multiple years and don't have repeating pattern within each year so many people confuse cyclical behavior with seasonal behavior but they're quite different so the fluctuations are not of a fixed period than they are cyclically if the period is unchanging and associated with some aspect of the calendar then the pattern is seasonal in general with cyclically patterns the average length of cycles is longer than the length of the seasonal pattern and the magnitude of the cycles tends to change more than the magnitude of seasonal patterns also cyclical patterns are much harder to predict as well for example the decline in stock markets is often too sudden and violent all coming as a surprise to most and the variation of observation in our time series which is unusual or unexpected is known as irregular variation it is also termed as a random variation and is usually unpredictable the example of irregular variation can be the time of strikes or natural disasters which are unusual or unexpected so now that we've covered some of the basic building blocks of time series analysis next we can start discussing the first model type which is exponential smoothing model so exponential smoothing forecasts use weighted averages of past observations giving more weight to the most recent observation with weights gradually getting smaller as the observation gets older so the e T and s terms represent how the error trend and seasonality are applied in the smoothing method calculation so each term can be applied either additively multiplicativly or in some cases we left out of the model altogether this framework allows for a wide spectrum of time series analysis due to simplicity of the calculation so how do we determine how to apply the error trend and seasonality terms of an ets model a good way to start is to visualize the data by using a time series decomposition plot so what this plot does is separate the time series into its seasonal trend and error component so let's start by looking at the data from a business problem the first plot shows the actual time series the seasonal portion shows us that there is a seasonal pattern our trend line indicates the general course or tendency of the time series so it has a centered moving average of the time series and fits between the seasonal peaks and valleys this line is considered DC's analyzed and lastly the remainder is the error in the model that calculates the difference between the observed value and the trendline estimate here's the piece that is not accounted for by combining the seasonal Peice in the trend piece all time series will have this residual error to help explain what trend and seasonality cannot and making use of the trend seasonal and error plots shown together in a decomposition plot allows us to identify this main components of the time series so later we can extract this components so that we can figure our exponential smoothing model to best represent the underlying data for time series so why do we need ARIMA model panorama model is a class of statistical models for analyzing and forecasting time series data it is an acronym which stands for auto regressive integrated moving average if it is our generalization of the simpler autoregressive moving average and that's the notion of integration these ARIMA models are applied in some cases where data shows evidence of non stationarity and an initial differencing step can be applied one or more times to eliminate this non stationarity so a random variable which is time series is said to be stationary if it's statistical properties are all constant over time so were stationary series has no trend but as its variations around its mean have a constant amplitude and it.
Wiggles in a consistent fashion that is a short-term random time patterns always took the same in a statistical sense the latter condition means that it's Auto correlations which is nothing but correlations with its own prior deviations from the mean remain constant over time or equivalently that its power spectrum remains constant over time random variable of this form can be viewed as a combination of signal and noise and the signal could be a pattern of fast or slow mean reversion or sinusoidal oscillation and it could also have a seasonal component so ARIMA model can be viewed as a filter that tries to separate the signal from the noise and the signal is then extrapolated into the future to obtain forecasts so what exactly is ARIMA model the ARIMA forecasting equation for a stationary time series as linear that as regression type equation in which the predictors consist of large of the dependent variable or large to the forecast errors that as predicted value of y is equal to a constant or a weighted sum of one or more recent values of Pi or a weighted sum of one over recent values of the errors to the acronym ARIMA as descriptive capturing the key aspects of the model itself so a R means order regression so model that uses the dependent relationship between an observation and some number of lagged observations and I stands for integrated so this is used for differencing of raw observations that as subtracting one observation from another observation of the previous time step in order to make the time series stationary and that means stands for moving average so it is a model that uses the dependency between an observation and residual errors from moving average model apply to lagged observations and each of these components are explicitly specified in the model as a parameter and this is a standard notation used for these three aspects P T and Q but the parameters are substituted with integer values to quickly indicate the specific ARMA model being used so P denotes the number of lag observations included in the model and it is also called the lag order D stands for the number of times so the raw observations are different and it is also called the degree of differencing and Q denotes the size of the moving average window it is also called the order of moving average so now we look at the assumptions of ARIMA model so here the first assumption is that the series is a stationary essentially this means that the series is normally distributed and the mean and variance are constant over a long time period next is uncorrelated random error so we assume that the error term is randomly distributed and the mean and variance are constant over a time period so the Durbin Watson test is a standard test for correlated errors we also assume that there are no outliers in the series as outliers may affect conclusion strongly and can be misleading the last assumption is the random shocks or the random error component so if any shocks are present they are assumed to be randomly distributed with a mean of 0 and a constant variance now we look at the steps to build ARIMA model sometimes I am a model is also known as box Jenkins method so the Box elkins method as a stochastic model building process and it is an iterative approach that consists of the following three steps the first step is identification here we use the data and all related information to help select a subclass of the model that members summarize the data and next step is estimating so here we use the data to train the parameters of the model and the third step is diagnostic checking so here we evaluate the fitted model in the context of the available data and check for areas where the model may be improved in this entire process is iterative but as new information is K in during Diagnostics we can circle back to step 1 and incorporate that new information back into new model classes so let's take a look at these steps in more detail the identification step can be further broken down first we assess where there was a time series is stationary and if it is not we determine how many differences are required to make it stationary and after that we identify the parameters of an ARIMA model for the data now let's have a look at some of the tips during identification so it is advised to use unit route statistical tests on the time series to determine whether or not the stationery and also we need to avoid over differencing as much as possible so differencing the time series more than what is required can result in the addition of extra serial correlation and additional complexity now we look at the steps for configuring a are endemic so two diagnostic plots can be used to choose the P and Q parameters for the ARIMA model they are auto correlation function and partial auto correlation function so the ACF plot summarizes the correlation of an observation with lag values the x-axis shows the lag and the y-axis shows the correlation coefficient between minus 1 and 1 for negative and positive correlation while the PCF plot summarizes the correlations for an observation with lag values which are not accounted for by prior lagged observations so these plots are drawn as bar charts showing the 95 percentile 99 percent confidence intervals as horizontal lines for the bars that cross these confidence intervals are therefore more significant and worth noting now you may observe some useful patterns when you make this two plots such as the model is ar if the ECF trails off after a lag and has a hard cutoff in the pse of after lag so this lag is taken as a value for P and the model is MA if the ACF trails off after a lag and has a hard cutoff in the DCF after the light and this lag value is taken as the value for Q and the model is a mix of AR and ma if both the ECF and p ACF trail off so that was the process involved in identification and the next step is estimation so estimation involves using numerical methods to minimize a loss or error term the method of least squares can be used for this however for models involving anime component there is no simple form rather can be applied to obtain the estimates and the third step is diagnostic checking for the idea of diagnostic checking just to look for evidence that the model is not a good fit for the data the two useful areas to investigate Diagnostics are overfitting and residual errors so what do we do in or fitting we start off by checking of the model or fits the data generally this means that the model is more complex than it needs to be and captures random noise in the training data so this is a problem for time series forecasting because it negatively impacts the ability of the model to generalize resulting in poor forecast performance on data which is out of the sample data so careful attention must be paid to both in-sample and out-of-sample performance and this requires the careful design of a robust test harness for evaluating models and what do we do in case of residual errors so these forecast residuals provide a great opportunity for Diagnostics a review of the distribution of errors helps in removing out the bias in the model so the errors from an ideal model would resemble white noise which is a Gaussian distribution with a mean of zero and a symmetrical variance and for this purpose you may use density plots histograms and QQ plots and compare the distribution of erros to the expected distribution so a non Gaussian distribution may suggest an opportunity for data pre-processing and the skew in the distribution or a nonzero mean may suggest a bias and forecasts that may be correct additionally an ideal model would leave no temporal structure in the time series of forecasts residuals so this can be checked by creating a CF and P ACF plots of the residual error time series and the presence of serial correlation in the residual errors such as further opportunity for using this information in the model so we'll be implementing all of these on top of the eight passengers data set so let's quickly go to R studio one start working right guys so this is R studio and our first stars could be to have a glance to the eight passengers data set so I'll type in a passengers and this is our time series data so thus time series data starts from the year 1949 and goes on till the year 1960 and all of these are just entries which tell us about the number of passengers for each month so if we have a look at this entry then there were three hundred and two passengers during July 1954 similarly if we take this entry Over here there were 178 passengers during March of 1951 and if we take this entry there were four hundred and five passengers during December of 1959 and just to be sure let's also have a glance at the class of this data set so it'll be class of eight passengers so you get TS so TS is nothing but time series now let's also explore some time series functions so the start function would give us the first entry point of the time series dataset so start of a passenger and this is the first entry point so 1949 one so this basically is telling us the first entry point is the first month of the year 1949 and similarly we can use the end function to get the last entry point which would be end of a passengers the last entry point is the 12th month of the year 1960 right so the first entry point is January 1949 and the last entry point is December 1960 okay now let's also have a glance the summary of this data set so it'll be summary of air passengers so we see that the minimum number of passengers for any month is 104 the maximum number of passengers for any month was 622 and the mean number of passengers around 280.
Okay now let's also plot this so it'll be lot of eight passengers and this is what we get so here time is my abdomen to the x-axis and that the number of passengers are mapped onto the y-axis so what we see is the number of passengers increase with respect to time now we can also add some labels and color to this plot so the color which I'll be giving this aspie green for right so I've added color to this plot let me also add a title to this so it'll be mean equals passengers versus dying right so this is the plot after giving the color and the label so this is the color which you get the speed ring for and the label which have given is passengers versus time okay now what I'll do is I'll also add a linear line on top of this plot so a beeline of I'll use the LM function and the dependent variable would be eight passengers and the independent variable would be time of a passengers so we have also nabbed a linear line on top of this so this linear line is nothing but the mean guy's right so the inference which you can draw from this as as the time increases the mean also increases or in other words the mean is a function of the time variable but what happens as of the mean is a function of time then this time series data is not stationary so this is one problem another problem with this is the variance is also not equal so if you take these two peaks over here so the distance from this p to the main line and the distance from this p to the main line is different again if we take let's say these two peaks the distance from this p to the main line and the distance from the speed to the main line is again different so the two major problems are variance is not equal and the mean is also not constant and that is why this air passengers time series data is not stationary okay so now we'll go ahead and have a glance of the decomposition plot so I'll type plot off decompose off eight passengers and we get the decomposition plot so this is the original draft and this is the trend line which we get so we see that there is a general upward trend and this is the seasonal pattern and this is the random pattern now we'll also have a glance at the cyclic and pattern and to get a psych little pattern we'd have to build a box plot so it'll be box plot and I'll be giving the eight passengers on to the y-axis and I'll be mapping cycle of air passengers onto the x-axis right so this is the cyclically pattern guys so if we have a look at this graph closely what we see is most of the traffic comes during the seventh and eighth months that is during July and August and the minimum traffic comes during second and eleventh month that is February and November right so again I'm restating it guys most of the traffic which we get is during seventh and eighth month or in other words July and August and the minimum traffic which we get is during February in number so this is the cyclic Alberto in which we opting okay guys so now as I've already told you this air passengers data is not stationary so to make it stationary we need to do two things first we need to make the variance to be equal and second we need to make the mean to be constant so we'll start over the first task so we'll go ahead and make the variance to be equal and do that we would have to use the logarithmic function so let me actually plot the original graph first so we'll be plot of a passengers and we see here that the variance is not equal now to make the variance to be equal all I need to do is use the log function so plot of log of eight passengers and this is what we get after using the logarithmic function so this is the original trend line and after applying the logarithmic function this is what we get now let's also add the linear line on top of this a beeline let me use the LM function and this time the independent feasible would be log of eight passengers and the dependent variable would be time off log of eight passengers right swift map this linear line on top of this so this time what we see is the variance is equal the distance from this p to the mean line and the distance from this P to the mean line is equal again and if you have a look at these two peaks the distance from the speed to the main line and the distance from the speed to the mean line is equal so let's we have used a logarithmic function to make the variance to be equal okay so now we'll also go ahead and make the mean to be constant and to make the mean to be constant we have to use the differentiation function so let's do that I'll be lot of death of log of air passengers right so log of air passengers gives us a trend line where variance is equal and applying the differentiation function on top of this gives us a trend line where the mean is also constant right is the graph where the mean is constant and the variance is equal now again.
I will plot the DB line on top of this so it'll be a beeline LM and this time will be death of log of 8 passengers and the independent variable would be time of death of log of 8 passengers right so you have also mapped the mean line on top of this to see that the mean is horizontal or in other words the mean is constant and also the variance is equal so you have successfully converted the non-stationary data into stationary data and that is why we can go ahead and build the ARIMA model on top of this so before we go ahead and build I remember let's understand it properly so it's er I MA so over here a R stands for auto regression i stands for integration and ma stands for moving averages and a RS denoted with p is denoted with d and ma is denoted with q now to find out the values of P and Q we can use the ACF and the P AC of functions so I will type ACF and let me give in the original data set first so will be a CF of air passengers s actually keep this graph in mind now what I'll do is inside ECF i'll given the modified plot which would be def off log of 8 passengers right so this is the modified ECF graph so here we see how there are no inverted lines and over here we see that there are some inverted lines now to get the Q value it will be that line which comes just before the first inverted line I am repeating it guys the Q would be that line which comes just before the first inverted line so here the numbering starts from 0 so this be the zeroeth line this would be the first line and this would be a second line and since this is the first inverted line the value of Q would be 1 so 0 1 2 and this is the value of Q okay so let me give him the value of Q which would be 1 okay now similarly to get the value of P we need to use the PSU function so P AC f stands for partial autocorrelation function alright so again it's the same case to get the value of P it will be that first line which comes just before the first inverted line so this is the zero at line this is the first line and since it is a 0 at line value of P would be 0 okay so we have found out the value of P and the value of Q the value of T is just the number of times we differentiate the dataset to make the mean to be constant and since we differentiate the data set only once the value of D would be 1 so we have successfully got the value of P D and Q so the value of P is 0 the value of D is 1 and the value of Q is also 1 okay since we've got the value of P D and Q we can go ahead and build the ARIMA model so I'll type in a Rima here and inside this I will Kevin log of e passengers after this I'll given the values of P T and Q so it will be 0 1 1 you guys must be wondering why haven't I differentiated this that is because I'm already giving the value of D to be 1 over here right and that is why I'm not differentiating this ok so I've given the data set I have also given the values of P D and Q now I'll also given the seasonal parameters so seasonal equals list order equals C of 0 1 0 and after this I will also given the period and since they're 12 months and one year period would be equal to 12 and I will store this in let's see mod time right guys we have successfully built the ARIMA model now it's time to predict the values so I'll use the predict function and the first parameter would be the model which you just built so predict of mod time and after this the next parameter would be the number of miles the number of years for which we'd want the prediction and I'd want the prediction for the next ten years so it'll be 10 into 12 and I will store this in let's see red time okay so let me print this so I'll be pred time so guys now what you need to keep in mind is all of these values are logarithmic values right so that is why we get values like one point three seven one point six one one point seven seven so we need to raise these values to the power of e so the e value is two point seven one eight and I'll raise this predicted values of the power of each will be two point seven one eight to the power of bread time dollar Fred okay right so I will store this back in bread time and this time let me print it again Fred time so guys these are the predicted values so we get predicted values from 1961 to 1970 so let's have a look at this entry so the number of passengers for March 1966 would be 653 similarly the number of passengers for August 1970 would be twelve hundred and seventy-one and similarly there are number of passengers for November 1967 would be 655 again if we have a look at this so the number of passengers for me 1970 would be 990 so guys we have successfully forecasted values for the next 10 years now also we will go ahead and make a plot of the actual values and the predicted values so lb TS dot plot and let me given the actual values which are stored in 8 passengers let me given the predicted values which are stored in Fred time and since adduce a logarithmic function shall be log equals y now I use the lty parameter to represent all of the actual values with solid lines and to represent all of the predicted values with dotted lines so lb L T y equals C of 1 3 right guys so as these are the actual values and these are the predicted values so we have actual values from 1950 to 1960 and then we forecasted values for the next ten years all right okay so now I'll also give a color to this so we see oh well equals feel green for yep so I've changed the color now also I'll give a title to this so the title which I'll be giving us forecasted values right guys so this is the final plot right so I've changed the color and I've also given a title to this so guys this is how we can work with the ARIMA model and focus new values now if this session was useful to you guys please like this video and also comment down any queries that you have in regards to this particular video so thanks a lot for listening to me guys have a great day ahead and good bye.