In this lab, we're going to build an ARIMA model for some stock closing values. The lab objectives are to pull data from Google Cloud Storage into a Pandas dataframe, practice preparing raw stock closing data for an ARIMA model, applying the Dickey-Fuller test for stationarity and to build an ARIMA model using the statsmodel library. To do all this, we'll be using an AI Platform Notebook. Let's get started by booting up an AI Platform Notebook. First we'll go to the AI Platform Notebook many page by selecting it from this left side navigation pane. Then we'll click on New Instance and select TensorFlow, Without GPUs. Leave the defaults and then hit Create. Let me take a few minutes for the instance to be ready. When it is ready, you will see the option to open JupyterLab. Go ahead and click on Open JupyterLab when you see it. Since we'll be working out of a notebook with pre-filled content, let's go ahead and pull the course repository from GitHub. First open up a terminal, then clone the GitHub repository by pasting in the git clone command that's in the Qwiklabs directions. This will load the training data analysts repository to your instance. When everything has been cloned, go ahead and navigate to the AI for finance course folder and in the solution folder open up the ARIMA model notebook. For this lab we're going to need the statsmodel library which isn't installed by default. Now in some AI platform notebooks, this command will work but it's safer to just go back to the terminal and copy and paste the pip install command. This will take a few seconds. Here when stats models is finished installing, go back to the ARIMA model notebook and restart the kernel. In these cell [inaudible] most of the libraries will need and sets some environment variables as well. In this next cell, we load the stock data from Google Cloud Storage by invoking Pandas' read_csv function. This loads our data into a dataframe. Notice how we also set the index of the DataFrame to be a datetime object. This tells Pandas that we're working with time series data which opens up some methods specific to time series. So, ARIMA stands for AutoRegressive Integrated Moving Average. In this lab, rather than having statsmodels do the integrated part, we're going to do it manually in the next few steps. We integrate our data by subtracting an observation from our previous timesteps observation. Ultimately, we hope this process makes our time series stationary. Rather than look at our data on a daily granularity, we'll look at it on a weekly granularity. We'll resample the data in this cell here to do just that. Okay. In the next cell we difference the data. We also take the log of the result to normalize large fluctuations. Then next I'll drops the one null row that's the earliest row in our dataset since there's nothing to difference it with. Then here we plot the data as a time series. Then finally in this next cell we drop the close column since we won't be using anymore. We'll only be using the difference data. Visually, our difference time series does not look like it has any trends in it. But to be a bit more rigorous we'll apply the Dickey-Fuller test to test for stationarity. So first, we'll load the relevant statsmodels libraries we'll need for this. In this cell we calculate a rolling standard deviation and average based on a 20 value window. Now let's plot the moving averages against the difference time series data. Notice how the rolling average is centered around zero with some fluctuations. Now on this next cell, we apply the Dickey-Fuller test. Since the p-value return here is less than 0.05 or current threshold, we can reject the null hypothesis and conclude that our difference data is in fact stationary. In this next section, we're going to make some plots of autocorrelation and partial autocorrelation in order to help us choose hyperparameters for the ARIMA model. Let's go ahead and view those plots before we dive into how they extract meaning from them. So observe this chart here that shows you how to interpret the plots we just made. For an ARMA model which is what we're interested in since we already insured our different status stationary, we want to pick the lags corresponding to points outside of this shaded area here. So for the autocorrelation plot which corresponds to the ARMA parameter q, we can choose one. Then looking at the partial autocorrelation plot which helps us select the hyperparameter p, we can go with either as a one or three. Now, let's go ahead and thin ARMA model with these hyperparameters, and feel free to try different values. One way to evaluate the model is to look at the AIC or Akaike Information Criteria. The more negative the AIC the better. One way to gain confidence in the ARMA model is to plot its fitted values to our original difference data. That's what we do in this next cell. So it appears that our fitted model does a good job predicting direction, but it doesn't do a great job in predicting the variance or the extent of the peaks and valleys. The fitted values are in red here while the original difference data is in blue. Finally, in the last cell we use our ARMA model to make a forecast two weeks into the future. We then plot the results. So the forecasted values are here in green at the very end, and as before the blues, the original difference data and the red is the fitted data. See if you can improve on this model. Try different hyperparameter values when you train the model. What if you don't take the log of the differences?