Time Series Part 1: An Introduction to Time Series Analysis
This tutorial provide the basics on time series analysis.
In this tutorial we provide some basics on time series so you can go further explore different types of time series problems. This will serve as base to explore the forecasting algorithms that will be introduced in future tutorials.
The complete code of this tutorial can be found at 01-Intro_time_series_tutorial.ipynb on GitHub.
After completing this tutorial, you will know:
- How to decompose time series
- How to identify time series properties
- The importance of stationarity
- How to make a time series stationary
Introduction to Time Series
Time series analysis deals with data that is ordered in time. Time series data is one of the most common data types and it is used in a wide variety of domains: finance, climate, health, energy, governance, industry, agriculture, business etc. Being able to effectively work with such data is an increasingly important skill for data scientists, especially when the goal is to report trends, forecast, and even detect anomalies.
The current explosion of Internet of Things (IoT) – that collects data using sensors and other sources – allows industries to use anomaly detection to predict when a machine will malfunction. This permits taking action in advance and avoid stopping production. However, this is an example of anomaly detection, which won’t be the main focus of this tutorial. Our main focus is to introduce time series and include some time series forecasting methods.
Some examples where time series are used as forecasting methods include:
- Public Administration: By forecasting the consumption of water and energy for the next years, governments are able to plan and build the infrastructure necessary avoiding collapse in the distribution of these resources.
- Health: Hospitals can use historical data to know the number of intensive care units necessary in the future, or even plan the number of nurses per shift in order to reduce the waiting time on emergency rooms.
- Different type of businesses: Analysing business trends, forecasting company revenue or exploring customer behaviour.
All in all, no matter which application, there is a great interest in the use of historical data to forecast demand so we can provide consumers what they need without wasting resources. If we think about agriculture, for example, we want to be able to produce what people need without harming the environment. As a result, we are not only having a positive impact on the life producers and consumers but also on the life of the whole society.
These few examples already give a good idea of why time series are important and why a data scientist should have some knowledge about it.
Components of Time Series and Time Series Properties
Time Series Decomposition
- trend shows whether the series is consistently decreasing (downward trend), constant (no trend) or increasing (upward trend) over time.
- seasonality describes the periodic signal in your time series.
- noise or residual displays the unexplained variance and volatility of the time series.
Time Series Properties
Seasonality x Cyclicality
A time series is stationary if its statistical properties do not change over time.
Many algorithms such as SARIMAX models are built on this concept. For those algorithms it is important to identify this property. This happens because when running linear regression, the assumption is that all of the observations are independent of each other. In a time series, however, we know that observations are time dependent. So, by making the time series stationary we are able to apply regression techniques to time dependent variables. In other words, the data becomes easier to analyse over long periods of time as it won’t necessarily keep varying and so, the algorithms can assume that stationary data and make better predictions.
If the time series is non-stationary there are ways to make it stationary.
A stationary time series fulfills the following criteria:
- The trend is zero.
- The variance in the seasonality component is constant: The amplitude of the signal does not change much over time.
- Autocorrelation is constant: The relationship of each value of the time series and its neighbors stays the same.
Additive x Multiplicative Model
Y(t) = trend + seasonality + residual
- Linear trend: Trend is a straight line
- Linear seasonality: Seasonality with same frequency (width of cycles) and amplitude (height of cycles).
Y(t) = trend * seasonality * residual
- Non-linear trend: trend is a curved line
- Non-linear seasonality: Seasonality varies in frequency (width of cycles) and/or amplitude (height of cycles).
So to choose between additive and multiplicative decompositions we consider that:
- The additive model is useful when the seasonal variation is relatively constant over time.
- The multiplicative model is useful when the seasonal variation increases over time.
Example of Time Series
Google Trends Data
The plot above shows a clear pattern: At the end of the year the word Diet has the lowest number of searches while at the beginning of the year it has the highest number of searches. Do people at the end of the year just want to celebrate and enjoy good food? In consequence, do they choose as New Year’s resolution to become in good shape?
This time series has a seasonal pattern, i.e., it is influenced by seasonal factors. Seasonality occurs over a fixed and known period (e.g., the quarter of the year, the month, or day of the week). In this case the period seems to be yearly and causes a peek at the festivities at the end of year.
We can also observe that there is no constant increase or decrease in trend which would suggest a non-linear trend.
Let’s decompose this time series in its components. Because we believe that trend is non-linear, we will set parameter model as multiplicative. By default, this parameter is additive.
Parameter period is optional but you can set it depending on the time series. Because the data is given in weeks, we set the parameter period to the the number of weeks in a year.
Observe that both frequency and amplitude of seasonal component do not change with time suggesting linear seasonality, i.e., a seasonal additive model. Let’s change parameter model to additive.
As we suspected, the trend is non-linear. More than this, the time series follows no consistent upwards or downwards slope. Therefore, there is no positive (upwards slope) or negative (downwards slope) trend.
Also, if we compare multiplicative and additive residuals, we can see that the later is much smaller. As a result, a additive model (Trend + Seasonality) fits the original data much more closely.
Want to know more? Check this interesting article about performing additive and multiplicative decomposition.
Augmented Dicky-Fuller test (ADF) is a very popular test for stationarity. However, it can happen that a time series passes the ADF test, without being stationary. Kwiatkowski-Phillips-Schmidt-Shin (KPSS) is another test for checking the stationarity of a time series. It is prudent to apply both tests, so that it can be ensured that the series is truly stationary. Next to that, we cannot forget the importance of also observing the time series plot.
ADF test is used to determine the presence of unit root in the series, and hence helps in understanding if the series is stationary or not. The null and alternate hypothesis of this test are:
Null Hypothesis: The series has a unit root, meaning it is non-stationary. It has some time dependent structure.
Alternate Hypothesis: The series has no unit root, meaning it is stationary. It does not have time-dependent structure.
If the null hypothesis failed to be rejected, this test may provide evidence that the series is non-stationary.
A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (non-stationary).
The null and alternate hypothesis for the KPSS test is opposite that of the ADF test.
Null Hypothesis: The process is trend stationary.
Alternate Hypothesis: The series has a unit root (series is not stationary).
A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (non-stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (stationary).
The following functions can be found here.
When applying those tests the following outcomes are possible:
Case 1: Both tests conclude that the series is not stationary – The series is not stationary.
Case 2: Both tests conclude that the series is stationary – The series is stationary.
Case 3: KPSS indicates stationarity and ADF indicates non-stationarity – The series is trend stationary. The trend needs to be removed to make the time series strict stationary. After that, the detrended series is checked for stationarity.
Case 4: KPSS indicates non-stationarity and ADF indicates stationarity – The series is difference stationary. Differencing is used to make the series stationary. After that, the differenced series is checked for stationarity.
Let’s apply the tests on the Google Trends data.
Based upon the significance level of 0.05 and the p-value:
ADF test: The null hypothesis is rejected. Hence, the series is stationary
KPSS test: There is evidence for rejecting the null hypothesis in favour of the alternative. Hence, the series is non-stationary as per the KPSS test.
These results fall in case 4 of the above mentioned outcomes. In this case, we should apply differencing to make the time series stationary and again apply the tests until we have both tests pointing to stationarity.
Summing up, the analysis made for the Google dataset so far shows that:
- The trend is non-linear (multiplicative) and is not increasing or decreasing all the time.
- We have seasonality, which apparently is influenced by end-of-the-year festive period.
- Seasonality is linear, i.e., seasonality does not vary in frequency (width of cycles) neither amplitude (height of cycles).
- Additive residuals are smaller than multiplicative residuals.
- Time series is stationary according to ADF test, but not according to KPSS test. Therefore, we need differencing to make the time series stationary.
Because of the linear seasonality and small additive residuals we can conclude that an additive model is more appropriate in this case.
Making Time Series Stationary
You can’t see any trend, or any obvious changes in variance, or dynamics. This time series now looks stationary.
Visit notebook 01-Intro_time_series_tutorial.ipynb on GitHub for one more example using global temperature dataset time series. This dataset includes global monthly mean temperature anomalies in degrees Celsius from 1880 to the present. Data are included from the GISS Surface Temperature (GISTEMP) analysis and the global component of Climate at a Glance (GCAG).
In this tutorial you were introduced to time series. You’ve learnt about important time series properties and how to identify them using both statistical and graphical tools.
In the future tutorials we will use what was learnt so far and be introduced to some forecasting algorithms: