Performing an EDA, with a Focus on Correlation, for Time Series Forecasting
In my previous post, I started learning about time series forecasting, and focused on some foundational concepts and techniques for preparing data. In this post, we are going to learn about several more key concepts (the components of a time series) and then discuss the exploratory data analysis (EDA) process, with a focus on calculating correlations between time series. Let’s get started.
Parts of a Time Series
A time series can contain [1]:
-
Trends: Long-term changes (linear or otherwise) in the average value of a time series. For example, global warming shows up in temperature data as a long-term rise in the yearly average for many regions in the world.
-
Seasonality: These are more regular variations over a period of time, like a peak in power consumption during warm months (for air conditioning) or increased rainfall during the rainy season.
-
Cycles: Cyclic patterns are often conflated with seasonality but the difference is that a cyclic pattern is irregular, while seasonal patterns are regularly repeating behaviors. Manu cites economic recessions as being cyclic because they tend to repeat, but not in a regularly predictable pattern like changes in temperature over different seasons [1].
-
Irregularities: This term refers to the remainder of a time series signal after the trends, seasonal and cyclic patterns have been accounted for. It typically refers to the stochastic component, and is also called the residual or error term.
Often data scientists combine these components via addition or multiplication to build models of the underlying DGP.
Exploratory Data Analysis
-
Look for correlations between the features and the target. Use correlation heat maps for some key variables, or generate line plots (use smoothing if necessary) with variables you hypothesize are related to see if they exhibit similar patterns [1].
-
Plot some or all of the data, perhaps grouping by geographic region, time period, etc. Box plots can be a good choice here because they will indicate the interquartile range, outliers and the medians. Another good choice is a calendar heatmap, which again shows variation in a variable over time [1].
-
Examine the distribution of the target variable.
-
Look for trends, seasonality, and other cyclic patterns in the dataset.
Decomposing Time Series Data
As we discussed above, time series data can contain a series of patterns combined together, and it is possible to break the time series apart into these components. Manu Joseph recommends detrending the data first, then removing the seasonality, and leaving behind the residual signal at the end [1].
To first detect trends in a time series, you can use something as simple as a moving average [1]. A moving average essentially smoothes out the data, often making it easier to see long-term trends [1]. However, moving averages usually leave behind a signal that is still somewhat noisy, which is not ideal - we want the trends and seasonality patterns that we extract to be very clean, and all the noise to be contained in our residual [1]. An alternative to a moving average is the locally estimated scatterplot smoothing (LOESS) regression, which fits a smooth curve onto a noisy signal using a non-parametric approach (meaning we do not assume normality, for example) [1]. The LOESS regression method is considered a more general version of the moving average algorithm, using a combination of regression models in a KNN meta-model [2].
Assuming we have successfully extracted trends from the data, the next thing we need to do is to extract seasonal patterns. Since seasonal patterns are cyclic in nature, we will use a different set of tools to identify them, for example using period-adjusted averages or a Fourier series [1]. The period-adjusted average is a straightforward method where we compute the average of the series for one specific part of the cycle [1]. For example, if we are looking for seasonality in rainfall, we can imagine that the pattern repeats every year, and that the rainfall in one particular month should be similar from year to year [1]. The period-adjusted average method assumes this and would, in our example, compute the average rainfall for each month of the year over all the years of data that we have [1].
Instead of the period-adjusted average, we can also decompose seasonal patterns into a Fourier series. A Fourier series is a sum of sines and cosines, which are naturally repeating signals. We can fit the coefficients of a Fourier series (with a limited number of terms) to a seasonal pattern in the data [1].
There are several packages in Python that can perform decomposition for you. One popular tool is the seasonal_decompose()
function from the statsmodels.tsa.seasonal
library [3].
Correlation in Time Series
Finding relationships between different time series is a different challenge from finding correlation between features in a dataset. For a non-time series dataset, we can compute the correlation between features using a number of methods including the Pearson, Kendall or Spearman methods [4]. The Pearson correlation coefficient is a commonly used measure of correlation between two datasets, and it is specifically based on the assumption that the correlation is linear [5]. The Pearson correlation coefficient can have values between -1 and 1, where values near 0 indicate weak correlations and values near one of the extremes represent high correlation (either positive or negative) [5]. For a population, we can write the Pearson correlation coefficient between two datasets $X$ and $Y$ as [5]:
\[\rho_{X, Y} = \frac{ \mathbb{E} \left[ (X - \mu_X) (Y - \mu_Y) \right] }{\sigma_X \sigma_Y}\]The Kendall rank correlation coefficient and Spearman’s rank correlation coefficient also measure the similarity between two datasets. Both of them measure the amount of rank correlation between two variables [6,7].
But for time series data, how can we take the temporal nature of the data into account? According to [8], it is not necessarily required to use different metrics - it claims that we can still use the Pearson correlation metric described above to measure the correlation between two time series. Dr. Cheong warns us that outliers can skew the results of the calculation, and that the Pearson correlation coefficient assumes the data is homoskedastic, but still says we can use this metric [8]. It does not provide information about whether one time series leads or lags another, and is just a global estimate of the correlation over all time [8]. It is also possible to compute the correlation coefficient over a local window and to roll the window along the time series to get a better sense of how the correlation itself varies with time [8].
But we can use cross correlation techniques to get a better sense of which signals lead and lag others, which we cannot tell with Pearson correlation [8]. For example, time lagged cross correlation (TLCC) still calculates the Pearson correlation, but it does it repeatedly while shifting one signal relative to another one [8]. When we find the maximum correlation value (again, this is a global value for the entire time period), that indicates whether one signal leads or lags another one with respect to correlation [8]. If we wanted to know on a local level which signal is leading and how that is changing over time, we can again use a windowed approach (WTLCC) [8]. The limitation to these methods is that they assume that we should look for correlation either globally or over a fixed window size [8].
Another technique that we can use which allows for comparing signals of different lengths is called dynamic time warping (DTW) [8]. This method is calculating the shortest distance between two signals by computing the Euclidean distance at each time step against all other time steps [8]. It is particularly useful when the two signals represent things that are happening at different speeds, like comparing speech or walking gaits [9]. To be more specific, DTW looks for an optimal matching between two sequences based on some rules [9]:
- Every index from the first time series needs to have its complement(s) from the other sequence, and vice versa.
- The first indices of each series need to match with each other, but these do not have to be the only matches they have. The same is true for the final indices of each sequence.
- The mapping of the indices of the first series to the other must be monotonically increasing.
There are multiple mappings that satisfy these results, but the optimal one will also minimize some cost, like the Euclidean distance [9]. The Python packages that implement DTW will often plot the two series in question and show where the matching occurred to indicate which parts of one series correlated the best with the second series [8].
Finally, we can also use instantaneous phase synchrony to see when two oscillating signals are in phase [8]. This method does require that we filter the inputs so that they exhibit the primary wavelength we want to study [8]. This technique uses the Hilbert transform to then divide the input signals into their phases and powers, and we compare the phases to see when the two signals are in or out of phase [8]. This method is similar to the rolling window correlation methods we saw earlier, but do not require the user to choose a window size [8].
References
[1] Joseph, M. Modern Time Series Forecasting with Python : Explore Industry-Ready Time Series Forecasting Using Modern Machine Learning and Deep Learning. 1st ed. Birmingham: Packt Publishing, Limited, 2022.
[2] “Local regression.” Wikipedia. https://en.wikipedia.org/wiki/Local_regression Visited 8 Feb 2023.
[3] “statsmodels.tsa.seasonal.seasonal_decompose.” statsmodels 0.13.5 02 Nov 2022. https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html Visited 8 Feb 2023.
[4] “pandas.DataFrame.corr” pandas 1.5.3. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html Visited 17 Feb 2023.
[5] “Pearson correlation coefficient.” Wikipedia. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient Visited 17 Feb 2023.
[6] “Kendall rank correlation coefficient.” Wikipedia. https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient Visited 17 Feb 2023.
[7] “Spearman’s rank correlation coefficient.” Wikipedia. https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Visited 17 Feb 2023.
[8] Cheong, J. “Four ways to quantify synchrony between time series data.” Towards Data Science. 13 May 2019. https://towardsdatascience.com/four-ways-to-quantify-synchrony-between-time-series-data-b99136c4a9c9 Visited 17 Feb 2023.
[9] “Dynamic time warping.” Wikipedia. https://en.wikipedia.org/wiki/Dynamic_time_warping Visited 17 Feb 2023.