Introduction

I believe there is no need to emphasize the importance of reliable predictions of time series. Decades of research on the subject, two main approaches surviving: statistical (ARIMA, SARIMA, Box-Jenkins, Holt-Winter , ...) and neural (TLNN, CNN, recurrent networks, LSTM, ... ). Each one has its merits, otherwise there wouldn't be two: if you choose statistics, you know the what and the why of every number you get, if you go neural, you possibly get better results, but unjustified. This article follows neither approach, there are plenty of good courses teaching how to make serious forecasting, and even articles on CodeProject ([1], [2]).

This article presents a quick and dirty way to get something, knowing that one can do better. The associated code is quite simple and completely self-contained. I wrote it in JavaScript just because, but it is straightforward to move it to other languages. And it is also easy to do the same processing in Excel, in fact, you can download from here the same in Excel.

Background

If you go neural, you might even consider downloading a code library and feeding it with your raw data. You would get something. If you are for a more model grounded approach, some preprocessing of the data is in order. Which proves anyway beneficial also in case of later neural processing. Besides, most of the fun is here, here is the art, here is the magic that can turn a mediocre result into a good one.

The objective of preprocessing is to turn the data we have to work on, into a series for which the prediction is the easiest. The ideal one would be this one:

Easy series

Fig.1 - Easy series

As dreams mostly do not come true, we will aim at a stationary time series, which basically means that the series is such that if an interval of its data is cut out and presented alone, it is hard to understand from descriptive statistics the time interval which was copied.

Stationary series

Fig.2 - Stationary series

Given a generic series, how do we try getting to that? First, we check that data is complete and credible. There should be no missing values in the series and no dubious outliers. In case, there are techniques to mend the series, whose description unfortunately does not fit into this article. Then, we assume that the series derives from the combination of four components added to a baseline mean value as that of Fig.1:

a trend component, often assumed to be linear, but it could be a higher order polynomial, or exponential, or whatever;
a seasonality component, should there be any;
a cyclic component, typically related to economic cycles;
a random component, to account for our ignorance, which is itself stationary;

This model accounts for the following. A series with just a (linear) trend looks like this:

Trendy series

Fig.3 - Series with a trend

Adding seasonality, we get something like this:

Fig.4 - Series with trend and seasonality

Should there be any cyclic effect, the result could be this:

Fig.5 - Series with trend, seasonality and cyclicity

And the full model, including randomness, could become this:

Fig.5 - Series, the full model

Now, given the data, if we could de-randomize it, de-seasonalize it, de-trend it and remove the cycle effect (de-cyclify?), we would be left with the simple series of Figure 1. This is what data preprocessing aims to do. I will show it on one example, which is *the* time series, the one used by Box and Jenkins presenting their eponymous method [3]. The series describes monthly totals of the international airline passengers for the period between January 1949 and December 1960. These data are available from so many sites on the net, here is the one I will use, in json format. The series looks as follows:

Box Jenkins series

Fig.6 - Airline passengers series

The series is made of 144 values, but I will not use them all. At least not for forecasting. A number of the last values will be set apart, and I will assess the quality of the forecast by trying to generate them (without having previously seen them, obviously) and by comparing the forecast and the actual data. How many values are to be set apart? I will take the 12 last values, for reasons that will be apparent in the following. Therefore, the historic 132 data of the series will be stored in an array (named tSeries in the code), and the last 12 ones in another array (named checkSeries in the code).

Preprocessing

Since this series has no missing value and no outlier, the objective of this phase is to make data more amenable to algorithmic intelligence, without tampering with the data content. Two elements are worth being noticed: the trend is clearly nonlinear and the elongations of the seasonality effect increase dramatically over time. Moreover, there is no cyclic effect to speak of (it's before 2008, right?).

Box Jenkins series

Fig.7 - Nonlinearities in the series

Now we come to some code, finally. The nonlinearities can be effectively countered by considering the log of the data, which would (hopefully) leave a linear trend. This, in turns, can be hidden by considering the difference of each figure with the previous one. The log-diff operator is as follows.

// logdiff preprocessing
function logdiff()
{
    var i;
    for (i=n-1;i>=0;i--)
        tSeries[i] = Math.log(tSeries[i]);

    for (i=n-1;i>0;i--)
        tSeries[i] -= tSeries[i-1];
}

This leaves the following series to analyze, which is clearly nonstationary because of seasonality, but which got rid of all those pesky nonlinearities.

Box Jenkins series

Fig.8 - Preprocessed data

Deseasonalization

The series still contains seasonality and random components. We can see it does, but we have got to let the code detect it. One way to determine if there is a significant seasonality effect, and in case which one, is to use correlograms (to be precise, it should be Partial Autocorrelation Plots, see PACF), which make use of the Pearson correlation coefficient (warning, lots more to say, but here we are hasty). Granted, my presentation here is a bit uncouth, and such is also the method put forth, but this was implied in the title.

The idea is to compute autocorrelations for data values at varying time lags, that is, to determine the level of correlation between the series and the series itself shifted by 1, 2, ... time periods. If there is a seasonality, there will be a high correlation between the original series and the series shifted by a number of periods equal to the season duration.

Autocorrelations

Fig.9 - Series shifts for autocorrelation

It is now enough to compute the correlation of original series with each of the shifted ones: the one that is maximally correlated was shifted by a lag corresponding to the season length. Here is the code for seasonality assessment.

// seasonality = max correlation among lags
function getSeasonality(k)
{  var i,j,pindex,imax;
   correlations = new Array(k);

   let arr1 = new Array(n-k);
   let arr2 = new Array(n-k);
   let maxcorr = imax = Number.NEGATIVE_INFINITY;

   // tSeries[1] is needed only for reconstruction of the diff
   for (j = 0; j < n-k; j++) arr1[j]=tSeries[j+k];
   for (i = 1; i < k; i++)
   {
      for (j = 0; j < n-k; j++) arr2[j]=tSeries[j+k-i];
      pindex = pearson(arr1,arr2);
      correlations[i] = pindex.toFixed(3);
      if(pindex > maxcorr)
      {  maxcorr = pindex;
         imax = i;
      }
   }
   txtConsole.value += "Correlations: "+correlations+"\n ";

   return imax;
}

And, for sake of completeness, here is that for the computation of the Pearson's coefficients.

// Pearson's correlation index
function pearson(arr1, arr2) 
{
   var i,np,num,avg1,avg2,den1,den2,den;

   if(arr1.length != arr2.length) return undefined;
   else
      np = arr1.length;

   if (np == 0) return 0;

   avg1 = avg2 = 0;
   for (i = 0; i < np; i++) 
   {  avg1 += arr1[i]/np;
      avg2 += arr2[i]/np;
   }

   num  = den1 = den2 = 0;

   for (i = 0; i < np; i++) 
   {  let dx = (arr1[i] - avg1);
      let dy = (arr2[i] - avg2);
      num  += dx * dy;
      den1 += dx * dx;
      den2 += dy * dy;
   }

   den = Math.sqrt(den1) * Math.sqrt(den2);
   if (den == 0) return 0;

   return num / den;
}

Upon applying this procedure to our series, we get the following coefficients, which unsurprisingly confirm that there is a high correlation for a shift of 12 months. The season is therefore 12 months long.

Autocorrelations

Fig.10 - Pearson's coefficients for time-lagged series

In order to de-sesonalize data, it is now enough to determine the offset, with respect to the baseline, induced by each period within the season. Unfortunately, there still is the random component that makes it so, that the first period of each season takes different values along the series.

Fortunately, de-randomization is easy if we accept a usual assumption of random values being serially uncorrelated values with zero mean and limited variance, and if you have enough values. Then, averaging all values corresponding to one same period within the season will get you rid of random effects. Here is the code.

// remove seasonality effect
function deseasonalize(k)
{  var i;
   var inum = new Array(k).fill(0);
   seasoff  = new Array(k).fill(0);
   
   for(i=1;i < n;i++)
   {  seasoff[i%k] += tSeries[i];
      inum[i%k]++;
   }

   for(i=0;i < k;i++)
      seasoff[i] /= inum[i];  // averages of summed values
    
   txtConsole.value += "Seasonality indices: "+seasoff+"\n ";
}

Running this, we have the average offsets from the baseline induced by each period within the season, which in the example are:

Seasoff

Fig.11 - Season offsets from baseline

Forecasting

Finally, we are ready to forecast the next season's values, i.e., the values stored in the array checkSeries. At this point, this is trivial. Having removed randomness, seasonality and trend and having assumed no cycle effect, we reverted to the case of Figure 1, with just a flat baseline at a value corresponding to the first value of the series (that is, 4.718). That's it, the forecast is done. We must however reintroduce seasonality and trend into the forecast. Not randomness, as this is unknown by hypothesis. First, I will reintroduce seasonality, adding to the flat baseline the expected offsets. The code is as follows:

function makeForecast()
{   var i;
    forSeries = new Array(tSeries.length+checkSeries.length);
    for(i=0;i < n;i++)
        forSeries[i]=tSeries[i];
    for(i=0;i < checkSeries.length;i++)
        forSeries[n+i] = seasoff[i];
}

Then, I reintroduce the trend, by reverting the diff operator, that is, by adding each value to the previous one. At this point, it is enough to revert the log operator to get the forecast final values.

// data reconstruction
function recostructdata()
{
   var i;
   for(i=1;i < forSeries.length;i++)
      forSeries[i] += forSeries[i-1];

   for(i=0;i < forSeries.length;i++)
      forSeries[i] = Math.exp(forSeries[i]);
}

The results, shown with respect to the full series and to a zoom on the last 3 seasons, are shown in the following figures.

Forecast values

Fig.12 - Actual and forecast, whole series

Forecast values

Fig.13 - Actual and forecast, 3 seasons

How good are these? There are a number of indices quantifying the quality of a forecast, when compared to actual data. Some easy ones are BIAS (average of the differences between forecast and corresponding actual datum), Mean Absolute Deviation (MAD, same as BIAS, but with absolute value of differences), Standard Error (square root of the mean of squared differences), and Mean Absolute Percent Error (MAPE, mean of the absolute values of percent errors). In our case, they take the following values:

BIAS -6.50
MAD 15.47
Std.Err. 21.96
MAPE 3.21

If we unintelligently feed the raw data to the neural module of R (mind you, nothing wrong with it, but as I implied at the beginning, before stepping into a Ferrari, one should know how to drive), we would get these figures:

BIAS 4.10
MAD 15.67
Std.Err.16.71
MAPE 3.26

Therefore, the quality of the results is absolutely comparable.

The Full Code

The full code is provided with the article, and is also available from this site. It was written as it is, with the objective to make it as readily intelligible as possible. The main function just pipelines the above described functions:

// Forecast pipeline
function forecast()
{
   console.log("Starting forecast pipeline")
   logdiff();
   let seasonality = getSeasonality(15); // 15 is an upper bound to seasonality
   deseasonalize(seasonality);
   makeForecast();
   recostructdata();
}

Points of Interest

What I presented is just a simple and indolent way to fast achieve forecast indications. The possibility to quickly code the algorithm in any reasonable language I know of, or even to unroll it in Excel, makes it a worthwhile piece of knowledge, I think.

Clearly, the topic can be expanded. By making a proper model, surely, but also - in keeping with the spirit of this work - by adding quick and dirty confidence intervals or by moving to quick and dirty multivariate forecast.

Bonus content! In the associated Excel, there is a sheet that applies this method to another time series, downloaded from the U.S. Census Bureau. As there are no nonlinearities here, the preprocessing just uses diff, with no log, but no problem including it. You can try to use the code to deal with this series, too. You can download the series in json format from here.

Let me add one last remark. I wrote this article in the belief that this publication outlet can be a useful and worthwhile complement to the standard, indexed scientific journals, for academics. I hope to have gauged correctly the level of the presentation. Comments are welcome.

Bibliography

A Time-series Forecasting Library in C#
Time Series Analysis in C#.NET
Box G.E.P., Jenkins G.M., Reinsel G.C. Time Series Analysis, Forecasting and Control. Third Edition. Holden-Day. Series G. (1976)
Correlograms, wikipedia
Pearson correlation coefficient, wikipedia
Jewelry retail monthly data, U.S. Census Bureau, Not Seasonally Adjusted Sales - Monthly [Millions of Dollars]

History

July 2018: Initial version