Time series forecasting¶
Introduction¶
Time series forecasting is the task of predicting future values of a time series, i.e., a sequence of observations taken sequentially in time. This is a very common task in many domains, such as finance, weather, retail, etc.
Time series forecasting is a type of regression problem in supervised learning, where the input features are
the previous values of the time series, and the target is the next value of the time series. The features of a
time series problem are called lags. For example, if we want to predict the value of a time series at time t
,
we can use the values of the time series at times t-1
, t-2
, t-3
, etc. as features.
However, in practice we can enrich the features of a time series problem by adding other features that are not
necessarily previous observations of the time series. For example, if we want to predict the price of a stock
at time t
, we can use the price of the stock at time t-1
, t-2
, t-3
, etc. as features, but we can also
use other features such as the price of other stocks, the weather, etc. In any case, the features of a time series
are naturally tabular data (and can be encoded in pandas DataFrames).
Note
The fact that the data is sequential in time means that we need to be careful when splitting the data into training and test sets.
Basic time series processing¶
Pandas provides some basic tools for time series processing. Sometimes it's not necessary to embark on a big time series forecasting project, and we can get away with some simple transformations of the data.
Pandas provides the following (simple) tools for time series processing:
-
shift
: shifts the values of a time series by a given number of steps. This is useful to create lags of the time series, i.e., to create features that are the previous values of the time series. For example, to create a lag of 1 of a time series, we can do the following: -
diff
: computes the difference between consecutive values of a time series. For example, to create a feature that is the difference between the current value of the time series and the previous value of the time series, we can do the following: -
Therolling
: computes a rolling statistic of a time series. For example, to compute the rolling mean (alternatively known as moving average) of a time series for the last 3 observations, we can do the following:window
parameter specifies the number of observations to use to compute the rolling statistic (it returnsNaN
for the firstwindow
observations, where there are not enough observations to compute the rolling statistic). It can be used with themean()
, as in the example above, but also with other functions such assum()
,min()
,max()
, etc.
Note
For more advanced time series and signal processing, we can use the SciPy
library
(and its scipy.signal
module).
Time series in SKLearn¶
Unfortunately, Scikit Learn does not contain any of the classical time series models such as ARIMA, SARIMA, etc.
However, it does contain some tools that can be useful for time series forecasting, such as the TimeSeriesSplit
,
and some generalist models that can be used for time series forecasting, such as the RandomForestRegressor
.
Train-test split¶
When splitting time series data into training and test sets, we need to be careful to respect the time order of the data. We cannot shuffle the data before a split, as we would do with regular tabular data, because the time order of the data is important. The reason is that in time series forecasting we need to predict the future values of the time series using only the past values of the time series, since this is what we will have access to in practice.
Note
This is easily seen with an example: predicting the price of an asset for next Monday, using information from the previous Sunday and the Tuesday after said Monday is much easier than predicting Monday's price using only information from the past; also, in practice we will not have access to data from the future :)
In practice, we can split the data ourselves using the iloc
(if we are working with pandas DataFrames),
or if we are using numpy arrays, we can just use the slicing operator []
:
# Pandas DataFrame
X_train = df.iloc[:n_train]
X_test = df.iloc[n_train:]
# Numpy array
X_train = X[:n_train]
X_test = X[n_train:]
That was only the time series features. What is the target? The target is the time series shifted one step
into the future. Assume that our time series is in the column time_series
of the DataFrame df
, and that
n_train
is the number of observations in the training set. Then, the target is:
# Pandas DataFrame
y_train = df['time_series'].iloc[1:n_train+1]
y_test = df['time_series'].iloc[n_train+1:]
# Numpy array
y_train = y[1:n_train+1]
y_test = y[n_train+1:]
The following figure shows how to create the training matrix from a pure time series:
Cross-validation with time series data¶
As just discussed, we need to be careful when splitting time series data into training and test sets.
TimeSeriesSplit
is a Scikit Learn cross-validation strategy that is useful for time series data. It is similar to the
KFold
strategy, but it differs in that the training data in each fold always comes before the test data.
We can use the TimeSeriesSplit
with the cross_val_score
function to perform cross-validation on time series
data:
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
X = ... # Features, that include the time series from the past
y = ... # Target, that is the time series shifted one step into the future
tscv = TimeSeriesSplit(n_splits=5)
cross_val_score(model, X, y, cv=tscv)
We can also use the TimeSeriesSplit
with the GridSearchCV
and RandomizedSearchCV
functions in a similar
way:
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15]
}
tscv = TimeSeriesSplit(n_splits=5)
rf_grid_search = GridSearchCV(
estimator=RandomForestRegressor(),
param_grid=param_grid,
cv=tscv
)
rf_grid_search.fit(X, y)
And after we can use rf_grid_search
to make predictions on the test set.
Models for time series forecasting¶
As mentioned before, Scikit Learn does not contain any of the classical time series models, but
we can nonetheless use a RandomForestRegressor
to perform time series forecasting. When training models
for time series, we always have two options to choose from:
-
Recursive multi-step forecasting: in this case, we train a model to predict the next value of the time series given the previous values of the time series. Then, we use the model to predict the next value of the time series, and we use this prediction to predict the next value of the time series, and so on. This option has the advantage that we only need to train one model, but it has the disadvantage that we have no control over how the errors propagate over time.
-
Direct multi-step forecasting: in this case, we train
n
different models to predict the nextn
values of the time series.
The RandomForestRegressor
is a good choice for time series forecasting because it can handle time series
data, but can also handle other types of features to enrich the model. Sometimes, ARIMA models can even be
used to generate features for a RandomForestRegressor
model.
Note
Other tree-based models such as XGBoost and CatBoost can also be used for time series forecasting with extra ("exogenous") variables.
Note
Alternatives to the RandomForestRegressor
for time series forecasting are the boosted trees from
external libraries such as XGBoost or CatBoost. Neural networks are starting to be used for time series
forecasting as well, but they are not yet as popular as the previous options (and require much more data
to perform reasonably well).