Feature Engineering#
ForecastFlowML includes a preprocessing module to create features bas ed on the time series dataset. This user guide shows how the features can be created in a scaleable way before the modelling phase.
Imports#
from forecastflowml import FeatureExtractor
from forecastflowml import ForecastFlowML
from forecastflowml.data.loader import load_walmart_m5
from pyspark.sql import SparkSession
from lightgbm import LGBMRegressor
import pandas as pd
pd.set_option("display.max_columns", 100)
Initialize Spark#
spark = (
SparkSession.builder.master("local[4]")
.config("spark.driver.memory", "8g")
.config("spark.sql.shuffle.partitions", "4")
.config("spark.sql.execution.arrow.enabled", "true")
.getOrCreate()
)
Sample Dataset#
df = load_walmart_m5(spark).localCheckpoint()
df.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|
+--------------------+-----------+-------+------+--------+--------+----------+-----+
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-29| 2.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-30| 5.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-31| 3.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-01| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-02| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-03| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-04| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-05| 1.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-06| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-07| 3.0|
+--------------------+-----------+-------+------+--------+--------+----------+-----+
only showing top 10 rows
Feature Overview#
With FeatureExtractor, we can extract:
Lag features
Rolling statistics (mean, standard deviation etc.) with spesified lags
Count of consecutive spesific values that may be used to count number of out-of-stock periods
History length that refers to the number of periods from the beginning of the time series
Date features
Lags#
When extracting the features, we should be careful about the lags we are creating. In this example, we are going to prepare features for 4 weekly models.
Model 1 will predict days 1–7, not using the the 6 most recent lag features.
Model 2 will predict days 8–14, not using the the 13 most recent lag features.
Model 3 will predict dayts 15–21, not using the the 20 most recent lag features.
Model 4 will predict days 22–28, not using the the 27 most recent lag features.
For lag features, we are going to extract the sales on the same week day over the past 4 weeks.
Since each model has different horizon, they will be allowed to use different lags in the modelling phase. In summary, we need to extract lag_7, lag_14, lag_21, lag_28, lag_35, lag_42 and lag_49 as features.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
lag_window_features={
"lag": [7 * (i + 1) for i in range(8)],
},
)
df_features = feature_extractor.transform(df)
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 2.0| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 0.0| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 0.0| null| null| null| null| null| null| null|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+
only showing top 10 rows
Rolling Statistics#
For rolling statistics, we are going to calculate the mean over the window of 7, 14 and 30 days, with the most recent lags that models can use which are 7 days for model 1, 14 days for model 2, 21 days for model 3 and 28 days for model 4.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
lag_window_features={
"mean": [[window, lag] for lag in [7, 14, 21, 28] for window in [7, 14, 30]],
},
)
df_features = feature_extractor.transform(df)
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_lag_28_mean|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| null| null| null| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 2.0| 2.0| 2.0| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 1.0| 1.0| 1.0| null| null| null| null| null| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 0.6666666666666666| 0.6666666666666666| 0.6666666666666666| null| null| null| null| null| null| null| null| null|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+
only showing top 10 rows
Out-of-stock Periods#
Sometimes a product might be out-of-stock for a certain period. We are now going to count the consecutive periods where sales did not occur with the most recent lags that models can use.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
count_consecutive_values={
"value": 0,
"lags": [7, 14, 21, 28],
},
)
df_features = feature_extractor.transform(df).localCheckpoint()
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----------------------------+------------------------------+------------------------------+------------------------------+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|count_consecutive_value_lag_7|count_consecutive_value_lag_14|count_consecutive_value_lag_21|count_consecutive_value_lag_28|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----------------------------+------------------------------+------------------------------+------------------------------+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| null| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 0| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 1| null| null| null|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 2| null| null| null|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----------------------------+------------------------------+------------------------------+------------------------------+
only showing top 10 rows
History Length#
We can also count the total number periods past after the introduction of the time series.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
history_length=True,
)
df_features = feature_extractor.transform(df).localCheckpoint()
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+--------------+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|history_length|
+--------------------+-----------+-------+------+--------+--------+----------+-----+--------------+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| 1|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| 2|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| 3|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| 4|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| 5|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| 6|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| 7|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 8|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 9|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 10|
+--------------------+-----------+-------+------+--------+--------+----------+-----+--------------+
only showing top 10 rows
Date Features#
Finally, we can also include the date derived features.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
date_features=[
"day_of_month",
"day_of_week",
"week_of_year",
"week_of_month",
"weekend",
"quarter",
"month",
"year",
],
)
df_features = feature_extractor.transform(df).localCheckpoint()
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+------------+-----------+------------+-------------+-------+-------+-----+----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|day_of_month|day_of_week|week_of_year|week_of_month|weekend|quarter|month|year|
+--------------------+-----------+-------+------+--------+--------+----------+-----+------------+-----------+------------+-------------+-------+-------+-----+----+
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-29| 2.0| 29| 7| 4| 5| 1| 1| 1|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-30| 5.0| 30| 1| 4| 5| 1| 1| 1|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-31| 3.0| 31| 2| 5| 5| 0| 1| 1|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-01| 0.0| 1| 3| 5| 1| 0| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-02| 0.0| 2| 4| 5| 1| 0| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-03| 0.0| 3| 5| 5| 1| 0| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-04| 0.0| 4| 6| 5| 1| 0| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-05| 1.0| 5| 7| 5| 1| 1| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-06| 0.0| 6| 1| 5| 1| 1| 1| 2|2011|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-07| 3.0| 7| 2| 6| 1| 0| 1| 2|2011|
+--------------------+-----------+-------+------+--------+--------+----------+-----+------------+-----------+------------+-------------+-------+-------+-----+----+
only showing top 10 rows
Combine Features#
Let’s combine all of the features extraction steps together.
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
lag_window_features={
"lag": [7 * (i + 1) for i in range(8)],
"mean": [[window, lag] for lag in [7, 14, 21, 28] for window in [7, 14, 30]],
},
date_features=[
"day_of_month",
"day_of_week",
"week_of_year",
"week_of_month",
"weekend",
"quarter",
"month",
"year",
],
count_consecutive_values={
"value": 0,
"lags": [7, 14, 21, 28],
},
history_length=True,
)
PySpark DataFrame#
df_train = feature_extractor.transform(df).localCheckpoint()
df_train.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------------+-------+-------+-----+----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_lag_28_mean|count_consecutive_value_lag_7|count_consecutive_value_lag_14|count_consecutive_value_lag_21|count_consecutive_value_lag_28|history_length|day_of_month|day_of_week|week_of_year|week_of_month|weekend|quarter|month|year|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------------+-------+-------+-----+----+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 1| 31| 2| 5| 5| 0| 1| 1|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 2| 1| 3| 5| 1| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 3| 2| 4| 5| 1| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 4| 3| 5| 5| 1| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 5| 4| 6| 5| 1| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6| 5| 7| 5| 1| 1| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 7| 6| 1| 5| 1| 1| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 2.0| null| null| null| null| null| null| null| 2.0| 2.0| 2.0| null| null| null| null| null| null| null| null| null| 0| null| null| null| 8| 7| 2| 6| 1| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 0.0| null| null| null| null| null| null| null| 1.0| 1.0| 1.0| null| null| null| null| null| null| null| null| null| 1| null| null| null| 9| 8| 3| 6| 2| 0| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 0.0| null| null| null| null| null| null| null| 0.6666666666666666| 0.6666666666666666| 0.6666666666666666| null| null| null| null| null| null| null| null| null| 2| null| null| null| 10| 9| 4| 6| 2| 0| 1| 2|2011|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------------+-------+-------+-----+----+
only showing top 10 rows
Pandas DataFrame#
feature_extractor.transform(df.toPandas(), spark=spark)
| id | item_id | dept_id | cat_id | store_id | state_id | date | sales | lag_7 | lag_14 | lag_21 | lag_28 | lag_35 | lag_42 | lag_49 | lag_56 | window_7_lag_7_mean | window_14_lag_7_mean | window_30_lag_7_mean | window_7_lag_14_mean | window_14_lag_14_mean | window_30_lag_14_mean | window_7_lag_21_mean | window_14_lag_21_mean | window_30_lag_21_mean | window_7_lag_28_mean | window_14_lag_28_mean | window_30_lag_28_mean | count_consecutive_value_lag_7 | count_consecutive_value_lag_14 | count_consecutive_value_lag_21 | count_consecutive_value_lag_28 | history_length | day_of_month | day_of_week | week_of_year | week_of_month | weekend | quarter | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-01-31 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | 31 | 2 | 5 | 5 | 0 | 1 | 1 | 2011 |
| 1 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-01 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | 1 | 3 | 5 | 1 | 0 | 1 | 2 | 2011 |
| 2 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-02 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 2 | 4 | 5 | 1 | 0 | 1 | 2 | 2011 |
| 3 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-03 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 3 | 5 | 5 | 1 | 0 | 1 | 2 | 2011 |
| 4 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-04 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 | 4 | 6 | 5 | 1 | 0 | 1 | 2 | 2011 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1470899 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-18 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.166667 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.214286 | 0.133333 | 9.0 | 2.0 | 5.0 | 6.0 | 1936 | 18 | 4 | 20 | 3 | 0 | 2 | 5 | 2016 |
| 1470900 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-19 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.100000 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.214286 | 0.133333 | 10.0 | 3.0 | 6.0 | 7.0 | 1937 | 19 | 5 | 20 | 3 | 0 | 2 | 5 | 2016 |
| 1470901 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-20 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.100000 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 11.0 | 4.0 | 7.0 | 0.0 | 1938 | 20 | 6 | 20 | 3 | 0 | 2 | 5 | 2016 |
| 1470902 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-21 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.066667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 12.0 | 5.0 | 8.0 | 1.0 | 1939 | 21 | 7 | 20 | 3 | 1 | 2 | 5 | 2016 |
| 1470903 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-22 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.066667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 13.0 | 6.0 | 9.0 | 2.0 | 1940 | 22 | 1 | 20 | 4 | 1 | 2 | 5 | 2016 |
1470904 rows × 41 columns
Training#
We can not pass the features created by FeatureExtractor to ForecastFlowML for training. As mentioned in the lag feature creation step, we are going to set use_lag_range=28 to use lags which are 28 days after from the most recent lag features.
forecast_flow = ForecastFlowML(
group_col="store_id",
id_col="id",
date_col="date",
target_col="sales",
date_frequency="days",
model_horizon=7,
max_forecast_horizon=28,
model=LGBMRegressor(),
use_lag_range=28,
)
trained_models = forecast_flow.train(df_train).toPandas()
trained_models
| group | forecast_horizon | model | start_time | end_time | elapsed_seconds | |
|---|---|---|---|---|---|---|
| 0 | CA_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:36) | 01-May-2023 (02:43:41) | 5.5 |
| 1 | CA_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:42) | 01-May-2023 (02:43:50) | 7.9 |
| 2 | WI_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:50) | 01-May-2023 (02:43:53) | 2.7 |
| 3 | WI_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:53) | 01-May-2023 (02:43:57) | 3.8 |
| 4 | CA_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:34) | 01-May-2023 (02:43:40) | 5.9 |
| 5 | CA_4 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:41) | 01-May-2023 (02:43:49) | 7.7 |
| 6 | TX_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:49) | 01-May-2023 (02:43:53) | 3.8 |
| 7 | TX_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:53) | 01-May-2023 (02:43:57) | 4.2 |
| 8 | WI_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:58) | 01-May-2023 (02:44:00) | 2.1 |
| 9 | TX_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (02:43:32) | 01-May-2023 (02:43:39) | 7.2 |
Examine Features#
Let’s examine which features are used for each model.
import pickle
features = {}
for i in range(4):
model = pickle.loads(bytes(trained_models["model"].iloc[0][i], "latin1"))
features[f"model_{i}"] = sorted(model.feature_name_)
pd.DataFrame(features)
| model_0 | model_1 | model_2 | model_3 | |
|---|---|---|---|---|
| 0 | count_consecutive_value_lag_7 | count_consecutive_value_lag_14 | count_consecutive_value_lag_21 | count_consecutive_value_lag_28 |
| 1 | day_of_month | day_of_month | day_of_month | day_of_month |
| 2 | day_of_week | day_of_week | day_of_week | day_of_week |
| 3 | history_length | history_length | history_length | history_length |
| 4 | lag_14 | lag_14 | lag_21 | lag_28 |
| 5 | lag_21 | lag_21 | lag_28 | lag_35 |
| 6 | lag_28 | lag_28 | lag_35 | lag_42 |
| 7 | lag_35 | lag_35 | lag_42 | lag_49 |
| 8 | lag_7 | lag_42 | lag_49 | lag_56 |
| 9 | month | month | month | month |
| 10 | quarter | quarter | quarter | quarter |
| 11 | week_of_month | week_of_month | week_of_month | week_of_month |
| 12 | week_of_year | week_of_year | week_of_year | week_of_year |
| 13 | weekend | weekend | weekend | weekend |
| 14 | window_14_lag_7_mean | window_14_lag_14_mean | window_14_lag_21_mean | window_14_lag_28_mean |
| 15 | window_30_lag_7_mean | window_30_lag_14_mean | window_30_lag_21_mean | window_30_lag_28_mean |
| 16 | window_7_lag_7_mean | window_7_lag_14_mean | window_7_lag_21_mean | window_7_lag_28_mean |
| 17 | year | year | year | year |