Quick Start#
ForecastFlowML is designed for scaleable forecasting and uses Spark for both feature engineering and training/prediction/hyperparameter optimisation.
Use Cases#
ForecastFlowML can generally be used for three use cases:
Data is stored in a
PySpark DataFrame, and we need to paralelly build many/big group models which does not fit into driver memory.Data is stored in a
PySpark DataFrame, and we need to paralelly build a few/small group models which fits into driver memory.Data is stored in a
Pandas DataFrame, and we need to paralelly build a few/small group models which fits into driver memory.
This quick guide shows how you can develop a scaleable forecasting system on Kaggle Walmart M5 Competition sample dataset.
Goal#
Build independent models for each of the stores in the dataset.
Parallelize training/inference steps.
Use LightGBM as machine learning algorithm.
Utilize direct multi-step forecasting approach.
Perform backtesting.
Import Packages#
from forecastflowml import ForecastFlowML
from forecastflowml import FeatureExtractor
from forecastflowml.data.loader import load_walmart_m5
from lightgbm import LGBMRegressor
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import plotly.express as px
import plotly.io as pio
import pandas as pd
pd.set_option("display.max_columns", 40)
Initialize Spark#
spark = (
SparkSession.builder.master("local[4]")
.config("spark.driver.memory", "8g")
.config("spark.sql.shuffle.partitions", "4")
.config("spark.sql.execution.arrow.enabled", "true")
.getOrCreate()
)
Sample Dataset#
df = load_walmart_m5(spark)
df.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|
+--------------------+-----------+-------+------+--------+--------+----------+-----+
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-29| 2.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-30| 5.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-01-31| 3.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-01| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-02| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-03| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-04| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-05| 1.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-06| 0.0|
|FOODS_1_013_TX_2_...|FOODS_1_013|FOODS_1| FOODS| TX_2| TX|2011-02-07| 3.0|
+--------------------+-----------+-------+------+--------+--------+----------+-----+
only showing top 10 rows
Feature Engineering#
feature_extractor = FeatureExtractor(
id_col="id",
date_col="date",
target_col="sales",
lag_window_features={
"lag": [7 * (i + 1) for i in range(8)],
"mean": [[window, lag] for lag in [7, 14, 21, 28] for window in [7, 14, 30]],
},
date_features=[
"day_of_month",
"day_of_week",
"week_of_year",
"quarter",
"month",
"year",
],
count_consecutive_values={
"value": 0,
"lags": [7, 14, 21, 28],
},
history_length=True,
)
Pandas DataFrame#
feature_extractor.transform(df.toPandas(), spark=spark)
| id | item_id | dept_id | cat_id | store_id | state_id | date | sales | lag_7 | lag_14 | lag_21 | lag_28 | lag_35 | lag_42 | lag_49 | lag_56 | window_7_lag_7_mean | window_14_lag_7_mean | window_30_lag_7_mean | window_7_lag_14_mean | window_14_lag_14_mean | window_30_lag_14_mean | window_7_lag_21_mean | window_14_lag_21_mean | window_30_lag_21_mean | window_7_lag_28_mean | window_14_lag_28_mean | window_30_lag_28_mean | count_consecutive_value_lag_7 | count_consecutive_value_lag_14 | count_consecutive_value_lag_21 | count_consecutive_value_lag_28 | history_length | day_of_month | day_of_week | week_of_year | quarter | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-01-31 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | 31 | 2 | 5 | 1 | 1 | 2011 |
| 1 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-01 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | 1 | 3 | 5 | 1 | 2 | 2011 |
| 2 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-02 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 2 | 4 | 5 | 1 | 2 | 2011 |
| 3 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-03 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 3 | 5 | 5 | 1 | 2 | 2011 |
| 4 | FOODS_1_011_WI_2_evaluation | FOODS_1_011 | FOODS_1 | FOODS | WI_2 | WI | 2011-02-04 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 | 4 | 6 | 5 | 1 | 2 | 2011 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1470899 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-18 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.166667 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.214286 | 0.133333 | 9.0 | 2.0 | 5.0 | 6.0 | 1936 | 18 | 4 | 20 | 2 | 5 | 2016 |
| 1470900 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-19 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.100000 | 0.142857 | 0.142857 | 0.166667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.214286 | 0.133333 | 10.0 | 3.0 | 6.0 | 7.0 | 1937 | 19 | 5 | 20 | 2 | 5 | 2016 |
| 1470901 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-20 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.100000 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 11.0 | 4.0 | 7.0 | 0.0 | 1938 | 20 | 6 | 20 | 2 | 5 | 2016 |
| 1470902 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-21 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.066667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 12.0 | 5.0 | 8.0 | 1.0 | 1939 | 21 | 7 | 20 | 2 | 5 | 2016 |
| 1470903 | HOUSEHOLD_2_514_WI_3_evaluation | HOUSEHOLD_2_514 | HOUSEHOLD_2 | HOUSEHOLD | WI_3 | WI | 2016-05-22 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.071429 | 0.066667 | 0.142857 | 0.071429 | 0.166667 | 0.000000 | 0.071429 | 0.166667 | 0.142857 | 0.285714 | 0.166667 | 13.0 | 6.0 | 9.0 | 2.0 | 1940 | 22 | 1 | 20 | 2 | 5 | 2016 |
1470904 rows × 39 columns
PySpark DataFrame#
df_features = feature_extractor.transform(df).localCheckpoint()
df_features.show(10)
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_lag_28_mean|count_consecutive_value_lag_7|count_consecutive_value_lag_14|count_consecutive_value_lag_21|count_consecutive_value_lag_28|history_length|day_of_month|day_of_week|week_of_year|quarter|month|year|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-01-31| 2.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 1| 31| 2| 5| 1| 1|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-01| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 2| 1| 3| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-02| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 3| 2| 4| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-03| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 4| 3| 5| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-04| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 5| 4| 6| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-05| 0.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6| 5| 7| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-06| 1.0| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 7| 6| 1| 5| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-07| 0.0| 2.0| null| null| null| null| null| null| null| 2.0| 2.0| 2.0| null| null| null| null| null| null| null| null| null| 0| null| null| null| 8| 7| 2| 6| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-08| 0.0| 0.0| null| null| null| null| null| null| null| 1.0| 1.0| 1.0| null| null| null| null| null| null| null| null| null| 1| null| null| null| 9| 8| 3| 6| 1| 2|2011|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2011-02-09| 0.0| 0.0| null| null| null| null| null| null| null| 0.6666666666666666| 0.6666666666666666| 0.6666666666666666| null| null| null| null| null| null| null| null| null| 2| null| null| null| 10| 9| 4| 6| 1| 2|2011|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
only showing top 10 rows
Training/Test Dataset#
df_train = df_features.filter(F.col("date") < "2016-04-25")
df_test = df_features.filter(F.col("date") >= "2016-04-25")
Training#
forecast_flow = ForecastFlowML(
group_col="store_id",
id_col="id",
date_col="date",
target_col="sales",
categorical_cols=["item_id", "dept_id", "cat_id"],
date_frequency="days",
model_horizon=7,
max_forecast_horizon=28,
model=LGBMRegressor(),
)
PySpark DataFrame with Distributed Results#
trained_models = forecast_flow.train(df_train).localCheckpoint()
trained_models.show()
+-----+--------------------+--------------------+--------------------+--------------------+---------------+
|group| forecast_horizon| model| start_time| end_time|elapsed_seconds|
+-----+--------------------+--------------------+--------------------+--------------------+---------------+
| CA_2|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.8|
| CA_3|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.2|
| WI_2|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.2|
| WI_3|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 2.9|
| CA_1|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 4.3|
| CA_4|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.5|
| TX_1|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.2|
| TX_3|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.0|
| WI_1|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 2.0|
| TX_2|[[1, 2, 3, 4, 5, ...|[clightgbm.skle...|01-May-2023 (03:2...|01-May-2023 (03:2...| 3.8|
+-----+--------------------+--------------------+--------------------+--------------------+---------------+
PySpark DataFrame with Local Results#
forecast_flow.train(df_train, local_result=True)
forecast_flow.model_
| group | forecast_horizon | model | start_time | end_time | elapsed_seconds | |
|---|---|---|---|---|---|---|
| 0 | CA_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:31) | 01-May-2023 (03:22:38) | 6.6 |
| 1 | CA_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:38) | 01-May-2023 (03:22:42) | 3.6 |
| 2 | WI_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:42) | 01-May-2023 (03:22:47) | 5.1 |
| 3 | WI_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:47) | 01-May-2023 (03:22:51) | 3.2 |
| 4 | CA_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:30) | 01-May-2023 (03:22:37) | 7.5 |
| 5 | CA_4 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:38) | 01-May-2023 (03:22:41) | 3.8 |
| 6 | TX_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:42) | 01-May-2023 (03:22:47) | 5.3 |
| 7 | TX_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:48) | 01-May-2023 (03:22:51) | 3.4 |
| 8 | WI_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:51) | 01-May-2023 (03:22:54) | 2.4 |
| 9 | TX_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:22:28) | 01-May-2023 (03:22:33) | 4.7 |
Pandas DataFrame#
forecast_flow.train(df_train.toPandas(), spark=spark)
forecast_flow.model_
| group | forecast_horizon | model | start_time | end_time | elapsed_seconds | |
|---|---|---|---|---|---|---|
| 0 | CA_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:16) | 01-May-2023 (03:23:21) | 4.4 |
| 1 | CA_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:21) | 01-May-2023 (03:23:25) | 3.4 |
| 2 | WI_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:25) | 01-May-2023 (03:23:28) | 3.0 |
| 3 | WI_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:28) | 01-May-2023 (03:23:32) | 3.3 |
| 4 | CA_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:14) | 01-May-2023 (03:23:20) | 5.8 |
| 5 | CA_4 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:21) | 01-May-2023 (03:23:24) | 3.3 |
| 6 | TX_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:24) | 01-May-2023 (03:23:28) | 3.4 |
| 7 | TX_3 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:28) | 01-May-2023 (03:23:32) | 3.4 |
| 8 | WI_1 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:32) | 01-May-2023 (03:23:34) | 2.2 |
| 9 | TX_2 | [[1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13,... | [clightgbm.sklearn\nLGBMRegressor\nq)q}q... | 01-May-2023 (03:23:12) | 01-May-2023 (03:23:17) | 5.0 |
Prediction#
PySpark DataFrame#
forecast = forecast_flow.predict(df_test, trained_models).localCheckpoint()
forecast.show(10)
+-----+--------------------+----------+----------+
|group| id| date|prediction|
+-----+--------------------+----------+----------+
| CA_2|FOODS_1_179_CA_2_...|2016-04-25| 0.481568|
| CA_2|FOODS_1_179_CA_2_...|2016-04-26|0.46724537|
| CA_2|FOODS_1_179_CA_2_...|2016-04-27|0.41596597|
| CA_2|FOODS_1_179_CA_2_...|2016-04-28|0.40775877|
| CA_2|FOODS_1_179_CA_2_...|2016-04-29|0.43439913|
| CA_2|FOODS_1_179_CA_2_...|2016-04-30| 0.4951446|
| CA_2|FOODS_1_179_CA_2_...|2016-05-01| 0.4308696|
| CA_2|FOODS_1_192_CA_2_...|2016-04-25| 0.2172628|
| CA_2|FOODS_1_192_CA_2_...|2016-04-26| 0.1687214|
| CA_2|FOODS_1_192_CA_2_...|2016-04-27| 0.1687214|
+-----+--------------------+----------+----------+
only showing top 10 rows
df_test.show()
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
| id| item_id|dept_id|cat_id|store_id|state_id| date|sales|lag_7|lag_14|lag_21|lag_28|lag_35|lag_42|lag_49|lag_56|window_7_lag_7_mean|window_14_lag_7_mean|window_30_lag_7_mean|window_7_lag_14_mean|window_14_lag_14_mean|window_30_lag_14_mean|window_7_lag_21_mean|window_14_lag_21_mean|window_30_lag_21_mean|window_7_lag_28_mean|window_14_lag_28_mean|window_30_lag_28_mean|count_consecutive_value_lag_7|count_consecutive_value_lag_14|count_consecutive_value_lag_21|count_consecutive_value_lag_28|history_length|day_of_month|day_of_week|week_of_year|quarter|month|year|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-25| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 1.0| 0.0| 0.0| 1.0| 0.5714285714285714| 0.8| 0.14285714285714285| 0.7857142857142857| 0.8| 1.4285714285714286| 0.7142857142857143| 0.9666666666666667| 0.0| 0.5714285714285714| 0.7333333333333333| 1| 5| 0| 8| 1912| 25| 2| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-26| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 4.0| 0.0| 1.0| 0.5714285714285714| 0.6| 0.14285714285714285| 0.7857142857142857| 0.6666666666666666| 1.4285714285714286| 0.7142857142857143| 0.9666666666666667| 0.0| 0.5714285714285714| 0.6333333333333333| 2| 6| 1| 9| 1913| 26| 3| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-27| 0.0| 2.0| 0.0| 1.0| 0.0| 0.0| 1.0| 2.0| 0.0| 1.2857142857142858| 0.6428571428571429| 0.6666666666666666| 0.0| 0.7857142857142857| 0.6333333333333333| 1.5714285714285714| 0.7857142857142857| 1.0| 0.0| 0.5| 0.6333333333333333| 0| 7| 0| 10| 1914| 27| 4| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-28| 1.0| 0.0| 4.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 1.0714285714285714| 0.7666666666666667| 1.5714285714285714| 0.7857142857142857| 0.8666666666666667| 0.0| 0.42857142857142855| 0.6333333333333333| 1| 0| 1| 11| 1915| 28| 5| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-29| 0.0| 0.0| 0.0| 0.0| 4.0| 0.0| 0.0| 0.0| 0.0| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.7857142857142857| 0.8| 0.5714285714285714| 0.7142857142857143| 0.7666666666666667| 2| 1| 2| 0| 1916| 29| 6| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-04-30| 1.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.7857142857142857| 0.7333333333333333| 0.5714285714285714| 0.7857142857142857| 0.7| 1.0| 0.7857142857142857| 0.8| 0.5714285714285714| 0.7142857142857143| 0.7666666666666667| 0| 2| 3| 1| 1917| 30| 7| 17| 2| 4|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-01| 4.0| 0.0| 3.0| 0.0| 5.0| 0.0| 6.0| 4.0| 0.0| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.6428571428571429| 0.8| 0.2857142857142857| 0.7857142857142857| 0.8| 1.2857142857142858| 0.6428571428571429| 0.9333333333333333| 1| 0| 4| 0| 1918| 1| 1| 17| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-02| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 1.0| 0.0| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.5714285714285714| 0.8| 0.14285714285714285| 0.7857142857142857| 0.8| 1.4285714285714286| 0.7142857142857143| 0.9666666666666667| 2| 1| 5| 0| 1919| 2| 2| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-03| 0.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 4.0| 0.8571428571428571| 0.9285714285714286| 0.8| 1.0| 0.5714285714285714| 0.6| 0.14285714285714285| 0.7857142857142857| 0.6666666666666666| 1.4285714285714286| 0.7142857142857143| 0.9666666666666667| 0| 2| 6| 1| 1920| 3| 3| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-04| 0.0| 0.0| 2.0| 0.0| 1.0| 0.0| 0.0| 1.0| 2.0| 0.5714285714285714| 0.9285714285714286| 0.8| 1.2857142857142858| 0.6428571428571429| 0.6666666666666666| 0.0| 0.7857142857142857| 0.6333333333333333| 1.5714285714285714| 0.7857142857142857| 1.0| 1| 0| 7| 0| 1921| 4| 4| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-05| 0.0| 1.0| 0.0| 4.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.7142857142857143| 0.7142857142857143| 0.8333333333333334| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 1.0714285714285714| 0.7666666666666667| 1.5714285714285714| 0.7857142857142857| 0.8666666666666667| 0| 1| 0| 1| 1922| 5| 5| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-06| 4.0| 0.0| 0.0| 0.0| 0.0| 4.0| 0.0| 0.0| 0.0| 0.7142857142857143| 0.7142857142857143| 0.8333333333333334| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.7857142857142857| 0.8| 1| 2| 1| 2| 1923| 6| 6| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-07| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.5714285714285714| 0.7857142857142857| 0.8666666666666667| 1.0| 0.7857142857142857| 0.7333333333333333| 0.5714285714285714| 0.7857142857142857| 0.7| 1.0| 0.7857142857142857| 0.8| 0| 0| 2| 3| 1924| 7| 7| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-08| 0.0| 4.0| 0.0| 3.0| 0.0| 5.0| 0.0| 6.0| 4.0| 1.1428571428571428| 0.8571428571428571| 0.8666666666666667| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.6428571428571429| 0.8| 0.2857142857142857| 0.7857142857142857| 0.8| 0| 1| 0| 4| 1925| 8| 1| 18| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-09| 1.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 1.0| 1.1428571428571428| 0.8571428571428571| 0.8666666666666667| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 1.0| 0.5714285714285714| 0.8| 0.14285714285714285| 0.7857142857142857| 0.8| 1| 2| 1| 5| 1926| 9| 2| 19| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-10| 0.0| 0.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.0| 0.8571428571428571| 0.8571428571428571| 0.7| 0.8571428571428571| 0.9285714285714286| 0.8| 1.0| 0.5714285714285714| 0.6| 0.14285714285714285| 0.7857142857142857| 0.6666666666666666| 2| 0| 2| 6| 1927| 10| 3| 19| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-11| 0.0| 0.0| 0.0| 2.0| 0.0| 1.0| 0.0| 0.0| 1.0| 0.8571428571428571| 0.7142857142857143| 0.6666666666666666| 0.5714285714285714| 0.9285714285714286| 0.8| 1.2857142857142858| 0.6428571428571429| 0.6666666666666666| 0.0| 0.7857142857142857| 0.6333333333333333| 3| 1| 0| 7| 1928| 11| 4| 19| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-12| 1.0| 0.0| 1.0| 0.0| 4.0| 0.0| 0.0| 0.0| 1.0| 0.7142857142857143| 0.7142857142857143| 0.6666666666666666| 0.7142857142857143| 0.7142857142857143| 0.8333333333333334| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 1.0714285714285714| 0.7666666666666667| 4| 0| 1| 0| 1929| 12| 5| 19| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-13| 0.0| 4.0| 0.0| 0.0| 0.0| 0.0| 4.0| 0.0| 0.0| 1.2857142857142858| 1.0| 0.7666666666666667| 0.7142857142857143| 0.7142857142857143| 0.8333333333333334| 0.7142857142857143| 0.6428571428571429| 0.6666666666666666| 0.5714285714285714| 0.7857142857142857| 0.7333333333333333| 0| 1| 2| 1| 1930| 13| 6| 19| 2| 5|2016|
|FOODS_1_011_WI_2_...|FOODS_1_011|FOODS_1| FOODS| WI_2| WI|2016-05-14| 0.0| 0.0| 1.0| 2.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.1428571428571428| 0.8571428571428571| 0.7666666666666667| 0.5714285714285714| 0.7857142857142857| 0.8666666666666667| 1.0| 0.7857142857142857| 0.7333333333333333| 0.5714285714285714| 0.7857142857142857| 0.7| 1| 0| 0| 2| 1931| 14| 7| 19| 2| 5|2016|
+--------------------+-----------+-------+------+--------+--------+----------+-----+-----+------+------+------+------+------+------+------+-------------------+--------------------+--------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+--------------------+---------------------+---------------------+-----------------------------+------------------------------+------------------------------+------------------------------+--------------+------------+-----------+------------+-------+-----+----+
only showing top 20 rows
Pandas DataFrame#
forecast_flow.predict(df_test.toPandas(), spark=spark)
| group | id | date | prediction | |
|---|---|---|---|---|
| 0 | CA_2 | FOODS_1_179_CA_2_evaluation | 2016-04-25 | 0.481568 |
| 1 | CA_2 | FOODS_1_179_CA_2_evaluation | 2016-04-26 | 0.467245 |
| 2 | CA_2 | FOODS_1_179_CA_2_evaluation | 2016-04-27 | 0.415966 |
| 3 | CA_2 | FOODS_1_179_CA_2_evaluation | 2016-04-28 | 0.407759 |
| 4 | CA_2 | FOODS_1_179_CA_2_evaluation | 2016-04-29 | 0.434399 |
| ... | ... | ... | ... | ... |
| 26427 | TX_2 | HOUSEHOLD_2_481_TX_2_evaluation | 2016-05-18 | 0.215980 |
| 26428 | TX_2 | HOUSEHOLD_2_481_TX_2_evaluation | 2016-05-19 | 0.215980 |
| 26429 | TX_2 | HOUSEHOLD_2_481_TX_2_evaluation | 2016-05-20 | 0.222249 |
| 26430 | TX_2 | HOUSEHOLD_2_481_TX_2_evaluation | 2016-05-21 | 0.334569 |
| 26431 | TX_2 | HOUSEHOLD_2_481_TX_2_evaluation | 2016-05-22 | 0.313987 |
26432 rows × 4 columns
Visualize Predictions#
past_future = (
df.select("id", "store_id", "date", "sales")
.join(forecast, on=["id", "date"], how="left")
.groupBy("store_id", "date")
.agg(
F.sum("sales").alias("sales"),
F.sum("prediction").alias("prediction"),
)
.orderBy("store_id", "date")
.toPandas()
)
pio.renderers.default = "notebook"
fig = px.line(
past_future,
x="date",
y=["sales", "prediction"],
facet_row_spacing=0.04,
facet_col="store_id",
facet_col_wrap=2,
height=1000,
width=720,
)
fig.update_layout(
legend=dict(orientation="h", yanchor="top", y=1.07, xanchor="center", x=0.5),
margin=dict(l=0, r=10, t=5, b=5),
legend_title="",
)
fig.update_traces(line=dict(width=1.7))
fig.update_yaxes(matches=None, title="")
fig.update_xaxes(type="date", range=["2015-11-01", "2016-05-22"])
Backtesting#
cv_forecast = forecast_flow.cross_validate(df_train).localCheckpoint()
cv_forecast.show(10)
+-----+--------------------+----------+---+------+-----------+
|group| id| date| cv|target| prediction|
+-----+--------------------+----------+---+------+-----------+
| CA_2|FOODS_1_179_CA_2_...|2016-03-28| 0| 0.0| 0.44766802|
| CA_2|FOODS_1_179_CA_2_...|2016-03-29| 0| 0.0| 0.43386874|
| CA_2|FOODS_1_179_CA_2_...|2016-03-30| 0| 0.0| 0.40635538|
| CA_2|FOODS_1_179_CA_2_...|2016-03-31| 0| 1.0| 0.3618364|
| CA_2|FOODS_1_179_CA_2_...|2016-04-01| 0| 0.0| 0.40051356|
| CA_2|FOODS_1_179_CA_2_...|2016-04-02| 0| 1.0| 0.42851403|
| CA_2|FOODS_1_179_CA_2_...|2016-04-03| 0| 0.0| 0.40656742|
| CA_2|FOODS_1_192_CA_2_...|2016-03-28| 0| 0.0| 0.13468084|
| CA_2|FOODS_1_192_CA_2_...|2016-03-29| 0| 0.0|0.103752814|
| CA_2|FOODS_1_192_CA_2_...|2016-03-30| 0| 2.0|0.103752814|
+-----+--------------------+----------+---+------+-----------+
only showing top 10 rows
Visualize Cross Validation#
cv_forecast = (
df_train.select("id", "store_id", "date", "sales")
.join(
cv_forecast.select("id", "date", "cv", "prediction"),
on=["id", "date"],
how="left",
)
.groupBy("id", "store_id", "date", "sales")
.pivot("cv")
.sum("prediction")
.groupBy("store_id", "date")
.agg(
F.sum("sales").alias("sales"),
*[F.sum(f"{i}").alias(f"cv_{i}") for i in range(3)],
)
.orderBy("store_id", "date")
).toPandas()
pio.renderers.default = "notebook"
fig = px.line(
cv_forecast,
x="date",
y=["sales", *[f"cv_{i}" for i in range(3)]],
facet_row_spacing=0.04,
facet_col="store_id",
facet_col_wrap=2,
height=1000,
width=720,
)
fig.update_layout(
legend=dict(orientation="h", yanchor="top", y=1.07, xanchor="center", x=0.5),
margin=dict(l=0, r=10, t=5, b=5),
legend_title="",
)
fig.update_traces(line=dict(width=1.7))
fig.update_yaxes(matches=None, title="")
fig.update_xaxes(type="date", range=["2015-11-01", "2016-04-24"])