forecastflowml.FeatureExtractor#

class forecastflowml.FeatureExtractor(id_col, date_col, target_col, lag_window_features=None, date_features=None, count_consecutive_values=None, history_length=False)[source]#

Extract features from time series

Parameters:

id_col (str) – Id column name.
date_col (str) – Date column name.
target_col (str) – Target column name.

lag_window_features (Optional[Dict[str, List[Union[int, List[int]]]]]) –

Dictionary that contains different types of functions as keys and their corresponding lag-window arguments as values. The lag argument specifies how many units in the past the window should start, while the window specifies the size of the window to apply the function across.

For the lag function, only list of integers needs to be provided.
For all other functions, list of lists such that [[window, lag]] needs to be provided.

function	example
lag	{“lag”: [1, 2, 3, 4]}
mean	{“mean”: [[window, lag] for lag in [1, 2, 3] for window in [7, 14]]}
stddev	{“stddev”: [window, lag] for lag in [1, 2, 3] for window in [7, 14]}

The logic of the code is represented visually using symbols:

o: denotes the time stamp for which the window is summarized to
x: represents other time stamps within the window being summarized.
-: is used to denote observations, past or future, that are not part of the window.

lag	window	calculation
1	3	[- - - - - * * * o - - - -]
2	3	[- - - - * * * - o - - - -]
1	5	[- - - * * * * * o - - - -]

Keys needs to be a native pyspark functions.

date_features (List[str]) – Date features to extract: day_of_week, day_of_year, day_of_month, week_of_year, week_of_month, weekend, month, quarter, year.
count_consecutive_values (Optional[Dict[str, List[Union[int, List[int]]]]]) –
Counts consecutive apperance of spesific value. Needs to be a dictionary that contains value for counting, and lags for how many units in the past the counting should start,
- Example: count_consecutive_values={“value”: 0, “lags”: [7, 14, 21, 28]}
history_length (bool) – Whether to count number of time periods after the start of time series.

Methods

transform(df[, spark])

Extract features