间歇性数据

间歇性或稀疏数据具有非常少的非零观测值。这类数据很难预测，因为零值增加了数据底层模式的不确定性。此外，一旦出现非零观测值，其大小可能会有相当大的变化。间歇性时间序列在许多行业中都很常见，包括金融、零售、交通和能源。鉴于此类时间序列的普遍性，已经开发了专门的方法来预测它们。第一个方法来自 Croston (1972)，随后出现了几种变体以及不同的聚合框架。

NeuralForecast 的模型可以使用 Poisson 分布损失来训练，以建模稀疏或间歇性时间序列。通过本教程，您将很好地理解这些模型及其用法。

大纲

安装库
加载并探索数据
训练间歇性数据模型
执行交叉验证

提示

您可以使用 Colab 交互式运行本 Notebook。

警告

为了减少计算时间，建议使用 GPU。使用 Colab 时，请务必激活它。只需前往 Runtime>Change runtime type 并选择 GPU 作为硬件加速器。

1. 安装库

我们假定您已安装 NeuralForecast。如果尚未安装，请查看本指南了解如何安装 NeuralForecast。

使用 pip install neuralforecast 安装必要的包。

!pip install statsforecast s3fs fastparquet neuralforecast

2. 加载并探索数据

本例将使用 M5 竞赛数据集的子集。每个时间序列代表特定 Walmart 商店中特定产品的单位销售额。在此层面（产品-商店），大部分数据是间歇性的。我们首先需要导入数据。

import pandas as pd
from utilsforecast.plotting import plot_series

Y_df = pd.read_parquet('https://m5-benchmarks.s3.amazonaws.com/data/train/target.parquet')
Y_df = Y_df.rename(columns={
    'item_id': 'unique_id', 
    'timestamp': 'ds', 
    'demand': 'y'
})
Y_df['ds'] = pd.to_datetime(Y_df['ds'])

为了简化，我们将仅保留一个类别。

Y_df = Y_df.query('unique_id.str.startswith("FOODS_3")')
Y_df['unique_id'] = Y_df['unique_id'].astype(str)
Y_df = Y_df.reset_index(drop=True)

使用 StatsForecast 类的 plot 方法绘制一些序列。此方法打印数据集中的 8 个随机序列，有助于进行基础 EDA。

plot_series(Y_df)

3. 训练间歇性数据模型

from ray import tune

from neuralforecast import NeuralForecast
from neuralforecast.auto import AutoNHITS, AutoTFT
from neuralforecast.losses.pytorch import DistributionLoss

每个 Auto 模型都包含一个默认的搜索空间，该空间在多个大规模数据集上进行了广泛测试。此外，用户可以定义针对特定数据集和任务量身定制的搜索空间。

首先，我们为 AutoNHITS 和 AutoTFT 模型创建自定义搜索空间。搜索空间用字典指定，其中键对应于模型的超参数，值是指定超参数如何采样的 Tune 函数。例如，使用 randint 均匀采样整数，使用 choice 采样列表中的值。

config_nhits = {
    "input_size": tune.choice([28, 28*2, 28*3, 28*5]),              # Length of input window
    "n_blocks": 5*[1],                                              # Length of input window
    "mlp_units": 5 * [[512, 512]],                                  # Length of input window
    "n_pool_kernel_size": tune.choice([5*[1], 5*[2], 5*[4],         
                                      [8, 4, 2, 1, 1]]),            # MaxPooling Kernel size
    "n_freq_downsample": tune.choice([[8, 4, 2, 1, 1],
                                      [1, 1, 1, 1, 1]]),            # Interpolation expressivity ratios
    "learning_rate": tune.loguniform(1e-4, 1e-2),                   # Initial Learning rate
    "scaler_type": tune.choice([None]),                             # Scaler type
    "max_steps": tune.choice([1000]),                               # Max number of training iterations
    "batch_size": tune.choice([32, 64, 128, 256]),                  # Number of series in batch
    "windows_batch_size": tune.choice([128, 256, 512, 1024]),       # Number of windows in batch
    "random_seed": tune.randint(1, 20),                             # Random seed
}

config_tft = {
        "input_size": tune.choice([28, 28*2, 28*3]),                # Length of input window
        "hidden_size": tune.choice([64, 128, 256]),                 # Size of embeddings and encoders
        "learning_rate": tune.loguniform(1e-4, 1e-2),               # Initial learning rate
        "scaler_type": tune.choice([None]),                         # Scaler type
        "max_steps": tune.choice([500, 1000]),                      # Max number of training iterations
        "batch_size": tune.choice([32, 64, 128, 256]),              # Number of series in batch
        "windows_batch_size": tune.choice([128, 256, 512, 1024]),   # Number of windows in batch
        "random_seed": tune.randint(1, 20),                         # Random seed
    }

要实例化 Auto 模型，您需要定义：

h：预测范围。
loss：来自 neuralforecast.losses.pytorch 的训练和验证损失。
config：超参数搜索空间。如果为 None，则 Auto 类将使用预定义的建议超参数空间。
search_alg：搜索算法（来自 tune.search），默认为随机搜索。有关不同搜索算法选项的更多信息，请参阅 https://docs.rayai.org.cn/en-latest/tune/api_docs/suggestion.html。
num_samples：探索的配置数量。

在本例中，我们将预测范围 h 设置为 28，使用 Poisson 分布损失（适用于计数数据）进行训练和验证，并使用默认搜索算法。

nf = NeuralForecast(
    models=[
        AutoNHITS(h=28, config=config_nhits, loss=DistributionLoss(distribution='Poisson', level=[80, 90]), num_samples=5),
        AutoTFT(h=28, config=config_tft, loss=DistributionLoss(distribution='Poisson', level=[80, 90]), num_samples=2), 
    ],
    freq='D'
)

提示

样本数量 num_samples 是一个关键参数！较大的值通常会产生更好的结果，因为我们在搜索空间中探索了更多配置，但这会增加训练时间。较大的搜索空间通常需要更多样本。一般规则是，我们建议将 num_samples 设置得高于 20。

接下来，我们使用 Neuralforecast 类训练 Auto 模型。在此步骤中，Auto 模型将自动执行超参数调优，训练多个具有不同超参数的模型，在验证集上生成预测，并对其进行评估。最佳配置是根据验证集上的误差选择的。只有最佳模型会被存储并在推理期间使用。

nf.fit(df=Y_df)

接下来，我们使用 predict 方法使用最优超参数预测未来 28 天。

fcst_df = nf.predict()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Predicting: |          | 0/? [00:00<?, ?it/s]

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Predicting: |          | 0/? [00:00<?, ?it/s]

plot_series(Y_df, 
            fcst_df.drop(columns=["AutoNHITS-median", "AutoTFT-median"]), 
            max_insample_length=28*3, 
            level=[90])

4. 交叉验证

时间序列交叉验证是一种评估模型在过去表现的方法。其工作原理是在历史数据上定义一个滑动窗口，并预测紧随其后的时期。

NeuralForecast 实现了一种快速易用的时间序列交叉验证。

NeuralForecast 类的 cross_validation 方法接受以下参数。

df：训练数据框
step_size (int)：每个窗口之间的步长。换句话说：您希望多久运行一次预测过程。
n_windows (int)：用于交叉验证的窗口数量。换句话说：您希望评估过去多少个预测过程。

nf = NeuralForecast(
    models=[
        AutoNHITS(h=28, config=config_nhits, loss=DistributionLoss(distribution='Poisson', level=[80, 90]), num_samples=5),
        AutoTFT(h=28, config=config_tft, loss=DistributionLoss(distribution='Poisson', level=[80, 90]), num_samples=2), 
    ],
    freq='D'
)

cv_df = nf.cross_validation(Y_df, n_windows=3, step_size=28)

cv_df 对象是一个新的数据框，包含以下列：

unique_id：包含时间序列对应的 ID
ds：日期戳或时间索引
cutoff：n_windows 的最后一个日期戳或时间索引。如果 n_windows=1，则有一个唯一的截止值；如果 n_windows=2，则有两个唯一的截止值。
y：真实值
"model"：包含模型名称和拟合值的列。

cv_df.head()

	unique_id	ds	cutoff	AutoNHITS	AutoNHITS-hi-80	AutoNHITS-hi-90	AutoTFT	AutoTFT-median	AutoTFT-hi-80	AutoTFT-hi-90	y
0	FOODS_3_001_CA_1	2016-02-29	2016-02-28	0.550	2.0	2.0	0.775	1.0	2.0	2.0	0.0
1	FOODS_3_001_CA_1	2016-03-01	2016-02-28	0.611	2.0	2.0	0.746	1.0	2.0	2.0	1.0
2	FOODS_3_001_CA_1	2016-03-02	2016-02-28	0.567	2.0	2.0	0.750	1.0	2.0	2.0	1.0
3	FOODS_3_001_CA_1	2016-03-03	2016-02-28	0.554	2.0	2.0	0.750	1.0	2.0	2.0	0.0
4	FOODS_3_001_CA_1	2016-03-04	2016-02-28	0.627	2.0	2.0	0.788	1.0	2.0	3.0	0.0

for cutoff in cv_df['cutoff'].unique():
    display(plot_series(Y_df, 
                        cv_df.query('cutoff == @cutoff').drop(columns=['cutoff', 'y', 'AutoNHITS-median', 'AutoTFT-median']), 
                max_insample_length=28*4,
                ids=['FOODS_3_001_CA_1'],
                level=[90]))

评估

在本节中，我们将使用 MSE 指标评估每个模型在每个交叉验证窗口中的性能。

from utilsforecast.losses import mse, mae
from utilsforecast.evaluation import evaluate

metrics = pd.DataFrame()
for cutoff in cv_df["cutoff"].unique():
    metrics_per_cutoff = evaluate(cv_df.query("cutoff == @cutoff"),
                                metrics=[mse, mae],
                                models=['AutoNHITS', 'AutoTFT'],
                                level=[80, 90],
                                agg_fn="mean")
    metrics_per_cutoff = metrics_per_cutoff.assign(cutoff=cutoff)
    metrics = pd.concat([metrics, metrics_per_cutoff])

metrics

	指标	AutoNHITS	AutoTFT	cutoff
0	mse	10.059308	10.909020	2016-02-28
1	mae	1.485914	1.554572	2016-02-28
0	mse	9.590549	10.253903	2016-03-27
1	mae	1.494229	1.561868	2016-03-27
0	mse	9.596170	10.300666	2016-04-24
1	mae	1.501949	1.564157	2016-04-24

入门

功能

教程

用例

API 参考

1. 安装库

2. 加载并探索数据

3. 训练间歇性数据模型

4. 交叉验证

评估

参考资料

入门

功能

教程

用例

API 参考

​1. 安装库

​2. 加载并探索数据

​3. 训练间歇性数据模型

​4. 交叉验证

​评估

​参考资料

1. 安装库

2. 加载并探索数据

3. 训练间歇性数据模型

4. 交叉验证

评估

参考资料