分类变量

分类变量是可能影响预测的外部因素。这些变量取有限的、固定数量的可能值之一，并对观测数据进行分组。

例如，如果您正在预测零售商的每日产品需求，您可以从一个事件变量中受益，该变量可能告诉您在特定日期发生了什么样的事件，例如“无”、“体育”或“文化”事件。

要在 TimeGPT 中整合分类变量，您需要将时间序列数据中的每个点与相应的外部数据配对。

1. 导入包

首先，我们安装并导入所需的包并初始化 Nixtla 客户端。

import pandas as pd
import os

from nixtla import NixtlaClient
from datasetsforecast.m5 import M5

nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'   
)

👍 使用 Azure AI 端点

要使用 Azure AI 端点，请记住也要设置 base_url 参数

nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")

2. 加载 M5 数据

我们来看一个预测 M5 数据集中产品销售的例子。M5 数据集包含美国 10 家零售店的每日产品需求（销售额）。

首先，我们使用 datasetsforecast 加载数据。这将返回

Y_df，包含每个唯一产品（unique_id 列）在每个时间戳（ds 列）的销售额（y 列）。
X_df，包含每个唯一产品（unique_id 列）在每个时间戳（ds 列）的额外相关信息。

Y_df, X_df, _ = M5.load(directory=os.getcwd())
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
X_df['ds'] = pd.to_datetime(X_df['ds'])
Y_df.head(10)

	unique_id	ds	y
0	FOODS_1_001_CA_1	2011-01-29	3.0
1	FOODS_1_001_CA_1	2011-01-30	0.0
2	FOODS_1_001_CA_1	2011-01-31	0.0
3	FOODS_1_001_CA_1	2011-02-01	1.0
4	FOODS_1_001_CA_1	2011-02-02	4.0
5	FOODS_1_001_CA_1	2011-02-03	2.0
6	FOODS_1_001_CA_1	2011-02-04	0.0
7	FOODS_1_001_CA_1	2011-02-05	2.0
8	FOODS_1_001_CA_1	2011-02-06	0.0
9	FOODS_1_001_CA_1	2011-02-07	0.0

对于此示例，我们将只保留 event_type_1 列中的额外相关信息。此列是一个分类变量，指示在某个日期是否发生了可能影响产品销售的重要事件。

X_df = X_df[['unique_id', 'ds', 'event_type_1']]

X_df.head(10)

	unique_id	ds	event_type_1
0	FOODS_1_001_CA_1	2011-01-29	nan
1	FOODS_1_001_CA_1	2011-01-30	nan
2	FOODS_1_001_CA_1	2011-01-31	nan
3	FOODS_1_001_CA_1	2011-02-01	nan
4	FOODS_1_001_CA_1	2011-02-02	nan
5	FOODS_1_001_CA_1	2011-02-03	nan
6	FOODS_1_001_CA_1	2011-02-04	nan
7	FOODS_1_001_CA_1	2011-02-05	nan
8	FOODS_1_001_CA_1	2011-02-06	体育
9	FOODS_1_001_CA_1	2011-02-07	nan

正如您所见，在 2011 年 2 月 6 日，有一个体育赛事。

3. 使用分类变量预测产品需求

我们将只预测单个产品的需求。我们选择一个由 FOODS_3_090_CA_3 标识的高销量食品产品。

product = 'FOODS_3_090_CA_3'
Y_df_product = Y_df.query('unique_id == @product')
X_df_product = X_df.query('unique_id == @product')

我们合并两个数据框以创建将在 TimeGPT 中使用的数据集。

df = Y_df_product.merge(X_df_product)

df.head(10)

	unique_id	ds	y	event_type_1
0	FOODS_3_090_CA_3	2011-01-29	108.0	nan
1	FOODS_3_090_CA_3	2011-01-30	132.0	nan
2	FOODS_3_090_CA_3	2011-01-31	102.0	nan
3	FOODS_3_090_CA_3	2011-02-01	120.0	nan
4	FOODS_3_090_CA_3	2011-02-02	106.0	nan
5	FOODS_3_090_CA_3	2011-02-03	123.0	nan
6	FOODS_3_090_CA_3	2011-02-04	279.0	nan
7	FOODS_3_090_CA_3	2011-02-05	175.0	nan
8	FOODS_3_090_CA_3	2011-02-06	186.0	体育
9	FOODS_3_090_CA_3	2011-02-07	120.0	nan

为了在 TimeGPT 中使用分类变量，需要对变量进行数值编码。在本教程中，我们将使用独热编码。

我们可以使用 pandas 内置的 get_dummies 功能对 event_type_1 列进行独热编码。对 event_type_1 变量进行独热编码后，我们可以将其添加到数据框中并删除原始列。

event_type_1_ohe = pd.get_dummies(df['event_type_1'], dtype=int)
df = pd.concat([df, event_type_1_ohe], axis=1)
df = df.drop(columns = 'event_type_1')

df.tail(10)

	unique_id	ds	y	体育	nan
1959	FOODS_3_090_CA_3	2016-06-10	140.0	0	1
1960	FOODS_3_090_CA_3	2016-06-11	151.0	0	1
1961	FOODS_3_090_CA_3	2016-06-12	87.0	0	1
1962	FOODS_3_090_CA_3	2016-06-13	67.0	0	1
1963	FOODS_3_090_CA_3	2016-06-14	50.0	0	1
1964	FOODS_3_090_CA_3	2016-06-15	58.0	0	1
1965	FOODS_3_090_CA_3	2016-06-16	116.0	0	1
1966	FOODS_3_090_CA_3	2016-06-17	124.0	0	1
1967	FOODS_3_090_CA_3	2016-06-18	167.0	0	1
1968	FOODS_3_090_CA_3	2016-06-19	118.0	1	0

正如您所见，我们现在添加了 5 列，每列都有一个二进制指标（1 或 0），指示当天是否存在文化、国家、宗教、体育或无（nan）事件。例如，在 2016 年 6 月 19 日，有一个体育赛事。

接下来我们进行预测任务。我们将预测 2016 年 2 月的前 7 天。这包括 2016 年 2 月 7 日——第 50 届超级碗举办的日期。此类大型全国性事件通常会影响零售产品的销售。

要在 TimeGPT 中使用编码后的分类变量，我们需要将它们作为未来值添加。因此，我们创建一个未来值数据框，其中包含 unique_id、时间戳 ds 和编码后的分类变量。

当然，我们删除了目标列，因为这通常是不可用的——这是我们寻求预测的数量！

future_ex_vars_df = df.drop(columns = ['y'])
future_ex_vars_df = future_ex_vars_df.query("ds >= '2016-02-01' & ds <= '2016-02-07'")

future_ex_vars_df.head(10)

	unique_id	ds	体育	nan
1829	FOODS_3_090_CA_3	2016-02-01	0	1
1830	FOODS_3_090_CA_3	2016-02-02	0	1
1831	FOODS_3_090_CA_3	2016-02-03	0	1
1832	FOODS_3_090_CA_3	2016-02-04	0	1
1833	FOODS_3_090_CA_3	2016-02-05	0	1
1834	FOODS_3_090_CA_3	2016-02-06	0	1
1835	FOODS_3_090_CA_3	2016-02-07	1	0

接下来，我们将输入数据框限制为除这 7 个预测日之外的所有数据

df_train = df.query("ds < '2016-02-01'")

df_train.tail(10)

	unique_id	ds	y	nan
1819	FOODS_3_090_CA_3	2016-01-22	94.0	1
1820	FOODS_3_090_CA_3	2016-01-23	144.0	1
1821	FOODS_3_090_CA_3	2016-01-24	146.0	1
1822	FOODS_3_090_CA_3	2016-01-25	87.0	1
1823	FOODS_3_090_CA_3	2016-01-26	73.0	1
1824	FOODS_3_090_CA_3	2016-01-27	62.0	1
1825	FOODS_3_090_CA_3	2016-01-28	64.0	1
1826	FOODS_3_090_CA_3	2016-01-29	102.0	1
1827	FOODS_3_090_CA_3	2016-01-30	113.0	1
1828	FOODS_3_090_CA_3	2016-01-31	98.0	1

我们首先调用 forecast 方法，不包含分类变量。

timegpt_fcst_without_cat_vars_df = nixtla_client.forecast(df=df_train, h=7, level=[80, 90])
timegpt_fcst_without_cat_vars_df.head()

INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...

	unique_id	ds	TimeGPT	TimeGPT-lo-90	TimeGPT-lo-80	TimeGPT-hi-80	TimeGPT-hi-90
0	FOODS_3_090_CA_3	2016-02-01	73.304092	53.449049	54.795078	91.813107	93.159136
1	FOODS_3_090_CA_3	2016-02-02	66.335518	47.510669	50.274136	82.396899	85.160367
2	FOODS_3_090_CA_3	2016-02-03	65.881630	36.218617	41.388896	90.374364	95.544643
3	FOODS_3_090_CA_3	2016-02-04	72.371864	-26.683115	25.097362	119.646367	171.426844
4	FOODS_3_090_CA_3	2016-02-05	95.141045	-2.084882	34.027078	156.255011	192.366971

📘 Azure AI 中可用的模型

如果您使用 Azure AI 端点，请务必设置 model="azureai"

nixtla_client.forecast(..., model="azureai")

对于公共 API，我们支持两个模型：timegpt-1 和 timegpt-1-long-horizon。

默认情况下使用 timegpt-1。关于何时以及如何使用 timegpt-1-long-horizon，请参阅本教程。

我们绘制了预测结果和预测期前的最后 28 天数据

nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"), 
    timegpt_fcst_without_cat_vars_df, 
    max_insample_length=28, 
)

TimeGPT 已经提供了合理的预测，但似乎对 2016 年 2 月 6 日的峰值（超级碗前一天）有所低估。

我们再次调用 forecast 方法，这次是包含分类变量的。

timegpt_fcst_with_cat_vars_df = nixtla_client.forecast(df=df_train, X_df=future_ex_vars_df, h=7, level=[80, 90])
timegpt_fcst_with_cat_vars_df.head()

INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: D
INFO:nixtla.nixtla_client:Using the following exogenous variables: Cultural, National, Religious, Sporting, nan
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...

	unique_id	ds	TimeGPT	TimeGPT-lo-90	TimeGPT-lo-80	TimeGPT-hi-80	TimeGPT-hi-90
0	FOODS_3_090_CA_3	2016-02-01	70.661271	-0.204378	14.593348	126.729194	141.526919
1	FOODS_3_090_CA_3	2016-02-02	65.566941	-20.394326	11.654239	119.479643	151.528208
2	FOODS_3_090_CA_3	2016-02-03	68.510010	-33.713710	6.732952	130.287069	170.733731
3	FOODS_3_090_CA_3	2016-02-04	75.417710	-40.974649	4.751767	146.083653	191.810069
4	FOODS_3_090_CA_3	2016-02-05	97.340302	-57.385361	18.253812	176.426792	252.065965

📘 Azure AI 中可用的模型

如果您使用 Azure AI 端点，请务必设置 model="azureai"

nixtla_client.forecast(..., model="azureai")

对于公共 API，我们支持两个模型：timegpt-1 和 timegpt-1-long-horizon。

默认情况下使用 timegpt-1。关于何时以及如何使用 timegpt-1-long-horizon，请参阅本教程。

我们绘制了预测结果和预测期前的最后 28 天数据

nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"), 
    timegpt_fcst_with_cat_vars_df, 
    max_insample_length=28, 
)

我们可以通过视觉验证，预测结果更接近实际观测值，这是在预测中包含分类变量的结果。

我们通过计算我们创建的预测的平均绝对误差来验证这一结论。

from utilsforecast.losses import mae

# Create target dataframe
df_target = df[['unique_id', 'ds', 'y']].query("ds >= '2016-02-01' & ds <= '2016-02-07'")

# Rename forecast columns
timegpt_fcst_without_cat_vars_df = timegpt_fcst_without_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-without-cat-vars'})
timegpt_fcst_with_cat_vars_df = timegpt_fcst_with_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-with-cat-vars'})

# Merge forecasts with target dataframe
df_target = df_target.merge(timegpt_fcst_without_cat_vars_df[['unique_id', 'ds', 'TimeGPT-without-cat-vars']])
df_target = df_target.merge(timegpt_fcst_with_cat_vars_df[['unique_id', 'ds', 'TimeGPT-with-cat-vars']])

# Compute errors
mean_absolute_errors = mae(df_target, ['TimeGPT-without-cat-vars', 'TimeGPT-with-cat-vars'])

mean_absolute_errors

	unique_id	TimeGPT-无分类变量	TimeGPT-有分类变量
0	FOODS_3_090_CA_3	24.285649	20.028514

确实，我们发现使用包含分类变量的 TimeGPT 时，误差比不包含分类变量的 TimeGPT 低约 20%，这表明包含分类变量时性能更佳。

入门

功能

部署

教程

用例

API 参考

1. 导入包

2. 加载 M5 数据

3. 使用分类变量预测产品需求

入门

功能

部署

教程

用例

API 参考

​1. 导入包

​2. 加载 M5 数据

​3. 使用分类变量预测产品需求

1. 导入包

2. 加载 M5 数据

3. 使用分类变量预测产品需求