我们的任务是预测网站 cienciadedatos.net 未来 7 天的每日访问量。

在本教程中,我们将展示

  • 如何加载用于 TimeGPT 预测的时间序列数据

  • 如何使用 TimeGPT 创建交叉验证预测

本教程改编自 Joaquín Amat Rodrigo 和 Javier Escobar Ortiz 的文章《使用机器学习和 Python 预测网站流量》。我们将向您展示

  • 您如何获得接近 10% 的更优预测结果;

  • 使用明显更少的代码行;

  • 并且所需运行时间仅为原教程的一小部分。

1. 导入软件包

首先,我们导入所需的软件包并初始化 Nixtla 客户端。

import pandas as pd
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

👍 使用 Azure AI 端点

要使用 Azure AI 端点,请记住同时设置 base_url 参数

nixtla_client = NixtlaClient(base_url="你的 azure ai 端点", api_key="你的 api 密钥")

2. 加载数据

我们加载网站访问数据,并将其设置为适合 TimeGPT 使用的格式。在这种情况下,我们只需为时间序列添加一个标识符列,我们将其命名为 daily_visits

url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/' +
       'master/data/visitas_por_dia_web_cienciadedatos.csv')
df = pd.read_csv(url, sep=',', parse_dates=[0], date_format='%d/%m/%y')
df['unique_id'] = 'daily_visits'

df.head(10)
日期用户唯一ID
02020-07-012324每日访问量
12020-07-022201每日访问量
22020-07-032146每日访问量
32020-07-041666每日访问量
42020-07-051433每日访问量
52020-07-062195每日访问量
62020-07-072240每日访问量
72020-07-082295每日访问量
82020-07-092279每日访问量
92020-07-102155每日访问量

就是这样!不再需要预处理。

3. 使用 TimeGPT 进行交叉验证

我们可以按如下方式对数据执行交叉验证

timegpt_cv_df = nixtla_client.cross_validation(
    df, 
    h=7, 
    n_windows=8, 
    time_col='date', 
    target_col='users', 
    freq='D',
    level=[80, 90, 99.5]
)
timegpt_cv_df.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
唯一ID日期截止点用户TimeGPTTimeGPT-lo-99.5TimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90TimeGPT-hi-99.5
0每日访问量2021-07-012021-06-3031233310.9084473041.9254973048.3632203082.7219243539.0949713573.4536743579.891397
1每日访问量2021-07-022021-06-3028703090.9716802793.5359052838.4802982853.7504883328.1928713343.4630623388.407455
2每日访问量2021-07-032021-06-3020202346.9914552043.7312962150.0050782171.1870122522.7958982543.9778322650.251614
3每日访问量2021-07-042021-06-3018282182.1918951836.8481731897.6849001929.9145752434.4692142466.6988892527.535616
4每日访问量2021-07-052021-06-3027223082.7150882736.0080552746.9970342791.3753423374.0548343418.4331423429.422121

📘 Azure AI 中可用模型

如果您正在使用 Azure AI 端点,请务必设置 model="azureai"

nixtla_client.cross_validation(..., model="azureai")

对于公共 API,我们支持两种模型:timegpt-1timegpt-1-long-horizon

默认情况下使用 timegpt-1。有关如何以及何时使用 timegpt-1-long-horizon,请参阅此教程

在这里,我们执行了 8 折滚动交叉验证。让我们绘制包含预测区间的交叉验证预测结果

nixtla_client.plot(
    df, 
    timegpt_cv_df.drop(columns=['cutoff', 'users']), 
    time_col='date',
    target_col='users',
    max_insample_length=90, 
    level=[80, 90, 99.5]
)

这看起来很合理,并且与此处获得的结果非常相似。

让我们检查交叉验证的平均绝对误差

from utilsforecast.losses import mae
mae_timegpt = mae(df = timegpt_cv_df.drop(columns=['cutoff']),
    models=['TimeGPT'],
    target_col='users')

mae_timegpt
唯一IDTimeGPT
0每日访问量167.691711

我们的回测 MAE 为 167.69。因此,TimeGPT 不仅比此处完全定制的流程取得了更低的 MAE,预测误差也更低。

外部变量

现在让我们添加一些外部变量,看看是否能进一步提高预测性能。

我们将添加星期指标,这些指标将从 date 列中提取。

# We have 7 days, for each day a separate column denoting 1/0
for i in range(7):
    df[f'week_day_{i + 1}'] = 1 * (df['date'].dt.weekday == i)

df.head(10)
日期用户唯一ID星期_1星期_2星期_3星期_4星期_5星期_6星期_7
02020-07-012324每日访问量0010000
12020-07-022201每日访问量0001000
22020-07-032146每日访问量0000100
32020-07-041666每日访问量0000010
42020-07-051433每日访问量0000001
52020-07-062195每日访问量1000000
62020-07-072240每日访问量0100000
72020-07-082295每日访问量0010000
82020-07-092279每日访问量0001000
92020-07-102155每日访问量0000100

让我们使用添加的外部变量重新运行交叉验证过程。

timegpt_cv_df_with_ex = nixtla_client.cross_validation(
    df, 
    h=7, 
    n_windows=8, 
    time_col='date', 
    target_col='users', 
    freq='D',
    level=[80, 90, 99.5]
)
timegpt_cv_df_with_ex.head()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Using the following exogenous variables: week_day_1, week_day_2, week_day_3, week_day_4, week_day_5, week_day_6, week_day_7
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
INFO:nixtla.nixtla_client:Validating inputs...
唯一ID日期截止点用户TimeGPTTimeGPT-lo-99.5TimeGPT-lo-90TimeGPT-lo-80TimeGPT-hi-80TimeGPT-hi-90TimeGPT-hi-99.5
0每日访问量2021-07-012021-06-3031233314.7737432793.5669423043.3042613085.6681223543.8793643586.2432263835.980544
1每日访问量2021-07-022021-06-3028703093.0665292139.7278922725.9641122779.0821543407.0509043460.1689464046.405166
2每日访问量2021-07-032021-06-3020202347.9735731386.0905291915.4875501973.6796282722.2675192780.4595963309.856618
3每日访问量2021-07-042021-06-3018282182.4674081003.6774541681.2464911874.5723272490.3624882683.6883243361.257361
4每日访问量2021-07-052021-06-3027223083.6294531257.2484352220.4303572556.4086283610.8502793946.8285504910.010472

让我们再次绘制预测结果并计算误差。

nixtla_client.plot(
    df, 
    timegpt_cv_df_with_ex.drop(columns=['cutoff', 'users']), 
    time_col='date',
    target_col='users',
    max_insample_length=90, 
    level=[80, 90, 99.5]
)

mae_timegpt_with_exogenous = mae(df = timegpt_cv_df_with_ex.drop(columns=['cutoff']),
    models=['TimeGPT'],
    target_col='users')

mae_timegpt_with_exogenous
唯一IDTimeGPT
0每日访问量167.22857

总之,我们在本笔记本中获得了以下预测结果

mae_timegpt['Exogenous features'] = False
mae_timegpt_with_exogenous['Exogenous features'] = True

df_results = pd.concat([mae_timegpt, mae_timegpt_with_exogenous])
df_results = df_results.rename(columns={'TimeGPT':'MAE backtest'})
df_results = df_results.drop(columns={'unique_id'})
df_results['model'] = 'TimeGPT'

df_results[['model', 'Exogenous features', 'MAE backtest']]
模型外部特征回测 MAE
0TimeGPT167.691711
0TimeGPT167.228570

我们展示了如何预测网站的每日访问量。与原教程相比,我们使用明显更少的代码行,在所需运行时间的一小部分内,实现了接近 10% 的更优预测结果。

您是否注意到这花费了很少的精力?您不必做的事情包括

  • 复杂的预处理 - 仅包含时间序列的表格就足够了
  • 创建验证集和测试集 - TimeGPT 在一个函数中处理交叉验证
  • 选择和测试不同的模型 - 只需一次 TimeGPT 调用
  • 超参数调优 - 没有必要。

预测愉快!