时间序列预测中的特征选择与工程实践-程序员充电站

1. 时间序列预测中的特征选择核心挑战

当我在2013年第一次尝试用Python构建销售预测模型时，面对包含200多个特征的批发数据集完全无从下手。那时我才明白，时间序列预测中的特征选择与传统机器学习有着本质区别——我们不仅要考虑特征与目标变量的相关性，还要处理时间依赖性、滞后效应以及多重共线性等特殊问题。

时间序列数据就像层层叠叠的千层饼，每个时间点的观测值都承载着历史信息的回声。以电商平台的日销售额预测为例，昨天的销量、上周同期的促销效果、上个月的整体趋势都会影响今天的预测结果。但若简单地将所有历史数据都作为特征，模型很快就会陷入维度灾难的泥潭。

2. 时间序列特征工程方法论

2.1 基础特征构造技术

最基础的时间特征构造往往被初学者忽视。假设我们处理的是每日销售数据，以下代码展示了如何提取关键时间特征：

def create_time_features(df, timestamp_col): df['day_of_week'] = df[timestamp_col].dt.dayofweek df['day_of_month'] = df[timestamp_col].dt.day df['week_of_year'] = df[timestamp_col].dt.isocalendar().week df['month'] = df[timestamp_col].dt.month df['quarter'] = df[timestamp_col].dt.quarter df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int) df['is_month_start'] = df[timestamp_col].dt.is_month_start.astype(int) return df

实战经验：对于季节性强的业务（如冰淇淋销售），月份特征比季度特征更重要；而B2B业务则可能对季度末效应更敏感。

2.2 滞后特征工程

滞后特征(lagged features)是时间序列预测的基石。正确的滞后阶数选择需要结合自相关函数(ACF)和偏自相关函数(PACF)分析：

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # 绘制ACF和PACF图 fig, (ax1, ax2) = plt.subplots(2,1, figsize=(12,8)) plot_acf(series, lags=40, ax=ax1) plot_pacf(series, lags=40, ax=ax2) plt.show()

我在能源负荷预测项目中曾发现，电力消耗的ACF呈现每周7天的周期性衰减，而PACF在lag=1和lag=7处有显著峰值，这提示我们需要同时包含近期滞后(lag1-lag3)和周期滞后(lag7, lag14)。

2.3 滚动统计量特征

滚动窗口统计量能有效捕捉时间序列的局部模式。常用的包括：

window_sizes = [3,7,14] # 根据业务周期确定 for window in window_sizes: df[f'rolling_mean_{window}'] = df['value'].rolling(window).mean() df[f'rolling_std_{window}'] = df['value'].rolling(window).std() df[f'rolling_min_{window}'] = df['value'].rolling(window).min() df[f'rolling_max_{window}'] = df['value'].rolling(window).max()

避坑指南：处理边缘数据时，建议设置min_periods=1而非直接dropna，否则在预测初期会丢失大量数据。但需警惕由此带来的预测偏差。

3. 高级特征选择技术

3.1 基于模型的重要性评估

使用LightGBM等树模型可以高效评估特征重要性。这种方法的优势在于能自动捕捉非线性关系：

import lightgbm as lgb model = lgb.LGBMRegressor() model.fit(X_train, y_train) # 获取特征重要性 importance = pd.DataFrame({ 'feature': X_train.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False)

我在实践中发现，将特征重要性分析与时序交叉验证结合效果更可靠：

from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) importance_results = [] for train_idx, test_idx in tscv.split(X): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] model.fit(X_train, y_train) fold_importance = model.feature_importances_ importance_results.append(fold_importance)

3.2 递归特征消除(RFE)的时序适配

传统RFE需要改造才能用于时间序列。关键改进点包括：

使用时序交叉验证代替随机划分
添加滞后特征保护机制（防止过早剔除关键滞后项）

from sklearn.feature_selection import RFECV # 使用时序交叉验证 tscv = TimeSeriesSplit(n_splits=5) selector = RFECV(estimator=model, step=1, cv=tscv, scoring='neg_mean_squared_error') selector = selector.fit(X, y)

3.3 基于互信息的非线性选择

对于存在复杂非线性关系的时间序列，互信息能更好地度量特征与目标的依赖关系：

from sklearn.feature_selection import mutual_info_regression mi_scores = mutual_info_regression(X, y) mi_scores = pd.Series(mi_scores, index=X.columns)

4. 特征选择实战策略

4.1 优先级排序框架

根据多个项目经验，我总结出以下特征选择优先级：

领域知识驱动的核心特征（如促销标记、节假日）
统计检验显著的滞后项（通过PACF确定）
滚动统计量特征（均值、标准差等）
交互特征（如促销与周末的交叉项）

4.2 稳定性检验方法

好的特征选择应该在时间维度上保持稳定。我常用的检验方法是：

将数据按时间分段
在各段上独立进行特征选择
计算特征出现频率的Jaccard相似度

from sklearn.metrics import jaccard_score # 假设feature_sets是各时间段的特征选择结果 stability_scores = [] for i in range(len(feature_sets)-1): set1 = set(feature_sets[i]) set2 = set(feature_sets[i+1]) stability_scores.append(len(set1 & set2)/len(set1 | set2))

4.3 预测性能验证框架

最终需要通过预测性能验证特征选择效果。我推荐以下评估流程：

def evaluate_features(X, y, selected_features): tscv = TimeSeriesSplit(n_splits=5) scores = [] for train_idx, test_idx in tscv.split(X): X_train, X_test = X.iloc[train_idx][selected_features], X.iloc[test_idx][selected_features] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] model.fit(X_train, y_train) preds = model.predict(X_test) score = mean_absolute_percentage_error(y_test, preds) scores.append(score) return np.mean(scores)

5. 典型问题解决方案

5.1 高相关特征处理

时间序列中常见高度相关的滚动统计量特征。我的处理流程：

计算特征相关矩阵
识别相关系数>0.9的特征对
保留业务解释性更强的特征

corr_matrix = X.corr().abs() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

5.2 概念漂移应对

当特征重要性随时间变化时，说明存在概念漂移。检测方法：

# 滑动窗口计算特征重要性变化 window_size = 100 importance_changes = [] for i in range(len(X) - window_size): X_window = X.iloc[i:i+window_size] y_window = y.iloc[i:i+window_size] model.fit(X_window, y_window) importance_changes.append(model.feature_importances_)

5.3 实时预测的特征缓存

生产系统中需要高效计算滚动特征。我的解决方案是维护一个环形缓冲区：

from collections import deque class CircularBuffer: def __init__(self, size): self.buffer = deque(maxlen=size) def add(self, value): self.buffer.append(value) def mean(self): return np.mean(self.buffer) def std(self): return np.std(self.buffer) # 使用示例 price_buffer = CircularBuffer(7) # 7天窗口 price_buffer.add(current_price) rolling_avg = price_buffer.mean()

6. 工具链与性能优化

6.1 基于Dask的分布式特征计算

当处理超长时间序列时，我使用Dask进行并行计算：

import dask.dataframe as dd ddf = dd.from_pandas(df, npartitions=4) # 并行计算滚动特征 ddf['rolling_mean'] = ddf['value'].rolling(7).mean().compute()

6.2 特征计算流水线

使用sklearn Pipeline封装特征工程步骤：

from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer def create_lags(df): for lag in [1,2,3,7,14]: df[f'lag_{lag}'] = df['value'].shift(lag) return df pipeline = Pipeline([ ('create_features', FunctionTransformer(create_lags)), ('feature_selection', SelectKBest(score_func=mutual_info_regression, k=10)) ])

6.3 基于Cython的关键加速

对于计算密集的滚动统计量，我用Cython实现加速：

# cython: language_level=3 import numpy as np cimport numpy as np def rolling_mean(np.ndarray[double] arr, int window): cdef int i, n = arr.shape[0] cdef np.ndarray[double] out = np.empty(n) cdef double current_sum = 0.0 for i in range(n): current_sum += arr[i] if i >= window: current_sum -= arr[i - window] out[i] = current_sum / min(i+1, window) return out

在最近的一个项目中，这种优化将特征计算时间从45秒缩短到0.8秒。