版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u014281392/article/details/83238999
TSAP : TimeSeries Analysis with Python
( 4 ) Trended & Seasonality
import numpy as np # version : 1.14.0
import pandas as pd # version : 0.22.0
import statsmodels # version : 0.8.0
%matplotlib inline
from matplotlib.pylab import plt
print(statsmodels.__version__)
print(np.__version__)
print(pd.__version__)
Let’s start with some informal exploration
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
air_passengers.head()
Passengers | |
---|---|
Month | |
1949-01-01 | 112 |
1949-02-01 | 118 |
1949-03-01 | 132 |
1949-04-01 | 129 |
1949-05-01 | 121 |
parse_dates : boolean / list of [ints|names] / list of lists / dict, default False
boolean. If True -> try parsing the index.
list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
如果列或索引包含不可解析的日期,则整个列或索引将作为对象数据类型以不变的方式返回。
Note: A fast-path exists for iso8601-formatted dates.
Visualization
air_passengers.plot(grid=True, figsize=(10, 5))
# AirPass 每一年的变化曲线,随着时间的推移,方差是越来越大了
plt.figure(figsize=(10,5))
plt.grid(True)
for year in [str(x) for x in range(1949, 1961)]:
plt.plot(range(1,13), air_passengers[year], label=year)
plt.legend()
Getting a little more formal time series analysis
- mean
- variance
- autocovariance
plot the moving average
air_passengers.rolling(window = 12).mean().plot(grid=True, figsize=(10,5),title='12month moving average')
var( ) & Time
air_passengers.resample('1Y').var().plot(figsize=(10,5), grid=True)
年方差
Autocorrelation
# autocorrelation
from statsmodels.tsa.stattools import acf, pacf
# autocorrelation
ac = acf(air_passengers)
plt.figure(figsize=(10, 5))
plt.title('ACF')
plt.grid(True, axis='x', xdata=np.arange(0,41))
plt.xticks(range(0,41,12))
plt.plot(ac)
And now let’s make it formal
# Augmented Dickey-Fuller test
from statsmodels.tsa.stattools import adfuller
adfuller(air_passengers.Passengers, autolag = 'AIC', regression = 'ct')
(-2.100781813844671,
0.545658934312454,
13,
130,
{'1%': -4.030152423759672,
'10%': -3.1471816659080565,
'5%': -3.444817634956759},
993.2814778200581)
What do these numbers mean?
- adf(float) - 测试统计
- pvalue(float) - MacKinnon基于MacKinnon的近似p值(1994,2010)
- usedlag(int) - 使用的滞后数
- nobs(int) - 用于ADF回归和计算临界值的观察数
- critical values(dict) - 1%,5%和10%水平的检验统计量的临界值。 基于MacKinnon(2010)
- icbest(float) - 如果autolag不是None,则为最大化信息标准。
- resstore(ResultStore,可选) - 一个虚拟类,其结果作为属性附加
data transformation
# power or log transformation (对数转换)
log_passengers = air_passengers.Passengers.apply(lambda x: np.log(x))
log_passengers.plot(figsize=(10, 5), grid=True, title='Log(Passengers)')
# power transformation(幂转换)
rt_passengers = air_passengers.Passengers.apply(lambda x: x**.5)
rt_passengers.plot(figsize=(10, 5), grid=True, title='sqrt(passengers)')
calculate a rolling mean(移动平均值)
# window size = 12
air_passengers.rolling(window = 12).mean().plot(figsize=(10,5),grid=True)
Detrended
原时间序列 - rolling_mean
rolling_mean = air_passengers.rolling(window = 12).mean()
passengers_detrended = air_passengers - rolling_mean
passengers_detrended.plot(figsize=(10, 5), grid=True)
detrended: log(Passenger) - rolling_mean(log(Passenger))
log_rolling_mean = log_passengers.rolling(window = 12).mean()
log_detrended = log_passengers - log_rolling_mean
log_detrended.plot(figsize=(10, 5), grid=True)
# 消除趋势后的,周期变化趋势
log_detrended.rolling(window=5).mean().plot(figsize=(10, 5), grid=True)
rolling median
log_detrended.rolling(12).median().plot(figsize=(10,5), grid=True)
log(original time series - rolling_mean)
rolling_mean = air_passengers.rolling(window = 12).mean()
passengers_detrended = air_passengers - rolling_mean
log_detrended2 = passengers_detrended.Passengers.apply(lambda x: np.log(x))
log_detrended2.plot(figsize=(10, 5), grid=True)
Why didn’t that work?
- 缺失部分的值 <= 0
detrended (original ts - regression ts)
# Now let's use a regression rather than a rolling mean to detrend
# 用回归取代移动平均(消除趋势)
from statsmodels.regression.linear_model import OLS
model = OLS(air_passengers.Passengers.values, list(range(len(air_passengers.values))))
result = model.fit()
result.params
regression_fit = pd.Series(result.predict(list(range(len(air_passengers.values)))), index = air_passengers.index)
passengers_detrended = air_passengers.Passengers - regression_fit
passengers_detrended.plot(figsize=(10, 5), grid=True)
- 经过对数转换后的数据,在经过detrended处理,可以基本消除了原始时序数据的趋势,反映出原始数据的周期性特征.
Seasonality
Differencing(差分)
- original timeseries differencing
- log(original timeseries) differencing
# original timeseries differencing
(air_passengers.Passengers - air_passengers.Passengers.shift()).plot(figsize=(10,5),grid=True)
# log(original timeseries) differencing
log_passengers_diff = log_passengers - log_passengers.shift()
log_passengers_diff.plot(figsize=(10, 5), grid=True)
Seasonal Decompose
decomposition = seasonal_decompose(log_passengers, model = 'multiplicative')
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# log(Original timeseries), Trend
plt.figure(figsize=(10,5))
plt.plot(log_passengers, label='Original')
plt.plot(trend, label='Trend')
plt.ylabel('log(Passengers)')
plt.grid(True), plt.legend(loc = 'best')
# Seasonality & Residuals
plt.figure(figsize=(10, 5))
plt.plot(seasonal,label='Seasonality')
plt.plot(residual, label='Residuals')
plt.grid(True), plt.legend(loc='best')