实战之回归法选股
概述
回归法就是用过去的股票的收益率对多因子进行回归. 得到一个回归方程, 然后再把最新的因子值带入回归方程得到一个对未来股票收益的预判. 然后再以此为依据进行选股, 并对选股模型的有效性和收益率进行评价.
回归法的优点是能够比较及时地调整股票对各个因子的敏感性也可以不同. 回归法的缺点则是容易受到极端值的影响. 在对因子敏感度变化较大的市场情况下效果会比较差.
回归法选股流程
在选股时我们通常按月调仓. 我们回归选择的是该月的最后一天若干个因子值与下一个月的股票收益率来建立横截面回归方程.
步骤分析
- 回归训练区间: 2016-01-01 ~ 2021-01-01
- 回归股票池: (HS300 指数)
- 回归因子数据准备, 收益率计算
- 因子数据: 横截面数据拼接, 添加日期数据, 去除空值
- 收益率计算:所有样本的收益率计算
- 目标值特征值提取进行回归估计
- 数据处理: 去除收益为 0 (价格数据不存在) 的数据, 去极值, 标准化处理
代码
导包
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
1. 准备日期数据
# 确定每月日期 2019-01-01 到 2021-01-01
dates = get_trading_dates(start_date="2019-01-01", end_date="2021-01-01")
# 每月最后一个交易日, 按月计算收益率
month_date = []
for i in range(len(dates) - 1):
if dates[i].month != dates[i + 1].month:
month_date.append(dates[i])
# 把最后一个交易日加入
month_date.append(dates[-1])
# 调试输出
print(month_date)
print(month_date[:-1])
print(np.shape(month_date))
输出结果:
[datetime.date(2019, 1, 31), datetime.date(2019, 2, 28), datetime.date(2019, 3, 29), datetime.date(2019, 4, 30), datetime.date(2019, 5, 31), datetime.date(2019, 6, 28), datetime.date(2019, 7, 31), datetime.date(2019, 8, 30), datetime.date(2019, 9, 30), datetime.date(2019, 10, 31), datetime.date(2019, 11, 29), datetime.date(2019, 12, 31), datetime.date(2020, 1, 23), datetime.date(2020, 2, 28), datetime.date(2020, 3, 31), datetime.date(2020, 4, 30), datetime.date(2020, 5, 29), datetime.date(2020, 6, 30), datetime.date(2020, 7, 31), datetime.date(2020, 8, 31), datetime.date(2020, 9, 30), datetime.date(2020, 10, 30), datetime.date(2020, 11, 30), datetime.date(2020, 12, 31)]
[datetime.date(2019, 1, 31), datetime.date(2019, 2, 28), datetime.date(2019, 3, 29), datetime.date(2019, 4, 30), datetime.date(2019, 5, 31), datetime.date(2019, 6, 28), datetime.date(2019, 7, 31), datetime.date(2019, 8, 30), datetime.date(2019, 9, 30), datetime.date(2019, 10, 31), datetime.date(2019, 11, 29), datetime.date(2019, 12, 31), datetime.date(2020, 1, 23), datetime.date(2020, 2, 28), datetime.date(2020, 3, 31), datetime.date(2020, 4, 30), datetime.date(2020, 5, 29), datetime.date(2020, 6, 30), datetime.date(2020, 7, 31), datetime.date(2020, 8, 31), datetime.date(2020, 9, 30), datetime.date(2020, 10, 30), datetime.date(2020, 11, 30)]
(24,)
2. 准备因子数据
# 获取沪深300的股票列表
stocks = index_components("000300.XSHG")
all_data = pd.DataFrame()
# 特征值都是该月的因子数据(避免下个月在日期列表当中不存在)
for date in month_date[:-1]:
q = query(
fundamentals.eod_derivative_indicator.pe_ratio,
fundamentals.eod_derivative_indicator.pb_ratio,
fundamentals.eod_derivative_indicator.market_cap,
fundamentals.financial_indicator.ev,
fundamentals.financial_indicator.return_on_asset_net_profit,
fundamentals.financial_indicator.du_return_on_equity,
fundamentals.financial_indicator.earnings_per_share,
fundamentals.income_statement.revenue,
fundamentals.income_statement.total_expense
).filter(
fundamentals.stockcode.in_(stocks)
)
# 查询因子数据
fund = get_fundamentals(q, entry_date=date).iloc[:, 0, :]
# 添加时间
fund["date"] = date
# 进行每月因子数据拼接
all_data = pd.concat([all_data, fund])
# 进行简单的数据处理
all_data.dropna()
all_data["next_month_return"] = np.nan
# 调试输出
print(all_data.head())
调试输出:
du_return_on_equity revenue return_on_asset_net_profit \
000002.XSHE 10.2572 1.76022e+11 1.6783
000001.XSHE 8.9467 8.6664e+10 0.6198
000066.XSHE 3.2718 6.27052e+09 1.5837
000063.XSHE -26.7064 5.87662e+10 -5.4534
000069.XSHE 9.6044 2.45503e+10 2.1907
pb_ratio pe_ratio earnings_per_share total_expense ev \
000002.XSHE 1.9667 9.0705 1.267 1.47083e+11 4.96679e+11
000001.XSHE 0.866 7.6796 1.14 6.005e+10 3.30656e+12
000066.XSHE 2.5219 14.0424 0.071 6.31538e+09 2.24526e+10
000063.XSHE 3.6823 -12.0731 -1.73 6.02975e+10 9.32565e+10
000069.XSHE 0.8881 4.9651 0.6205 1.7947e+10 1.5581e+11
market_cap date next_month_return
000002.XSHE 3.06336e+11 2019-01-31 NaN
000001.XSHE 1.90592e+11 2019-01-31 NaN
000066.XSHE 1.57378e+10 2019-01-31 NaN
000063.XSHE 8.43146e+10 2019-01-31 NaN
000069.XSHE 5.25061e+10 2019-01-31 NaN
3. 获取价格
# 建立每个股票样本, 每个月的因子值对应下个月的收益率
# 获取每月月末价格数据
all_price = pd.DataFrame()
for date in month_date:
price = get_price(stocks, start_date=date, end_date=date, fields="close")
all_price = pd.concat([all_price, price])
# 转置
all_price = all_price.T
# 去除空值
all_price = all_price.dropna()
# 调试输出
print(all_price.head())
输出结果:
date 2019-01-31 2019-02-28 2019-03-29 2019-04-30 2019-05-31 \
000001.XSHE 10.7961 12.0216 12.4690 13.4708 11.8465
000002.XSHE 25.7326 25.9552 28.4867 26.7805 24.7589
000063.XSHE 20.0043 29.6434 29.0466 31.9810 28.6089
000066.XSHE 5.3014 6.8443 10.4544 8.6543 9.6731
000069.XSHE 5.8395 6.3139 7.0256 7.2537 6.4952
date 2019-06-28 2019-07-31 2019-08-30 2019-09-30 2019-10-31 ... \
000001.XSHE 13.5490 13.8931 13.9226 15.3286 15.9874 ...
000002.XSHE 25.7882 26.6692 24.8829 24.9793 25.5870 ...
000063.XSHE 32.3591 32.7470 28.6586 31.8418 33.2245 ...
000066.XSHE 10.1676 9.4349 10.5981 12.8251 14.3960 ...
000069.XSHE 6.6190 6.8667 6.4857 6.6952 6.7048 ...
date 2020-03-31 2020-04-30 2020-05-29 2020-06-30 2020-07-31 \
000001.XSHE 12.5854 13.6965 13.0000 12.8000 13.3400
000002.XSHE 24.7382 25.8474 24.7865 25.2108 25.8859
000063.XSHE 42.5751 40.8741 35.8606 39.9191 38.9642
000066.XSHE 11.8707 11.7613 13.6503 13.1234 18.3000
000069.XSHE 6.0857 6.2095 5.5619 6.0600 7.1700
date 2020-08-31 2020-09-30 2020-10-30 2020-11-30 2020-12-31
000001.XSHE 15.08 15.17 17.75 19.74 19.34
000002.XSHE 27.27 28.02 27.55 30.70 28.70
000063.XSHE 39.00 33.10 32.27 34.73 33.65
000066.XSHE 17.71 16.04 15.15 14.42 18.99
000069.XSHE 7.15 6.78 6.56 7.28 7.09
[5 rows x 24 columns]
4. 计算对应的收益率
for i in range (len(all_price.columns) - 1):
# 利用后一个月的收盘价-这个月的收盘价/这个月的收盘价
all_price.iloc[:, i] = all_price.iloc[:, i + 1] / all_price.iloc[:, i] -1
# 调试输出
print(all_price.head())
输出结果:
date 2019-01-31 2019-02-28 2019-03-29 2019-04-30 2019-05-31 \
000001.XSHE 0.113513 0.037216 0.080343 -0.120579 0.143713
000002.XSHE 0.008651 0.097533 -0.059895 -0.075488 0.041573
000063.XSHE 0.481851 -0.020133 0.101024 -0.105441 0.131085
000066.XSHE 0.291036 0.527461 -0.172186 0.117722 0.051121
000069.XSHE 0.081240 0.112720 0.032467 -0.104567 0.019060
date 2019-06-28 2019-07-31 2019-08-30 2019-09-30 2019-10-31 ... \
000001.XSHE 0.025397 0.002123 0.100987 0.042978 -0.059653 ...
000002.XSHE 0.034163 -0.066980 0.003874 0.024328 0.044101 ...
000063.XSHE 0.011987 -0.124848 0.111073 0.043424 -0.080540 ...
000066.XSHE -0.072062 0.123287 0.210132 0.122486 0.025549 ...
000069.XSHE 0.037423 -0.055485 0.032302 0.001434 -0.021313 ...
date 2020-03-31 2020-04-30 2020-05-29 2020-06-30 2020-07-31 \
000001.XSHE 0.088285 -0.050852 -0.015385 0.042187 0.130435
000002.XSHE 0.044838 -0.041045 0.017118 0.026778 0.053469
000063.XSHE -0.039953 -0.122657 0.113174 -0.023921 0.000919
000066.XSHE -0.009216 0.160611 -0.038600 0.394456 -0.032240
000069.XSHE 0.020343 -0.104292 0.089556 0.183168 -0.002789
date 2020-08-31 2020-09-30 2020-10-30 2020-11-30 2020-12-31
000001.XSHE 0.005968 0.170073 0.112113 -0.020263 19.34
000002.XSHE 0.027503 -0.016774 0.114338 -0.065147 28.70
000063.XSHE -0.151282 -0.025076 0.076232 -0.031097 33.65
000066.XSHE -0.094297 -0.055486 -0.048185 0.316921 18.99
000069.XSHE -0.051748 -0.032448 0.109756 -0.026099 7.09
[5 rows x 24 columns]
5. 填充因子收益率
# 将受益率填充到因子对应的下月受益当中
for i in range(len(all_data)):
# 每个样本填充对应收益率
stock = all_data.index[i]
date = all_data.ix[i, "date"]
# 在all_price里面寻找收益率
if stock in all_price.index and date in all_price.columns:
all_data.ix[i,"next_month_return"] = all_price.loc[stock, date]
# 把收益率为空删除
all_data = all_data.dropna()
# 调试输出
print(all_data.head())
输出结果:
du_return_on_equity revenue return_on_asset_net_profit \
000002.XSHE 10.2572 1.76022e+11 1.6783
000001.XSHE 8.9467 8.6664e+10 0.6198
000066.XSHE 3.2718 6.27052e+09 1.5837
000063.XSHE -26.7064 5.87662e+10 -5.4534
000069.XSHE 9.6044 2.45503e+10 2.1907
pb_ratio pe_ratio earnings_per_share total_expense ev \
000002.XSHE 1.9667 9.0705 1.267 1.47083e+11 4.96679e+11
000001.XSHE 0.866 7.6796 1.14 6.005e+10 3.30656e+12
000066.XSHE 2.5219 14.0424 0.071 6.31538e+09 2.24526e+10
000063.XSHE 3.6823 -12.0731 -1.73 6.02975e+10 9.32565e+10
000069.XSHE 0.8881 4.9651 0.6205 1.7947e+10 1.5581e+11
market_cap date next_month_return
000002.XSHE 3.06336e+11 2019-01-31 0.008651
000001.XSHE 1.90592e+11 2019-01-31 0.113513
000066.XSHE 1.57378e+10 2019-01-31 0.291036
000063.XSHE 8.43146e+10 2019-01-31 0.481851
000069.XSHE 5.25061e+10 2019-01-31 0.081240
6. 特征值和目标值处理
def mad(factor):
"""3倍中位数去极值"""
# 求出因子值的中位数
median = np.median(factor)
# 求出因子值与中位数的差值, 进行绝对值
mad = np.median(abs(factor - median))
# 定义几倍的中位数上下限
high = median + (3 * 1.4826 * mad)
low = median - (3 * 1.4826 * mad)
# 替换上下限
factor = np.where(factor > high, high, factor)
factor = np.where(factor < low, low, factor)
return factor
def stand(factor):
"""数据标准化"""
mean = factor.mean()
std = factor.std()
return (factor - mean) / std
y = all_data["next_month_return"]
x = all_data.drop(["next_month_return", "date"], axis=1)
x_market_cap = x["market_cap"]
# 特征处理 (去极值, 标准化, 中心化)
for name in x.columns:
x[name] = mad(x[name])
x[name] = stand(x[name])
# 中性化处理 (特征值: 市值因子, 目标值: 其他因子)
for name in x.columns:
if name == "market_cap":
continue
# 准备特征值, 目标值
y_factor = x[name]
# 线性回归方程建立
lr = LinearRegression()
lr.fit(x_market_cap.values.reshape(-1, 1), y_factor)
y_predict = lr.predict(x_market_cap.values.reshape(-1, 1))
# 得出真实值与预测之间的误差当做新的因子值
x[name] = y_factor - y_predict
# 收益率目标值y (标准化)
y = stand(y)
print(y)
输出结果:
000002.XSHE -0.190612
000001.XSHE 0.656175
000066.XSHE 2.089711
000063.XSHE 3.630581
000069.XSHE 0.395561
000100.XSHE 1.149259
000157.XSHE 0.781338
000425.XSHE 1.423916
000538.XSHE 0.503104
000568.XSHE 1.361395
000596.XSHE 1.451692
000625.XSHE 0.824005
000627.XSHE 1.834775
000651.XSHE 0.446647
000656.XSHE 0.177199
000661.XSHE 1.485170
000671.XSHE 1.489377
000703.XSHE 0.345296
000708.XSHE 1.240191
000723.XSHE 1.583630
000725.XSHE 4.502545
000728.XSHE 2.082080
000768.XSHE 1.094074
000776.XSHE 2.157632
000783.XSHE 2.334973
000786.XSHE 1.180269
000858.XSHE 1.239401
000860.XSHE 0.213113
000876.XSHE 2.431267
000895.XSHE -0.535029
...
601808.XSHG -0.451852
601878.XSHG -0.843214
601881.XSHG -0.498526
601818.XSHG -0.911693
601901.XSHG 0.948128
601888.XSHG 3.469926
601877.XSHG 1.296525
601939.XSHG -1.243044
601872.XSHG -0.781879
601933.XSHG -0.921353
601899.XSHG -0.260467
601919.XSHG 1.603668
601990.XSHG -0.737300
601989.XSHG -0.539564
601985.XSHG -0.244020
603019.XSHG -0.518526
601998.XSHG -0.460806
601988.XSHG -0.554111
603156.XSHG -0.350088
603160.XSHG -0.810521
603259.XSHG 2.192650
603288.XSHG 1.836448
603369.XSHG 1.194465
603501.XSHG 0.326507
603658.XSHG -0.249327
603799.XSHG 3.903695
603833.XSHG 0.177536
603899.XSHG 1.493082
603986.XSHG -0.615777
603993.XSHG 2.659977
Name: next_month_return, Length: 6486, dtype: float64
7. 建立回归方程
# 建立特征值因子数据 (处理过的) 与目标值 (标准化) 下期收益率之间的回归方程
lr = LinearRegression()
lr.fit(x, y)
print(lr.coef_)
输出结果:
[ 0.04549957 0.01249463 -0.02397849 0.06077185 -0.00195205 -0.00892116
-0.04641399 -0.05644752 -0.08393869]