量化交易 实战之回归法选股

概述

回归法就是用过去的股票的收益率对多因子进行回归. 得到一个回归方程, 然后再把最新的因子值带入回归方程得到一个对未来股票收益的预判. 然后再以此为依据进行选股, 并对选股模型的有效性和收益率进行评价.

回归法的优点是能够比较及时地调整股票对各个因子的敏感性也可以不同. 回归法的缺点则是容易受到极端值的影响. 在对因子敏感度变化较大的市场情况下效果会比较差.

回归法选股流程

在选股时我们通常按月调仓. 我们回归选择的是该月的最后一天若干个因子值与下一个月的股票收益率来建立横截面回归方程.

步骤分析

  1. 回归训练区间: 2016-01-01 ~ 2021-01-01
  2. 回归股票池: (HS300 指数)
  3. 回归因子数据准备, 收益率计算
    • 因子数据: 横截面数据拼接, 添加日期数据, 去除空值
    • 收益率计算:所有样本的收益率计算
  4. 目标值特征值提取进行回归估计
    • 数据处理: 去除收益为 0 (价格数据不存在) 的数据, 去极值, 标准化处理

代码

导包

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

1. 准备日期数据

# 确定每月日期 2019-01-01 到 2021-01-01
dates = get_trading_dates(start_date="2019-01-01", end_date="2021-01-01")

# 每月最后一个交易日, 按月计算收益率
month_date = []
for i in range(len(dates) - 1):
    if dates[i].month != dates[i + 1].month:
        month_date.append(dates[i])
        
# 把最后一个交易日加入
month_date.append(dates[-1])

# 调试输出
print(month_date)
print(month_date[:-1])
print(np.shape(month_date))

输出结果:

[datetime.date(2019, 1, 31), datetime.date(2019, 2, 28), datetime.date(2019, 3, 29), datetime.date(2019, 4, 30), datetime.date(2019, 5, 31), datetime.date(2019, 6, 28), datetime.date(2019, 7, 31), datetime.date(2019, 8, 30), datetime.date(2019, 9, 30), datetime.date(2019, 10, 31), datetime.date(2019, 11, 29), datetime.date(2019, 12, 31), datetime.date(2020, 1, 23), datetime.date(2020, 2, 28), datetime.date(2020, 3, 31), datetime.date(2020, 4, 30), datetime.date(2020, 5, 29), datetime.date(2020, 6, 30), datetime.date(2020, 7, 31), datetime.date(2020, 8, 31), datetime.date(2020, 9, 30), datetime.date(2020, 10, 30), datetime.date(2020, 11, 30), datetime.date(2020, 12, 31)]
[datetime.date(2019, 1, 31), datetime.date(2019, 2, 28), datetime.date(2019, 3, 29), datetime.date(2019, 4, 30), datetime.date(2019, 5, 31), datetime.date(2019, 6, 28), datetime.date(2019, 7, 31), datetime.date(2019, 8, 30), datetime.date(2019, 9, 30), datetime.date(2019, 10, 31), datetime.date(2019, 11, 29), datetime.date(2019, 12, 31), datetime.date(2020, 1, 23), datetime.date(2020, 2, 28), datetime.date(2020, 3, 31), datetime.date(2020, 4, 30), datetime.date(2020, 5, 29), datetime.date(2020, 6, 30), datetime.date(2020, 7, 31), datetime.date(2020, 8, 31), datetime.date(2020, 9, 30), datetime.date(2020, 10, 30), datetime.date(2020, 11, 30)]
(24,)

2. 准备因子数据

# 获取沪深300的股票列表
stocks = index_components("000300.XSHG")

all_data = pd.DataFrame()

# 特征值都是该月的因子数据(避免下个月在日期列表当中不存在)
for date in month_date[:-1]:
    q = query(
        fundamentals.eod_derivative_indicator.pe_ratio,
        fundamentals.eod_derivative_indicator.pb_ratio,
        fundamentals.eod_derivative_indicator.market_cap,
        fundamentals.financial_indicator.ev,
        fundamentals.financial_indicator.return_on_asset_net_profit,
        fundamentals.financial_indicator.du_return_on_equity,
        fundamentals.financial_indicator.earnings_per_share,
        fundamentals.income_statement.revenue,
        fundamentals.income_statement.total_expense
    ).filter(
        fundamentals.stockcode.in_(stocks)
    )

    # 查询因子数据
    fund = get_fundamentals(q, entry_date=date).iloc[:, 0, :]
    
    # 添加时间
    fund["date"] = date
    
    # 进行每月因子数据拼接
    all_data = pd.concat([all_data, fund])
    
    
# 进行简单的数据处理
all_data.dropna()
all_data["next_month_return"] = np.nan
    
# 调试输出
print(all_data.head())

调试输出:

            du_return_on_equity      revenue return_on_asset_net_profit  \
000002.XSHE             10.2572  1.76022e+11                     1.6783   
000001.XSHE              8.9467   8.6664e+10                     0.6198   
000066.XSHE              3.2718  6.27052e+09                     1.5837   
000063.XSHE            -26.7064  5.87662e+10                    -5.4534   
000069.XSHE              9.6044  2.45503e+10                     2.1907   

            pb_ratio pe_ratio earnings_per_share total_expense           ev  \
000002.XSHE   1.9667   9.0705              1.267   1.47083e+11  4.96679e+11   
000001.XSHE    0.866   7.6796               1.14     6.005e+10  3.30656e+12   
000066.XSHE   2.5219  14.0424              0.071   6.31538e+09  2.24526e+10   
000063.XSHE   3.6823 -12.0731              -1.73   6.02975e+10  9.32565e+10   
000069.XSHE   0.8881   4.9651             0.6205    1.7947e+10   1.5581e+11   

              market_cap        date  next_month_return  
000002.XSHE  3.06336e+11  2019-01-31                NaN  
000001.XSHE  1.90592e+11  2019-01-31                NaN  
000066.XSHE  1.57378e+10  2019-01-31                NaN  
000063.XSHE  8.43146e+10  2019-01-31                NaN  
000069.XSHE  5.25061e+10  2019-01-31                NaN  

3. 获取价格

# 建立每个股票样本, 每个月的因子值对应下个月的收益率

# 获取每月月末价格数据
all_price = pd.DataFrame()
for date in month_date:
    price = get_price(stocks, start_date=date, end_date=date, fields="close")
    all_price = pd.concat([all_price, price])
    
    
# 转置
all_price = all_price.T

# 去除空值
all_price = all_price.dropna()

# 调试输出
print(all_price.head())

输出结果:

date         2019-01-31  2019-02-28  2019-03-29  2019-04-30  2019-05-31  \
000001.XSHE     10.7961     12.0216     12.4690     13.4708     11.8465   
000002.XSHE     25.7326     25.9552     28.4867     26.7805     24.7589   
000063.XSHE     20.0043     29.6434     29.0466     31.9810     28.6089   
000066.XSHE      5.3014      6.8443     10.4544      8.6543      9.6731   
000069.XSHE      5.8395      6.3139      7.0256      7.2537      6.4952   

date         2019-06-28  2019-07-31  2019-08-30  2019-09-30  2019-10-31  ...  \
000001.XSHE     13.5490     13.8931     13.9226     15.3286     15.9874  ...   
000002.XSHE     25.7882     26.6692     24.8829     24.9793     25.5870  ...   
000063.XSHE     32.3591     32.7470     28.6586     31.8418     33.2245  ...   
000066.XSHE     10.1676      9.4349     10.5981     12.8251     14.3960  ...   
000069.XSHE      6.6190      6.8667      6.4857      6.6952      6.7048  ...   

date         2020-03-31  2020-04-30  2020-05-29  2020-06-30  2020-07-31  \
000001.XSHE     12.5854     13.6965     13.0000     12.8000     13.3400   
000002.XSHE     24.7382     25.8474     24.7865     25.2108     25.8859   
000063.XSHE     42.5751     40.8741     35.8606     39.9191     38.9642   
000066.XSHE     11.8707     11.7613     13.6503     13.1234     18.3000   
000069.XSHE      6.0857      6.2095      5.5619      6.0600      7.1700   

date         2020-08-31  2020-09-30  2020-10-30  2020-11-30  2020-12-31  
000001.XSHE       15.08       15.17       17.75       19.74       19.34  
000002.XSHE       27.27       28.02       27.55       30.70       28.70  
000063.XSHE       39.00       33.10       32.27       34.73       33.65  
000066.XSHE       17.71       16.04       15.15       14.42       18.99  
000069.XSHE        7.15        6.78        6.56        7.28        7.09  

[5 rows x 24 columns]

4. 计算对应的收益率

for i in range (len(all_price.columns) - 1):
    
    # 利用后一个月的收盘价-这个月的收盘价/这个月的收盘价
    all_price.iloc[:, i] = all_price.iloc[:, i + 1] / all_price.iloc[:, i] -1
    
    
# 调试输出
print(all_price.head())

输出结果:

date         2019-01-31  2019-02-28  2019-03-29  2019-04-30  2019-05-31  \
000001.XSHE    0.113513    0.037216    0.080343   -0.120579    0.143713   
000002.XSHE    0.008651    0.097533   -0.059895   -0.075488    0.041573   
000063.XSHE    0.481851   -0.020133    0.101024   -0.105441    0.131085   
000066.XSHE    0.291036    0.527461   -0.172186    0.117722    0.051121   
000069.XSHE    0.081240    0.112720    0.032467   -0.104567    0.019060   

date         2019-06-28  2019-07-31  2019-08-30  2019-09-30  2019-10-31  ...  \
000001.XSHE    0.025397    0.002123    0.100987    0.042978   -0.059653  ...   
000002.XSHE    0.034163   -0.066980    0.003874    0.024328    0.044101  ...   
000063.XSHE    0.011987   -0.124848    0.111073    0.043424   -0.080540  ...   
000066.XSHE   -0.072062    0.123287    0.210132    0.122486    0.025549  ...   
000069.XSHE    0.037423   -0.055485    0.032302    0.001434   -0.021313  ...   

date         2020-03-31  2020-04-30  2020-05-29  2020-06-30  2020-07-31  \
000001.XSHE    0.088285   -0.050852   -0.015385    0.042187    0.130435   
000002.XSHE    0.044838   -0.041045    0.017118    0.026778    0.053469   
000063.XSHE   -0.039953   -0.122657    0.113174   -0.023921    0.000919   
000066.XSHE   -0.009216    0.160611   -0.038600    0.394456   -0.032240   
000069.XSHE    0.020343   -0.104292    0.089556    0.183168   -0.002789   

date         2020-08-31  2020-09-30  2020-10-30  2020-11-30  2020-12-31  
000001.XSHE    0.005968    0.170073    0.112113   -0.020263       19.34  
000002.XSHE    0.027503   -0.016774    0.114338   -0.065147       28.70  
000063.XSHE   -0.151282   -0.025076    0.076232   -0.031097       33.65  
000066.XSHE   -0.094297   -0.055486   -0.048185    0.316921       18.99  
000069.XSHE   -0.051748   -0.032448    0.109756   -0.026099        7.09  

[5 rows x 24 columns]

5. 填充因子收益率

# 将受益率填充到因子对应的下月受益当中
for i in range(len(all_data)):
    # 每个样本填充对应收益率
    stock = all_data.index[i]
    date = all_data.ix[i, "date"]
    
    # 在all_price里面寻找收益率
    if stock in all_price.index and date in all_price.columns:
        
        all_data.ix[i,"next_month_return"] = all_price.loc[stock, date]
    
    
# 把收益率为空删除
all_data = all_data.dropna()

# 调试输出
print(all_data.head())  

输出结果:

            du_return_on_equity      revenue return_on_asset_net_profit  \
000002.XSHE             10.2572  1.76022e+11                     1.6783   
000001.XSHE              8.9467   8.6664e+10                     0.6198   
000066.XSHE              3.2718  6.27052e+09                     1.5837   
000063.XSHE            -26.7064  5.87662e+10                    -5.4534   
000069.XSHE              9.6044  2.45503e+10                     2.1907   

            pb_ratio pe_ratio earnings_per_share total_expense           ev  \
000002.XSHE   1.9667   9.0705              1.267   1.47083e+11  4.96679e+11   
000001.XSHE    0.866   7.6796               1.14     6.005e+10  3.30656e+12   
000066.XSHE   2.5219  14.0424              0.071   6.31538e+09  2.24526e+10   
000063.XSHE   3.6823 -12.0731              -1.73   6.02975e+10  9.32565e+10   
000069.XSHE   0.8881   4.9651             0.6205    1.7947e+10   1.5581e+11   

              market_cap        date  next_month_return  
000002.XSHE  3.06336e+11  2019-01-31           0.008651  
000001.XSHE  1.90592e+11  2019-01-31           0.113513  
000066.XSHE  1.57378e+10  2019-01-31           0.291036  
000063.XSHE  8.43146e+10  2019-01-31           0.481851  
000069.XSHE  5.25061e+10  2019-01-31           0.081240 

6. 特征值和目标值处理

def mad(factor):
    """3倍中位数去极值"""
    
    # 求出因子值的中位数
    median = np.median(factor)
    
    # 求出因子值与中位数的差值, 进行绝对值
    mad = np.median(abs(factor - median))
    
    # 定义几倍的中位数上下限
    high = median + (3 * 1.4826 * mad)
    low = median - (3 * 1.4826 * mad)
    
    # 替换上下限
    factor = np.where(factor > high, high, factor)
    factor = np.where(factor < low, low, factor)
    return factor

def stand(factor):
    """数据标准化"""
    mean = factor.mean()
    std = factor.std()
    
    return (factor - mean) / std
    
y = all_data["next_month_return"]
x = all_data.drop(["next_month_return", "date"], axis=1)
x_market_cap = x["market_cap"]

# 特征处理 (去极值, 标准化, 中心化)
for name in x.columns:
    x[name] = mad(x[name])
    x[name] = stand(x[name])
    
# 中性化处理 (特征值: 市值因子, 目标值: 其他因子)

for name in x.columns:
    if name == "market_cap":
        continue
    # 准备特征值, 目标值
    y_factor = x[name]
    
    # 线性回归方程建立
    lr = LinearRegression()
    lr.fit(x_market_cap.values.reshape(-1, 1), y_factor)
    
    y_predict = lr.predict(x_market_cap.values.reshape(-1, 1))
    
    # 得出真实值与预测之间的误差当做新的因子值
    x[name] = y_factor - y_predict
    
    
# 收益率目标值y (标准化)
y = stand(y)
print(y)

输出结果:

000002.XSHE   -0.190612
000001.XSHE    0.656175
000066.XSHE    2.089711
000063.XSHE    3.630581
000069.XSHE    0.395561
000100.XSHE    1.149259
000157.XSHE    0.781338
000425.XSHE    1.423916
000538.XSHE    0.503104
000568.XSHE    1.361395
000596.XSHE    1.451692
000625.XSHE    0.824005
000627.XSHE    1.834775
000651.XSHE    0.446647
000656.XSHE    0.177199
000661.XSHE    1.485170
000671.XSHE    1.489377
000703.XSHE    0.345296
000708.XSHE    1.240191
000723.XSHE    1.583630
000725.XSHE    4.502545
000728.XSHE    2.082080
000768.XSHE    1.094074
000776.XSHE    2.157632
000783.XSHE    2.334973
000786.XSHE    1.180269
000858.XSHE    1.239401
000860.XSHE    0.213113
000876.XSHE    2.431267
000895.XSHE   -0.535029
                 ...   
601808.XSHG   -0.451852
601878.XSHG   -0.843214
601881.XSHG   -0.498526
601818.XSHG   -0.911693
601901.XSHG    0.948128
601888.XSHG    3.469926
601877.XSHG    1.296525
601939.XSHG   -1.243044
601872.XSHG   -0.781879
601933.XSHG   -0.921353
601899.XSHG   -0.260467
601919.XSHG    1.603668
601990.XSHG   -0.737300
601989.XSHG   -0.539564
601985.XSHG   -0.244020
603019.XSHG   -0.518526
601998.XSHG   -0.460806
601988.XSHG   -0.554111
603156.XSHG   -0.350088
603160.XSHG   -0.810521
603259.XSHG    2.192650
603288.XSHG    1.836448
603369.XSHG    1.194465
603501.XSHG    0.326507
603658.XSHG   -0.249327
603799.XSHG    3.903695
603833.XSHG    0.177536
603899.XSHG    1.493082
603986.XSHG   -0.615777
603993.XSHG    2.659977
Name: next_month_return, Length: 6486, dtype: float64

7. 建立回归方程

# 建立特征值因子数据 (处理过的) 与目标值 (标准化) 下期收益率之间的回归方程
lr = LinearRegression()

lr.fit(x, y)
print(lr.coef_)

输出结果:

[ 0.04549957  0.01249463 -0.02397849  0.06077185 -0.00195205 -0.00892116
 -0.04641399 -0.05644752 -0.08393869]

猜你喜欢

转载自blog.csdn.net/weixin_46274168/article/details/115220865