statsmodels入门
这个非常简单的案例研究旨在让您快速上手 statsmodels。从原始数据开始,我们将展示估计统计模型和绘制诊断图所需的步骤。我们只使用由statsmodels它或它pandas和patsy 依赖项提供的函数。
来源:http://www.statsmodels.org/stable/gettingstarted.html
import statsmodels.api as sm
import pandas
from patsy import dmatrices # patsy用于描述统计模型和使用类似公式构建设计矩阵R
#下载了Guerry数据集,这是一组历史数据,用于支持Andre-Michel Guerry在1833年 关于法国道德统计的论文。
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df.head()
dept | Region | Department | Crime_pers | Crime_prop | Literacy | Donations | Infants | Suicides | MainCity | ... | Crime_parents | Infanticide | Donation_clergy | Lottery | Desertion | Instruction | Prostitutes | Distance | Area | Pop1831 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | E | Ain | 28870 | 15890 | 37 | 5098 | 33120 | 35039 | 2:Med | ... | 71 | 60 | 69 | 41 | 55 | 46 | 13 | 218.372 | 5762 | 346.03 |
1 | 2 | N | Aisne | 26226 | 5521 | 51 | 8901 | 14572 | 12831 | 2:Med | ... | 4 | 82 | 36 | 38 | 82 | 24 | 327 | 65.945 | 7369 | 513.00 |
2 | 3 | C | Allier | 26747 | 7925 | 13 | 10973 | 17044 | 114121 | 2:Med | ... | 46 | 42 | 76 | 66 | 16 | 85 | 34 | 161.927 | 7340 | 298.26 |
3 | 4 | E | Basses-Alpes | 12935 | 7289 | 46 | 2733 | 23018 | 14238 | 1:Sm | ... | 70 | 12 | 37 | 80 | 32 | 29 | 2 | 351.399 | 6925 | 155.90 |
4 | 5 | E | Hautes-Alpes | 17488 | 8174 | 69 | 6962 | 23076 | 16171 | 1:Sm | ... | 22 | 23 | 64 | 79 | 35 | 7 | 1 | 320.280 | 5549 | 129.10 |
5 rows × 23 columns
df.columns
Index(['dept', 'Region', 'Department', 'Crime_pers', 'Crime_prop', 'Literacy',
'Donations', 'Infants', 'Suicides', 'MainCity', 'Wealth', 'Commerce',
'Clergy', 'Crime_parents', 'Infanticide', 'Donation_clergy', 'Lottery',
'Desertion', 'Instruction', 'Prostitutes', 'Distance', 'Area',
'Pop1831'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 23 columns):
dept 86 non-null int64
Region 85 non-null object
Department 86 non-null object
Crime_pers 86 non-null int64
Crime_prop 86 non-null int64
Literacy 86 non-null int64
Donations 86 non-null int64
Infants 86 non-null int64
Suicides 86 non-null int64
MainCity 86 non-null object
Wealth 86 non-null int64
Commerce 86 non-null int64
Clergy 86 non-null int64
Crime_parents 86 non-null int64
Infanticide 86 non-null int64
Donation_clergy 86 non-null int64
Lottery 86 non-null int64
Desertion 86 non-null int64
Instruction 86 non-null int64
Prostitutes 86 non-null int64
Distance 86 non-null float64
Area 86 non-null int64
Pop1831 86 non-null float64
dtypes: float64(2), int64(18), object(3)
memory usage: 15.5+ KB
# 在地区Region 85 non-null object有缺失值
df = df.dropna()
法国86个省的识字率是否与19世纪20年代皇家彩票的人均赌注有关。
我们需要控制每个部门的财富水平,在回归方程的右侧包含一系列虚拟变量,以控制由于区域效应而导致的未观察到的异质性。
使用普通最小二乘回归(OLS)估计模型。
设计矩阵
第一个是内源变量矩阵(即依赖,响应,回归等)。第二个是外生变量矩阵(即独立变量,预测变量,回归量等)。
# 使用patsy的dmatrices函数来创建设计矩阵
# y 是一个 N×1关于人均彩票投注数据的一栏(彩票)。X 是 N×7带有截距, 识字和财富变量,以及4个区域二进制变量
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
y[:3]
Lottery | |
---|---|
0 | 41.0 |
1 | 38.0 |
2 | 66.0 |
X[:3]
Intercept | Region[T.E] | Region[T.N] | Region[T.S] | Region[T.W] | Literacy | Wealth | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 37.0 | 73.0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 51.0 | 22.0 |
2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | 61.0 |
# 模型拟合和总结
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print(res.summary()) # Summarize model
OLS Regression Results
==============================================================================
Dep. Variable: Lottery R-squared: 0.338
Model: OLS Adj. R-squared: 0.287
Method: Least Squares F-statistic: 6.636
Date: Thu, 09 May 2019 Prob (F-statistic): 1.07e-05
Time: 18:02:22 Log-Likelihood: -375.30
No. Observations: 85 AIC: 764.6
Df Residuals: 78 BIC: 781.7
Df Model: 6
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 38.6517 9.456 4.087 0.000 19.826 57.478
Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938
Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419
Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943
Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235
Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232
Wealth 0.4515 0.103 4.390 0.000 0.247 0.656
==============================================================================
Omnibus: 3.049 Durbin-Watson: 1.785
Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694
Skew: -0.340 Prob(JB): 0.260
Kurtosis: 2.454 Cond. No. 371.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
res.params
Intercept 38.651655
Region[T.E] -15.427785
Region[T.N] -10.016961
Region[T.S] -4.548257
Region[T.W] -10.091276
Literacy -0.185819
Wealth 0.451475
dtype: float64
# 应用Rainbow测试线性度
print(sm.stats.linear_rainbow(res))#第一个数字是F统计量,第二个数字是p值。
sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
data=df, obs_labels=False)
(0.847233997615691, 0.6997965543621644)
回想
对于statsmodels确实是一个值得学习的机器学习库