题目来自高级编程课程
给定一个csv文件,完成以下两题:
对应代码如下:
import random
import numpy
as np
import scipy
as sp
import pandas
as pd
import matplotlib.pyplot
as plt
import seaborn
as sns
import statsmodels.api
as sm
import statsmodels.formula.api
as smf
anascombe
= pd.read_csv(
'anscombe.csv')
print(anascombe.groupby(
'dataset')[
'x'].mean())
print(anascombe.groupby(
'dataset')[
'y'].mean())
print(anascombe.groupby(
'dataset')[
'x'].var())
print(anascombe.groupby(
'dataset')[
'y'].var())
print(anascombe.groupby(
'dataset').corr())
dataset_names
= [
'I',
'II',
'III',
'IV']
for i
in dataset_names:
n
=
len(anascombe[anascombe.dataset
== i])
is_train
= np.random.rand(n)
<
0.7
train
= anascombe[anascombe.dataset
== i][is_train].reset_index(
drop
=
True)
test
= anascombe[anascombe.dataset
== i][
~is_train].reset_index(
drop
=
True)
lin_model
= smf.ols(
'y ~ x', train).fit()
print(lin_model.summary())
g
= sns.FacetGrid(anascombe,
col
=
'dataset')
g.map(plt.scatter,
'x',
'y')
plt.show()
程序命令行输出:
dataset I 9.0 II 9.0 III 9.0 IV 9.0 Name: x, dtype: float64 dataset I 7.500909 II 7.500909 III 7.500000 IV 7.500909 Name: y, dtype: float64 dataset I 11.0 II 11.0 III 11.0 IV 11.0 Name: x, dtype: float64 dataset I 4.127269 II 4.127629 III 4.122620 IV 4.123249 Name: y, dtype: float64 x y dataset I x 1.000000 0.816421 y 0.816421 1.000000 II x 1.000000 0.816237 y 0.816237 1.000000 III x 1.000000 0.816287 y 0.816287 1.000000 IV x 1.000000 0.816521 y 0.816521 1.000000 C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8 "anyway, n=%i" % int(n)) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.650 Model: OLS Adj. R-squared: 0.592 Method: Least Squares F-statistic: 11.15 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156 Time: 12:18:34 Log-Likelihood: -12.931 No. Observations: 8 AIC: 29.86 Df Residuals: 6 BIC: 30.02 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 2.4459 1.497 1.634 0.153 -1.216 6.108 x 0.5464 0.164 3.339 0.016 0.146 0.947 ============================================================================== Omnibus: 0.157 Durbin-Watson: 3.211 Prob(Omnibus): 0.925 Jarque-Bera (JB): 0.343 Skew: -0.096 Prob(JB): 0.842 Kurtosis: 2.004 Cond. No. 27.8 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10 "anyway, n=%i" % int(n)) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.654 Model: OLS Adj. R-squared: 0.610 Method: Least Squares F-statistic: 15.10 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00464 Time: 12:18:34 Log-Likelihood: -15.546 No. Observations: 10 AIC: 35.09 Df Residuals: 8 BIC: 35.70 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.0642 1.169 2.621 0.031 0.369 5.760 x 0.4842 0.125 3.886 0.005 0.197 0.772 ============================================================================== Omnibus: 1.436 Durbin-Watson: 2.438 Prob(Omnibus): 0.488 Jarque-Bera (JB): 0.889 Skew: -0.413 Prob(JB): 0.641 Kurtosis: 1.795 Cond. No. 27.4 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given. "samples were given." % int(n), ValueWarning) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 1.699e+06 Date: Sun, 10 Jun 2018 Prob (F-statistic): 2.08e-12 Time: 12:18:34 Log-Likelihood: 29.314 No. Observations: 6 AIC: -54.63 Df Residuals: 4 BIC: -55.04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 4.0098 0.003 1498.423 0.000 4.002 4.017 x 0.3451 0.000 1303.508 0.000 0.344 0.346 ============================================================================== Omnibus: nan Durbin-Watson: 2.677 Prob(Omnibus): nan Jarque-Bera (JB): 2.907 Skew: 1.640 Prob(JB): 0.234 Kurtosis: 3.933 Cond. No. 29.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9 "anyway, n=%i" % int(n)) C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars return np.sqrt(eigvals[0]/eigvals[-1]) C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: divide by zero encountered in double_scalars return self.ess/self.df_model OLS Regression Results ============================================================================== Dep. Variable: y R-squared: -0.000 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: -inf Date: Sun, 10 Jun 2018 Prob (F-statistic): nan Time: 12:18:34 Log-Likelihood: -13.393 No. Observations: 9 AIC: 28.79 Df Residuals: 8 BIC: 28.98 Df Model: 0 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.1107 0.006 18.991 0.000 0.097 0.124 x 0.8856 0.047 18.991 0.000 0.778 0.993 ============================================================================== Omnibus: 0.591 Durbin-Watson: 1.614 Prob(Omnibus): 0.744 Jarque-Bera (JB): 0.509 Skew: -0.052 Prob(JB): 0.775 Kurtosis: 1.840 Cond. No. inf ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 0. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. C:\Users\10617\Desktop\Python\statistics_exercise\cme193-ipython-notebooks-lecture-master\data1.py dataset I 9.0 II 9.0 III 9.0 IV 9.0 Name: x, dtype: float64 dataset I 7.500909 II 7.500909 III 7.500000 IV 7.500909 Name: y, dtype: float64 dataset I 11.0 II 11.0 III 11.0 IV 11.0 Name: x, dtype: float64 dataset I 4.127269 II 4.127629 III 4.122620 IV 4.123249 Name: y, dtype: float64 x y dataset I x 1.000000 0.816421 y 0.816421 1.000000 II x 1.000000 0.816237 y 0.816237 1.000000 III x 1.000000 0.816287 y 0.816287 1.000000 IV x 1.000000 0.816521 y 0.816521 1.000000 C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given. "samples were given." % int(n), ValueWarning) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.144 Model: OLS Adj. R-squared: -0.070 Method: Least Squares F-statistic: 0.6714 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.459 Time: 12:20:16 Log-Likelihood: -9.2736 No. Observations: 6 AIC: 22.55 Df Residuals: 4 BIC: 22.13 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 5.5660 3.535 1.575 0.190 -4.249 15.381 x 0.2723 0.332 0.819 0.459 -0.650 1.195 ============================================================================== Omnibus: nan Durbin-Watson: 1.587 Prob(Omnibus): nan Jarque-Bera (JB): 0.403 Skew: 0.513 Prob(JB): 0.818 Kurtosis: 2.252 Cond. No. 66.8 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10 "anyway, n=%i" % int(n)) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.696 Model: OLS Adj. R-squared: 0.658 Method: Least Squares F-statistic: 18.33 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00268 Time: 12:20:16 Log-Likelihood: -15.103 No. Observations: 10 AIC: 34.21 Df Residuals: 8 BIC: 34.81 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 2.8740 1.120 2.565 0.033 0.291 5.457 x 0.5000 0.117 4.281 0.003 0.231 0.769 ============================================================================== Omnibus: 1.425 Durbin-Watson: 2.338 Prob(Omnibus): 0.490 Jarque-Bera (JB): 0.931 Skew: -0.471 Prob(JB): 0.628 Kurtosis: 1.840 Cond. No. 28.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given. "samples were given." % int(n), ValueWarning) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 1.000 Model: OLS Adj. R-squared: 1.000 Method: Least Squares F-statistic: 7.652e+05 Date: Sun, 10 Jun 2018 Prob (F-statistic): 3.71e-14 Time: 12:20:16 Log-Likelihood: 31.802 No. Observations: 7 AIC: -59.60 Df Residuals: 5 BIC: -59.71 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 4.0036 0.004 1102.706 0.000 3.994 4.013 x 0.3456 0.000 874.754 0.000 0.345 0.347 ============================================================================== Omnibus: nan Durbin-Watson: 2.583 Prob(Omnibus): nan Jarque-Bera (JB): 0.574 Skew: 0.284 Prob(JB): 0.750 Kurtosis: 1.717 Cond. No. 29.3 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given. "samples were given." % int(n), ValueWarning) OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.803 Model: OLS Adj. R-squared: 0.754 Method: Least Squares F-statistic: 16.34 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156 Time: 12:20:16 Log-Likelihood: -8.3460 No. Observations: 6 AIC: 20.69 Df Residuals: 4 BIC: 20.28 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.3904 1.264 2.683 0.055 -0.118 6.899 x 0.4795 0.119 4.042 0.016 0.150 0.809 ============================================================================== Omnibus: nan Durbin-Watson: 2.450 Prob(Omnibus): nan Jarque-Bera (JB): 0.200 Skew: 0.199 Prob(JB): 0.905 Kurtosis: 2.199 Cond. No. 27.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
图形输出:
第三个图中x与y最符合线性关系,而回归分析中第三组数据的误差值也是最小的。