统计学习（二）：统计推断

概念

所谓统计推断( statistical inference )，指的是给定样本 $x_1, x_2,\dots, x_n$ 下，如何推断总体 $F$ ? 或者 $F$ 的数字特征，如均值、方差等。

统计模型

参数模型

参数模型，指的是一个分布集合 $\mathfrak{F}$ , 其中的参数可以由有限个参数给定。

例2.1 一维正态分布集

F = {f (x; μ, σ 2) : f (x; μ, σ 2) = 1 2 π - - \sqrt σ e x p {- 1 2 σ 2 (x - μ) 2}, μ \in R, σ 2 \in R +}

$\mathfrak{F}=\{f(x; \mu, \sigma^2): f(x; \mu, \sigma^2)= \dfrac{1}{\sqrt{2\pi} \sigma} exp\{-\dfrac{1}{2\sigma^2}(x-\mu)^2\}, \mu\in \mathbb{R},\, \sigma^2\in \mathbb{R}^+\}$
参数模型集通常可以表示为

F={f(x;θ);θ∈Ⓢ} $\mathfrak{F}=\{f(x; \theta);\, \theta\in\circledS\}$

非参数模型

非参数模型，分布集 $\mathfrak{F}$ 不能被参数化。

例2.2 $\mathfrak{F}_{All}=\{all \, \, CDFs\}$ , $CDF$ 指的是累积分布函数( cumulative distribution function ).

例2.3 一维参数估计

设样本 $x_1, x_2,\dots, x_n$ 来自 Bernoulli(p), 估计 $p$ .

例2.4 二维参数估计

设样本 $x_1, x_2,\dots, x_n$ 来自一维正态分布族 $\mathfrak{F}$ , 估计 $\mu, \sigma^2$ .

例2.5 非参数密度估计

设样本 $x_1, x_2,\dots, x_n$ 来自某连续分布 $F$ , 密度为 $f$ , 估计 $f$ .
这里，不能仅假定 $F\in\mathfrak{F}_{All}$ , 为了估计 $f$ , 需要进一步假定
$f\in\mathfrak{F}_{DENS}\bigcap\mathfrak{F}_{SOB}$ .
其中， $\mathfrak{F}_{DENS}$ 是所有概率密度函数集。
$\mathfrak{F}_{SOB}=\{f: \int (f^{''}(x))^2 {\rm d}x<\infty\}$ , 称 $\mathfrak{F}_{SOB}$
为索伯列夫空间( Sobolev Space ), 该空间的函数具有一定的稳定性。

例2.6 非参数函数估计

设样本 $x_1, x_2,\dots, x_n \sim F$ ，称 $F$ 的函数为统计泛函，记为 $T(F)$ . 例如，
均值 $\mu=\int x {\rm d}F(x)$ , 方差 $\sigma^2=\int x^2{\rm d}F(x)-(\int x {\rm d}F(x))^2$ ,
中位数 $median=F^{-1}(\frac{1}{2})$ .

点估计

设 $x_1, x_2,\dots, x_n$ 是来自某分布 $F$ 的样本，参数 $\theta$ 一个点估计 $\hat{\theta}_n=g(x_1, x_2, \dots, x_n)$ .

定义2.1 估计的偏差( bias )

b i a s (θ^n) = E θ (θ^n) - θ

$bias(\hat{\theta}_n)=E_{\theta}(\hat{\theta}_n)-\theta$

定义2.2 称 $\hat{\theta}_n$ 是无偏的( unbiased ), 如果 $E(\hat{\theta}_n)=\theta$ , 即 $bias(\hat{\theta}_n)=0$ .

定义2.3 称 $\hat{\theta}_n$ 是相合的或一致的( consistent ), 如果 $\hat{\theta}_n \xrightarrow{p} \theta$ , 当 $n\rightarrow \infty$ 时，即

对 \forall ε > 0, lim n \to \infty P (| θ^n - θ | \geq ε) = 0

$\mbox{对} \,\forall \,\,\varepsilon > 0,\,\, \lim_{n\to\infty}\mathcal{P}(|\hat{\theta}_n -\theta|\ge \varepsilon)=0$

定义2.4 称估计量 $\hat{\theta}_n$ 的分布为抽样分布。

定义2.5 称 $\hat{\theta}_n$ 的标准差( standard deviation ) 为标准误差，简称标准误 ( standard error ), 即

s e = s e (θ^n) = V a r (θ^n) - - - - - - - \sqrt

$se=se(\hat{\theta}_n)=\sqrt{Var(\hat{\theta}_n)}$

例3.1 设样本 $x_1, x_2,\dots, x_n$ 来自 $Bernoulli(p)$ , 则估计量
$\hat{p}_n =\bar{x}=\dfrac{1}{n}\sum\limits_{i=1}^n x_i$ ,
$E(\hat{p}_n)=\dfrac{1}{n}\sum\limits_{i=1}^n E(x_i)=p$ ,
$se=\sqrt{Var(\hat{p}_n)}=\sqrt{\dfrac{p(1-p)}{n}}$ .

定义2.6 称 $E_{\theta}(\hat{\theta}_n -\theta)^2$ 为均方误差( mean squared error ), 记为 MSE, 即 $MSE(\hat{\theta}_n)=E_{\theta}(\hat{\theta}_n -\theta)^2$ .

定理2.1 $MSE(\hat{\theta}_n)=bias^2(\hat{\theta}_n)+Var(\hat{\theta}_n)$ .

证明： 令 $\bar{\theta}_n=E_{\theta}(\hat{\theta}_n)$ , 则

E θ (θ^n - θ) 2 = E θ (θ^n - θ ¯ n + θ ¯ n - θ) 2 = E θ (θ^n - θ ¯ n) 2 + 2 (θ ¯ n - θ) E θ (θ^n - θ ¯ n) + E θ (θ ¯ n - θ) 2

$E_{\theta}(\hat{\theta}_n -\theta)^2=E_{\theta}(\hat{\theta}_n -\bar{\theta}_n +\bar{\theta}_n -\theta)^2= E_{\theta}(\hat{\theta}_n -\bar{\theta}_n)^2 + 2(\bar{\theta}_n -\theta)E_{\theta}(\hat{\theta}_n -\bar{\theta}_n) + E_{\theta}(\bar{\theta}_n -\theta)^2$

= (θ ¯ n - θ) 2 + E θ (θ^n - θ ¯ n) 2 = b i a s 2 (θ^n) + V a r (θ^n)

$=(\bar{\theta}_n -\theta)^2+E_{\theta}(\hat{\theta}_n-\bar{\theta}_n)^2 =bias^2(\hat{\theta}_n)+Var(\hat{\theta}_n)$ .

定理2.2 如果 $bias\rightarrow 0$ , $se\rightarrow 0$ , 当 $n\rightarrow\infty$ 时，则 $\hat{\theta}_n$ 是 $\theta$ 的相合估计。

证明： 依定理3.1, $MSE=E_{\theta}(\hat{\theta}_n -\theta)^2\rightarrow 0$ , 当 $n\rightarrow\infty$ 时，那么，对 $\forall \, \varepsilon>0$ , 由切比雪夫不等式

P (| θ^n - θ | > ε) = P (| θ^n - θ | 2 > ε 2) \leq E θ ( θ ^ n - θ ) 2 ε 2 \to 0

$\mathcal{P}(|\hat{\theta}_n-\theta|>\varepsilon)=\mathcal{P}(|\hat{\theta}_n-\theta|^2>\varepsilon^2)\le\dfrac{E_{\theta}(\hat{\theta}_n -\theta)^2}{\varepsilon^2}\rightarrow 0$ ,

故 $\hat{\theta}_n \xrightarrow{p} \theta$ , 当 $n\rightarrow \infty$ 时.
例3.2 接例3.1, $bias(\hat{p}_n)=E(\hat{p}_n)-p=0$ , $se=\sqrt{\dfrac{p(1-p)}{n}}\rightarrow 0$ , 当 $n\rightarrow \infty$ 时, 故依定理3.2, $\hat{p}_n$ 是相合的。

分布的估计

定义2.7 经验分布( empirical distribution )
称 $\hat{F}_n(x)=\dfrac{1}{n}\sum\limits_{i=1}^n I(x_i \le x),\, x\in \mathbb{R}$ 为经验分布(函数)。

定理2.3 对 $\forall \, x\in \mathbb{R}$ , 有
$E(\hat{F}_n(x))=F(x)$ , $Var(\hat{F}_n(x))=\dfrac{F(x)(1-F(x))}{n}$ ,
$MSE(\hat{F}_n(x))=\dfrac{F(x)(1-F(x))}{n}\rightarrow 0$ , 故
$\hat{F}_n(x)\xrightarrow{p} F(x)$ , 其中， $F(x)$ 为总体分布。

定理2.4 ( The Glivenko-Cantelli Theorem )
设样本 x_1, x_2, \dots, x_n 来自分布 $F$ , 则

s u p x | F^n (x) - F (x) | - \to p 0, n \to \infty

$\mathop{sup}\limits_{x} |\hat{F}_n(x)-F(x)|\xrightarrow{p} 0,\, n\rightarrow\infty$

阅读更多精彩内容，请关注微信公众号“统计学习与大数据”！