1. 一些定理
Markov inequality :
r
.
v
.
x
≥
0
r.v. \ \ \mathsf{x}\ge0
r . v . x ≥ 0
P
(
x
≥
μ
)
≤
E
[
x
]
μ
\mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu}
P ( x ≥ μ ) ≤ μ E [ x ] Proof : omit…
Weak law of large numbers(WLLN) :
y
⃗
=
[
y
1
,
y
2
,
.
.
.
,
y
N
]
T
,
y
i
∼
p
i
.
i
.
d
\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
y
= [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d
lim
N
→
∞
P
(
∣
L
p
(
y
⃗
)
+
H
(
p
)
∣
>
ε
)
=
0
,
∀
ε
>
0
\lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0
N → ∞ lim P ( ∣ L p ( y
) + H ( p ) ∣ > ε ) = 0 , ∀ ε > 0 Proof : omit…
2. Typical set
3. Divergence
ε
\varepsilon
ε -typical set
WLLN:
y
⃗
=
[
y
1
,
y
2
,
.
.
.
,
y
N
]
T
,
y
i
∼
p
i
.
i
.
d
\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
y
= [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d $$ L_{p | q}(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{\mathbf{y}}(\boldsymbol{y})}{q_{\mathbf{y}}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p\left(y_{n}\right)}{q\left(y_{n}\right)} \
\lim {N \rightarrow \infty} \mathbb{P}\left(\left|L {p | q}(\boldsymbol{y})-D(p | q)\right|>\epsilon\right)=0 $$ Remarks : 前面只考虑的均值,这里还考虑了另一个分布
Definition:
y
⃗
=
[
y
1
,
y
2
,
.
.
.
,
y
N
]
T
,
y
i
∼
p
i
.
i
.
d
\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d
y
= [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d
T
ϵ
(
p
∣
q
;
N
)
=
{
y
∈
Y
N
:
∣
L
p
∣
q
(
y
)
−
D
(
p
∥
q
)
∣
≤
ϵ
}
\mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\}
T ϵ ( p ∣ q ; N ) = { y ∈ Y N : ∣ ∣ L p ∣ q ( y ) − D ( p ∥ q ) ∣ ∣ ≤ ϵ }
Properties
WLLN
⟹
q
y
(
y
)
≈
p
y
(
y
)
2
−
N
D
(
p
∥
q
)
\Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)}
⟹ q y ( y ) ≈ p y ( y ) 2 − N D ( p ∥ q )
Q
{
T
ϵ
(
p
∣
q
;
N
)
}
≈
2
−
N
D
(
p
∥
q
)
→
0
Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0
Q { T ϵ ( p ∣ q ; N ) } ≈ 2 − N D ( p ∥ q ) → 0
Remarks : p 的典型集可能是 q 的非典型集,在
N
N
N 很大的时候,不同分布的 typical set 是正交的
Theorem
(
1
−
ϵ
)
2
−
N
(
D
(
p
∥
q
)
+
ϵ
)
≤
Q
{
T
ϵ
(
p
∥
q
;
N
)
}
≤
2
−
N
(
D
(
p
∥
q
)
−
ϵ
)
(1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)}
( 1 − ϵ ) 2 − N ( D ( p ∥ q ) + ϵ ) ≤ Q { T ϵ ( p ∥ q ; N ) } ≤ 2 − N ( D ( p ∥ q ) − ϵ )
4. Large deviation of sample averages
Theorem (Cram´er’s Theorem) :
y
⃗
=
[
y
1
,
y
2
,
.
.
.
,
y
N
]
T
,
y
i
∼
q
i
.
i
.
d
\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.d
y
= [ y 1 , y 2 , . . . , y N ] T , y i ∼ q i . i . d with mean
μ
<
∞
\mu<\infty
μ < ∞ and
γ
>
μ
\gamma>\mu
γ > μ
lim
N
→
∞
−
1
N
log
P
(
1
N
∑
n
=
1
N
y
n
≥
γ
)
=
E
C
(
γ
)
\lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma)
N → ∞ lim − N 1 log P ( N 1 n = 1 ∑ N y n ≥ γ ) = E C ( γ ) where
E
C
(
γ
)
E_C(\gamma)
E C ( γ ) is referred as Chernoff exponent
E
C
(
γ
)
≜
D
(
p
(
⋅
;
x
)
∥
q
)
,
p
(
⋅
;
x
)
=
q
(
y
)
e
x
y
−
α
(
x
)
E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)}
E C ( γ ) ≜ D ( p ( ⋅ ; x ) ∥ q ) , p ( ⋅ ; x ) = q ( y ) e x y − α ( x ) and with
x
>
0
x>0
x > 0 chosen such that
E
p
(
⋅
;
x
)
[
y
]
=
γ
\mathbb{E}_{p(\cdot;x)}[y]=\gamma
E p ( ⋅ ; x ) [ y ] = γ Proof :
P
(
1
N
∑
n
=
1
N
y
n
≥
γ
)
=
P
(
e
x
∑
n
=
1
N
y
n
≥
e
N
x
γ
)
≤
e
−
N
x
γ
E
[
e
x
∑
n
=
1
N
y
n
]
=
e
−
N
x
γ
(
E
[
e
x
y
]
)
N
≤
e
−
N
(
x
∗
γ
−
α
(
x
∗
)
)
\begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned}
P ( N 1 n = 1 ∑ N y n ≥ γ ) = P ( e x ∑ n = 1 N y n ≥ e N x γ ) ≤ e − N x γ E [ e x ∑ n = 1 N y n ] = e − N x γ ( E [ e x y ] ) N ≤ e − N ( x ∗ γ − α ( x ∗ ) )
φ
(
x
)
=
x
γ
−
α
(
x
)
\varphi(x)=x\gamma-\alpha(x)
φ ( x ) = x γ − α ( x ) 是凸的,最大值取在
E
p
(
⋅
;
x
∗
)
[
y
]
=
α
˙
(
x
∗
)
=
γ
\mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gamma
E p ( ⋅ ; x ∗ ) [ y ] = α ˙ ( x ∗ ) = γ
可以证明
x
∗
γ
−
α
(
x
∗
)
=
x
∗
α
˙
(
x
∗
)
−
α
(
x
∗
)
=
D
(
p
(
⋅
;
x
∗
)
∥
q
)
x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right)
x ∗ γ − α ( x ∗ ) = x ∗ α ˙ ( x ∗ ) − α ( x ∗ ) = D ( p ( ⋅ ; x ∗ ) ∥ q )
于是有
P
(
1
N
∑
n
=
1
N
y
n
≥
γ
)
≤
e
−
N
E
C
(
γ
)
\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)}
P ( N 1 ∑ n = 1 N y n ≥ γ ) ≤ e − N E C ( γ )
下界的证明,暂时略…
用到的两个事实:
p
(
y
;
x
)
=
q
(
y
)
exp
(
x
y
−
α
(
x
)
)
p(y;x)=q(y)\exp(xy-\alpha(x))
p ( y ; x ) = q ( y ) exp ( x y − α ( x ) )
D
(
p
(
y
;
x
)
∣
∣
q
(
y
)
)
D(p(y;x)||q(y))
D ( p ( y ; x ) ∣ ∣ q ( y ) ) 随着 x 单调增加
E
p
(
;
x
)
[
y
]
\mathbb{E}_{p(;x)}[y]
E p ( ; x ) [ y ] 随着 x 单调增加
Remarks :
这个定理也相当于表达了
P
(
1
N
∑
n
=
1
N
y
n
≥
γ
)
≅
2
−
N
E
C
(
γ
)
\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)}
P ( N 1 ∑ n = 1 N y n ≥ γ ) ≅ 2 − N E C ( γ )
相当于是分布 q 向由
E
[
y
]
=
∑
n
=
1
N
y
n
≥
γ
\mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gamma
E [ y ] = ∑ n = 1 N y n ≥ γ 所定义的一个凸集中投影,恰好投影到边界(线性分布族)
E
[
y
]
=
γ
\mathbb{E}[y]=\gamma
E [ y ] = γ 上,而
q
q
q 向线性分布族的投影恰好就是 (10) 中的指数族表达式
5. Types and type classes
Definition:
y
⃗
=
[
y
1
,
y
2
,
.
.
.
,
y
N
]
T
\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T
y
= [ y 1 , y 2 , . . . , y N ] T (不关心真实服从的是哪个分布)
p
^
(
b
;
y
)
=
1
N
∑
n
=
1
N
1
b
(
y
n
)
=
N
b
(
y
)
N
\hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N}
p ^ ( b ; y ) = N 1 n = 1 ∑ N 1 b ( y n ) = N N b ( y )
P
N
y
\mathcal{P}_{N}^{y}
P N y 表示长度为
N
N
N 的序列所有可能的 types
type class :
T
N
y
(
p
)
=
{
y
∈
y
N
:
p
^
(
⋅
;
y
)
≡
p
(
⋅
)
}
,
p
∈
P
N
y
\mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y}
T N y ( p ) = { y ∈ y N : p ^ ( ⋅ ; y ) ≡ p ( ⋅ ) } , p ∈ P N y
Exponential Rate Notation:
f
(
N
)
≐
2
N
α
f(N) \doteq 2^{N \alpha}
f ( N ) ≐ 2 N α
lim
N
→
∞
log
f
(
N
)
N
=
α
\lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha
N → ∞ lim N log f ( N ) = α Remarks :
α
\alpha
α 表示了指数上面关于
N
N
N 的阶数(log、线性、二次 …)
Properties
∣
P
N
y
∣
≤
(
N
+
1
)
∣
y
∣
\left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|}
∣ P N y ∣ ≤ ( N + 1 ) ∣ y ∣
q
N
(
y
)
=
2
−
N
(
D
(
p
^
(
⋅
y
)
∥
q
)
+
H
(
p
^
(
⋅
;
y
)
)
)
q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))}
q N ( y ) = 2 − N ( D ( p ^ ( ⋅ y ) ∥ q ) + H ( p ^ ( ⋅ ; y ) ) )
p
N
(
y
)
=
2
−
N
H
(
p
)
for
y
∈
T
N
y
(
p
)
p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p)
p N ( y ) = 2 − N H ( p ) for y ∈ T N y ( p )
c
N
−
∣
y
∣
2
N
H
(
p
)
≤
∣
T
N
y
(
p
)
∣
≤
2
N
H
(
p
)
c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)}
c N − ∣ y ∣ 2 N H ( p ) ≤ ∣ T N y ( p ) ∣ ≤ 2 N H ( p )
Theorem
c
N
−
∣
y
∣
2
−
N
D
(
p
∥
q
)
≤
Q
{
T
N
y
(
p
)
}
≤
2
−
N
D
(
p
∥
q
)
Q
{
T
N
y
(
p
)
}
≐
2
−
N
D
(
p
∥
q
)
c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\ Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)}
c N − ∣ y ∣ 2 − N D ( p ∥ q ) ≤ Q { T N y ( p ) } ≤ 2 − N D ( p ∥ q ) Q { T N y ( p ) } ≐ 2 − N D ( p ∥ q )
6. Large Deviation Analysis via Types
Definition:
R
=
{
y
∈
y
N
:
p
^
(
⋅
;
y
)
∈
S
∩
P
N
y
}
\mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\}
R = { y ∈ y N : p ^ ( ⋅ ; y ) ∈ S ∩ P N y }
Sanov’s Theorem :
Q
{
S
∩
P
N
y
}
≤
(
N
+
1
)
∣
y
∣
2
−
N
D
(
p
∗
∥
q
)
Q
{
S
∩
P
N
y
}
≤
˙
2
−
N
D
(
p
∗
∥
q
)
p
∗
=
arg
min
p
∈
S
D
(
p
∥
q
)
Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\ p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q)
Q { S ∩ P N y } ≤ ( N + 1 ) ∣ y ∣ 2 − N D ( p ∗ ∥ q ) Q { S ∩ P N y } ≤ ˙ 2 − N D ( p ∗ ∥ q ) p ∗ = p ∈ S arg min D ( p ∥ q )
7. Asymptotics of hypothesis testing
LRT:
L
(
y
)
=
1
N
log
p
1
N
(
y
)
p
0
N
(
y
)
=
1
N
∑
n
=
1
N
log
p
1
(
y
n
)
p
0
(
y
n
)
>
<
γ
L(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gamma
L ( y ) = N 1 log p 0 N ( y ) p 1 N ( y ) = N 1 ∑ n = 1 N log p 0 ( y n ) p 1 ( y n ) < > γ
P
F
=
P
0
{
1
N
∑
n
=
1
N
t
n
≥
γ
}
≈
2
−
N
D
(
p
∗
∥
p
0
′
)
P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)}
P F = P 0 { N 1 ∑ n = 1 N t n ≥ γ } ≈ 2 − N D ( p ∗ ∥ p 0 ′ )
P
M
=
1
−
P
D
≈
2
−
N
D
(
p
∗
∥
p
1
′
)
P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)}
P M = 1 − P D ≈ 2 − N D ( p ∗ ∥ p 1 ′ )
D
(
p
∗
∥
p
0
′
)
−
D
(
p
∗
∥
p
1
′
)
=
∫
p
∗
(
t
)
log
p
1
′
(
t
)
p
0
′
(
t
)
d
t
=
∫
p
∗
(
t
)
t
d
t
=
E
p
∗
[
t
]
=
γ
D\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gamma
D ( p ∗ ∥ p 0 ′ ) − D ( p ∗ ∥ p 1 ′ ) = ∫ p ∗ ( t ) log p 0 ′ ( t ) p 1 ′ ( t ) d t = ∫ p ∗ ( t ) t d t = E p ∗ [ t ] = γ
8.Asymptotics of parameter estimation
Strong Law of Large Numbers(SLLN) :
P
(
lim
N
→
∞
1
N
∑
n
=
1
N
w
n
=
μ
)
=
1
\mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1
P ( N → ∞ lim N 1 n = 1 ∑ N w n = μ ) = 1 Central Limit Theorem(CLT) :
lim
N
→
∞
P
(
1
N
∑
n
=
1
N
(
w
n
−
μ
σ
)
≤
b
)
=
Φ
(
b
)
\lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b)
N → ∞ lim P ( N
1 n = 1 ∑ N ( σ w n − μ ) ≤ b ) = Φ ( b ) 以下三个强度依次递减
依概率 1 收敛(SLLN):
x
N
⟶
w
.
p
.
1
a
\mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} a
x N ⟶ w . p . 1 a
概率趋于 0(WLLN):
依分布收敛:
x
N
⟶
d
p
\mathsf{x}_{N} \stackrel{d}{\longrightarrow} p
x N ⟶ d p
其他内容请看: 统计推断(一) Hypothesis Test 统计推断(二) Estimation Problem 统计推断(三) Exponential Family 统计推断(四) Information Geometry 统计推断(五) EM algorithm 统计推断(六) Modeling 统计推断(七) Typical Sequence 统计推断(八) Model Selection 统计推断(九) Graphical models 统计推断(十) Elimination algorithm 统计推断(十一) Sum-product algorithm