一 交叉熵原理
1 信息量
信息量的大小与信息发生的概率成反比。
公式如下:
I ( x ) = − l o g ( P ( x ) ) I(x)=-log (P(x)) I(x)=−log(P(x))
其中, I ( x ) I(x) I(x)为信息量, P ( x ) P(x) P(x)为某一事件发生的概率
2 信息熵(熵)
信息熵用来表示所有信息量的期望。
公式如下:
H ( X ) = − ∑ i = 1 n P ( x i ) log ( P ( x i ) ) H(\mathrm{X})=-\sum_{i=1}^{n} P\left(x_{i}\right) \log \left(P\left(x_{i}\right)\right) H(X)=−i=1∑nP(xi)log(P(xi))
其中 X X X为离散变量 ( X = x 1 , x 2 , … , x n ) (X=x 1, x 2, \ldots, x n) (X=x1,x2,…,xn)
3 相对熵(KL散度)
使用KL散度来衡量对于同一随机变量的两个单独概率分布之间的差异。
公式如下:
D K L ( p ∥ q ) = ∑ i = 1 n p ( x i ) log ( p ( x i ) q ( x i ) ) D_{K L}(p \| q)=\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(\frac{p\left(x_{i}\right)}{q\left(x_{i}\right)}\right) DKL(p∥q)=i=1∑np(xi)log(q(xi)p(xi))
P ( x ) P(x) P(x)表示样本的真实分布, Q ( x ) Q(x) Q(x)表示模型所预测的分布。
KL散度越小,表示 P ( x ) P(x) P(x)和 Q ( x ) Q(x) Q(x)的分布更接近,反复训练 Q ( x ) Q(x) Q(x)使其分布逼近 P ( x ) P(x) P(x)。
4 交叉熵
交叉熵=相对熵-信息熵
H ( p , q ) = [ − ∑ i = 1 n p ( x i ) log ( q ( x i ) ) ] H(p, q)=\left[-\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(q\left(x_{i}\right)\right)\right] H(p,q)=[−i=1∑np(xi)log(q(xi))]
注:
D K L ( p ∥ q ) = ∑ i = 1 n p ( x i ) log ( p ( x i ) q ( x i ) ) = ∑ i = 1 n p ( x i ) log ( p ( x i ) ) − ∑ i = 1 n p ( x i ) log ( q ( x i ) ) = H ( p ( x ) ) + [ − ∑ i = 1 n p ( x i ) log ( q ( x i ) ) ] \begin{gathered} D_{K L}(p \| q)=\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(\frac{p\left(x_{i}\right)}{q\left(x_{i}\right)}\right) \\ =\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(p\left(x_{i}\right)\right)-\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(q\left(x_{i}\right)\right) \\ =H(p(x))+\left[-\sum_{i=1}^{n} p\left(x_{i}\right) \log \left(q\left(x_{i}\right)\right)\right] \end{gathered} DKL(p∥q)=i=1∑np(xi)log(q(xi)p(xi))=i=1∑np(xi)log(p(xi))−i=1∑np(xi)log(q(xi))=H(p(x))+[−i=1∑np(xi)log(q(xi))]
训练网络时输入数据与标签已经确定,即 P ( x ) P(x) P(x)确定,信息熵为常量。KL值越小,预测结果越好,需最小化KL散度,即用交叉熵损失函数计算。
5 小结
交叉熵源于信息论,主要用于度量两个概率分布间的差异性。
在线性回归问题中,常使用MSE作为损失函数;在分类问题中常使用交叉熵作为损失函数,在输出层使用softmax将输出的结果进行处理,使其多个分类的预测值和为1,再通过交叉熵来计算损失。
二 推导
1 Logistic交叉熵损失函数
公式:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right) J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
导数:
∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial \theta_{j}} J(\theta)=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} ∂θj∂J(θ)=m1i=1∑m(hθ(x(i))−y(i))xj(i)
推导
对于logistic回归,m组样本,输入样本 x ( i ) = ( 1 , x 1 ( i ) , x 2 ( i ) , … , x p ( i ) ) T x^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, \ldots, x_{p}^{(i)}\right)^{T} x(i)=(1,x1(i),x2(i),…,xp(i))T,为 p + 1 p+1 p+1维向量(考虑bias); y ( i ) y^{(i)} y(i)表示类别,此处取0或1;模型的参数为 θ = ( θ 0 , θ 1 , … , θ p ) T \theta=\left(\theta_{0}, \theta_{1, \ldots,} \theta_{p}\right)^{T} θ=(θ0,θ1,…,θp)T
θ T x ( i ) : = θ 0 + θ 1 x 1 ( i ) + ⋯ + θ p x p ( i ) . \theta^{T} x^{(i)}:=\theta_{0}+\theta_{1} x_{1}^{(i)}+\cdots+\theta_{p} x_{p}^{(i)} . θTx(i):=θ0+θ1x1(i)+⋯+θpxp(i).
假设函数定义为: h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) h_{\theta}\left(x^{(i)}\right)=\frac{1}{1+e^{
{-\theta ^T}x^{(i)}}} hθ(x(i))=1+e−θTx(i)1
P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) = h θ ( x ( i ) ) P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = 1 − h θ ( x ( i ) ) log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) = log h θ ( x ( i ) ) = log 1 1 + e − θ T x ( i ) log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = log ( 1 − h θ ( x ( i ) ) ) = log e − θ T x ( i ) 1 + e − θ T x ( i ) \begin{gathered} P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)=h_{\theta}\left(x^{(i)}\right) \\ P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right)=1-h_{\theta}\left(x^{(i)}\right) \\ \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)=\log h_{\theta}\left(x^{(i)}\right)=\log \frac{1}{1+e^{
{-\theta ^{T}} x^{(i)}}} \\ \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right)=\log \left(1-h_{\theta}\left(x^{(i)}\right)\right)=\log \frac{e^{-\theta^{T} x^{(i)}}}{1+e^{-\theta^{T} x^{(i)}}} \end{gathered} P(y^(i)=1∣x(i);θ)=hθ(x(i))P(y^(i)=0∣x(i);θ)=1−hθ(x(i))logP(y^(i)=1∣x(i);θ)=loghθ(x(i))=log1+e−θTx(i)1logP(y^(i)=0∣x(i);θ)=log(1−hθ(x(i)))=log1+e−θTx(i)e−θTx(i)
对于第 i i i组样本,假设函数表征正确的组合对数概率为:
I { y ( i ) = 1 } log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) + I { y ( i ) = 0 } log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = y ( i ) log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) + ( 1 − y ( i ) ) log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) \begin{gathered} I\left\{y^{(i)}=1\right\} \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)+I\left\{y^{(i)}=0\right\} \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right) \\ =y^{(i)} \log P\left(\hat{y}^{(i)}=1 \mid x^{(i)} ; \theta\right)+\left(1-y^{(i)}\right) \log P\left(\hat{y}^{(i)}=0 \mid x^{(i)} ; \theta\right) \\ =y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right) \end{gathered} I{
y(i)=1}logP(y^(i)=1∣x(i);θ)+I{
y(i)=0}logP(y^(i)=0∣x(i);θ)=y(i)logP(y^(i)=1∣x(i);θ)+(1−y(i))logP(y^(i)=0∣x(i);θ)=y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
对于 m m m组样本可得损失函数:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right) J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
J J J取负号的原因:表征正确的概率值越大,模型对数据的表达能力越好;但在衡量模型优劣时表现误差的损失函数且越小越好。两相矛盾,所以令损失函数对表征正确的组合对数概率取反。
求导
第一步:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) = − 1 m ∑ i = 1 m [ − y ( i ) ( log ( 1 + e − θ T x ( i ) ) ) + ( 1 − y ( i ) ) ( − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log e θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − ( log e θ T x ( i ) + log ( 1 + e − θ T x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log ( e θ T x ( i ) + 1 ) ] \begin{gathered} J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\\ =-\frac{1}{m} \sum_{i=1}^{m}\left[-y^{(i)}\left(\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)+\left(1-y^{(i)}\right)\left(-\theta^{T} x^{(i)}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)\right] \\ =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\theta^{T} x^{(i)}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right] \\ =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\log e^{\theta^{T} x^{(i)}}-\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right]_{} \\ =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\left(\log e^{\theta^{T} x^{(i)}}+\log \left(1+e^{-\theta^{T} x^{(i)}}\right)\right)\right]_{} \\ =-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \theta^{T} x^{(i)}-\log \left(e^{\theta^{T} x^{(i)}}+1\right)\right] \end{gathered} J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))=−m1i=1∑m[−y(i)(log(1+e−θTx(i)))+(1−y(i))(−θTx(i)−log(1+e−θTx(i)))]=−m1i=1∑m[y(i)θTx(i)−θTx(i)−log(1+e−θTx(i))]=−m1i=1∑m[y(i)θTx(i)−logeθTx(i)−log(1+e−θTx(i))]=−m1i=1∑m[y(i)θTx(i)−(logeθTx(i)+log(1+e−θTx(i)))]=−m1i=1∑m[y(i)θTx(i)−log(eθTx(i)+1)]
第二步:
∂ ∂ θ j J ( θ ) = ∂ ∂ θ j ( 1 m ∑ i = 1 m [ log ( 1 + e θ T x ( i ) ) − y ( i ) θ T x ( i ) ] ) = 1 m ∑ i = 1 m ( x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) − y ( i ) x j ( i ) ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{gathered} \frac{\partial}{\partial \theta_{j}} J(\theta)=\frac{\partial}{\partial \theta_{j}}\left(\frac{1}{m} \sum_{i=1}^{m}\left[\log \left(1+e^{\theta^{T} x^{(i)}}\right)-y^{(i)} \theta^{T} x^{(i)}\right]\right) \\ =\frac{1}{m} \sum_{i=1}^{m}\left(\frac{x_{j}^{(i)} e^{\theta^{T} x^{(i)}}}{1+e^{\theta^{T} x^{(i)}}}-y^{(i)} x_{j}^{(i)}\right) \\ =\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \end{gathered} ∂θj∂J(θ)=∂θj∂(m1i=1∑m[log(1+eθTx(i))−y(i)θTx(i)])=m1i=1∑m(1+eθTx(i)xj(i)eθTx(i)−y(i)xj(i))=m1i=1∑m(hθ(x(i))−y(i))xj(i)
2 Softmax交叉熵损失函数
公式:
C = − ∑ i y i ln a i C=-\sum_{i} y_{i} \ln a_{i} C=−i∑yilnai
a i = e z i ∑ k e z k , z i = ∑ j w i j x i j + b a_{i}=\frac{e^{z _{i}}}{\sum_{k} e^{z _{k}}},z_{i}=\sum_{j} w_{i j} x_{i j}+b ai=∑kezkezi,zi=∑jwijxij+b
其中, y i y_{i} yi表示真实的分类结果, z i z_{i} zi为神经元的输出
w i j w_{i j} wij为第 i i i个神经元的第 j j j个权重, b b b是偏移值, z i z_{i} zi表示该网络的第 i i i个输出, a i a_{i} ai为给第 i i i个输出加softmax函数:
导数:
∂ C ∂ z i = a i − y i \frac{\partial C}{\partial z_{i}}=a_{i}-y_{i} ∂zi∂C=ai−yi
推导:
∂ C ∂ z i = ∑ j ( ∂ C j ∂ a j ∂ a j ∂ z i ) \frac{\partial C}{\partial z_{i}}=\sum_{j}\left(\frac{\partial C_{j}}{\partial a_{j}} \frac{\partial a_{j}}{\partial z_{i}}\right) ∂zi∂C=j∑(∂aj∂Cj∂zi∂aj)
∂ C j ∂ a j = ∂ ( − y j ln a j ) ∂ a j = − y j 1 a j \frac{\partial C_{j}}{\partial a_{j}}=\frac{\partial\left(-y_{j} \ln a_{j}\right)}{\partial a_{j}}=-y_{j} \frac{1}{a_{j}} ∂aj∂Cj=∂aj∂(−yjlnaj)=−yjaj1
对于 ∂ a j ∂ z i \frac{\partial a_{j}}{\partial z_{i}} ∂zi∂aj有如下两种情况:
(1) i = j i=j i=j
∂ a i ∂ z i = ∂ ( e z i ∑ k e z k ) ∂ z i = ∑ k e z k e z i − ( e z i ) 2 ( ∑ k e z k ) 2 = ( e z i ∑ k e z k ) ( 1 − e z i ∑ k e z k ) = a i ( 1 − a i ) \frac{\partial a_{i}}{\partial z_{i}}=\frac{\partial\left(\frac{e^{z _{i}}}{\sum_{k} e^{z _{k}}}\right)}{\partial z_{i}}=\frac{\sum_{k} e^{z _{k}} e^{z _{i}}-\left(e^{z _{i}}\right)^{2}}{\left(\sum_{k} e^{z _{k}}\right)^{2}}\\ =\left(\frac{e^{z_{i}}}{\sum_{k} e^{z k}}\right)\left(1-\frac{e^{z_{i}}}{\sum_{k} e^{z k}}\right)=a_{i}\left(1-a_{i}\right) ∂zi∂ai=∂zi∂(∑kezkezi)=(∑kezk)2∑kezkezi−(ezi)2=(∑kezkezi)(1−∑kezkezi)=ai(1−ai)
(2) i ≠ j i \neq j i=j
∂ a j ∂ z i = ∂ ( e z j ∑ k e z k ) ∂ z i = − e z j ( 1 ∑ k e z k ) 2 e z i = − a i a j \frac{\partial a_{j}}{\partial z_{i}}=\frac{\partial\left(\frac{e^{z _{j}}}{\sum k e^{z_{k}}}\right)}{\partial z_{i}}=-e^{z_{ j}}\left(\frac{1}{\sum_{k} e^{z k}}\right)^{2} e^{z_ {i}}=-a_{i} a_{j} ∂zi∂aj=∂zi∂(∑kezkezj)=−ezj(∑kezk1)2ezi=−aiaj
综上:
∂ C ∂ z i = ∑ j ( ∂ C j ∂ a j ∂ a j ∂ z i ) = ∑ j ≠ i ( ∂ C j ∂ a j ∂ a j ∂ z i ) + ∑ i = j ( ∂ C j ∂ a j ∂ a j ∂ z i ) = ∑ j ≠ i − y j 1 a j ( − a i a j ) + ( − y i 1 a i ) ( a i ( 1 − a i ) ) = ∑ j ≠ i a i y j + ( − y i ( 1 − a i ) ) = ∑ j ≠ i a i y j + a i y i − y i = a i ∑ j y j − y i \begin{aligned} &\frac{\partial C}{\partial z_{i}}=\sum_{j}\left(\frac{\partial C_{j}}{\partial a_{j}} \frac{\partial a_{j}}{\partial z_{i}}\right)=\sum_{j \neq i}\left(\frac{\partial C_{j}}{\partial a_{j}} \frac{\partial a_{j}}{\partial z_{i}}\right)+\sum_{i=j}\left(\frac{\partial C_{j}}{\partial a_{j}} \frac{\partial a_{j}}{\partial z_{i}}\right) \\ &=\sum_{j \neq i}-y_{j} \frac{1}{a_{j}}\left(-a_{i} a_{j}\right)+\left(-y_{i} \frac{1}{a_{i}}\right)\left(a_{i}\left(1-a_{i}\right)\right) \\ &=\sum_{j \neq i} a_{i} y_{j}+\left(-y_{i}\left(1-a_{i}\right)\right) \\ &=\sum_{j \neq i} a_{i} y_{j}+a_{i} y_{i}-y_{i} \\ &=a_{i} \sum_{j} y_{j}-y_{i} \end{aligned} ∂zi∂C=j∑(∂aj∂Cj∂zi∂aj)=j=i∑(∂aj∂Cj∂zi∂aj)+i=j∑(∂aj∂Cj∂zi∂aj)=j=i∑−yjaj1(−aiaj)+(−yiai1)(ai(1−ai))=j=i∑aiyj+(−yi(1−ai))=j=i∑aiyj+aiyi−yi=aij∑yj−yi
针对分类问题, y i yi yi最终只会有一个类别是1,其他类别都是0
所以 ∂ C ∂ z i = a i − y i \frac{\partial C}{\partial z_{i}}=a_{i}-y_{i} ∂zi∂C=ai−yi
附录 求导公式和法则
基本初等函数求导公式
(1) ( C ) ′ = 0 \quad(C)^{\prime}=0 (C)′=0
(2) ( x μ ) ′ = μ x μ − 1 \quad\left(x^{\mu}\right)^{\prime}=\mu x^{\mu-1} (xμ)′=μxμ−1
(3) ( sin x ) ′ = cos x (\sin x)^{\prime}=\cos x (sinx)′=cosx
(4) ( cos x ) ′ = − sin x (\cos x)^{\prime}=-\sin x (cosx)′=−sinx
(5) ( tan x ) ′ = sec 2 x (\tan x)^{\prime}=\sec ^{2} x (tanx)′=sec2x
(6) ( cot x ) ′ = − csc 2 x (\cot x)^{\prime}=-\csc ^{2} x (cotx)′=−csc2x
(7) ( sec x ) ′ = sec x tan x (\sec x)^{\prime}=\sec x \tan x (secx)′=secxtanx
(8) ( csc x ) ′ = − csc x cot x (\csc x)^{\prime}=-\csc x \cot x (cscx)′=−cscxcotx
(9) ( a x ) ′ = a x ln a \left(a^{x}\right)^{\prime}=a^{x} \ln a (ax)′=axlna
(10) ( e x ) ′ = e x \left(\mathrm{e}^{x}\right)^{\prime}=\mathrm{e}^{x} (ex)′=ex
(11) ( log a x ) ′ = 1 x ln a \left(\log _{a} x\right)^{\prime}=\frac{1}{x \ln a} (logax)′=xlna1
(12) ( ln x ) ′ = 1 x (\ln x)^{\prime}=\frac{1}{x} (lnx)′=x1,
(13) ( arcsin x ) ′ = 1 1 − x 2 (\arcsin x)^{\prime}=\frac{1}{\sqrt{1-x^{2}}} (arcsinx)′=1−x21
(14) ( arccos x ) ′ = − 1 1 − x 2 (\arccos x)^{\prime}=-\frac{1}{\sqrt{1-x^{2}}} (arccosx)′=−1−x21
(15) ( arctan x ) ′ = 1 1 + x 2 (\arctan x)^{\prime}=\frac{1}{1+x^{2}} (arctanx)′=1+x21
(16) ( arccot x ) ′ = − 1 1 + x 2 (\operatorname{arccot} x)^{\prime}=-\frac{1}{1+x^{2}} (arccotx)′=−1+x21
求导法则
设 u = u ( x ) , v = v ( x ) u=u(x), v=v(x) u=u(x),v=v(x) 都可导, 则
(1) ( u ± v ) ′ = u ′ ± v ′ \quad(u \pm v)^{\prime}=u^{\prime} \pm v^{\prime} (u±v)′=u′±v′
(2) ( C u ) ′ = C u ′ ( C (C u)^{\prime}=C u^{\prime}(C (Cu)′=Cu′(C 是常数)
(3) ( u v ) ′ = u ′ v + u v ′ \quad(u v)^{\prime}=u^{\prime} v+u v^{\prime} (uv)′=u′v+uv′
(4) ( u v ) ′ = u ′ v − u v ′ v 2 \left(\frac{u}{v}\right)^{\prime}=\frac{u^{\prime} v-u v^{\prime}}{v^{2}} (vu)′=v2u′v−uv′
复合函数求导法则
设 y = f ( u ) y=f(u) y=f(u), 而 u = φ ( x ) u=\varphi(x) u=φ(x) 且 f ( u ) f(u) f(u) 及 φ ( x ) \varphi(x) φ(x) 都可导, 则复合函数 y = f [ φ ( x ) ] y=f[\varphi(x)] y=f[φ(x)] 的导数为
d y d x = d y d u ⋅ d u d x 或 y ′ = f ′ ( u ) ⋅ φ ′ ( x ) \frac{d y}{d x}=\frac{d y}{d u} \cdot \frac{d u}{d x} \text { 或 } y^{\prime}=f^{\prime}(u) \cdot \varphi^{\prime}(x) dxdy=dudy⋅dxdu 或 y′=f′(u)⋅φ′(x)