Andrew Ng machine learning课程笔记--监督学习应用,梯度下降

Last lesson I taiked about supervised learning.And supervised learning was this machine-learning problem where I said we're going to tell the algorithm what the close right answer is for a number of examples.


Applied supervised learning to get a car to drive itself.The essential learning algorithm for this is something called gradient descent.


Autonomous driving (supervised learning)


In supervised learning ,this is what we're going to do.we're given a training set,and we are going to feed our training set comprising our M training example,so 47 training examples,into a learning algorithm.And our algorithm then has output function that is by tradition and for histotial reasons,which is usually denoted lower case alphabet H,and is called a hypothesis.

The hypothesis H maps from inputs X to outputs Y.

What we'll do,is minimize as a functin of the parameters of theta,the quantity  J of theta.


Search algorithm:we'll start with some value of my parameter vector theta.And then I'm goiing to keep changing  my parameter vector theta to reduce J of theta a llittle bit until we hopefully end up at the minimum with respect to theta of J of theta.

I'm going to take a small step in this direction of steepest descent,or the direction that the gradient turns out to be.And then you take a small step,and you end up at a new point shown there,and it would keep can take another step,and you sort of keep going until up at a local minimum of this function,J of theta.TA一定会结束?If using a slightly different initial starting point,descent from that point,and if you take a steepest descent direction again,you'll finally stopped at a completely differert local optimum.this is another property of gradient descent.

Gradient descent:we'll going to take a repeatedly take a step in the direction of steepest descent,and it turns out that you can write that as,the theta I minus the partial derivative,with respect to theta I,J of theta.

This greek aiphabet alpha here is a parameter of the algorithm called the learning rate.(决定了 参数方向和步子)You decided what direction to take a step in,and so this parameter alpha controls .如果你这个参数,通常是手动设置的,如果值设置的过小,你会想着最陡峭方向下降的算法,每次只动一小步,这样它会花很长时间去收敛;如果值设的过大,你的算法可能会越过最小值,因为你的步子迈的太大了。37:47

This ordinary release squares,turn out to be a quadratic function.And so we’ll always have a nice bow shape,and only have one global minimum with no other local optima.So wen you run gradient descent,here are actually the contours of the function J.


Is the alpha changing every time?Because the step is not …这和值无关,因为这是梯度下降的一个性质,当你接近局部最小值,步子确实会越来越小,最终直到收敛。当你达到局部最小值,梯度也会变为0.




Batch gradient descent:遍历整个样本集

Stotistic gradient descent(incremental gradient descent):像往常一样,你对所有i进行更新,对参数向量的所有第i个位置都按这个方式进行更新,。为了开始学习,你仅仅需要查看你的第一个训练样本,并且利用第一个训练样本进行更新,之后你需要使用第二个训练样本,执行下一次更新。这样,你调整参数会快得多。但是这个算法,不会精确地收敛到全局最小值。会向着全局最小值附近徘徊,可能会在全局最小值附近一直徘徊。通常的得到的结果很接近全局最小值。










