Lecture 4- excercises in the tutorials

For the 24 people involved, the local encoding is created using a sparse 24-dimensional vector with all components zero, except one. E.g.

Colin $\equiv(1,0,0,0,0,\ldots,0)$ , Charlotte $\equiv(0,0,1,0,0,\ldots,0)$ , Victoria $\equiv (0,0,0,0,1,\ldots,0)$

and so on.

Why don't we use a more succinct encoding like the ones computers use for representing numbers in binary ?

Colin $\equiv (0, 0, 0, 0, 1)$ , Charlotte $\equiv (0, 0, 0, 1, 1)$ , Victoria $\equiv (0, 0, 1, 0, 1)$

etc, even though this encoding will use 5-dimensional vectors as opposed to 24-dimensional ones.

Check all that apply.

Its always better to have more input dimensions

未选择的是正确的

The 24-d encoding makes each subset of persons linearly separable from every other disjoint subset while the 5-d does not

正确

Considering the way this encoding is used, the 24-d encoding asserts no a-priori knowledge about the persons while the 5-d one does.

正确

In what ways is the task of predicting 'B' given 'A R' different from predicting a class label given the pixels of an image? Check all that apply.

'A' and 'R' are symbols rather than dense vectors of real numbers.

正确

'B' given 'A R' involves predicting a set of targets from a set of inputs.

这个选项的答案不正确

In the case of 'A R', the input dimension is the number of possible values for A plus the number of possible values for R; whereas for images, the input dimension is exponentially less than the number of possible values.

这应该被选择

The ordering of the elements of A (i.e. the fact that 'John' might correspond to input index 1 and 'Mary' to input index 2) provides no additional information while the spatial ordering of the pixels does provide information.

正确

$E=\frac{1}{2}(y-t)^2$ , where $\sigma(z) = \frac{1}{1+\exp(-z)}$ , derivatives tend to "plateau-out" when $y$ is close to 0 or 1.

Which of the following statements are true ?

$\frac{dE}{dz} = (y-t)*y*(1-y)$

正确

The first option can be seen to be true just by taking derivatives, similarly the third option can be trivially shown to be wrong. The second option is subtle, but in general this is nota good way to fix this problem since it will amplify the gradients for training cases that are not close to 0 or 1. The cost function used in the last option is called cross-entropy and it has a nice looking derivative that doesn't suffer from this plateau problem. Don't worry if it is not immediately obvious how we arrived at it.

A good way to fix the problem is by having a large global learning rate.

未选择的是正确的

$\frac{dE}{dz} = (y-t)*y$

未选择的是正确的

Using a loss function $-t\log(y) - (1-t)\log(1-y)$ will fix the problem because then $\frac{dE}{dz} = y-t$ .

这应该被选择

If $\mathbf{z} = (z_1, z_2, \ldots z_k)$ is the input to a k-way softmax unit, the output distribution is $\mathbf{y}=(y_1, y_2, \ldots y_k)$ , where

$y_i = \dfrac{\exp(z_i)}{\sum_j\exp(z_j)}$

Which of the following statements are true ?

The output distribution would still be the same if the input vector was $c\mathbf{z}$ for any positive constant $c$ .

这个选项的答案不正确

The output distribution would still be the same if the input vector was $\mathbf{z}$ for any positive constant $c$ .

正确

Regarding the first two options:

Let's say we have two $z$ 's: $z_1=2,z_2=-2$ . Now let's take a softmax over them: $\frac{\exp(z_1)}{\exp(z_1) + \exp(z_2)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}$ . If we add some positive constant $c$ to each $z_i$ then this becomes:

$\frac{\exp(2+c)}{\exp(2+c) + \exp(-2+c)}=\frac{\exp(2)\exp(c)}{(\exp(2)+\exp(-2))\exp(c)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}$ .

Multiplying each $z_i$ by $c$ gives:

$\frac{\exp(2c)}{\exp(2c) + \exp(-2c)}=\frac{\exp(2)^c}{\exp(2)^c + \exp(-2)^c} \neq \frac{\exp(2)}{\exp(2)+\exp(-2)}$

Any probability distribution $P$ over discrete states ( $\ \ \forall x$ ) can be represented as the output of a softmax unit for some inputs.

Each output of a softmax unit always lies in $(0, 1)$

Consider the following two networks with no bias weights. The network on the left takes 3 n-length word vectors corresponding to the previous 3 words, computes 3 d-length individual word-feature embeddings and then a k-length joint hidden layer which it uses to predict the 4th word. The network on the right is comparatively simpler in that it takes the previous 3 words and uses them to predict the 4th word.

If $n = 100, 000$ , $d = 1, 000$ and $k = 10, 000$ , which network has more parameters?

The network on the left.

The network on the right.

正确

The network on the left as 3nd + 3dk + nk parameters which comes out to 1,330,000,000 while the network on the right has 30,000,000,000 parameters, an order of magnitude more. One advantage of the neural representation is that we can get much more compact representations of our data while still making good predictions.

Lecture 4- excercises in the tutorials

猜你喜欢