For the 24 people involved, the local encoding is created using a sparse 24-dimensional vector with all components zero, except one. E.g.
Colin \equiv(1,0,0,0,0,\ldots,0)≡(1,0,0,0,0,…,0), Charlotte \equiv(0,0,1,0,0,\ldots,0)≡(0,0,1,0,0,…,0), Victoria \equiv (0,0,0,0,1,\ldots,0)≡(0,0,0,0,1,…,0)
and so on.
Why don't we use a more succinct encoding like the ones computers use for representing numbers in binary ?
Colin \equiv (0, 0, 0, 0, 1)≡(0,0,0,0,1), Charlotte \equiv (0, 0, 0, 1, 1)≡(0,0,0,1,1), Victoria \equiv (0, 0, 1, 0, 1)≡(0,0,1,0,1)
etc, even though this encoding will use 5-dimensional vectors as opposed to 24-dimensional ones.
Check all that apply.
Its always better to have more input dimensions
The 24-d encoding makes each subset of persons linearly separable from every other disjoint subset while the 5-d does not
Considering the way this encoding is used, the 24-d encoding asserts no a-priori knowledge about the persons while the 5-d one does.
正确
In what ways is the task of predicting 'B' given 'A R' different from predicting a class label given the pixels of an image? Check all that apply.
'A' and 'R' are symbols rather than dense vectors of real numbers.
'B' given 'A R' involves predicting a set of targets from a set of inputs.
In the case of 'A R', the input dimension is the number of possible values for A plus the number of possible values for R; whereas for images, the input dimension is exponentially less than the number of possible values.
The ordering of the elements of A (i.e. the fact that 'John' might correspond to input index 1 and 'Mary' to input index 2) provides no additional information while the spatial ordering of the pixels does provide information.
E=\frac{1}{2}(y-t)^2E=21(y−t)2, where y = \sigma(z) = \frac{1}{1+\exp(-z)}y=σ(z)=1+exp(−z)1, derivatives tend to "plateau-out" when yy is close to 0 or 1.
Which of the following statements are true ?
\frac{dE}{dz} = (y-t)*y*(1-y)dzdE=(y−t)∗y∗(1−y)
The first option can be seen to be true just by taking derivatives, similarly the third option can be trivially shown to be wrong. The second option is subtle, but in general this is nota good way to fix this problem since it will amplify the gradients for training cases that are not close to 0 or 1. The cost function used in the last option is called cross-entropy and it has a nice looking derivative that doesn't suffer from this plateau problem. Don't worry if it is not immediately obvious how we arrived at it.
A good way to fix the problem is by having a large global learning rate.
\frac{dE}{dz} = (y-t)*ydzdE=(y−t)∗y
Using a loss function E = -t\log(y) - (1-t)\log(1-y)E=−tlog(y)−(1−t)log(1−y) will fix the problem because then \frac{dE}{dz} = y-tdzdE=y−t.
If \mathbf{z} = (z_1, z_2, \ldots z_k)z=(z1,z2,…zk) is the input to a k-way softmax unit, the output distribution is \mathbf{y}=(y_1, y_2, \ldots y_k)y=(y1,y2,…yk), where
y_i = \dfrac{\exp(z_i)}{\sum_j\exp(z_j)}yi=∑jexp(zj)exp(zi)
Which of the following statements are true ?
The output distribution would still be the same if the input vector was c\mathbf{z}cz for any positive constant cc.
The output distribution would still be the same if the input vector was c + \mathbf{z}c+z for any positive constant cc.
Regarding the first two options:
Let's say we have two zz's: z_1=2,z_2=-2z1=2,z2=−2. Now let's take a softmax over them: \frac{\exp(z_1)}{\exp(z_1) + \exp(z_2)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}exp(z1)+exp(z2)exp(z1)=exp(2)+exp(−2)exp(2). If we add some positive constant cc to each z_izi then this becomes:
\frac{\exp(2+c)}{\exp(2+c) + \exp(-2+c)}=\frac{\exp(2)\exp(c)}{(\exp(2)+\exp(-2))\exp(c)}=\frac{\exp(2)}{\exp(2)+\exp(-2)}exp(2+c)+exp(−2+c)exp(2+c)=(exp(2)+exp(−2))exp(c)exp(2)exp(c)=exp(2)+exp(−2)exp(2).
Multiplying each z_izi by cc gives:
\frac{\exp(2c)}{\exp(2c) + \exp(-2c)}=\frac{\exp(2)^c}{\exp(2)^c + \exp(-2)^c} \neq \frac{\exp(2)}{\exp(2)+\exp(-2)}exp(2c)+exp(−2c)exp(2c)=exp(2)c+exp(−2)cexp(2)c≠exp(2)+exp(−2)exp(2)
Any probability distribution PP over discrete states (P(x) > 0 \ \ \forall xP(x)>0 ∀x) can be represented as the output of a softmax unit for some inputs.
Each output of a softmax unit always lies in (0,1)(0,1)
Consider the following two networks with no bias weights. The network on the left takes 3 n-length word vectors corresponding to the previous 3 words, computes 3 d-length individual word-feature embeddings and then a k-length joint hidden layer which it uses to predict the 4th word. The network on the right is comparatively simpler in that it takes the previous 3 words and uses them to predict the 4th word.
If n=100,000n=100,000, d=1,000d=1,000 and k=10,000k=10,000, which network has more parameters?
The network on the left.
The network on the right.
The network on the left as 3nd + 3dk + nk parameters which comes out to 1,330,000,000 while the network on the right has 30,000,000,000 parameters, an order of magnitude more. One advantage of the neural representation is that we can get much more compact representations of our data while still making good predictions.