[ad_1]
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization complete course is currently being offered by DeepLearning.AI through Coursera platform.
About this Course
In the second course of the Deep Learning Specialization, you will open the deep learning black box to understand the processes that drive performance and generate good results systematically.
By the end, you will learn the best practices to train and develop test sets and analyze bias/variance for building deep learning applications; be able to use standard neural network techniques such as initialization, L2 and dropout regularization, hyperparameter tuning, batch normalization, and gradient checking; implement and apply a variety of optimization algorithms, such as minibatch gradient descent, Momentum, RMSprop and Adam, and check for their convergence; and implement a neural network in TensorFlow.
Instructors:
– Andrew Ng
– Kian Katanforoosh
– Younes Bensouda Mourri
Skills You Will Gain
 Tensorflow
 Deep Learning
 Mathematical Optimization
 hyperparameter tuning
Also Check: How to Apply for Coursera Financial Aid
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization Week 1 Quiz Answers – Coursera!
Practice Exercise – Practical aspects of Deep Learning
Question 1) If you have 10,000,000 examples, how would you split the train/dev/test set?
 98% train. 1% dev. 1% test
 60% train . 20% dev . 20% test
 33% train . 33% dev . 33% test
Question 2) The dev and test set should:
 Come from the same distribution
 Have the same number of examples
 Come from different distributions
 Be identical to each other (same (x,y) pairs)
Question 3) If your Neural Network model seems to have high bias, what of the following would be promising things to try? (Check all that apply.)
 Get more test data
 add regularization
 Get more training data
 Make the Neural Network deeper
 Increase the number of units in each hidden layer
Question 4) You are working on an automated checkout kiosk for a supermarket, and are building a classifier for apples, bananas and oranges.
Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)
 Get more training data
 Use a bigger neural network
 Increase the regularization parameter lambda
 Decrease the regularization parameter lambda
Question 5) What is weight decay?
 The process of gradually decreasing the learning rate during training.
 A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
 Gradual corruption of the dee in the neural network if it is trained on noisy data.
 A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
Question 6) What happens when you increase the regularization hyperparameter lambda?
 Weights are pushed toward becoming smaller (closer to 0)
 Doubling lambda should roughly result in doubling the weights
 Weights are pushed toward becoming bigger (further from 0)
 Gradient descent taking bigger steps with each iteration (proportional to lambda)
Question 7) With the inverted dropout technique, at test time:
 You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
 You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.
 You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training
 You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training
Question 8) Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)
 Increasing the regularization effect
 Reducing the regularization effect
 Causing the neural network to end up with a higher training set error
 Causing the neural network to end up with a lower training set error
Question 9) Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)
 Vanishing gradient
 Gradient Checking
 Xavier initialization
 Dropout
 Data augmentation
 L2 regularization
 Exploding gradient
Question 10) Why do we normalize the inputs x?
 It makes the parameter initialization faster
 It makes it easier to visualize the data
 It makes the cost function faster to optimize
 Normalization is another word for regularization–It helps to reduce variance
Improving Deep Neural Networks: Hyperparameter Tuning Regularization and Optimization Week 2 Quiz Answers
Practice Exercise – Optimization Algorithms
Question 1) Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
 a[3]{7}(8)
 a[8]{7}(3)
 a[3]{8}(7)
 a[8]{3}(7)
Question 2) Which of these statements about minibatch gradient descent do you agree with?
 Training one epoch (one pass through the training set) using minibatch gradient descent is faster than training one epoch using batch gradient descent.
 You should implement minibatch gradient descent without an explicit forloop over different minibatches, so that the algorithm processes all minibatches at the same time (vectorization).
 One iteration of minibatch gradient descent (computing on a single minibatch) is faster than one iteration of batch gradient descent.
Question 3) Why is the best minibatch size usually not 1 and not m, but instead something inbetween?
 If the minibatch size is m, you end up with stochastic gradient descent, which is usually slower than minibatch gradient descent.
 If the minibatch size is 1, you end up having to process the entire training set before making any progress.
 If the minibatch size is 1, you lose the benefits of vectorization across examples in the minibatch.
 If the minibatch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress. —> Correct
 Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
Question 4) Which of the following do you agree with?
 If you’re using minibatch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
 If you’re using minibatch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
 Whether you’re using batch gradient descent or minibatch gradient descent, something is wrong.
 Whether you’re using batch gradient descent or minibatch gradient descent, this looks acceptable.
Question 5) Suppose the temperature in Casablanca over the first three days of January are the same:
Jan 1st: θ1=10oC
Jan 2nd: θ210oC
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with β=0.5 to track the temperature: v0=0, vt=βvt−1+(1−β)θt. If v2 is the value computed after day 2 without bias correction, and v2corrected is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what is bias correction doing.)
Question 6) Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
Question 7) You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: vt=βvt−1+(1−β)θt. The red line below was computed using β=0.9. What would happen to your red curve as you vary β? (Check the two that apply)
Decreasing β will shift the red line slightly to the right.
 Increasing β will create more oscillations within the red line.
 Increasing β will shift the red line slightly to the right.
 Decreasing β will create more oscillation within the red line.
Question 8) Consider this figure:
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
 (1) is gradient descent with momentum (small β), (2) is gradient descent with momentum (small β), (3) is gradient descent
 (1) is gradient descent. (2) is gradient descent with momentum (large β) . (3) is gradient descent with momentum (small β)
 (1) is gradient descent with momentum (small β). (2) is gradient descent. (3) is gradient descent with momentum (large β)
 (1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)
Question 9) Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)
 Try minibatch gradient descent
 Try better random initialization for the weights
 Try tuning the learning rate α
 Try initializing all the weights to zero
Question 10) Which of the following statements about Adam is False?
 Adam should be used with batch gradient computations, not with minibatches.
 The learning rate hyperparameter α in Adam usually needs to be tuned.
 We usually use “default” values for the hyperparameters β1,β2 and ε in Adam (β1=0.9, β2=0.999, ε=10−8)
 Adam combines the advantages of RMSProp and momentum
Improving Deep Neural Networks: Hyperparameter Tuning Regularization and Optimization Week 3 Quiz Answers
Practice Exercise – Hyperparameter tuning, Batch Normalization, Programming Frameworks
Question 1) If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?
Question 2) Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?
Question 3) During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:
 The amount of computational power you can access
 The number of hyperparameters you have to tune
 Whether you use batch or minibatch optimization
 The presence of local minima (and saddle points) in your neural network
Question 4) If you think β (hyperparameter for momentum) is between on 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?
r = np.random.rand()
beta = 1 – 10 ** (r – 1)
Question 5) Finding good hyperparameter values is very timeconsuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?
Question 6) In batch normalization as presented in the videos, if you apply it on the lth layer of your neural network, what are you normalizing?
Question 7) In the normalization formula, why do we use epsilon?
 To speed up convergence
 To have a more accurate normalization
 In case μ is too small
 To avoid division by zero
Question 8) Which of the following statements about γ and β in Batch Norm are true? Only correct options listed
 They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.
 They set the mean and variance of the linear variable z l] of a given layer.
 [The optimal values are γ = σ + ε, and . 2 β = μ
 β and γ are hyperparameters of the algorithm, which we tune via random sampling
 There is one global value of and one global value of for each layer, and applies to all the hidden units in that layer.
Question 9) After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:

Use the most recent minibatch’s value of μ and σ to perform the needed normalizations.
 Skip the step where you normalize using μ and σ since a single test example cannot be normalized
 Perform the needed normalizations, use μ and σ^2 estimated using an exponentially weighted average across minibatches seen during training.
 If you implemented Batch Norm on minibatches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a minibatch the same size as during training.
Question 10) Which of these statements about deep learning programming frameworks are true? (Check all that apply)
 Deep learning programming frameworks require cloudbased machines to run.
 A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lowerlevel language such as Python.
 Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.
[ad_2]
Source link