Back to course
Next page
Interactive Quiz
Test your knowledge!
1
What is the main difference between classic gradient descent and stochastic gradient descent (SGD)?
A
Classic gradient descent uses an adaptive learning rate while SGD uses a fixed learning rate.
B
Classic gradient descent calculates the loss over the entire dataset while SGD calculates the loss over a mini-batch of data.
C
Classic gradient descent uses momentum while SGD does not use it.
D
Classic gradient descent is faster than SGD on large datasets.
2
What is the main effect of adding the momentum term in stochastic gradient descent with momentum?
A
It allows reducing the mini-batch size without performance loss.
B
It automatically adapts the learning rate for each parameter of the model.
C
It retains the previous optimization direction in memory to accelerate convergence and traverse flat regions more efficiently.
D
It completely eliminates oscillations in the gradient descent trajectory.
3
What major problem does Adagrad encounter during model training?
A
It requires a large number of hyperparameters to be tuned.
B
The learning rate can become too large, leading to model divergence.
C
The learning rate continuously decreases, which can slow down convergence or prevent final convergence.
D
It does not perform well on noisy data.
4
How does RMSProp improve upon the Adagrad optimizer?
A
RMSProp uses an exponentially decaying average of gradients instead of their cumulative sum to prevent the learning rate from decreasing too much.
B
RMSProp adds a momentum term to accelerate convergence.
C
RMSProp completely eliminates the need to choose a learning rate.
D
RMSProp calculates the loss over the entire dataset at each training step.
5
Why is Adam often recommended as the default optimizer?
A
Because it does not require any hyperparameter tuning.
B
Because it combines the use of momentum and RMSProp, enabling fast convergence and good performance even on noisy data.
C
Because it uses a fixed learning rate that guarantees convergence.
D
Because it requires less memory than classic stochastic gradient descent.
Score: 0/5
Score: 0/5