Back to course
Next page
Interactive Quiz
Test your knowledge!
1
What is the main difference between an autoregressive model (like GPT) and an encoder model (like BERT) in natural language processing?
A
The autoregressive model predicts masked words in a sentence, while the encoder model predicts the next word.
B
The autoregressive model generates a token based solely on previous tokens, whereas the encoder model considers both left and right context of a token.
C
The encoder model works only for translation, while the autoregressive model works for all NLP tasks.
D
The autoregressive model uses a full architecture with cross-attention, unlike the encoder model.
2
What is the role of the Query (Q), Key (K), and Value (V) matrices in the self-attention mechanism of a transformer?
A
Q represents what each position is looking for, K what it contains, and V the actual value to extract if deemed relevant.
B
Q is a masked version of the input, K is a normalized version, and V is the final output.
C
Q, K, and V are identical matrices used to compute a weighted average.
D
Q contains the weights, K contains the biases, and V contains the network activations.
3
In the transformer architecture, what is the purpose of residual connections between attention and feed-forward layers?
A
They allow increasing the model size without changing its depth.
B
They facilitate the training of deep models by preventing the vanishing gradient problem.
C
They normalize activations between layers.
D
They mask future tokens in the decoder.
4
In the implementation of a bigram language model, what is the main limitation that explains the poor quality of generated texts?
A
It predicts the next character based solely on a single context character.
B
It uses an encoder architecture instead of a decoder.
C
It applies incorrect masking of future tokens.
D
It lacks the normalization layer (layer norm).
5
What is the main difference between the self-attention layer used in a decoder and the one in a transformer encoder?
A
The encoder layer applies a lower triangular mask, the decoder layer does not.
B
The decoder layer masks future tokens via a lower triangular matrix, whereas the encoder layer does not mask.
C
The encoder layer uses cross-attention, the decoder layer does not.
D
The decoder layer uses multiple attention heads, the encoder layer uses only one.
6
In the Vision Transformer (ViT), how are images processed before being passed into the transformer?
A
Images are transformed into sequences of individual pixels.
B
Images are divided into fixed patches (e.g., 16x16), flattened, and then projected into an embedding space.
C
Images are transformed into feature maps by a CNN before the transformer.
D
Images are converted to grayscale before being processed.
7
What is the purpose of the 'class token' in the Vision Transformer?
A
To enable text generation from an image.
B
To provide a special token dedicated to classification, avoiding the need to aggregate all transformer outputs.
C
To replace position embedding in the transformer.
D
To enable masking of patches in the image.
8
What is the main innovation of the Swin Transformer compared to the Vision Transformer?
A
Use of an encoder and a decoder in the same architecture.
B
Application of attention only on hierarchical local windows with shifted windowing.
C
Switching from multi-head attention to a single attention head.
D
Exclusive use of convolutions instead of feed-forward layers.
9
What is the advantage of relative position embedding in the Swin Transformer?
A
It replaces the attention mechanism.
B
It better captures spatial relationships between patches and adapts the model to different image resolutions.
C
It masks irrelevant patches in a window.
D
It increases the model's capacity to handle large images.
10
What is the training principle of the CLIP model that associates text and image?
A
Supervised training with precise annotation of objects in images.
B
Contrastive training with positive pairs (image-description) and negative pairs to maximize correct correlation.
C
Generation of images from textual descriptions.
D
Prediction of the next text token from an image.
Score: 0/10
Score: 0/10