Deep Learning — MCQ Practice Questions & Answers

Question 1

In a neural network, the vanishing gradient problem is MOST commonly associated with which activation function?

Accepted Answer

Correct answer: B. Sigmoid — The sigmoid activation function squashes inputs into the range (0, 1) . Its derivative is at most 0.25 , so during backpropagation through many layers, gradients are repeatedly multiplied by values less than 1 , causing them to shrink exponentially — this is the vanishing gradient problem. ReLU and Leaky ReLU were introduced partly to mitigate this issue.

Answer

A. ReLU

Answer

C. Leaky ReLU

Answer

D. Softmax

Question 2

Which of the following best describes the role of the 'stride' parameter in a Convolutional Neural Network (CNN)?

Accepted Answer

Correct answer: B. It controls how many pixels the filter moves at each step during convolution. — Stride specifies the number of pixels by which the convolutional filter shifts across the input at each step. A stride of 1 moves the filter one pixel at a time, while a stride of 2 moves it two pixels, effectively reducing the spatial dimensions of the output feature map.

Answer

A. It determines the number of filters applied to the input.

Answer

C. It sets the depth of the convolutional layer.

Answer

D. It defines the size of the padding added around the input.

Question 3

In the context of Batch Normalization, what is normalized during the forward pass of training?

Accepted Answer

Correct answer: C. The activations of each layer using the mean and variance computed over the current mini-batch — Batch Normalization normalizes the activations (pre- or post-activation) of a layer by computing the mean \mu_B and variance \sigma_B^2 over the current mini-batch. The normalized value is \hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} , which is then scaled and shifted by learnable parameters \gamma and \beta .

Answer

A. The weights of each layer

Answer

B. The activations of each layer using the mean and variance of the entire dataset

Answer

D. The learning rate at each iteration

Question 4

What is the primary purpose of the 'dropout' regularization technique in deep learning?

Accepted Answer

Correct answer: B. To randomly deactivate neurons during training to prevent overfitting — Dropout randomly sets a fraction p of neuron activations to zero during each training forward pass. This prevents co-adaptation of neurons, forcing the network to learn more robust and distributed representations. At inference time, all neurons are active, and weights are scaled by (1 - p) to compensate.

Answer

A. To increase the learning rate adaptively

Answer

C. To remove neurons with the smallest weights permanently

Answer

D. To normalize the output of each layer

Question 5

Which optimizer uses both the first moment (mean) and the second moment (uncentered variance) of gradients to adapt the learning rate for each parameter?

Accepted Answer

Correct answer: D. Adam — Adam (Adaptive Moment Estimation) maintains exponentially decaying moving averages of past gradients m_t (first moment) and past squared gradients v_t (second moment). The update rule is 	heta_{t+1} = 	heta_t - \frac{\eta}{{\sqrt{\hat{v}_t} + \epsilon}} \hat{m}_t , combining the benefits of momentum and RMSProp.

Answer

A. SGD with momentum

Answer

B. RMSProp

Answer

C. Adagrad

Question 6

In an LSTM (Long Short-Term Memory) network, which gate is responsible for deciding what information to discard from the cell state?

Accepted Answer

Correct answer: C. Forget gate — The forget gate in an LSTM uses a sigmoid activation to output values between 0 and 1 for each element of the cell state C_{t-1} . A value near 0 means 'forget this', while a value near 1 means 'keep this'. It is computed as f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) .

Answer

A. Input gate

Answer

B. Output gate

Answer

D. Update gate

Question 7

The output size of a convolutional layer is given by \lfloor \frac{n + 2p - f}{s} + 1 floor , where n is the input size, p is padding, f is filter size, and s is stride. For an input of size 32 	imes 32 , filter 5 	imes 5 , padding 0 , and stride 1 , what is the output size?

Accepted Answer

Correct answer: B. 28 	imes 28 — Applying the formula: \lfloor \frac{32 + 2(0) - 5}{1} + 1 floor = \lfloor \frac{27}{1} + 1 floor = 28 . So the output feature map is 28 	imes 28 .

Answer

A. 30 	imes 30

Answer

C. 32 	imes 32

Answer

D. 27 	imes 27

Question 8

Which of the following is the key architectural difference that distinguishes a Transformer model from a traditional RNN?

Accepted Answer

Correct answer: B. Transformers process all input tokens in parallel using self-attention, unlike RNNs which process sequentially. — The core innovation of the Transformer architecture is the self-attention mechanism, which allows every token in a sequence to attend to every other token simultaneously. This parallelism overcomes the sequential dependency of RNNs/LSTMs, enabling much faster training and better capturing of long-range dependencies.

Answer

A. Transformers use convolutional layers instead of recurrent connections.

Answer

C. Transformers use sigmoid activations while RNNs use tanh.

Answer

D. Transformers require labeled data while RNNs are unsupervised.

Deep Learning - MCQ Practice Questions