Knowledge + Practice

CCNA Aio Ai Concepts Techniques Questions

75 questions · Aio Ai Concepts Techniques topic · All types, answers revealed

Practice these questions Exam hub All questions

1

MCQhard

A machine learning engineer is training a transformer model for machine translation. The model's perplexity on the validation set is 8.5, and the BLEU score is 32. After increasing the number of encoder layers from 6 to 12, perplexity drops to 7.2 but BLEU decreases to 28. What is the MOST likely cause?

A.The model is overfitting the training data

B.The batch size is too small

C.The model is underfitting the training data

D.The learning rate is too high

AnswerA

Overfitting leads to lower perplexity on validation (memorization) but worse generalization, reflected in the BLEU drop.

Why this answer

Perplexity measures language model confidence, but BLEU measures translation quality. The deeper model may overfit to the training data, reducing perplexity but hurting generalization to validation translations. Overfitting causes high confidence (low perplexity) but poor translation diversity or exact matches.

Practice this question →

2

MCQmedium

An AI engineer is tuning a large language model for a summarization task. The output summaries are too verbose and include irrelevant details. Which technique should be applied to encourage concise outputs?

A.Provide a few-shot example with concise summaries

B.Use chain-of-thought prompting

C.Decrease the top-k value

D.Increase the temperature

AnswerA

Few-shot examples demonstrate the desired output style, teaching the model to produce concise summaries.

Why this answer

Providing a few-shot example with concise summaries (Option A) directly demonstrates the desired output format to the model, leveraging in-context learning to bias generation toward brevity and relevance. This is the most effective technique for controlling output style without altering the model's underlying parameters.

Exam trap

Cisco often tests the misconception that adjusting sampling parameters (top-k, temperature) is the primary way to control output length, when in fact these parameters affect randomness and diversity, not the explicit length or relevance of the generated text.

How to eliminate wrong answers

Option B is wrong because chain-of-thought prompting encourages step-by-step reasoning, which typically increases verbosity and is designed for complex reasoning tasks, not for reducing output length. Option C is wrong because decreasing the top-k value restricts the sampling pool to the k most likely tokens, which can reduce randomness but does not inherently enforce conciseness or relevance; it may even produce repetitive or incomplete summaries. Option D is wrong because increasing the temperature raises the randomness of token selection, often leading to more diverse but also more verbose and irrelevant outputs, the opposite of the desired effect.

Practice this question →

3

MCQmedium

A product team wants a system that can generate high-quality synthetic images of furniture in different room settings for an online catalog. The images must be photorealistic and vary in style. Which generative AI approach is BEST suited for this task?

A.Variational autoencoder (VAE)

B.Diffusion model

C.Generative adversarial network (GAN)

D.Recurrent neural network (RNN)

AnswerB

Diffusion models (e.g., Stable Diffusion) produce state-of-the-art photorealistic images with high diversity.

Why this answer

Diffusion models are the best choice because they iteratively denoise random noise to produce high-quality, photorealistic images with diverse styles. Unlike GANs, they avoid mode collapse and training instability, and they generate more detailed and varied outputs than VAEs, making them ideal for furniture catalog images in different room settings.

Exam trap

Cisco often tests the misconception that GANs are always the best for image generation, but the trap here is that GANs' mode collapse and training instability make diffusion models superior for high-quality, diverse outputs in production systems.

How to eliminate wrong answers

Option A is wrong because VAEs generate blurry images due to their variational lower bound objective, which smooths over fine details, making them unsuitable for photorealistic furniture images. Option C is wrong because GANs can suffer from mode collapse, where they generate limited variations (e.g., only one style of room), and training instability, reducing reliability for diverse catalog images. Option D is wrong because RNNs are designed for sequential data (e.g., text, time series) and cannot generate high-dimensional spatial images like furniture in room settings.

Practice this question →

4

MCQeasy

Which machine learning paradigm is best suited for training a model to play a game by learning from its own actions and rewards, without labeled data?

A.Unsupervised learning

B.Semi-supervised learning

C.Reinforcement learning

D.Supervised learning

AnswerC

Reinforcement learning uses rewards from the environment to learn optimal actions through exploration and exploitation.

Why this answer

Reinforcement learning learns via trial-and-error using rewards and penalties, ideal for game-playing agents. Supervised learning requires labeled data; unsupervised learning finds patterns without rewards; semi-supervised uses a mix.

Practice this question →

5

MCQeasy

A company wants to recommend products to users based on their past purchase history. Which machine learning paradigm is BEST suited for this task?

A.Reinforcement learning

B.Unsupervised clustering

C.Supervised learning with regression

D.Self-supervised learning

AnswerC

Supervised regression can predict the likelihood or rating of a product for a user based on historical data.

Why this answer

Recommender systems are a classic application of supervised learning (if using regression or classification to predict ratings) or unsupervised learning (collaborative filtering). Among the options, supervised learning with regression is appropriate for predicting purchase likelihood.

Practice this question →

6

MCQmedium

A developer is using a large language model via an API. They want the model to solve a math problem step by step. Which prompt engineering technique should they use?

A.Set temperature to 0.9

B.Chain-of-thought prompting

C.Few-shot prompting

D.Zero-shot prompting

AnswerB

Chain-of-thought prompts the model to output intermediate reasoning steps, which improves performance on arithmetic and logic problems.

Why this answer

Chain-of-thought prompting encourages the model to show intermediate reasoning steps, improving accuracy on multi-step problems. Zero-shot gives no examples; few-shot provides examples but not necessarily step-by-step; temperature controls randomness.

Practice this question →

7

Multi-Selecthard

A company wants to deploy an LLM-based chatbot that can handle sensitive customer information. Which THREE measures should be implemented to mitigate prompt injection attacks? (Choose 3)

Select 3 answers

A.Use a system prompt that instructs the model to ignore any instructions in the user input

B.Implement output filtering to detect and block harmful responses

C.Sanitize user inputs to remove special characters and escape sequences

D.Use a smaller model with fewer parameters

E.Set temperature to a low value

AnswersA, B, C

A well-crafted system prompt can reduce the success of injection attacks by separating instructions from data.

Why this answer

Input sanitization removes special characters or patterns; output filtering checks responses for sensitive data; system prompts with separation instructions can reduce injection risk. Restricting temperature only affects randomness; using a smaller model does not prevent injection.

Practice this question →

8

MCQmedium

An organization's AI system uses a decision tree model for loan approval. The compliance team requires explanations for each decision. Which property of decision trees makes them suitable for this requirement?

A.They can handle nonlinear relationships

B.They are robust to outliers

C.The decision rules are transparent and can be visualized as a tree

D.They can handle missing values

AnswerC

The tree structure provides clear if-then rules for each decision.

Why this answer

Decision trees inherently provide interpretable decision rules by splitting data based on feature thresholds at each node. The entire model can be visualized as a tree structure, allowing compliance teams to trace the exact path and logic behind each loan approval or rejection, which directly satisfies explainability requirements.

Exam trap

Cisco often tests the distinction between model performance properties (e.g., handling nonlinearity, robustness) and interpretability properties, leading candidates to select a technically true but irrelevant advantage instead of the one that directly satisfies the compliance requirement.

How to eliminate wrong answers

Option A is wrong because handling nonlinear relationships is a general capability of many models (e.g., neural networks, SVMs with kernels) and is not unique to decision trees, nor does it directly address the need for transparent explanations. Option B is wrong because decision trees are not inherently robust to outliers; in fact, they can be sensitive to outliers that cause splits to be skewed, and robustness is not related to explainability. Option D is wrong while decision trees can handle missing values through surrogate splits or other imputation methods, this property does not provide the transparency or traceability required for compliance explanations.

Practice this question →

9

MCQmedium

A company wants to automatically group customer support tickets into categories (e.g., billing, technical, account) without pre-labeled data. Which machine learning approach should they use?

A.Supervised classification with logistic regression

B.Semi-supervised learning with a small labeled set

C.Unsupervised clustering using K-means

D.Reinforcement learning with a reward function

AnswerC

K-means clustering groups similar tickets without labels.

Why this answer

Option C is correct because the company has no pre-labeled data, which means supervised learning (which requires labeled examples) is not feasible. Unsupervised clustering, such as K-means, groups data points into clusters based on feature similarity without needing any labels, making it ideal for automatically discovering categories like billing, technical, or account from raw ticket text.

Exam trap

Cisco often tests the distinction between supervised and unsupervised learning by presenting a scenario with 'no pre-labeled data' to trick candidates into choosing semi-supervised learning (Option B) because it sounds like a compromise, but the correct answer is always unsupervised clustering when zero labels are available.

How to eliminate wrong answers

Option A is wrong because supervised classification with logistic regression requires a pre-labeled training dataset, which the company does not have. Option B is wrong because semi-supervised learning still requires at least a small set of labeled data to guide the model, contradicting the 'without pre-labeled data' condition. Option D is wrong because reinforcement learning uses a reward function to learn a policy through trial-and-error interactions with an environment, which is not suited for static grouping of text data into categories.

Practice this question →

10

MCQeasy

In unsupervised learning, which task involves grouping similar data points together based on feature similarities?

A.Anomaly detection

B.Classification

C.Clustering

D.Regression

AnswerC

Clustering groups unlabeled data based on similarity.

Why this answer

Clustering partitions data into groups where intra-cluster similarity is high. Classification is supervised; anomaly detection finds outliers; regression predicts continuous values.

Practice this question →

11

Multi-Selecthard

An AI engineer is fine-tuning a transformer-based language model for a domain-specific task. They want to improve the model's factual accuracy and reduce hallucinations. Which THREE strategies should they consider? (Select THREE)

Select 3 answers

A.Increase the model's context window size beyond the training limit

B.Fine-tune the model on a curated domain-specific corpus

C.Use a higher temperature setting during generation

D.Apply chain-of-thought prompting for complex queries

E.Implement Retrieval-Augmented Generation (RAG)

AnswersB, D, E

Fine-tuning adapts the model's knowledge to the domain, improving accuracy.

Why this answer

Option B is correct because fine-tuning on a curated domain-specific corpus directly aligns the model with the factual patterns and terminology of the target domain. This supervised learning process adjusts the model's weights to reduce the probability of generating incorrect or hallucinated content by reinforcing ground-truth examples from the domain.

Exam trap

Cisco often tests the misconception that increasing randomness (higher temperature) or extending context windows beyond training limits can improve factual accuracy, when in fact these techniques degrade reliability.

Practice this question →

12

MCQhard

A team is training a recurrent neural network (RNN) with LSTM units to predict stock prices. The validation loss is significantly higher than the training loss. Which action is MOST likely to reduce the gap?

A.Increase the number of LSTM units

B.Increase the number of training epochs

C.Reduce the sequence length

D.Increase the dropout rate in LSTM layers

AnswerD

Dropout regularises the network, reducing overfitting and closing the train-validation gap.

Why this answer

A large gap between training and validation loss indicates overfitting. Increasing dropout (a regularisation technique) reduces overfitting by preventing co-adaptation of neurons. Increasing LSTM units or epochs would worsen overfitting, and reducing sequence length may lose important temporal patterns.

Practice this question →

13

MCQmedium

A data scientist is using a linear regression model to predict house prices and observes that the model performs well on training data but poorly on test data. Which regularisation technique is MOST appropriate to reduce overfitting?

A.L1 regularisation (Lasso)

B.Dropout

C.L2 regularisation (Ridge)

D.Data augmentation

AnswerC

Ridge adds squared magnitude penalty, shrinking coefficients smoothly, which helps generalise.

Why this answer

L2 regularisation (Ridge) adds a penalty term equal to the sum of the squared coefficients to the loss function, which shrinks coefficient magnitudes without forcing them to zero. This reduces variance and overfitting by making the model less sensitive to individual features, which is ideal when the model performs well on training data but poorly on test data due to high variance.

Exam trap

Cisco often tests the distinction between L1 and L2 regularisation by presenting a scenario where feature selection is not needed, and candidates mistakenly choose Lasso because they confuse 'reducing coefficients' with 'eliminating coefficients'.

How to eliminate wrong answers

Option A is wrong because L1 regularisation (Lasso) performs feature selection by shrinking some coefficients exactly to zero, which is more appropriate when you suspect many features are irrelevant, not for general variance reduction. Option B is wrong because Dropout is a regularisation technique specific to neural networks, not linear regression models. Option D is wrong because data augmentation is used to artificially increase the size of the training dataset, typically for image or text data, and does not directly address overfitting in a linear regression context.

Practice this question →

14

Multi-Selectmedium

A data scientist is evaluating a binary classifier for a medical diagnosis task. The dataset is imbalanced with 5% positive cases. Which THREE metrics should the data scientist consider for a comprehensive evaluation?

Select 3 answers

A.Precision

B.F1 score

C.Accuracy

D.Perplexity

E.Recall

AnswersA, B, E

Precision measures the proportion of positive predictions that are correct.

Why this answer

Precision (A) is correct because it measures the proportion of true positive predictions among all positive predictions, which is critical in imbalanced medical diagnosis where false positives can lead to unnecessary stress or procedures. In a dataset with only 5% positive cases, a model that predicts all negatives would achieve high accuracy but zero precision, so precision helps assess the cost of false alarms.

Exam trap

The trap here is that candidates often default to accuracy as a universal metric, but Cisco tests the understanding that accuracy is unreliable for imbalanced datasets, and that metrics like precision, recall, and F1 score are required for a comprehensive evaluation.

Practice this question →

15

MCQeasy

An AI practitioner needs to measure the performance of a binary classification model for disease detection, where the cost of false negatives is very high. Which metric should be prioritized?

A.Recall

B.Precision

C.F1-score

D.Accuracy

AnswerA

Recall measures the proportion of actual positives correctly identified, directly addressing the cost of false negatives.

Why this answer

Recall (true positive rate) minimises false negatives, which is critical when missing a positive case is dangerous.

Practice this question →

16

MCQmedium

A team is developing a sentiment analysis model and obtains the following performance on the test set: accuracy=0.92, precision=0.75, recall=0.80, F1=0.77. The baseline majority-class classifier achieves 0.85 accuracy. Which conclusion is MOST justified?

A.The model should use a different evaluation metric like BLEU

B.The model likely suffers from class imbalance, as the gap between accuracy and precision suggests

C.The model is excellent because accuracy is high

D.The model has high variance and is overfitting

AnswerB

High accuracy with lower precision/recall is a classic sign of imbalance; the model predicts majority class too often.

Why this answer

Accuracy is high but precision and recall are notably lower, indicating class imbalance where the model biases toward the majority class, inflating accuracy.

Practice this question →

17

MCQmedium

A machine learning engineer is training a neural network for image classification. The training loss decreases slowly and the model accuracy improves only marginally each epoch. Which hyperparameter adjustment is MOST likely to accelerate convergence?

A.Add more hidden layers

B.Increase the batch size

C.Increase the learning rate

D.Decrease the number of epochs

AnswerC

A small learning rate causes slow convergence; increasing it can accelerate training.

Why this answer

The training loss decreasing slowly and accuracy improving marginally each epoch indicates that the learning rate is too small, causing the optimizer to take very small steps toward the minimum of the loss function. Increasing the learning rate allows the optimizer to take larger steps per update, which accelerates convergence. Option C is correct because adjusting the learning rate directly addresses the step size in gradient descent.

Exam trap

Cisco often tests the misconception that adding more layers or increasing batch size always improves training speed, when in fact the learning rate is the primary hyperparameter controlling convergence rate.

How to eliminate wrong answers

Option A is wrong because adding more hidden layers increases model complexity and can lead to slower convergence or overfitting, not faster convergence. Option B is wrong because increasing the batch size reduces the variance of gradient estimates but does not directly speed up convergence; it can actually slow down training due to fewer weight updates per epoch. Option D is wrong because decreasing the number of epochs reduces training time but does not accelerate convergence per epoch; it may stop training before the model has converged.

Practice this question →

18

MCQmedium

A machine learning engineer is training a logistic regression model and notices that the loss is decreasing very slowly. The learning rate is set to 0.001. What is the MOST likely cause and appropriate fix?

A.The learning rate is too low; increase it to 0.01

B.The learning rate is too high; decrease it to 0.0001

C.The model is overfitting; add L2 regularisation

D.The batch size is too large; reduce it

AnswerA

A learning rate of 0.001 is very small; increasing it to 0.01 will speed up convergence without causing divergence.

Why this answer

A learning rate of 0.001 is very low for many logistic regression implementations, causing the gradient descent algorithm to take extremely small steps toward the minimum of the loss function. This results in a slow decrease in loss because each weight update is minimal. Increasing the learning rate to 0.01 allows larger steps per iteration, accelerating convergence without typically causing divergence in well-scaled data.

Exam trap

Cisco often tests the misconception that a slow decrease in loss always indicates a learning rate that is too high, when in fact a very low learning rate is the typical cause for slow convergence.

How to eliminate wrong answers

Option B is wrong because a learning rate that is too high would cause the loss to oscillate or diverge, not decrease slowly; decreasing it further would worsen the slow convergence. Option C is wrong because overfitting manifests as low training loss but high validation loss, not as a slow decrease in training loss; L2 regularization addresses overfitting, not convergence speed. Option D is wrong because a large batch size can slow training in terms of wall-clock time per epoch but does not inherently cause the loss to decrease slowly per iteration; it actually provides more stable gradient estimates.

Practice this question →

19

MCQmedium

A bank wants to detect fraudulent transactions in real-time. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraud). Which evaluation metric is MOST appropriate for model performance?

A.AUC-ROC

B.Accuracy

C.Recall

D.Precision

AnswerA

AUC-ROC is robust to imbalance and evaluates the model's ability to distinguish classes.

Why this answer

AUC-ROC is the most appropriate metric because it evaluates the model's ability to distinguish between the minority fraud class (0.1%) and the majority legitimate class across all classification thresholds, without being biased by the extreme class imbalance. Unlike accuracy, AUC-ROC remains robust when the dataset is 99.9% legitimate, as it measures the true positive rate against the false positive rate, providing a comprehensive view of model performance for rare event detection.

Exam trap

Cisco often tests the misconception that accuracy is a reliable metric for imbalanced datasets, leading candidates to overlook that AUC-ROC or precision-recall curves are required when the minority class is extremely rare.

How to eliminate wrong answers

Option B (Accuracy) is wrong because in a highly imbalanced dataset (99.9% legitimate), a model that predicts all transactions as legitimate would achieve 99.9% accuracy, masking its complete failure to detect fraud. Option C (Recall) is wrong because while recall measures the proportion of actual fraud cases correctly identified, it ignores false positives, which can lead to an overwhelming number of false alerts in real-time transaction systems, degrading user experience and operational efficiency. Option D (Precision) is wrong because precision focuses only on the proportion of flagged transactions that are actually fraud, but it does not account for missed fraud cases (false negatives), which is critical in fraud detection where undetected fraud causes direct financial loss.

Practice this question →

20

MCQmedium

A data scientist is selecting a model for a binary classification task where interpretability is critical because of regulatory requirements. The dataset has 20 features and 10,000 samples. Which model is MOST appropriate?

A.Neural network (MLP)

B.Decision tree

C.Gradient boosting machine

D.Random forest classifier

AnswerB

A single decision tree provides clear, human-readable decision rules, meeting regulatory interpretability needs.

Why this answer

Decision trees are inherently interpretable, showing the decision rules. Random forests and gradient boosting are ensembles that sacrifice interpretability for accuracy. Neural networks are black-box models.

Practice this question →

21

MCQmedium

A developer is using a pre-trained BERT model for a question-answering system. They want to ensure the model can handle out-of-vocabulary words. Which component of the BERT architecture is responsible for this?

A.Positional encoding

B.Feed-forward layers

C.WordPiece tokenisation

D.Attention mechanism

AnswerC

WordPiece tokenisation splits rare words into subwords, enabling handling of any input.

Why this answer

WordPiece tokenisation is the component of BERT that handles out-of-vocabulary (OOV) words by breaking them into subword units (e.g., 'playing' → 'play' + '##ing'). This allows the model to represent any word, even unseen ones, as a sequence of known subword tokens, ensuring no word is truly out of vocabulary.

Exam trap

The trap here is that candidates often associate 'handling unknown words' with the attention mechanism or positional encoding, but Cisco specifically tests the understanding that tokenisation—not the model's internal layers—is what makes BERT robust to OOV words.

How to eliminate wrong answers

Option A is wrong because positional encoding adds information about the position of tokens in a sequence, not about handling unknown words. Option B is wrong because feed-forward layers apply non-linear transformations to the attention output and do not address tokenisation or vocabulary coverage. Option D is wrong because the attention mechanism computes relationships between tokens but relies on the tokeniser to first convert input text into known subword pieces; it cannot handle OOV words on its own.

Practice this question →

22

MCQmedium

A data analyst wants to use a model that provides feature importance scores to understand which factors most influence customer churn. They also need the model to handle both numerical and categorical data with minimal preprocessing. Which algorithm is BEST suited?

A.Random forest

B.Support vector machine (SVM) with RBF kernel

C.Logistic regression

D.k-nearest neighbours (k-NN)

AnswerA

Random forests output feature importance, handle mixed data, and are robust to scaling.

Why this answer

Random forests provide feature importance, handle mixed data types, and require little preprocessing.

Practice this question →

23

MCQmedium

A developer is building a natural language processing system to classify customer reviews as positive, neutral, or negative. They have 50,000 labeled reviews. Which model architecture is MOST appropriate for this task?

A.Use a convolutional neural network (CNN) on raw text

B.Train a recurrent neural network (RNN) from scratch

C.Fine-tune a pre-trained BERT model

D.Word2vec embeddings followed by logistic regression

AnswerC

BERT provides deep bidirectional representations; fine-tuning on the labeled reviews yields state-of-the-art text classification accuracy.

Why this answer

Fine-tuning a pre-trained BERT model is most appropriate because BERT is a transformer-based model pre-trained on a large corpus and can be fine-tuned on the 50,000 labeled reviews to achieve high accuracy with relatively little data. It captures bidirectional context, which is crucial for sentiment classification, and avoids the need for training from scratch.

Exam trap

Cisco often tests the misconception that training from scratch or using simpler models is sufficient for NLP tasks, when in reality pre-trained transformers like BERT are the standard for achieving high accuracy with limited labeled data.

How to eliminate wrong answers

Option A is wrong because using a CNN on raw text without embeddings or pre-processing ignores the sequential and contextual nature of language, leading to poor performance on sentiment classification. Option B is wrong because training an RNN from scratch on only 50,000 samples is prone to overfitting and underperformance compared to leveraging a pre-trained model like BERT. Option D is wrong because Word2vec embeddings followed by logistic regression provides only shallow, bag-of-words-like features and cannot capture complex contextual relationships needed for nuanced sentiment analysis.

Practice this question →

24

MCQeasy

A data scientist is building a model to predict whether a credit card transaction is fraudulent, using labeled historical data. Which machine learning paradigm is being used?

A.Reinforcement learning

B.Unsupervised learning

C.Supervised learning

D.Self-supervised learning

AnswerC

The model is trained on labeled data (fraud vs. legitimate) to predict outcomes, which is supervised learning.

Why this answer

Supervised learning uses labeled data to train a model to map inputs to outputs. Fraud detection with historical labels is a classic binary classification problem.

Practice this question →

25

MCQhard

A generative AI model produces images from text prompts. The outputs are often blurry and lack fine details. Which model type is MOST likely being used, and which improvement would best address this issue?

A.Variational Autoencoder (VAE); switch to a diffusion model

B.Variational Autoencoder (VAE); switch to a Generative Adversarial Network (GAN)

C.Generative Adversarial Network (GAN); increase the discriminator's capacity

D.Diffusion model; use a larger batch size during training

AnswerA

VAEs tend to blur; diffusion models iteratively denoise, producing high-quality details.

Why this answer

Variational Autoencoders (VAEs) are known for producing blurry outputs because their loss function (ELBO) encourages pixel-wise averaging, which smooths out fine details. Diffusion models, by contrast, iteratively denoise a random field, learning to reconstruct high-frequency details through a multi-step reverse process, directly addressing the blurriness issue.

Exam trap

Cisco often tests the misconception that GANs are always the best for sharp images, but the trap here is that the question specifically describes blurry outputs—a hallmark of VAEs—and the best modern improvement is a diffusion model, not a GAN.

How to eliminate wrong answers

Option B is wrong because switching from a VAE to a GAN would improve sharpness but GANs are prone to mode collapse and training instability, making diffusion models a more robust and state-of-the-art choice for fine detail generation. Option C is wrong because GANs already produce sharp images; increasing discriminator capacity would not fix blurriness (which is a VAE characteristic) and could worsen training instability. Option D is wrong because diffusion models do not inherently produce blurry outputs; using a larger batch size during training improves gradient stability but does not address a blurriness problem that is not characteristic of diffusion models.

Practice this question →

26

MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use a larger foundation model with a longer context window and paste all documents into each prompt

B.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store

C.Fine-tune a base LLM on the policy documents monthly

D.Train a custom model from scratch on the policy documents each month

AnswerB

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

Retrieval-Augmented Generation (RAG) is the most appropriate approach because it allows the chatbot to answer questions based on the latest policy documents without retraining the model. By indexing the documents in a vector store and retrieving relevant chunks at query time, RAG provides up-to-date, contextually accurate answers while keeping the underlying LLM static, which avoids the cost and complexity of monthly retraining.

Exam trap

Cisco often tests the misconception that a larger context window or fine-tuning is the only way to handle dynamic data, when in fact RAG is the scalable, cost-effective solution for frequently updated knowledge bases without retraining.

How to eliminate wrong answers

Option A is wrong because pasting all policy documents into each prompt would quickly exceed the model's context window (even with larger models, context windows are finite and costly), leading to truncated inputs, degraded performance, and high token costs. Option C is wrong because fine-tuning a base LLM monthly on the policy documents is expensive, time-consuming, and requires storing and managing multiple model versions, which directly contradicts the requirement to avoid retraining. Option D is wrong because training a custom model from scratch each month is prohibitively expensive, requires vast amounts of data and compute resources, and is entirely unnecessary for a task that only needs to retrieve and synthesize existing information.

Practice this question →

27

MCQeasy

Which machine learning paradigm involves training an agent to make decisions by interacting with an environment and receiving rewards or penalties based on its actions?

A.Unsupervised learning

B.Reinforcement learning

C.Supervised learning

D.Self-supervised learning

AnswerB

RL uses an agent that learns from rewards and penalties through interaction with an environment.

Why this answer

Reinforcement learning (RL) is the correct paradigm because it explicitly involves an agent learning a policy through trial-and-error interactions with an environment, receiving scalar reward signals (positive or negative) to maximize cumulative reward. This matches the question's description of making decisions based on rewards or penalties, which is the defining characteristic of RL, as opposed to learning from labeled data or discovering hidden patterns without feedback.

Exam trap

Cisco often tests the distinction between reinforcement learning and supervised learning by phrasing the question to emphasize 'rewards or penalties' — candidates mistakenly think supervised learning uses penalties (like loss functions) and confuse it with RL's delayed reward signals from an environment.

How to eliminate wrong answers

Option A is wrong because unsupervised learning discovers hidden patterns or structures in unlabeled data without any reward or penalty signals from an environment. Option C is wrong because supervised learning maps inputs to outputs using labeled training data, where the model receives direct error feedback (e.g., loss function) rather than delayed rewards from environmental interactions. Option D is wrong because self-supervised learning generates its own supervisory signal from the input data itself (e.g., predicting masked tokens) and does not involve an agent acting in an environment to receive rewards or penalties.

Practice this question →

28

MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store

B.Train a custom model from scratch on the policy documents each month

C.Use a larger foundation model with a longer context window and paste all documents into each prompt

D.Fine-tune a base LLM on the policy documents monthly

AnswerA

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

Retrieval-Augmented Generation (RAG) is the most appropriate approach because it allows the chatbot to answer questions by retrieving relevant chunks from the policy documents stored in a vector store, without requiring model retraining when documents are updated monthly. The retrieval component dynamically fetches the latest content, while the generation component uses a pre-trained LLM to produce answers, making it cost-effective and scalable for frequently changing knowledge bases.

Exam trap

Cisco often tests the misconception that fine-tuning or training from scratch is the only way to adapt a model to new data, ignoring the efficiency of retrieval-based approaches like RAG for dynamic knowledge bases.

How to eliminate wrong answers

Option B is wrong because training a custom model from scratch each month is prohibitively expensive and time-consuming, requiring large datasets and significant compute resources, which contradicts the constraint of not being able to afford retraining. Option C is wrong because pasting all policy documents into each prompt exceeds the context window limits of even the largest foundation models (e.g., 128k tokens for GPT-4), leading to truncation, high latency, and increased cost per query. Option D is wrong because fine-tuning a base LLM monthly on the policy documents still requires retraining the model, which incurs similar costs and effort as training from scratch, and does not efficiently handle document updates without full retraining cycles.

Practice this question →

29

Multi-Selecthard

A healthcare startup is building a diagnostic support system using a large language model. The system must provide accurate, evidence-based answers and avoid generating harmful or fabricated information. Which THREE techniques should be implemented to achieve this? (Choose 3)

Select 3 answers

A.Retrieval-Augmented Generation (RAG)

B.Disabling output filtering to speed up generation

C.Using chain-of-thought prompting for reasoning steps

D.Increasing the temperature parameter to encourage creativity

E.Fine-tuning on medical textbooks and guidelines

AnswersA, C, E

RAG retrieves relevant medical literature to ground responses.

Why this answer

RAG grounds answers in retrieved evidence, fine-tuning can align with medical domain, and prompt engineering can enforce accuracy and safety.

Practice this question →

30

MCQhard

A deep learning engineer is training a transformer model and notices that validation perplexity increases after a few epochs while training perplexity continues to decrease. Which of the following is the MOST likely cause?

A.The temperature parameter is set too high

B.The batch size is too small

C.The learning rate is too low

D.The model is overfitting the training data

AnswerD

Overfitting leads to good training performance but poor generalisation, causing validation metrics to worsen.

Why this answer

Option D is correct because the described pattern—decreasing training perplexity alongside increasing validation perplexity—is the classic signature of overfitting. The model is memorizing the training data rather than learning generalizable patterns, causing its performance on unseen validation data to degrade after a certain point in training.

Exam trap

Cisco often tests the distinction between optimization issues (like learning rate or batch size) and generalization issues (like overfitting), and the trap here is that candidates may confuse a rising validation loss with a learning rate that is too high, when in fact the divergence between training and validation metrics is the definitive clue for overfitting.

How to eliminate wrong answers

Option A is wrong because the temperature parameter controls the sharpness of the output probability distribution during inference (e.g., in softmax), not the training dynamics or the divergence between training and validation loss; a high temperature would make predictions more uniform, not cause overfitting. Option B is wrong because a batch size that is too small typically introduces high gradient variance and can slow convergence or cause instability, but it does not directly cause the specific pattern of training loss decreasing while validation loss increases—that is a hallmark of overfitting, not a batch-size issue. Option C is wrong because a learning rate that is too low would cause the model to converge very slowly or get stuck in a local minimum, but both training and validation perplexity would likely plateau or decrease together; it would not produce a divergence where training perplexity continues to drop while validation perplexity rises.

Practice this question →

31

Multi-Selectmedium

A company is deploying a chatbot using a large language model. They want to mitigate the risk of prompt injection attacks. Which TWO measures should be implemented?

Select 2 answers

A.Implement input validation and sanitisation

B.Use a system prompt that strictly defines the chatbot's behavior

C.Fine-tune the model on safe conversational examples

D.Use a larger context window

E.Limit the maximum output token length

AnswersA, B

Input validation and sanitisation filter out harmful or injected content before processing.

Why this answer

Input validation and sanitisation (A) prevent malicious user inputs from being interpreted as instructions by the LLM, directly mitigating prompt injection by stripping or escaping special characters and control sequences. A strict system prompt (B) defines the chatbot's role and boundaries, reducing the attack surface by making it harder for injected prompts to override the intended behavior.

Exam trap

Cisco often tests the misconception that fine-tuning or output limits can prevent prompt injection, when in fact these measures do not address the root cause of untrusted input being processed as instructions.

Practice this question →

32

MCQmedium

A natural language processing team wants to build a sentiment analysis model for customer reviews. They have 10,000 labeled reviews and 1 million unlabeled reviews. Which approach would MOST effectively leverage the unlabeled data?

A.Use self-supervised learning to pretrain on the unlabeled data, then fine-tune on the labeled data

B.Train a supervised classifier on only the 10,000 labeled reviews

C.Implement a semi-supervised learning algorithm that propagates labels from the labeled to the unlabeled data

D.Use reinforcement learning with the unlabeled data as rewards

AnswerC

Semi-supervised learning leverages the unlabeled data by using the labeled data to infer labels for similar unlabeled examples, improving model generalization.

Why this answer

Semi-supervised learning uses the small labeled set to guide learning from the large unlabeled set. Self-supervised learning would require a pretext task; fine-tuning a pre-trained model is also valid but semi-supervised directly addresses the labeled-unlabeled mix.

Practice this question →

33

Multi-Selecthard

A company is deploying an LLM-powered application that answers questions based on internal documents. They want to minimize prompt injection attacks where users trick the model into ignoring instructions. Which THREE measures should they implement? (Select THREE)

Select 3 answers

A.Use a system-level prompt that clearly defines allowed behavior and boundaries

B.Set temperature to 0.0 for all queries

C.Allow the model to execute any code from user prompts for flexibility

D.Implement a separate classifier to detect and block injection attempts

E.Sanitize user inputs to remove special tokens or injection patterns

AnswersA, D, E

A strong system prompt sets context and restricts the model from following malicious instructions.

Why this answer

A is correct because a system-level prompt establishes a foundational instruction set that defines the model's allowed behavior and boundaries. This acts as a first line of defense by explicitly instructing the model to ignore any user attempts to override its core directives, thereby reducing the risk of prompt injection attacks.

Exam trap

Cisco often tests the misconception that reducing model temperature or randomness can mitigate security threats, when in fact temperature only affects output creativity, not instruction adherence or input safety.

Practice this question →

34

MCQhard

A team is fine-tuning a BERT model for a document classification task. They notice the model achieves high F1 scores on the training set but low F1 on the validation set. Which regularization technique would be MOST effective?

A.L1 regularization

B.L2 regularization

C.Dropout

D.Reduce batch size

AnswerC

Dropout is widely used in transformer models; increasing dropout rate during fine-tuning can reduce overfitting.

Why this answer

Dropout randomly drops neurons during training, preventing co-adaptation and overfitting. L1 and L2 add penalties to weights but are less common for transformers; L1 induces sparsity, L2 reduces weight magnitude. However, dropout is the standard regularization in BERT-like models.

Practice this question →

35

MCQmedium

A data scientist is building a model to predict credit default using historical loan data. The dataset contains 100,000 records with 50 features, including income, debt-to-income ratio, and loan amount. The target variable is binary (default vs. no default). The goal is to maximize interpretability while maintaining high accuracy. Which algorithm is MOST appropriate?

A.Logistic regression

B.Random forest

C.Gradient boosting machine

D.Decision tree

AnswerA

Logistic regression provides clear odds ratios and feature coefficients, making it highly interpretable, and it performs well on large datasets with moderate feature complexity.

Why this answer

Logistic regression is interpretable (coefficients show feature impact) and performs well on binary classification with a large dataset. Decision trees are interpretable but may overfit; random forests and gradient boosting are less interpretable.

Practice this question →

36

MCQmedium

A data scientist is building a model to predict whether a transaction is fraudulent. The dataset has 99.9% legitimate transactions and 0.1% fraudulent ones. Which evaluation metric is MOST appropriate to assess model performance given this class imbalance?

A.BLEU score

B.Accuracy

C.F1-score

D.Perplexity

AnswerC

F1-score balances precision and recall, making it robust for imbalanced classification tasks.

Why this answer

With 99.9% legitimate transactions and only 0.1% fraudulent ones, accuracy would be misleadingly high (99.9%) even if the model never predicts fraud. The F1-score is the harmonic mean of precision and recall, making it robust to class imbalance by penalizing both false positives and false negatives. This makes it the most appropriate metric for evaluating fraud detection performance.

Exam trap

Cisco often tests the trap that candidates default to accuracy as the universal metric, failing to recognize that in extreme class imbalance (e.g., 99.9% vs 0.1%), accuracy becomes meaningless and F1-score is the standard alternative.

How to eliminate wrong answers

Option A is wrong because BLEU score is a metric for evaluating machine translation quality by comparing n-gram overlap, not for binary classification or imbalanced datasets. Option B is wrong because accuracy is misleading in extreme class imbalance; a model that always predicts 'legitimate' would achieve 99.9% accuracy but fail to detect any fraud. Option D is wrong because perplexity is a metric used in language models to measure how well a probability distribution predicts a sample, not for evaluating classification performance on imbalanced data.

Practice this question →

37

MCQmedium

A company wants to generate realistic images of new product designs. They have a large dataset of existing product images. Which generative AI approach is MOST suitable for creating novel, high-quality images?

A.Large language model (LLM)

B.Variational autoencoder (VAE)

C.Generative adversarial network (GAN)

D.Diffusion model

AnswerC

GANs are designed to generate high-quality, realistic images by adversarial training.

Why this answer

GANs (Generative Adversarial Networks) consist of a generator and discriminator that compete, producing highly realistic images. Diffusion models are also good but GANs are historically the go-to for image generation. VAEs produce blurrier images; LLMs are for text.

Practice this question →

38

MCQmedium

A data scientist is evaluating a binary classification model. The model achieves 95% accuracy on the test set, but the precision is 0.60 and recall is 0.55. The dataset has 90% negative class samples. Which metric should the team focus on to improve the model?

A.F1 score

B.Perplexity

C.BLEU score

D.Accuracy

AnswerA

F1 score is the harmonic mean of precision and recall, providing a balanced metric that accounts for both false positives and false negatives.

Why this answer

With high class imbalance (90% negatives), accuracy is misleading. F1 score balances precision and recall, giving a better picture of performance on the minority class. AUC-ROC is also good but F1 directly optimizes for positive class.

Practice this question →

39

MCQeasy

An AI practitioner needs to extract key phrases from a large collection of customer support emails for trend analysis. Which technique is MOST suitable?

A.Named entity recognition (NER)

B.Language translation

C.Text classification

D.Sentiment analysis

AnswerA

NER extracts specific entities (e.g., product names, problems) which can serve as key phrases for trend analysis.

Why this answer

Named entity recognition (NER) identifies predefined entities like names, dates, or product names. For extracting key phrases, keyword extraction or topic modeling (e.g., TF-IDF, RAKE) is more appropriate. Sentiment analysis gives sentiment scores, and text classification assigns categories.

Practice this question →

40

MCQhard

A developer is fine-tuning a large language model for a legal document summarization task. They notice that during training, the loss decreases rapidly in the first few epochs but then plateaus with high variance. Which hyperparameter adjustment is MOST likely to help stabilize training?

A.Add L1 regularization

B.Decrease the learning rate

C.Increase the batch size

D.Increase the number of epochs

AnswerB

A lower learning rate reduces gradient step sizes, stabilizing training and reducing variance.

Why this answer

A high-variance loss plateau after rapid initial convergence typically indicates that the learning rate is too large, causing the optimizer to overshoot the minima and oscillate. Decreasing the learning rate allows smaller, more stable weight updates, reducing variance and enabling smoother convergence.

Exam trap

Cisco often tests the misconception that high variance in loss is always solved by increasing batch size or regularization, when in fact the immediate cause is often an overly aggressive learning rate that prevents convergence.

How to eliminate wrong answers

Option A is wrong because L1 regularization adds a penalty on the absolute magnitude of weights to induce sparsity, which does not directly address high variance in the loss curve during fine-tuning. Option C is wrong because increasing the batch size reduces gradient noise and can stabilize training, but the question describes high variance after a plateau, which is more directly tied to learning rate oscillations rather than batch size. Option D is wrong because increasing the number of epochs does not fix the underlying instability; it may even exacerbate overfitting or variance if the learning rate remains too high.

Practice this question →

41

MCQmedium

A data scientist is building a model to predict whether a loan application will default. The dataset has 10,000 labeled examples with 1,000 defaults. Which metric is MOST appropriate for evaluating this highly imbalanced binary classification?

A.Precision

B.AUC-ROC

C.Recall

D.Accuracy

AnswerB

AUC-ROC evaluates model performance across all thresholds and is insensitive to class imbalance.

Why this answer

AUC-ROC is robust to class imbalance because it measures the trade-off between true positive rate and false positive rate across all thresholds. Accuracy is misleading when classes are imbalanced. Precision and recall focus on one class but are threshold-dependent.

Practice this question →

42

Multi-Selecteasy

A machine learning team is splitting a dataset for a binary classification problem. They want to ensure robust evaluation and avoid data leakage. Which TWO practices should they follow? (Choose 2)

Select 2 answers

A.Normalise the entire dataset before splitting

B.Split into training, validation, and test sets

C.Include validation data in the training set for more data

D.Shuffle the data before splitting

E.Use the same split for all experiments

AnswersB, D

A three-way split allows tuning on validation and final evaluation on test.

Why this answer

Train/validation/test split is standard; cross-validation gives more robust estimates. Shuffling before split prevents ordering bias.

Practice this question →

43

Multi-Selectmedium

A team is training a deep learning model for image classification. They observe that training accuracy is high but validation accuracy is low, indicating overfitting. Which TWO techniques should they apply to reduce overfitting? (Select TWO)

Select 2 answers

A.Use L2 regularization

B.Increase learning rate

C.Add dropout layers

D.Increase the number of layers

E.Reduce training data size

AnswersA, C

L2 regularization penalizes large weights, encouraging simpler models.

Why this answer

L2 regularization (also known as weight decay) adds a penalty proportional to the square of the magnitude of the weights to the loss function. This discourages the model from learning overly complex patterns that fit the training data noise, effectively reducing overfitting by keeping weights small and the decision boundary simpler.

Exam trap

Cisco often tests the misconception that increasing model complexity (more layers or data reduction) helps generalization, when in fact these actions typically worsen overfitting.

Practice this question →

44

Multi-Selecteasy

A company wants to build a system that can generate new product images for an online catalog. Which TWO generative AI approaches are most suitable?

Select 2 answers

A.Diffusion models

B.Variational autoencoders (VAEs)

C.Generative Adversarial Networks (GANs)

D.BERT-based model

E.GPT-style language model

AnswersA, C

Diffusion models like DALL-E and Stable Diffusion produce high-quality images from noise.

Why this answer

Generative Adversarial Networks (GANs) are widely used for image generation, and diffusion models (like Stable Diffusion) have achieved state-of-the-art results in image synthesis. Variational autoencoders (VAEs) can generate images but often produce blurrier outputs. GPT is for text, and BERT is for understanding.

Practice this question →

45

MCQmedium

A team is deploying a sentiment analysis model for social media posts. The model currently performs well on English text but poorly on code-switched text (e.g., Spanglish). Which approach is MOST effective for improving performance on code-switched data without starting from scratch?

A.Use a larger base model without additional training

B.Apply data augmentation by translating all code-switched posts to English

C.Train a new model from scratch on a mix of English and code-switched data

D.Fine-tune the existing model on a corpus of code-switched text

AnswerD

Fine-tuning leverages pre-trained knowledge and adapts to the target domain with less data and compute.

Why this answer

Fine-tuning the existing model on a corpus of code-switched text adapts the model to the new language pattern efficiently.

Practice this question →

46

Multi-Selecteasy

A company wants to use machine learning to recommend products to customers based on their purchase history. Which TWO techniques are appropriate for this task? (Select TWO)

Select 2 answers

A.Collaborative filtering

B.Principal Component Analysis (PCA)

C.K-Nearest Neighbors (KNN)

D.Naive Bayes

E.Linear regression

AnswersA, C

Collaborative filtering uses behavior patterns to recommend items.

Why this answer

Collaborative filtering recommends based on user similarities. K-Nearest Neighbors can find similar users or items. Both are suitable for recommendation.

Practice this question →

47

MCQhard

A research team is training a deep learning model for image classification using a small dataset of 1,000 labeled images. They are concerned about overfitting. Which combination of regularisation techniques would be MOST effective?

A.Use early stopping without any other regularisation

B.Dropout with a rate of 0.5 and L2 regularisation

C.L1 regularisation and batch normalisation

D.Increase learning rate and use momentum

AnswerB

Dropout and L2 regularisation together effectively reduce overfitting by preventing reliance on specific neurons and penalising large weights.

Why this answer

Dropout randomly disables neurons during training to prevent co-adaptation, and L2 regularisation penalises large weights. Both are standard regularisation techniques. L1 promotes sparsity but is less common for dense layers.

Batch normalisation helps convergence but is not primarily a regularisation method.

Practice this question →

48

Multi-Selectmedium

A data scientist is preparing a dataset for a binary classification model. The dataset has 1000 samples, with 800 positives and 200 negatives. To evaluate the model properly, which THREE steps should they take? (Select THREE)

Select 3 answers

A.Remove the minority class samples to make the dataset balanced

B.Use a stratified train-test split to preserve class proportions

C.Apply SMOTE (Synthetic Minority Over-sampling Technique) to balance the training set

D.Report only accuracy as the evaluation metric

E.Use precision, recall, and F1-score for evaluation

AnswersB, C, E

Stratified split ensures both training and test sets have similar class ratios.

Why this answer

Option B is correct because stratified train-test splitting ensures that the class distribution (80% positive, 20% negative) is preserved in both training and test sets. This prevents the model from being evaluated on a test set that has a different class ratio, which could give a misleading impression of performance, especially in imbalanced datasets.

Exam trap

Cisco often tests the misconception that removing minority samples or relying solely on accuracy is acceptable for imbalanced datasets, when in fact these approaches degrade model performance and evaluation validity.

Practice this question →

49

MCQhard

A team is training a deep learning model for image classification. The training loss decreases steadily but the validation loss plateaus after 20 epochs and then starts to increase. Which action is MOST likely to improve generalization?

A.Add more convolutional layers

B.Increase the learning rate

C.Implement early stopping

D.Reduce the batch size

AnswerC

Early stopping monitors validation loss and stops training before overfitting occurs, directly addressing the plateau and rise.

Why this answer

Early stopping halts training when validation loss stops improving, preventing overfitting. Increasing learning rate would worsen divergence; adding more layers increases capacity and overfitting; reducing batch size may help optimization but not directly address overfitting.

Practice this question →

50

MCQhard

A research team is fine-tuning a BERT model for a text classification task. They notice that the model's performance on the validation set fluctuates wildly across epochs, sometimes dropping significantly from one epoch to the next. Which technique is MOST likely to stabilise training?

A.Use a smaller batch size

B.Increase the learning rate

C.Apply gradient clipping

D.Increase the number of epochs

AnswerC

Gradient clipping limits the norm of gradients, preventing large destabilising updates.

Why this answer

Gradient clipping directly addresses the problem of exploding gradients, which can cause large, destabilizing weight updates during fine-tuning of large models like BERT. By capping the gradient norm (e.g., to a value like 1.0), it prevents a single batch from drastically altering the model's parameters, thus smoothing out validation performance fluctuations across epochs.

Exam trap

Cisco often tests the misconception that increasing epochs or adjusting batch size alone can fix training instability, when the root cause is gradient explosion, which only gradient clipping directly mitigates.

How to eliminate wrong answers

Option A is wrong because using a smaller batch size typically increases gradient variance, which can actually worsen fluctuations in validation performance rather than stabilize them. Option B is wrong because increasing the learning rate amplifies the magnitude of weight updates, making the model more prone to overshooting minima and causing even more erratic validation loss spikes. Option D is wrong because increasing the number of epochs does not address the underlying instability; it merely extends training, which could allow the model to eventually converge but does not prevent the wild epoch-to-epoch drops caused by gradient instability.

Practice this question →

51

MCQmedium

A team is using a pre-trained BERT model for a sentiment analysis task on product reviews. They want to adapt it to their specific domain with limited labeled data. Which approach is MOST effective?

A.Use BERT as a feature extractor and train a logistic regression on top

B.Apply data augmentation to increase the dataset and then train from scratch

C.Train a new BERT model from scratch on the domain data

D.Fine-tune the pre-trained BERT model on the small labeled dataset

AnswerD

Fine-tuning leverages pre-trained knowledge and adapts effectively with limited data.

Why this answer

Fine-tuning the pre-trained model on the small labeled dataset is standard for transfer learning with BERT. Training from scratch would require massive data; feature extraction with a linear classifier is possible but less effective with limited data; data augmentation alone does not adapt the model.

Practice this question →

52

Multi-Selectmedium

A data scientist is preparing to train a convolutional neural network (CNN) for image classification. Which TWO actions are most effective for preventing overfitting? (Choose 2)

Select 2 answers

A.Use data augmentation

B.Increase the number of epochs

C.Use L2 regularization

D.Add more convolutional layers

E.Use dropout layers

AnswersA, E

Data augmentation generates variations of training images, effectively increasing the dataset size and reducing overfitting.

Why this answer

Data augmentation (A) is effective for preventing overfitting because it artificially expands the training dataset by applying random transformations (e.g., rotation, flipping, cropping, color jitter) to existing images. This exposes the CNN to a wider variety of input patterns, reducing the model's tendency to memorize noise or specific details and improving generalization to unseen data.

Exam trap

Cisco often tests the distinction between regularization techniques that directly reduce overfitting (like dropout and data augmentation) versus architectural changes (like adding layers) that increase capacity and may worsen overfitting if not balanced with regularization.

Practice this question →

53

Multi-Selectmedium

A data scientist is building a recommendation system for an e-commerce platform. The dataset includes user purchase history, product descriptions, and user demographics. The goal is to recommend products that a user is likely to purchase. Which TWO techniques are most appropriate for this task? (Select TWO.)

Select 2 answers

A.Content-based filtering

B.Association rule mining

C.Linear regression

D.Anomaly detection

E.Collaborative filtering

AnswersA, E

Uses product descriptions and demographics to match user preferences.

Why this answer

Collaborative filtering uses user-item interactions; content-based filtering uses item features. Association rule mining is for basket analysis; regression is not typically used for recommendations; anomaly detection is for outliers.

Practice this question →

54

MCQeasy

Which of the following best describes the difference between narrow AI and general AI?

A.Narrow AI is designed for a specific task; general AI aims to perform any cognitive task a human can.

B.Narrow AI relies on supervised learning; general AI uses unsupervised learning exclusively.

C.Narrow AI requires large datasets; general AI can learn from few examples.

D.Narrow AI can perform any intellectual task; general AI is limited to specific tasks.

AnswerA

Narrow AI excels at one domain (e.g., chess), whereas general AI would be versatile.

Why this answer

Narrow AI specializes in one task, while general AI would possess human-like cognitive abilities across domains.

Practice this question →

55

Multi-Selectmedium

A machine learning engineer is training a convolutional neural network (CNN) for object detection in satellite imagery. The training loss is not decreasing significantly. Which TWO adjustments could help the model converge? (Select TWO)

Select 2 answers

A.Normalize the pixel values to zero mean and unit variance

B.Remove dropout layers to allow more gradient flow

C.Use a smaller batch size to reduce memory

D.Increase the learning rate by 10x

E.Reduce the learning rate if the loss plateaus

AnswersA, E

Normalization ensures consistent scale, helping gradient descent converge faster.

Why this answer

Normalizing pixel values to zero mean and unit variance (A) ensures that input features have similar scales, which prevents certain weights from updating disproportionately and stabilizes gradient descent. This is especially important for CNNs processing satellite imagery, where raw pixel intensities can vary widely across bands and scenes, leading to poor convergence.

Exam trap

Cisco often tests the misconception that increasing the learning rate always accelerates convergence, when in practice it can cause divergence, and that removing regularization layers like dropout directly improves training loss reduction.

Practice this question →

56

MCQhard

A team is training a generative adversarial network (GAN) to generate realistic images of furniture. The generator loss decreases sharply while the discriminator loss increases. What is the MOST likely issue and recommended action?

A.Mode collapse has occurred; increase the generator's learning rate

B.The discriminator is overfitting; decrease its capacity

C.The learning rates are too high; reduce both

D.The generator is too strong; train the discriminator more frequently

AnswerD

Training the discriminator more often helps it catch up to the generator, balancing the GAN.

Why this answer

If the generator loss drops too fast and discriminator loss rises, the generator is overpowering the discriminator. The typical remedy is to train the discriminator more often (e.g., 5 steps per generator step) or adjust the learning rates.

Practice this question →

57

MCQhard

A prompt engineer wants to reduce the risk of prompt injection attacks in an LLM-based application that processes user input. Which strategy is MOST effective?

A.Set the temperature to 0

B.Use a system prompt that instructs the model to ignore injection attempts

C.Sanitize user input to remove or neutralize special characters and instruction-like patterns

D.Use a larger, more powerful LLM

AnswerC

Input sanitization directly removes attempts to hijack the prompt.

Why this answer

Option C is correct because sanitizing user input to remove or neutralize special characters and instruction-like patterns directly addresses the root cause of prompt injection attacks: the ability for user-supplied text to alter the intended behavior of the LLM. By stripping or escaping tokens that mimic system instructions (e.g., 'Ignore previous instructions' or delimiter sequences), the application prevents the injection vector from reaching the model's instruction-following logic. This is a fundamental input validation technique analogous to SQL injection prevention, applied to the LLM context.

Exam trap

Cisco often tests the misconception that model-level parameters (like temperature) or simple prompt instructions can substitute for robust input validation, when in fact only sanitization directly neutralizes the injection vector at the application layer.

How to eliminate wrong answers

Option A is wrong because setting the temperature to 0 only makes the model's output more deterministic and less creative, but it does not prevent the model from following injected instructions within the user input; the model will still execute malicious commands regardless of temperature. Option B is wrong because a system prompt instructing the model to ignore injection attempts is unreliable—it can be overridden by a cleverly crafted user input that tells the model to disregard prior instructions, as LLMs are susceptible to instruction hierarchy attacks. Option D is wrong because using a larger, more powerful LLM does not inherently improve security against prompt injection; larger models may even be more capable of following complex injected instructions, increasing the risk.

Practice this question →

58

MCQeasy

Which of the following is a key characteristic of Narrow AI (Weak AI)?

A.It can perform any intellectual task that a human can

B.It requires no training data

C.It is designed to excel at a single, specific task

D.It surpasses human intelligence in all domains

AnswerC

Narrow AI focuses on a limited domain.

Why this answer

Narrow AI, also known as Weak AI, is designed and trained to perform a single, specific task with high proficiency, such as language translation, image recognition, or playing chess. It cannot generalize its intelligence to other domains, which distinguishes it from Artificial General Intelligence (AGI). Option C correctly captures this fundamental characteristic.

Exam trap

Cisco often tests the distinction between Narrow AI and AGI, and the trap here is that candidates confuse 'narrow' with 'limited performance' rather than understanding it means 'restricted to a single task domain'.

How to eliminate wrong answers

Option A is wrong because the ability to perform any intellectual task that a human can describes Artificial General Intelligence (AGI), not Narrow AI, which is limited to a specific domain. Option B is wrong because Narrow AI systems require extensive training data to learn patterns and make accurate predictions; without training data, they cannot function. Option D is wrong because surpassing human intelligence in all domains is a trait of superintelligence, which is a theoretical concept beyond current Narrow AI capabilities.

Practice this question →

59

Multi-Selectmedium

A machine learning team is evaluating a logistic regression model for a binary classification task. The dataset has 1,000 samples and 20 features. Which TWO metrics are most appropriate for evaluating model performance? (Choose 2)

Select 2 answers

A.BLEU score

B.F1 score

C.Perplexity

D.Accuracy

E.AUC-ROC

AnswersB, E

F1 score combines precision and recall, useful when class distribution is uneven.

Why this answer

The F1 score is appropriate because it balances precision and recall, making it robust for binary classification when class distribution may be imbalanced. AUC-ROC measures the model's ability to distinguish between positive and negative classes across all classification thresholds, providing a threshold-independent evaluation of discriminative performance.

Exam trap

Cisco often tests the distinction between metrics for classification versus metrics for sequence generation or language modeling, leading candidates to mistakenly select BLEU or perplexity for a binary classification task.

Practice this question →

60

MCQmedium

A team trains a neural network for image classification. During training, the loss decreases on the training set but increases on the validation set after a few epochs. What is the most likely cause?

A.Vanishing gradients

B.Incorrect learning rate scheduling

C.Overfitting

D.Underfitting

AnswerC

Overfitting causes the model to perform well on training data but poorly on validation data.

Why this answer

Overfitting occurs when the model learns the training data too well, including noise and irrelevant patterns, causing it to memorize rather than generalize. This is evidenced by the loss decreasing on the training set while increasing on the validation set after a few epochs, as the model's performance on unseen data degrades.

Exam trap

Cisco often tests the distinction between overfitting and underfitting by presenting a scenario where training loss decreases but validation loss increases, which candidates may confuse with a learning rate issue or gradient problem.

How to eliminate wrong answers

Option A is wrong because vanishing gradients cause the network to stop learning entirely (loss plateaus on both sets), not a divergence between training and validation loss. Option B is wrong because incorrect learning rate scheduling typically causes erratic loss behavior (e.g., oscillations or failure to converge) on both sets, not a clear overfitting pattern. Option D is wrong because underfitting results in high loss on both training and validation sets, not a decreasing training loss with increasing validation loss.

Practice this question →

61

Multi-Selecthard

A team is deploying a sentiment analysis model that must achieve high precision and high recall. They have a labeled dataset of 10,000 samples. They want to minimize overfitting. Which THREE actions are most appropriate? (Select THREE.)

Select 3 answers

A.Decrease the learning rate

B.Apply L2 regularization to the model weights

C.Use dropout layers in the neural network

D.Increase the training batch size

E.Augment the training data with synthetic examples

AnswersB, C, E

Penalizes large weights, reducing overfitting.

Why this answer

L2 regularization (option B) penalizes large weights by adding a squared magnitude term to the loss function, which discourages the model from fitting noise in the training data. This directly reduces overfitting while maintaining high precision and recall by keeping the decision boundary smooth and generalizable.

Exam trap

Cisco often tests the misconception that decreasing the learning rate is a regularization technique, when in fact it only affects optimization speed and not model complexity or overfitting prevention.

Practice this question →

62

MCQmedium

A company is deploying a large language model for customer support. They want to reduce the number of off-topic or nonsensical responses while maintaining creativity. Which parameter adjustment would BEST achieve this?

A.Decrease temperature to 0.2

B.Set top-p to 0.1

C.Increase top-k to 100

D.Increase temperature to 0.9

AnswerA

Lower temperature reduces randomness, making the model more focused and less likely to generate nonsensical outputs.

Why this answer

Lowering temperature makes the model more deterministic and less likely to produce random outputs. Top-p and top-k can also help but are secondary; temperature directly controls randomness.

Practice this question →

63

Multi-Selectmedium

A data scientist needs to select a regression model to predict house prices. The dataset contains many features, some of which are irrelevant. Which TWO algorithms are BEST suited for this scenario, and why? (Select TWO)

Select 2 answers

A.Ridge regression (L2 regularization)

B.Linear regression

C.Lasso regression (L1 regularization)

D.K-Nearest Neighbors

E.Random Forest

AnswersC, E

Lasso applies L1 penalty, driving coefficients of irrelevant features to zero.

Why this answer

Random Forest handles irrelevant features well via feature importance. Lasso (L1) regression performs automatic feature selection by shrinking coefficients to zero.

Practice this question →

64

MCQeasy

A data scientist needs to predict whether a customer will churn (yes/no) based on historical data. Which type of machine learning problem is this?

A.Reinforcement learning

B.Regression

C.Binary classification

D.Clustering

AnswerC

Churn prediction with two classes (yes/no) is a binary classification problem.

Why this answer

This is a binary classification problem because the target variable has exactly two discrete outcomes: 'yes' (churn) or 'no' (no churn). Classification algorithms such as logistic regression, decision trees, or support vector machines are used to assign input features to one of these two predefined classes. The output is a categorical label, not a continuous value or a reward signal.

Exam trap

Cisco often tests the distinction between classification and regression by presenting a binary outcome and expecting candidates to recognize it as classification, not regression, even though the term 'regression' appears in 'logistic regression' which is actually a classification algorithm.

How to eliminate wrong answers

Option A is wrong because reinforcement learning involves an agent learning to make sequences of decisions by interacting with an environment to maximize cumulative reward, not predicting a static binary outcome from historical data. Option B is wrong because regression predicts a continuous numeric value (e.g., revenue, temperature), not a discrete class label like churn yes/no. Option D is wrong because clustering is an unsupervised learning technique that groups data points based on similarity without using labeled target variables, whereas churn prediction requires labeled historical data to train a supervised model.

Practice this question →

65

Multi-Selectmedium

An AI developer is selecting a model architecture for a real-time video surveillance system that must detect objects in each frame and also track movement patterns across frames. Which TWO architectures should the developer combine? (Choose 2)

Select 2 answers

A.Transformer encoder only

B.Generative adversarial network (GAN)

C.Variational autoencoder (VAE)

D.Recurrent neural network (RNN) or LSTM

E.Convolutional neural network (CNN)

AnswersD, E

RNNs/LSTMs capture temporal dependencies across frames.

Why this answer

CNNs are ideal for image feature extraction; RNNs/LSTMs are designed for sequence modelling to track temporal patterns.

Practice this question →

66

MCQmedium

A company is deploying a text generation model for customer service emails. They want to ensure the model's responses are factual and based on internal knowledge bases. Which technique is most effective?

A.Use Retrieval-Augmented Generation (RAG)

B.Fine-tune the model on historical customer service emails

C.Write a detailed system prompt

D.Set the temperature to 0

AnswerA

RAG retrieves relevant knowledge base content at query time, providing the model with factual context to generate accurate responses.

Why this answer

RAG retrieves relevant documents from the knowledge base at inference time, grounding the model's responses in facts. Prompt engineering alone can't ensure factual accuracy; fine-tuning may still hallucinate; temperature reduction only reduces randomness.

Practice this question →

67

Multi-Selecteasy

A company wants to classify images of products into categories. They have a large dataset of labeled images. Which TWO types of neural networks are most suitable for this task? (Select TWO.)

Select 2 answers

A.Generative Adversarial Network (GAN)

B.Convolutional Neural Network (CNN)

C.Recurrent Neural Network (RNN)

D.Transformer (e.g., Vision Transformer)

E.Multi-layer Perceptron (MLP)

AnswersB, D

Specialized for image classification.

Why this answer

Convolutional Neural Networks (CNNs) are the standard architecture for image classification because they use convolutional layers to automatically learn spatial hierarchies of features (edges, textures, shapes) from pixel data. They are highly effective for large labeled image datasets due to their parameter efficiency and translation invariance.

Exam trap

Cisco often tests the misconception that any neural network can handle images, but the trap here is that RNNs and MLPs are technically capable of processing image data yet are fundamentally unsuitable for spatial feature extraction, leading candidates to select them over the correct specialized architectures.

Practice this question →

68

MCQhard

A model trained on customer reviews achieves 98% accuracy on the test set. However, when deployed, it performs poorly on real-world data. The data scientist suspects distribution shift. Which action is MOST important to address this?

A.Reduce the learning rate during training

B.Implement a monitoring system to detect data drift and retrain with fresh data

C.Add more features to the model

D.Increase the number of cross-validation folds

AnswerB

Detecting drift and retraining with representative data directly addresses distribution shift.

Why this answer

Option B is correct because distribution shift (data drift) causes the model's training distribution to differ from the real-world distribution, degrading performance despite high test accuracy. Implementing a monitoring system to detect drift and retraining with fresh data directly addresses this by ensuring the model adapts to the current data distribution, which is the most critical action for maintaining performance in production.

Exam trap

Cisco often tests the misconception that high test accuracy guarantees real-world performance, leading candidates to focus on training improvements (like tuning hyperparameters or adding features) rather than addressing the root cause of distribution shift through monitoring and retraining.

How to eliminate wrong answers

Option A is wrong because reducing the learning rate affects the optimization step size during training, which does not address distribution shift after deployment; it only changes how the model converges on the training data. Option C is wrong because adding more features may improve model capacity but does not fix the mismatch between training and real-world distributions; it could even exacerbate overfitting to the original distribution. Option D is wrong because increasing cross-validation folds improves the reliability of performance estimates on the training/validation data but does not detect or correct for distribution shift in the deployed environment.

Practice this question →

69

MCQhard

An AI engineer is designing a system to detect unusual patterns in network traffic that may indicate a security breach. The system should learn from normal traffic patterns and flag deviations. Which machine learning approach is MOST appropriate?

A.Reinforcement learning with reward shaping

B.Supervised classification using logistic regression

C.Semi-supervised learning with a small labeled set

D.Unsupervised anomaly detection

AnswerD

Anomaly detection learns normal patterns from unlabeled data and flags deviations, ideal for unknown attacks.

Why this answer

Unsupervised anomaly detection is the most appropriate approach because the system must learn 'normal' traffic patterns from unlabeled data and then flag deviations without requiring pre-labeled examples of attacks. This aligns with the core requirement of detecting unknown or novel security breaches, which supervised methods cannot handle due to the lack of labeled attack data.

Exam trap

Cisco often tests the misconception that semi-supervised learning (Option C) is a middle ground for anomaly detection, but the trap is that it still requires labeled attack data, which is unavailable for unknown security breaches, making unsupervised methods the only viable choice.

How to eliminate wrong answers

Option A is wrong because reinforcement learning with reward shaping is designed for sequential decision-making problems (e.g., autonomous agents) and is not suited for static pattern detection in network traffic; it would require a reward function for 'normal' behavior, which is impractical for anomaly detection. Option B is wrong because supervised classification using logistic regression requires a fully labeled dataset of both normal and attack traffic, which is unavailable when the goal is to detect unknown or novel breaches. Option C is wrong because semi-supervised learning with a small labeled set still relies on labeled attack examples, which are scarce or nonexistent for novel security threats, and it does not purely model normal behavior like unsupervised methods do.

Practice this question →

70

MCQeasy

A startup wants to identify unusual patterns in network traffic to detect potential security breaches. They have a large dataset of normal traffic but very few labeled attacks. Which machine learning approach is MOST suitable?

A.Supervised classification with logistic regression

B.Unsupervised anomaly detection

C.Reinforcement learning

D.Semi-supervised learning

AnswerB

Unsupervised anomaly detection can find deviations from normal traffic without needing labeled attack data.

Why this answer

Unsupervised anomaly detection is the most suitable approach because the startup has a large dataset of normal traffic but very few labeled attacks. This technique learns the baseline of normal behavior from unlabeled data and flags deviations as potential anomalies, which is ideal for detecting unknown or rare attack patterns without requiring labeled attack samples.

Exam trap

Cisco often tests the misconception that semi-supervised learning is the best choice when labeled data is scarce, but the key distinction is that semi-supervised learning still requires a meaningful amount of labeled data for the target class, whereas unsupervised anomaly detection works with zero labeled attacks.

How to eliminate wrong answers

Option A is wrong because supervised classification with logistic regression requires a large, balanced set of labeled attack and normal traffic data to train effectively, which the startup lacks. Option C is wrong because reinforcement learning is designed for sequential decision-making problems (e.g., autonomous agents) and is not suited for static pattern detection in network traffic. Option D is wrong because semi-supervised learning still requires at least some labeled attack data to guide the model, and the startup has very few labeled attacks, making it less effective than pure unsupervised anomaly detection.

Practice this question →

71

MCQeasy

An AI system that can perform any intellectual task that a human being can is referred to as:

A.Machine learning

B.Artificial General Intelligence (AGI)

C.Narrow AI

D.Deep learning

AnswerB

AGI is the hypothetical ability of an AI to perform any intellectual task a human can.

Why this answer

Artificial General Intelligence (AGI) is the concept of a machine that understands or learns any intellectual task that a human being can.

Practice this question →

72

MCQhard

A generative AI model is asked to 'Write a poem about AI' and returns a very short, generic response. The user wants longer, more creative outputs. Which parameter adjustment is MOST likely to help?

A.Decrease the top-p value

B.Increase the frequency penalty

C.Decrease the max tokens limit

D.Increase the temperature parameter

AnswerD

Higher temperature (e.g., 0.8-1.0) makes the model take more risks, leading to more creative and varied outputs.

Why this answer

Increasing the temperature parameter raises the randomness of token selection, encouraging the model to explore less probable word sequences and produce more varied, creative, and longer outputs. A low temperature (e.g., 0.1) makes the model deterministic and repetitive, often yielding short, generic responses. By increasing temperature (e.g., to 0.8 or 1.0), the model is more likely to generate diverse and expansive text, directly addressing the user's request for longer, more creative poems.

Exam trap

Cisco often tests the misconception that increasing max tokens (or decreasing it) is the primary way to control output length, when in fact temperature and top-p are the key parameters for influencing creativity and diversity, while max tokens simply sets a hard cutoff.

How to eliminate wrong answers

Option A is wrong because decreasing top-p (nucleus sampling) narrows the cumulative probability mass considered for token selection, making the output more focused and less diverse, which would further shorten and genericize the response. Option B is wrong because increasing the frequency penalty reduces the likelihood of repeating tokens or phrases, which can help with variety but does not directly encourage longer outputs; it may even shorten the response by penalizing common words. Option C is wrong because decreasing the max tokens limit explicitly caps the output length, which would make the response even shorter, opposite to the user's goal of longer outputs.

Practice this question →

73

MCQeasy

Which type of neural network is BEST suited for processing sequential data such as time series or natural language?

A.Generative Adversarial Network (GAN)

B.Multi-layer Perceptron (MLP)

C.Recurrent Neural Network (RNN)

D.Convolutional Neural Network (CNN)

AnswerC

RNNs have loops that allow information to persist, making them ideal for sequences.

Why this answer

RNNs (including LSTMs) are designed for sequential data with temporal dependencies. CNNs excel at spatial data; transformers are also used but RNNs are the classic answer.

Practice this question →

74

MCQeasy

Which neural network architecture is specifically designed to process sequential data, such as time series or sentences, by maintaining a hidden state that captures information about previous inputs?

A.Transformer

B.Convolutional Neural Network (CNN)

C.Multi-layer Perceptron (MLP)

D.Recurrent Neural Network (RNN)

AnswerD

RNNs have a hidden state that evolves over time steps, ideal for sequences.

Why this answer

Recurrent Neural Networks (RNNs) are specifically designed for sequential data because they maintain a hidden state that is updated at each time step, allowing information about previous inputs to persist and influence current and future outputs. This feedback loop makes them ideal for tasks like time series forecasting, natural language processing, and speech recognition, where order and context matter.

Exam trap

Cisco often tests the misconception that Transformers are the default architecture for all sequence tasks, but the question specifically asks for a network that 'maintains a hidden state'—a defining feature of RNNs, not Transformers.

How to eliminate wrong answers

Option A (Transformer) is wrong because, while Transformers process sequences using self-attention mechanisms, they do not maintain a recurrent hidden state; they rely on positional encodings and parallel processing of the entire sequence. Option B (CNN) is wrong because CNNs are designed for spatial data (e.g., images) using convolutional filters and pooling layers, not for capturing temporal dependencies via a hidden state. Option C (MLP) is wrong because MLPs are feedforward networks with no memory or sequential processing capability; each input is processed independently without any hidden state carrying information across time steps.

Practice this question →

75

MCQeasy

Which neural network architecture is specifically designed to handle sequential data and mitigate the vanishing gradient problem?

A.Convolutional Neural Network (CNN)

B.Transformer

C.Vanilla Recurrent Neural Network (RNN)

D.Long Short-Term Memory (LSTM) network

AnswerD

LSTMs use forget, input, and output gates to control information flow, effectively handling long sequences and vanishing gradients.

Why this answer

LSTM (Long Short-Term Memory) is a type of RNN designed with gating mechanisms to prevent vanishing gradients in long sequences. CNNs are for spatial data; vanilla RNNs suffer from vanishing gradients; transformers use attention but are not specifically designed to mitigate vanishing gradients (they use residual connections).

Practice this question →

Ready to test yourself?

Try a timed practice session using only Aio Ai Concepts Techniques questions.

Start 20-question session