Knowledge + Practice

CCNA Oci Genai Llm Fundamentals Questions

70 of 145 questions · Page 2/2 · Oci Genai Llm Fundamentals topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQmedium

A data scientist is using OCI Generative AI to process a large batch of legal documents. The total cost is higher than expected. Which factor is most likely the primary driver of cost?

A.The number of API requests made

B.The total number of tokens processed (input + output)

C.The choice of sampling strategy (e.g., top-k vs greedy)

D.The latency of the inference endpoint

AnswerB

Pricing is typically per token; longer documents mean more tokens and higher cost.

Why this answer

In OCI Generative AI, pricing is primarily based on the total number of tokens processed, which includes both input (prompt) and output (generated) tokens. Processing large batches of legal documents generates high token counts due to lengthy text inputs and verbose outputs, directly increasing cost. The number of API requests alone does not determine cost—a single request with many tokens costs more than many requests with few tokens.

Exam trap

Cisco often tests the misconception that API request count is the primary cost driver, when in reality token-based pricing means a single large request can cost more than hundreds of tiny requests.

How to eliminate wrong answers

Option A is wrong because OCI Generative AI charges per token, not per API request; a single request with a large prompt and long response incurs higher cost than many small requests. Option C is wrong because sampling strategy (e.g., top-k vs greedy) affects output diversity and quality, not the token count or pricing model. Option D is wrong because latency of the inference endpoint impacts response time and throughput, not the cost per token or total cost.

Practice this question →

77

MCQhard

An OCI practitioner observes that an LLM consistently generates incorrect answers for questions about recent events (last 6 months). The model was fine-tuned on company data but not retrained recently. What is the MOST likely root cause?

A.The context window is too short

B.The sampling strategy is too greedy

C.The model's knowledge cutoff date is before the recent events

D.The fine-tuning dataset was too small

AnswerC

LLMs have a fixed pre-training cutoff; fine-tuning with company data does not add world knowledge beyond that cutoff unless the data includes those events.

Why this answer

The model's knowledge cutoff date determines the temporal boundary of its training data. Since the LLM was fine-tuned on company data but not retrained, it lacks exposure to events occurring after its cutoff, making it unable to generate accurate answers about recent events. This is the most direct and common cause for such temporal knowledge gaps in LLMs.

Exam trap

Cisco often tests the distinction between model architecture limitations (like context window) and training data limitations (like knowledge cutoff), leading candidates to confuse inference-time constraints with pre-training data recency.

How to eliminate wrong answers

Option A is wrong because a short context window limits the amount of input text the model can process at inference time, not the model's knowledge of recent events; it does not prevent the model from recalling information it was trained on. Option B is wrong because greedy sampling (e.g., always picking the highest probability token) affects output diversity and creativity, not the model's factual knowledge about recent events. Option D is wrong because the size of the fine-tuning dataset influences the model's ability to learn domain-specific patterns, but it does not determine the temporal cutoff of the base model's pre-training data; the issue is the lack of recent data in the base model, not the volume of fine-tuning data.

Practice this question →

78

Multi-Selectmedium

An organization wants to use OCI Generative AI for a multilingual translation task. They need high quality and must avoid biases present in the training data. Which THREE strategies should they consider? (Select THREE.)

Select 3 answers

A.Use a RAG pipeline to retrieve canonical translations from a trusted database

B.Implement a human-in-the-loop review process to catch biased translations

C.Fine-tune a pre-trained model on a high-quality parallel corpus for the target language pairs

D.Increase the temperature parameter to 1.5 to reduce repetitive biases

E.Use an encoder-decoder model such as T5 or BART

AnswersB, C, E

Human review is an effective way to identify and correct biased outputs.

Why this answer

Fine-tuning on high-quality parallel corpora improves accuracy. Using models designed for translation (e.g., encoder-decoder) often yields better results. Implementing human-in-the-loop review catches biases.

Increasing temperature may reduce bias but also reduces quality; it is not a primary strategy for bias mitigation. RAG is not directly applicable to translation as it requires retrieved documents in the target language.

Practice this question →

79

MCQeasy

Which model architecture is used by BERT for natural language understanding tasks?

A.Recurrent neural network

B.Encoder-decoder

C.Encoder-only

D.Decoder-only

AnswerC

BERT is an encoder-only model that uses bidirectional self-attention to understand the full context of the input.

Why this answer

BERT uses an encoder-only architecture, which processes the entire input sequence bidirectionally. This makes it well-suited for tasks like classification, NER, and QA where understanding the full context is important.

Practice this question →

80

MCQmedium

A company has a large dataset of legal documents in multiple languages. They need to find documents semantically similar to a query. Which step is essential for this task?

A.Apply BPE tokenization to all documents

B.Use a text embedding model to convert documents into dense vector representations

C.Fine-tune a generation model on the legal documents

D.Use beam search to identify similar passages

AnswerB

Embedding models produce vectors that enable semantic similarity computation via cosine similarity.

Why this answer

Embedding models convert text into dense vectors that capture semantic meaning. Cosine similarity between query and document embeddings is then used to find similar documents.

Practice this question →

81

MCQmedium

A researcher wants to compare two summarization models. Model A achieves a higher ROUGE-L score than Model B, but human evaluators prefer Model B's summaries. Which of the following is the MOST likely reason?

A.Model A overfits to the training data

B.Model B has a larger context window

C.ROUGE-L measures n-gram overlap, which may not align with human judgment of quality

D.Model A is an encoder-decoder model while Model B is decoder-only

AnswerC

Human evaluators consider factors like readability and conciseness, which ROUGE-L does not capture fully.

Why this answer

ROUGE-L measures n-gram overlap, which may not capture semantic quality. Human evaluators often prefer summaries that are fluent, coherent, and concise, even if they use different wording. The discrepancy indicates that ROUGE-L alone is insufficient for evaluation.

Practice this question →

82

MCQmedium

A company wants to build a sentiment analysis system for customer reviews. They have a labeled dataset of 10,000 reviews. Which approach is most cost-effective and likely to yield good performance?

A.Use GPT-4 with a prompt and no fine-tuning

B.Use a simple bag-of-words model with logistic regression

C.Fine-tune a pre-trained BERT model on the labeled dataset

D.Train a Transformer model from scratch on the reviews

AnswerC

BERT is pre-trained for language understanding; fine-tuning on a small classification dataset is efficient and effective.

Why this answer

Fine-tuning a pre-trained encoder-only model like BERT on the labeled dataset is a standard approach for classification tasks, offering good performance with relatively modest data and compute.

Practice this question →

83

MCQhard

An LLM is being used to answer customer queries about a product catalog. The answers are fluent but sometimes include plausible-sounding but incorrect product details. What is this phenomenon called, and which technique is most effective to mitigate it?

A.Knowledge cutoff; fine-tune the model on the catalog

B.Hallucination; use Retrieval-Augmented Generation (RAG) with the catalog indexed

C.Bias amplification; increase temperature

D.Overfitting; reduce the model size

AnswerB

Hallucination is the correct term; RAG is the standard mitigation.

Why this answer

Hallucination is the generation of false information; RAG grounds responses in retrieved factual documents, reducing hallucinations.

Practice this question →

84

MCQmedium

A developer is building a code generation assistant and wants to minimize the number of API calls to the OCI Generative AI service. Which tokenization approach results in the lowest token count for a given code snippet?

A.WordPiece tokenizer

B.SentencePiece tokenizer with unigram LM

C.BPE tokenizer trained on code corpora

D.Character-level tokenization

AnswerC

BPE learns frequent subword patterns in code, reducing token count.

Why this answer

Option C is correct because BPE (Byte Pair Encoding) tokenizers trained specifically on code corpora learn subword units that align closely with programming language syntax (e.g., common keywords, operators, and variable patterns), resulting in fewer tokens for a given code snippet compared to general-purpose tokenizers. This reduces API calls by encoding more semantic meaning per token, directly minimizing token count.

Exam trap

Cisco often tests the misconception that any subword tokenizer (like WordPiece or SentencePiece) is equally effective for code, but the trap is that only BPE trained on code corpora optimizes for the repetitive, syntax-heavy nature of programming languages, while others over-segment or use general-language frequency distributions.

How to eliminate wrong answers

Option A is wrong because WordPiece tokenizer, designed for natural language (e.g., BERT), splits code into subwords based on frequency in general text, leading to higher token counts for code-specific patterns like indentation or operators. Option B is wrong because SentencePiece with unigram LM uses a probabilistic unigram model that often over-segments code into many small pieces (e.g., splitting 'print' into 'pr', 'int'), increasing token count. Option D is wrong because character-level tokenization produces the highest token count possible, as each character becomes a separate token, which is the opposite of minimizing API calls.

Practice this question →

85

MCQmedium

An OCI user observes that their embedding model returns vectors that are not normalized, and they want to compute cosine similarity between two text embeddings. What should they do?

A.Compute the Euclidean distance between the vectors

B.Compute the L1 norm of the difference

C.Normalize the vectors to unit length, then compute the dot product

D.Compute the dot product directly

AnswerC

Cosine similarity is dot product of normalized vectors. Normalizing ensures the result is in [-1,1] and reflects the cosine of the angle.

Why this answer

Cosine similarity measures the cosine of the angle between two vectors, which is equivalent to the dot product of the vectors after they have been normalized to unit length (L2 norm = 1). Option C correctly describes this process: first normalize each embedding vector to unit length, then compute the dot product. This is the standard approach because raw embedding vectors from models like OCI's AI services may not be unit vectors, and the dot product alone does not account for magnitude differences.

Exam trap

Cisco often tests the misconception that the dot product alone is equivalent to cosine similarity, but the trap is that this only holds if the vectors are already normalized to unit length, which is not guaranteed by default.

How to eliminate wrong answers

Option A is wrong because Euclidean distance measures the straight-line distance between vectors, which is sensitive to vector magnitude and does not directly compute cosine similarity. Option B is wrong because the L1 norm of the difference (Manhattan distance) is a different metric that does not capture angular similarity. Option D is wrong because computing the dot product directly on non-normalized vectors yields a value that is influenced by both the angle and the magnitudes of the vectors, not purely the cosine of the angle.

Practice this question →

86

MCQeasy

Which of the following is a decoder-only model architecture?

A.T5

B.GPT-3

C.BART

D.BERT

AnswerB

GPT-3 is decoder-only, using masked self-attention.

Why this answer

GPT is a decoder-only model. BERT is encoder-only. T5 is encoder-decoder.

Practice this question →

87

MCQmedium

A company needs to evaluate a text summarization model. They have reference summaries and want a metric that measures overlap of n-grams. Which metric is MOST appropriate?

A.BLEU

B.Perplexity

C.ROUGE

D.BERTScore

AnswerC

ROUGE measures recall of n-grams and is standard for summarization evaluation.

Why this answer

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the most appropriate metric because it measures the overlap of n-grams between the generated summary and reference summaries, directly aligning with the company's requirement. Unlike BLEU, which focuses on precision, ROUGE emphasizes recall, making it better suited for summarization tasks where capturing all key content from the reference is critical.

Exam trap

Cisco often tests the distinction between precision-focused metrics (BLEU) and recall-focused metrics (ROUGE), trapping candidates who assume BLEU is suitable for summarization because it also measures n-gram overlap, without recognizing that summarization evaluation prioritizes recall of reference content over precision of generated output.

How to eliminate wrong answers

Option A is wrong because BLEU (Bilingual Evaluation Understudy) primarily measures precision of n-gram overlap and was designed for machine translation, not summarization; it penalizes shorter outputs and does not emphasize recall of reference content. Option B is wrong because Perplexity measures how well a language model predicts a sequence, not n-gram overlap between summaries, and is used for evaluating language model fluency, not summarization quality. Option D is wrong because BERTScore uses contextual embeddings from BERT to compute semantic similarity via cosine similarity, not direct n-gram overlap, and while it captures meaning, it does not meet the specific requirement for n-gram-based overlap measurement.

Practice this question →

88

MCQmedium

A company's AI system uses RAG to answer customer questions. Users often get incomplete answers because the retrieved chunks do not contain all relevant information. Which step in the RAG pipeline is most likely the issue?

A.Retrieval top-k setting

B.Generation model temperature

C.Chunking strategy (chunk size and overlap)

D.Embedding model selection

AnswerC

If chunks are too small or have insufficient overlap, relevant information may be split, leading to incomplete retrieval.

Why this answer

Chunking determines how documents are split into pieces. If chunks are too small, key information may be split across chunks, causing incomplete retrieval. Adjusting chunk size and overlap can improve completeness.

Practice this question →

89

MCQhard

A team is building a code generation assistant and needs to choose between fine-tuning a base LLM or using in-context learning with a few examples. They have 500 high-quality code examples. The assistant must generate code for a wide variety of tasks. Which approach is BETTER and why?

A.Fine-tuning, because it reduces inference cost compared to providing examples each time

B.Fine-tuning, because it permanently encodes the examples into the model weights

C.In-context learning, because it allows the model to adapt to each task dynamically without risking catastrophic forgetting

D.In-context learning, because it requires no additional training infrastructure

AnswerC

In-context learning uses the model's existing knowledge and adapts via examples in the prompt, which is more flexible for diverse tasks with a small dataset.

Why this answer

Fine-tuning with 500 examples may lead to overfitting or catastrophic forgetting, especially when the tasks are diverse. In-context learning with a few examples per task is more flexible and leverages the model's pre-trained knowledge. The small dataset size makes fine-tuning risky.

Practice this question →

90

MCQmedium

A developer is using the Cohere Command model for text generation and wants to ensure the output is deterministic for testing purposes. Which sampling strategy should they use?

A.Top-k sampling with k=50

B.Temperature sampling with temperature=0.7

C.Top-p (nucleus) sampling with p=0.9

D.Greedy decoding

AnswerD

Greedy decoding picks the most likely token at each step, making outputs deterministic.

Why this answer

Greedy decoding always selects the token with highest probability, producing the same output for a given input. Temperature, top-k, and top-p introduce randomness.

Practice this question →

91

MCQeasy

Which of the following best describes the role of the self-attention mechanism in a Transformer model?

A.It encodes the order of tokens in the sequence

B.It computes a weighted sum of all input token representations, where weights depend on pairwise compatibility between tokens

C.It applies a convolutional filter over local windows of tokens

D.It replaces the need for positional encoding by using recurrence

AnswerB

Self-attention calculates attention scores between every pair of tokens and uses them to aggregate information.

Why this answer

The self-attention mechanism computes a weighted sum of all input token representations, where the weights are determined by the pairwise compatibility (attention scores) between tokens. This allows each token to dynamically attend to every other token in the sequence, capturing global dependencies without the limitations of fixed local windows or recurrence.

Exam trap

Cisco often tests the misconception that self-attention inherently encodes positional information, when in fact it is permutation-invariant and relies on separate positional encodings to maintain sequence order.

How to eliminate wrong answers

Option A is wrong because encoding the order of tokens is the role of positional encoding, not the self-attention mechanism itself; self-attention is permutation-invariant and requires explicit positional information. Option C is wrong because applying a convolutional filter over local windows describes a CNN approach, not the global, pairwise weighting of self-attention in Transformers. Option D is wrong because self-attention does not replace positional encoding; it operates without recurrence, but positional encoding is still necessary to inject sequence order information into the model.

Practice this question →

92

MCQhard

An OCI user notices that their Llama 3 model generates the same output sequence regardless of the input prompt when using default generation parameters. Which setting is most likely causing this lack of diversity?

A.Top-k sampling with k=50

B.Temperature sampling with temperature=0.8

C.Top-p (nucleus) sampling with p=0.9

D.Greedy decoding (temperature=0)

AnswerD

Greedy decoding always picks the token with highest probability, producing no variation.

Why this answer

Greedy decoding, which is equivalent to setting temperature=0, always selects the token with the highest probability at each step. This deterministic behavior causes the model to produce the exact same output sequence for any input prompt, as there is no randomness or variation in token selection. The lack of diversity is a direct consequence of eliminating all stochasticity from the generation process.

Exam trap

Cisco often tests the misconception that temperature=0 is a valid sampling parameter, when in fact it disables sampling entirely and forces greedy decoding, leading to deterministic outputs.

How to eliminate wrong answers

Option A is wrong because Top-k sampling with k=50 introduces randomness by sampling from the 50 most likely tokens, which promotes diversity and would not cause identical outputs. Option B is wrong because temperature=0.8 applies a softmax scaling that still allows probabilistic sampling, producing varied outputs across different prompts. Option C is wrong because Top-p (nucleus) sampling with p=0.9 selects from a cumulative probability mass, introducing stochasticity and preventing deterministic repetition.

Practice this question →

93

MCQeasy

In the transformer architecture, what is the primary purpose of positional encoding?

A.To normalize the input embeddings

B.To reduce the number of parameters in the model

C.To enable multi-head attention

D.To provide the model with information about the order of tokens

AnswerD

Positional encoding injects sequence order information, allowing the model to use token positions.

Why this answer

Since self-attention processes tokens in parallel without inherent order, positional encoding adds information about the position of each token in the sequence.

Practice this question →

94

MCQeasy

In a Transformer model, what is the role of positional encoding?

A.To reduce the number of parameters in the model

B.To enable the model to process tokens in parallel

C.To encode the semantic meaning of each token

D.To provide information about the position of each token in the sequence

AnswerD

This is the exact purpose of positional encoding.

Why this answer

Positional encoding is essential in Transformer models because the self-attention mechanism processes all tokens in parallel and has no inherent notion of sequence order. By adding positional encodings (often sinusoidal or learned) to the input embeddings, the model can distinguish between tokens at different positions, enabling it to capture word order and relative positions. Without this, the model would treat the sequence as a bag of tokens, losing all sequential context.

Exam trap

Cisco often tests the misconception that positional encoding is responsible for enabling parallel processing, when in fact it is the self-attention mechanism's non-sequential computation that allows parallelism, and positional encoding merely injects order information into that parallel framework.

How to eliminate wrong answers

Option A is wrong because positional encoding does not reduce the number of parameters; it adds a fixed or learned vector to each token embedding, which may slightly increase parameters if learned, but its primary purpose is not parameter reduction. Option B is wrong because parallel processing is enabled by the self-attention mechanism itself, not by positional encoding; positional encoding actually compensates for the lack of recurrence that would otherwise provide order information in sequential models. Option C is wrong because semantic meaning is encoded by the token embeddings (e.g., learned word vectors), while positional encoding only provides information about the token's position in the sequence.

Practice this question →

95

Multi-Selecthard

An organization is deploying an LLM for document question answering. They want to reduce hallucinations and ensure answers are grounded in provided documents. Which THREE techniques should they implement? (Choose three.)

Select 3 answers

A.Use a longer context window to include more document text

B.Fine-tune the model on a corpus of in-domain documents

C.Set a low temperature (e.g., 0.1) for sampling

D.Set a high temperature (e.g., 1.5) for sampling

E.Use Retrieval-Augmented Generation (RAG)

AnswersB, C, E

Fine-tuning on relevant documents improves the model's knowledge and can reduce hallucination.

Why this answer

RAG retrieves relevant document chunks and conditions the generation on them, reducing hallucination. Fine-tuning on the document domain can improve grounding. Using a lower temperature (closer to 0) makes the model more deterministic and less likely to fabricate.

Higher temperature increases hallucination risk, and longer context window alone does not guarantee grounding.

Practice this question →

96

MCQmedium

A team is implementing a RAG pipeline in OCI. They have a large collection of PDF documents. After chunking and embedding the documents, retrieval quality is poor. Which step is MOST likely the root cause?

A.The retrieval step uses greedy decoding

B.The chunk size is too large, causing each chunk to contain multiple topics

C.The embedding model is a generation model, not an embedding model

D.Cosine similarity is not appropriate for comparing embeddings

AnswerB

Large chunks dilute the semantic focus, making it hard for the retriever to find passages relevant to a specific query.

Why this answer

Chunking strategy (size and overlap) directly affects how well the retrieval step can find relevant passages. Too large or poorly split chunks can dilute semantic meaning.

Practice this question →

97

MCQmedium

A company wants to translate legal documents from English to Spanish. They have a small parallel corpus of 500 sentence pairs. Which approach is MOST likely to yield the best translation quality?

A.Fine-tune an encoder-decoder model like T5 on the parallel corpus

B.Use a zero-shot prompt with a decoder-only model like GPT

C.Train a new model from scratch on the 500 sentence pairs

D.Use a rule-based machine translation system

AnswerA

Encoder-decoder architecture is well-suited for translation; fine-tuning on domain-specific data improves accuracy.

Why this answer

Fine-tuning an encoder-decoder model like T5 on the 500 sentence pairs is the best approach because it leverages the model's pre-trained knowledge of language structure and translation patterns, then adapts it to the specific legal domain with a small but relevant parallel corpus. This transfer learning method requires far less data than training from scratch and typically outperforms zero-shot prompting for specialized, low-resource translation tasks.

Exam trap

Cisco often tests the misconception that zero-shot prompting with large language models can match fine-tuned models for specialized tasks, but the trap here is that for domain-specific translation with limited data, transfer learning via fine-tuning is far more reliable than relying on a model's general-purpose capabilities.

How to eliminate wrong answers

Option B is wrong because zero-shot prompting with a decoder-only model like GPT lacks the explicit alignment between source and target sentences that encoder-decoder architectures provide, and for specialized legal terminology with only 500 examples, it will produce inconsistent and less accurate translations. Option C is wrong because training a new model from scratch on only 500 sentence pairs is insufficient to learn the complex syntax, vocabulary, and translation mappings needed for high-quality output, leading to severe overfitting and poor generalization. Option D is wrong because rule-based machine translation systems require extensive manual creation of linguistic rules and dictionaries, and they cannot adapt to the nuances of legal language or learn from the provided parallel corpus, resulting in rigid and often incorrect translations.

Practice this question →

98

Multi-Selecteasy

Which TWO of the following are advantages of using Byte-Pair Encoding (BPE) tokenization compared to word-level tokenization?

Select 2 answers

A.Guaranteed lossless encoding of all Unicode characters

B.Smaller vocabulary size

C.Fixed token length for every input

D.Faster inference due to reduced sequence length

E.Ability to handle out-of-vocabulary words by decomposing them into known subword tokens

AnswersB, E

BPE learns a limited set of subword units, which reduces the vocabulary size compared to storing every possible word.

Why this answer

BPE reduces vocabulary size by representing words as subword units, and it can handle out-of-vocabulary words by breaking them into known subwords. Fixed-length tokens and losslessness are not advantages of BPE.

Practice this question →

99

MCQeasy

Which component of the Transformer architecture allows the model to weigh the importance of different words in a sequence when processing a given word?

A.Self-attention mechanism

B.Feed-forward neural network

C.Positional encoding

D.Layer normalization

AnswerA

Self-attention computes attention scores between all pairs of positions, allowing the model to focus on relevant words.

Why this answer

The self-attention mechanism is the core component of the Transformer architecture that computes attention scores between every pair of words in the input sequence. These scores determine how much each word should influence the representation of the current word, allowing the model to dynamically weigh the importance of different words regardless of their positional distance. This mechanism is what enables the Transformer to capture long-range dependencies and contextual relationships in parallel.

Exam trap

Cisco often tests the distinction between components that process information (feed-forward networks) and components that enable contextual weighting (self-attention), leading candidates to mistakenly choose the feed-forward network because it is a more familiar neural network layer.

How to eliminate wrong answers

Option B is wrong because the feed-forward neural network processes each position independently after attention has already aggregated contextual information; it does not perform any cross-word weighting. Option C is wrong because positional encoding only adds information about the order of words in the sequence; it does not weigh the importance of words relative to each other. Option D is wrong because layer normalization stabilizes training by normalizing activations across features for each sample; it has no role in determining word importance.

Practice this question →

100

MCQhard

A research team is comparing two LLMs for a translation task. Model A uses greedy decoding, Model B uses beam search with width=5. Both models are otherwise identical. Which statement about their outputs is MOST likely true?

A.Model A will have higher BLEU scores than Model B

B.Model B will generally produce more fluent and accurate translations

C.Model A will produce more diverse translations

D.Model B will have lower latency than Model A

AnswerB

Beam search explores multiple paths and picks the best sequence, often improving fluency and accuracy over greedy decoding.

Why this answer

Beam search considers multiple candidate sequences and selects the one with the highest overall probability, which often results in more fluent and accurate translations than greedy decoding, but at higher computational cost.

Practice this question →

101

MCQmedium

Which of the following best describes the difference between pre-training and fine-tuning?

A.Pre-training uses labeled data; fine-tuning uses unlabeled data

B.Pre-training learns general language representations; fine-tuning adapts to a specific task

C.Fine-tuning requires more data than pre-training

D.Pre-training is done on a single task; fine-tuning is done on multiple tasks

AnswerB

This accurately describes the two stages.

Why this answer

Pre-training is the initial phase where a model learns general language patterns from a large corpus. Fine-tuning adapts the pre-trained model to a specific task using a smaller labeled dataset.

Practice this question →

102

MCQmedium

A practitioner wants to generate embeddings for a set of legal documents to enable semantic search. Which type of model should they use?

A.An embedding model like Cohere Embed

B.A large language model fine-tuned for classification

C.A vision transformer model

D.A generative LLM like Cohere Command

AnswerA

Embedding models output dense vectors that capture semantic meaning, suitable for similarity search.

Why this answer

Embedding models (e.g., Cohere Embed, OpenAI text-embedding-ada) are specialized to produce dense vector representations. Generation models (like GPT) produce text, not embeddings.

Practice this question →

103

Multi-Selecthard

A team is evaluating two LLMs for a summarization task. Model X has a BERTScore of 0.85, Model Y has a BERTScore of 0.82. However, human evaluators prefer Model Y. Which TWO reasons could explain this discrepancy?

Select 2 answers

A.BERTScore is based on BERT embeddings, which may not fully capture summary-specific qualities like conciseness or readability

B.BERTScore uses precision only, so it misses recall aspects

C.Human evaluators were not given clear criteria for evaluation

D.Model Y was fine-tuned on a different dataset, causing distribution shift

E.Model X overfits to the reference summaries, achieving high BERTScore but poor general quality

AnswersA, E

BERTScore measures semantic similarity but may not reflect human preferences for style.

Why this answer

BERTScore correlates with human judgment but is not perfect; it may favor certain styles. Additionally, BERTScore may be inflated if the reference summaries are similar to the model's training data.

Practice this question →

104

MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Train a custom model from scratch on the policy documents each month

B.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store

C.Use a larger foundation model with a longer context window and paste all documents into each prompt

D.Fine-tune a base LLM on the policy documents monthly

AnswerB

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

RAG (Retrieval-Augmented Generation) allows the LLM to retrieve relevant document sections at inference time, so knowledge stays current without retraining. The other options either require expensive retraining for each update or lack document grounding.

Practice this question →

105

MCQhard

A developer is using OCI Generative AI for a question-answering system. The model frequently provides outdated information because the training data cutoff is over a year old. Which approach would most effectively address this issue?

A.Implement a Retrieval-Augmented Generation (RAG) pipeline that retrieves up-to-date documents from an external knowledge base

B.Increase the context window to include more of the user's prompt

C.Fine-tune the model on a dataset that includes recent information up to today

D.Switch to a larger model that has a more recent knowledge cutoff

AnswerA

RAG allows the model to access current information dynamically, solving the cutoff problem.

Why this answer

Retrieval-Augmented Generation (RAG) directly addresses the problem of stale training data by dynamically retrieving current documents from an external knowledge base at inference time. This allows the model to generate answers grounded in up-to-date information without requiring retraining or a larger model, making it the most effective and practical solution for a question-answering system.

Exam trap

Cisco often tests the misconception that simply increasing model size or context length can solve knowledge staleness, when in fact only retrieval-based methods like RAG provide a scalable, real-time solution to keep answers current without retraining.

How to eliminate wrong answers

Option B is wrong because increasing the context window only allows the model to process more of the user's prompt, but it does not inject new or recent information into the model's responses — the model's parametric knowledge remains frozen at its training cutoff. Option C is wrong because fine-tuning on recent data up to today would require a new, curated dataset and significant compute resources, and the model would still be limited to the knowledge in that dataset; moreover, fine-tuning is not a real-time solution and cannot adapt to information that changes after the fine-tuning process. Option D is wrong because switching to a larger model with a more recent knowledge cutoff only shifts the staleness problem forward in time — the model will still eventually become outdated, and it does not provide a mechanism to access live or continuously updated information.

Practice this question →

106

MCQeasy

Which of the following is a primary limitation of large language models that can lead to generating factually incorrect information?

A.Bias in training data

B.Hallucinations

C.Context window limitation

D.Knowledge cutoff

AnswerB

Hallucinations occur when the model generates content that is not factually accurate or grounded in the training data.

Why this answer

Hallucinations are a primary limitation of large language models because they cause the model to generate text that is factually incorrect, nonsensical, or not grounded in the training data. This occurs due to the probabilistic nature of token prediction, where the model prioritizes fluency and coherence over factual accuracy, especially when the prompt lacks sufficient context or the model is asked to recall specific facts not well-represented in its training.

Exam trap

Cisco often tests the distinction between hallucinations and other limitations like bias or context windows, so the trap here is that candidates confuse 'bias in training data' with factual inaccuracy, when bias is about systematic prejudice, not random or confident fabrication of false facts.

How to eliminate wrong answers

Option A is wrong because bias in training data leads to skewed or prejudiced outputs, not necessarily factually incorrect information; it affects fairness and representation rather than factual accuracy. Option C is wrong because context window limitation restricts the amount of input the model can process at once, which can cause loss of context but does not directly cause the generation of factually incorrect information—it may lead to incomplete or irrelevant responses. Option D is wrong because knowledge cutoff refers to the date after which the model has no training data, meaning it cannot answer about events after that date, but it does not cause the model to fabricate facts; it simply limits the temporal scope of knowledge.

Practice this question →

107

MCQmedium

A data scientist is evaluating two LLMs for a summarization task. Model X scores 45 on ROUGE-L, while Model Y scores 42. However, in human evaluation, Model Y is preferred 60% of the time. What is the most likely explanation?

A.Human evaluators are biased and cannot be trusted for objective assessment

B.Model Y overfits to the training data, causing poor generalisation

C.ROUGE-L measures lexical overlap, which may not capture the semantic quality that humans value

D.ROUGE-L is not a reliable metric for summarization because it only measures recall

AnswerC

ROUGE relies on n-gram overlap; Model Y might produce more concise or coherent summaries that humans prefer but that share fewer exact n-grams with the reference.

Why this answer

ROUGE-L measures the longest common subsequence (LCS) between generated and reference summaries, focusing on lexical (word-level) overlap. It does not assess semantic meaning, fluency, or factual correctness. Human evaluators often prefer summaries that are coherent and capture key ideas, even if they use different wording, which explains why Model Y can score lower on ROUGE-L but be preferred 60% of the time.

Exam trap

Cisco often tests the distinction between lexical metrics (like ROUGE) and semantic quality, trapping candidates who assume higher automated scores always indicate better performance without considering human preferences.

How to eliminate wrong answers

Option A is wrong because human evaluators are not inherently biased in this context; their preference reflects subjective quality (e.g., coherence, relevance) that automated metrics may miss. Option B is wrong because overfitting would typically cause poor performance on unseen data, but here Model Y performs worse on ROUGE-L yet is preferred by humans, suggesting it generalizes better in terms of human-perceived quality. Option D is wrong because ROUGE-L measures both precision and recall via the F1-score of the LCS, not just recall; the issue is its reliance on lexical overlap, not a limitation to recall.

Practice this question →

108

MCQhard

In the self-attention mechanism, what is the role of the 'scaling factor' (division by sqrt(d_k)) in the softmax computation?

A.To make the attention mechanism translation invariant

B.To prevent the softmax from saturating and producing small gradients

C.To increase the variance of attention scores

D.To ensure the sum of attention weights equals 1

AnswerB

Scaling avoids large values that cause softmax saturation.

Why this answer

Scaling prevents the dot products from growing too large in magnitude, which would push softmax into regions with extremely small gradients.

Practice this question →

109

MCQeasy

Which of the following best describes the role of positional encoding in the Transformer architecture?

A.To compress the input sequence length

B.To increase the dimensionality of the hidden states

C.To reduce the effect of vanishing gradients

D.To provide information about the order of tokens in the input sequence

AnswerD

Positional encodings inject positional information so the model can use word order.

Why this answer

Positional encoding is essential in the Transformer architecture because the self-attention mechanism processes all tokens in parallel and has no inherent sense of order. By adding sinusoidal or learned positional vectors to the input embeddings, the model gains information about the relative or absolute position of each token in the sequence, enabling it to understand word order and sequence structure.

Exam trap

Cisco often tests the misconception that positional encoding is used to increase model capacity or dimensionality, when in fact it is purely a mechanism to inject sequence order information into a permutation-invariant attention mechanism.

How to eliminate wrong answers

Option A is wrong because positional encoding does not compress the input sequence length; sequence length compression is handled by pooling or stride mechanisms, not by positional encoding. Option B is wrong because positional encoding adds information to the existing embedding dimension but does not increase the dimensionality of the hidden states; it is added element-wise to the input embeddings of the same dimension. Option C is wrong because positional encoding does not address vanishing gradients; vanishing gradients are mitigated by residual connections and layer normalization in the Transformer, not by positional encoding.

Practice this question →

110

MCQeasy

What is the primary purpose of the self-attention mechanism in a Transformer model?

A.To generate token embeddings in parallel

B.To reduce the dimensionality of token embeddings

C.To encode positional information of tokens

D.To compute a weighted sum of all token representations based on pairwise relevance

AnswerD

Self-attention computes attention scores between all pairs and aggregates information.

Why this answer

Self-attention allows each token to attend to every other token in the sequence, capturing contextual relationships regardless of distance.

Practice this question →

111

MCQhard

A developer is implementing a text generation pipeline and wants to produce diverse, creative outputs. They set temperature=1.2, top_k=50, and top_p=1.0. What is the MOST likely effect of this combination?

A.The output will be identical to greedy decoding because top_p=1.0 disables sampling

B.The output will be mostly factual because top_k filters out unlikely tokens

C.The output will be diverse and creative, but may occasionally be incoherent or off-topic

D.The output will be highly deterministic and repetitive

AnswerC

High temperature increases randomness, and the relaxed cutoffs allow less likely tokens, yielding creative but sometimes nonsensical outputs.

Why this answer

Temperature >1 flattens the probability distribution, making low-probability tokens more likely. top_k=50 restricts to top 50 tokens, but top_p=1.0 imposes no cumulative probability cutoff. The combination yields diverse but potentially incoherent outputs.

Practice this question →

112

Multi-Selecteasy

Which TWO of the following sampling strategies introduce randomness into text generation?

Select 2 answers

A.Beam search

B.Greedy decoding

C.Temperature sampling

D.Top-k sampling

E.Top-p (nucleus) sampling

AnswersC, E

Temperature scales the logits before softmax, affecting the randomness of the distribution.

Why this answer

Temperature sampling and top-p (nucleus) sampling both introduce randomness by adjusting the probability distribution. Greedy decoding and beam search are deterministic or near-deterministic. Top-k sampling also introduces randomness but top-p is more dynamic.

Practice this question →

113

MCQhard

An ML engineer is selecting a pre-trained model for a code generation task. The model must be able to generate syntactically correct code in multiple programming languages. Which model family is BEST suited for this task?

A.Meta Llama (Code Llama variant)

B.BERT

C.Cohere Command

D.Mistral

AnswerA

Code Llama is a variant of Llama fine-tuned on code, making it well-suited for code generation across languages.

Why this answer

Models like Code Llama (a variant of Llama) are specifically fine-tuned on code and are known for strong code generation capabilities. While other models can generate code, Code Llama is the best fit among the options.

Practice this question →

114

MCQhard

A researcher is evaluating two LLMs for a summarization task. Model A achieves a ROUGE-L score of 0.45 and a BERTScore of 0.92. Model B achieves a ROUGE-L score of 0.50 and a BERTScore of 0.88. Which model is likely better for producing summaries that are semantically faithful to the source, even if not using the exact same words?

A.Neither model is acceptable because ROUGE-L is below 0.6

B.Both are equally good because the scores are close

C.Model B because ROUGE-L is higher

D.Model A because BERTScore is higher

AnswerD

Higher BERTScore suggests better semantic alignment with the source, which is more important for faithfulness.

Why this answer

BERTScore measures semantic similarity using contextual embeddings, while ROUGE-L measures n-gram overlap. Higher BERTScore indicates better semantic faithfulness even without exact phrase matches.

Practice this question →

115

MCQmedium

A data scientist wants to compare the semantic similarity between two sentences generated by an LLM. Which evaluation metric is most suitable for this purpose?

A.ROUGE-L

B.BLEU

C.BERTScore

D.Perplexity

AnswerC

BERTScore uses contextual embeddings to evaluate semantic similarity.

Why this answer

BERTScore computes cosine similarity between contextual embeddings, capturing semantic meaning better than surface-level n-gram metrics.

Practice this question →

116

MCQmedium

What is the main advantage of using Byte-Pair Encoding (BPE) over word-level tokenization?

A.It can represent any word as a sequence of subword tokens, including rare or unseen words

B.It produces fixed-length token sequences

C.It reduces the number of tokens by merging all letters into single tokens

D.It eliminates the need for a vocabulary altogether

AnswerA

BPE's subword approach ensures open vocabulary.

Why this answer

Byte-Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent pairs of bytes or characters into new tokens. Its main advantage over word-level tokenization is that it can represent any word, including rare or unseen words, as a sequence of subword tokens, thereby eliminating the out-of-vocabulary (OOV) problem. This allows models like GPT and BERT to handle arbitrary input without requiring a fixed word vocabulary.

Exam trap

Cisco often tests the misconception that BPE eliminates the need for a vocabulary or that it produces fixed-length outputs, but the core advantage is its ability to handle rare and unseen words through subword decomposition.

How to eliminate wrong answers

Option B is wrong because BPE does not produce fixed-length token sequences; the number of tokens depends on the input text, and BPE can generate variable-length sequences. Option C is wrong because BPE does not merge all letters into single tokens; it merges the most frequent character pairs iteratively, but the final vocabulary contains many multi-character subwords, not just single letters. Option D is wrong because BPE still requires a predefined vocabulary of subword tokens (typically 32k–100k tokens), and the model cannot operate without any vocabulary.

Practice this question →

117

Multi-Selectmedium

Which TWO of the following are characteristics of decoder-only models like GPT? (Select TWO)

Select 2 answers

A.They process input through an encoder and a decoder

B.They use bidirectional self-attention

C.They use masked self-attention to prevent attending to future tokens

D.They are ideal for tasks requiring full bidirectional context like NER

E.They are typically used for generative tasks like text completion

AnswersC, E

Masked self-attention ensures each token only attends to previous tokens.

Why this answer

Decoder-only models use masked self-attention (causal) and generate tokens left-to-right. They cannot use bidirectional context because future tokens are masked.

Practice this question →

118

MCQhard

A team is building a system to detect duplicate customer support tickets. They have a dataset of 10,000 resolved tickets and want to find pairs with similar intent. Which approach would be MOST efficient and effective?

A.Embed each ticket using an embedding model and compute cosine similarity between all pairs

B.Use a long-context LLM to process all tickets in a single prompt

C.Use a generation model like Cohere Command to compare tickets one pair at a time

D.Fine-tune a generation model on a classification task to predict duplicates

AnswerA

Embeddings reduce tickets to dense vectors; cosine similarity allows efficient comparison, and approximate nearest neighbor algorithms can handle large sets.

Why this answer

Option A is correct because embedding each ticket into a dense vector space and computing cosine similarity between all pairs is both efficient and effective for detecting duplicate intents. This approach leverages pre-trained embedding models (e.g., sentence-transformers) that capture semantic similarity, and pairwise cosine similarity scales well for 10,000 tickets (approximately 50 million comparisons) using optimized matrix operations. It avoids the quadratic cost of LLM inference per pair while preserving high accuracy for intent matching.

Exam trap

Cisco often tests the misconception that using a powerful LLM for every pairwise comparison is the most accurate approach, ignoring the massive computational cost and the fact that embedding similarity is both faster and equally effective for semantic duplicate detection.

How to eliminate wrong answers

Option B is wrong because a long-context LLM cannot process all 10,000 tickets in a single prompt due to context window limits (typically 4K-128K tokens), and even if it could, it would not perform pairwise duplicate detection—it would generate a summary or classification, not identify specific duplicate pairs. Option C is wrong because using a generation model like Cohere Command to compare tickets one pair at a time would require O(n²) LLM calls (50 million for 10,000 tickets), which is computationally prohibitive, slow, and cost-ineffective. Option D is wrong because fine-tuning a generation model on a classification task to predict duplicates is overkill and inefficient; it requires labeled data, significant training resources, and still does not directly solve the pairwise comparison problem—embedding-based similarity is simpler, faster, and equally effective for this unsupervised or semi-supervised task.

Practice this question →

119

MCQeasy

Which tokenization algorithm is commonly used in models like GPT and BERT and builds tokens by merging the most frequent pairs of characters or subwords iteratively?

A.WordPiece

B.SentencePiece

C.Unigram tokenization

D.Byte-Pair Encoding (BPE)

AnswerD

BPE is the algorithm that iteratively merges the most frequent byte pairs to build a subword vocabulary.

Why this answer

Byte-Pair Encoding (BPE) is a subword tokenization method that starts with individual characters and merges the most frequent pairs iteratively until a vocabulary size is reached.

Practice this question →

120

MCQhard

An ML engineer notices that when using temperature sampling with temperature=0.8 for code generation, the model sometimes produces syntactically incorrect code. The engineer needs to ensure syntactically valid outputs while maintaining some creativity. Which combination of sampling parameters is MOST appropriate?

A.Reduce temperature to 0.2 and use top-p=0.9

B.Increase temperature to 1.2 and use top-k=50

C.Use beam search with width=5

D.Set temperature=1.0 and use greedy decoding

AnswerA

Low temperature sharpens the distribution; top-p limits the token pool to the top 90% probability mass, reducing chances of sampling improbable tokens that break syntax.

Why this answer

Reducing temperature to 0.2 makes the output distribution more peaked, favoring high-probability tokens, which reduces syntax errors. Combining this with top-p=0.9 (nucleus sampling) limits the sampling pool to the smallest set of tokens whose cumulative probability reaches 0.9, further filtering out low-probability tokens that often cause invalid syntax. This balance preserves some creativity while ensuring syntactically valid code.

Exam trap

Cisco often tests the misconception that increasing temperature or using beam search improves output quality, when in fact reducing temperature and using top-p sampling is the standard approach for balancing correctness and creativity in code generation tasks.

How to eliminate wrong answers

Option B is wrong because increasing temperature to 1.2 flattens the probability distribution, making low-probability tokens more likely, which would increase syntax errors, not reduce them. Option C is wrong because beam search with width=5 is a deterministic decoding method that maximizes sequence probability, which can produce repetitive or overly conservative outputs and does not inherently guarantee syntactic validity; it also lacks the stochastic creativity needed. Option D is wrong because setting temperature=1.0 with greedy decoding (temperature=1.0 is effectively no scaling, and greedy decoding always picks the highest-probability token) eliminates creativity entirely and can still produce syntactically incorrect code if the highest-probability token leads to an invalid sequence.

Practice this question →

121

Multi-Selectmedium

A machine learning engineer is designing a RAG pipeline in OCI to improve the accuracy of an LLM-based FAQ bot. Which TWO components are essential for the retrieval phase? (Select TWO.)

Select 2 answers

A.Document chunking

B.Tokenization before the generation step

C.Text generation model

D.A reranker model

E.Embedding model to convert chunks into vectors

AnswersA, E

Documents must be split into chunks for effective retrieval.

Why this answer

Document chunking is essential because it breaks large documents into smaller, manageable pieces that can be individually indexed and retrieved. Without chunking, the retrieval phase would either miss relevant context or return overly large documents that exceed the LLM's context window, reducing accuracy.

Exam trap

Cisco often tests the distinction between retrieval-phase components (chunking and embeddings) and generation-phase components (tokenization and the LLM itself), leading candidates to mistakenly include reranking as essential when it is only an optional refinement.

Practice this question →

122

MCQmedium

A team is evaluating two embedding models for a similarity search task. Model A has a higher BERTScore on a reference dataset. Model B has a lower perplexity on the same dataset. Which model is likely better for retrieval?

A.Both are equally good for retrieval

B.Model A, because BERTScore directly measures semantic similarity of embeddings

C.Model B, because lower perplexity indicates better language modeling, which improves retrieval

D.Neither metric is relevant for retrieval tasks

AnswerB

BERTScore is a semantic similarity metric that evaluates the quality of embeddings for capturing meaning, which is crucial for retrieval.

Why this answer

For retrieval tasks, embedding quality is best measured by semantic similarity metrics like BERTScore, which correlate with how well embeddings capture meaning. Perplexity measures language model fluency, not embedding quality.

Practice this question →

123

MCQeasy

Which component of the Transformer architecture allows each token to consider the relevance of every other token in the input sequence?

A.Multi-head attention

B.Self-attention

C.Feed-forward network

D.Positional encoding

AnswerB

Self-attention directly computes relevance weights between every pair of tokens in the input.

Why this answer

Self-attention computes attention scores between all pairs of tokens, enabling the model to capture dependencies across the entire sequence.

Practice this question →

124

MCQhard

An AI engineer is deploying a RAG pipeline using OCI Generative AI. They notice the generated answers sometimes include information not present in the retrieved documents. What is the MOST likely cause?

A.The chunking strategy splits sentences across chunks

B.The embedding model is too small to represent document semantics

C.The context window of the generation model is exceeded

D.The generation model is hallucinating because it does not rely solely on the retrieved context

AnswerD

Even with RAG, generation models can ignore or misuse the context, leading to hallucinations.

Why this answer

Option D is correct because the generation model in a RAG pipeline is designed to leverage retrieved context but can still produce outputs not grounded in that context, a phenomenon known as hallucination. This occurs when the model relies on its parametric knowledge or statistical patterns rather than strictly adhering to the retrieved documents, especially if the prompt does not enforce strict grounding or the model's training biases override the context.

Exam trap

The trap here is that candidates often confuse retrieval failures (e.g., poor chunking or embedding) with generation failures, assuming that if the retrieved documents are correct, the model will always use them faithfully, but the core issue is the generation model's tendency to hallucinate when not strictly constrained to the context.

How to eliminate wrong answers

Option A is wrong because splitting sentences across chunks affects retrieval quality and context coherence, but it does not directly cause the model to generate information absent from retrieved documents; it may lead to incomplete or fragmented context, not hallucination. Option B is wrong because the embedding model's size impacts the quality of semantic representation and retrieval accuracy, but a small embedding model would likely cause poor retrieval (missing relevant documents) rather than causing the generation model to invent information not in the retrieved set. Option C is wrong because exceeding the context window would cause truncation of input, potentially losing retrieved context, but the generation model would then operate on incomplete context, not necessarily invent new information; hallucination can occur even within the context window.

Practice this question →

125

Multi-Selecthard

A data scientist is evaluating an LLM's performance on a summarization task. They observe that the model produces fluent summaries but often misses key information. Which TWO metrics would best capture this issue? (Select TWO.)

Select 2 answers

A.BLEU score

B.Perplexity

C.Human evaluation with a rubric for completeness

D.ROUGE-L

E.BERTScore

AnswersC, D

Human judgment can directly assess whether key information is included.

Why this answer

ROUGE-L measures recall of the longest common subsequence, capturing information coverage. Human evaluation can assess completeness. BLEU emphasizes precision and fluency.

BERTScore measures semantic similarity but not directly the presence of key points. Perplexity measures model confidence, not recall.

Practice this question →

126

MCQeasy

Which component of the Transformer architecture allows the model to weigh the importance of different tokens in the input sequence when generating an output?

A.Positional encoding

B.Self-attention mechanism

C.Layer normalization

D.Feed-forward network

AnswerB

Self-attention computes query-key-value dot products to assign importance weights across tokens.

Why this answer

The self-attention mechanism is the core component of the Transformer architecture that computes attention scores between every pair of tokens in the input sequence. These scores determine how much each token should influence the representation of every other token, allowing the model to dynamically weigh the importance of different tokens when generating an output. This is achieved through scaled dot-product attention, where queries, keys, and values are derived from the input embeddings.

Exam trap

Cisco often tests the misconception that positional encoding or layer normalization is responsible for weighting token importance, when in fact only the self-attention mechanism performs this dynamic weighting based on content relationships.

How to eliminate wrong answers

Option A is wrong because positional encoding adds information about the position of tokens in the sequence, not about their relative importance; it enables the model to use order but does not weigh token importance. Option C is wrong because layer normalization stabilizes training by normalizing activations across features, but it does not perform any weighting or attention between tokens. Option D is wrong because the feed-forward network applies a non-linear transformation to each token independently after attention, processing individual token representations without considering inter-token relationships.

Practice this question →

127

MCQhard

A developer notices that an LLM-based question-answering system sometimes provides answers that are correct but from an outdated version of the knowledge base. The system uses RAG with a vector database updated daily. What is the MOST likely root cause?

A.The retrieval top-k parameter is set too high

B.The chunking strategy splits documents into too-small pieces

C.The embedding model was not re-run on the updated documents, so the index contains old embeddings

D.The LLM's training data has a knowledge cutoff date

AnswerC

If the vector database is updated but embeddings are not recomputed, the index still matches old chunks, causing retrieval of outdated information.

Why this answer

Option C is correct because the core issue is that the vector database index still contains old embeddings. Even though the knowledge base documents are updated daily, if the embedding model is not re-run on those updated documents, the vector representations in the index remain stale. When the RAG system retrieves, it fetches these outdated embeddings, leading to correct but outdated answers.

This is a classic index synchronization problem in RAG pipelines.

Exam trap

Cisco often tests the distinction between retrieval-side issues (index staleness) and model-side issues (knowledge cutoff), so candidates mistakenly pick D because they confuse the LLM's training cutoff with the freshness of the vector database index.

How to eliminate wrong answers

Option A is wrong because a high top-k parameter would retrieve more documents, potentially including both old and new versions, but it does not cause the system to systematically favor outdated content; it would increase recall, not introduce staleness. Option B is wrong because chunking into too-small pieces might reduce context or cause fragmentation, but it does not inherently cause the system to retrieve outdated information; the chunks themselves would still reflect the current document content if embeddings are updated. Option D is wrong because the LLM's training data cutoff date affects the model's parametric knowledge, not the retrieval from the vector database; the RAG system is designed to overcome this by retrieving fresh documents, so the cutoff date is irrelevant to the index staleness problem.

Practice this question →

128

MCQmedium

A developer is building a code generation assistant and needs to ensure the LLM follows a specific output format (e.g., JSON). Which approach is MOST effective for achieving format adherence without retraining?

A.Lower the temperature to 0 to reduce output variability

B.Provide a few-shot example of the desired JSON format in the prompt

C.Fine-tune the model on a dataset of JSON code examples

D.Increase the context window to include more code context

AnswerB

In-context learning (few-shot) guides the model to mimic the provided format without retraining.

Why this answer

Option B is correct because few-shot prompting—providing explicit examples of the desired JSON format in the prompt—directly guides the LLM's output structure without requiring retraining. This technique leverages in-context learning, where the model infers the required schema from the examples, making it the most effective and efficient method for format adherence.

Exam trap

Cisco often tests the misconception that lowering temperature or increasing context window can enforce output format, when in fact these parameters only affect randomness or input length, not structural adherence.

How to eliminate wrong answers

Option A is wrong because lowering temperature to 0 reduces randomness but does not enforce a specific output format; the model may still produce valid JSON with varying structures or deviate entirely. Option C is wrong because fine-tuning requires retraining the model on a dataset, which is costly, time-consuming, and contradicts the constraint of 'without retraining.' Option D is wrong because increasing the context window provides more input context but does not constrain the output format; the model may still generate malformed or non-JSON responses.

Practice this question →

129

Multi-Selectmedium

Which THREE of the following are known limitations of large language models? (Select THREE)

Select 3 answers

A.Real-time awareness of current events

B.Knowledge cutoff (lack of information after a certain date)

C.Bias in training data leading to skewed outputs

D.Hallucination (generating factually incorrect information)

E.Unlimited context window

AnswersB, C, D

LLMs are trained on static datasets and do not know events after their cutoff date.

Why this answer

Option B is correct because large language models are trained on static datasets that have a fixed cutoff date, after which they have no knowledge of new events, publications, or data. This is an inherent architectural limitation: the model's parameters are frozen at the end of training, so it cannot learn or incorporate information beyond that point without retraining or fine-tuning.

Exam trap

Cisco often tests the misconception that LLMs can access real-time data or have unlimited memory, when in fact both are hard architectural constraints tied to training data cutoffs and transformer attention mechanisms.

Practice this question →

130

Multi-Selectmedium

An organization is concerned about bias in their LLM-powered hiring assistant. Which TWO actions are MOST effective in mitigating bias?

Select 2 answers

A.Use a larger context window to include more examples

B.Increase the temperature parameter to introduce more randomness

C.Use only encoder-only models like BERT for classification

D.Implement human-in-the-loop evaluation with fairness-focused rubrics

E.Fine-tune the model on a carefully curated dataset that balances demographic representation

AnswersD, E

Human evaluation with explicit fairness criteria can catch biased responses.

Why this answer

Option D is correct because human-in-the-loop evaluation with fairness-focused rubrics directly addresses bias by incorporating human judgment to detect and correct biased outputs. This approach allows reviewers to systematically assess responses against predefined fairness criteria, catching subtle biases that automated methods might miss. It is a standard practice in responsible AI deployment for high-stakes applications like hiring.

Exam trap

Cisco often tests the misconception that technical parameters like temperature or context window size can solve bias, when in fact bias mitigation requires deliberate data curation and human oversight, not model hyperparameter tuning.

Practice this question →

131

MCQmedium

A developer wants to compare two sentences for semantic similarity using embeddings. Which distance or similarity metric is most commonly used for dense vector representations?

A.Cosine similarity

B.Jaccard similarity

C.Manhattan distance

D.Euclidean distance

AnswerA

Cosine similarity is the standard metric for comparing embedding vectors because it focuses on orientation, not magnitude.

Why this answer

Cosine similarity measures the cosine of the angle between two vectors, is commonly used for comparing embedding vectors, and ranges from -1 to 1, where 1 indicates identical direction.

Practice this question →

132

MCQeasy

In the context of LLMs, what is the primary function of tokenization?

A.To assign positional encodings to each word

B.To convert tokens into dense vector representations

C.To split text into manageable pieces (tokens) that the model can understand

D.To remove stop words and punctuation from the input

AnswerC

Tokenization breaks text into tokens, which are the atomic units processed by the model.

Why this answer

Tokenization is the first step in processing text for LLMs, where raw input is split into smaller units called tokens (words, subwords, or characters). This is essential because models like GPT or BERT operate on discrete tokens, not raw strings, and tokenization defines the model's vocabulary and input structure.

Exam trap

Cisco often tests the distinction between tokenization and embedding, so the trap here is confusing the splitting of text into tokens (tokenization) with the subsequent conversion of those tokens into numerical vectors (embedding).

How to eliminate wrong answers

Option A is wrong because positional encodings are added after tokenization to inject sequence order information, not assigned during tokenization. Option B is wrong because converting tokens into dense vector representations is the role of the embedding layer, not tokenization. Option D is wrong because LLMs typically retain stop words and punctuation as tokens to preserve context and syntactic structure; removal is a preprocessing step in traditional NLP, not a function of tokenization.

Practice this question →

133

MCQeasy

What is the primary purpose of the self-attention mechanism in a transformer model?

A.To reduce the number of parameters in the model

B.To convert tokens into fixed-length vectors

C.To ensure the model is autoregressive

D.To process tokens in parallel while modeling long-range dependencies

AnswerD

Self-attention enables parallelization by computing attention scores between all token pairs simultaneously, and its receptive field covers the entire sequence.

Why this answer

The self-attention mechanism allows each token in the input sequence to attend to every other token, computing a weighted sum of their representations. This enables the model to capture long-range dependencies directly without the sequential processing constraints of RNNs, and because the attention scores for all tokens can be computed simultaneously, the mechanism supports parallel processing of the entire sequence.

Exam trap

Cisco often tests the distinction between the self-attention mechanism's core function (parallel processing and long-range dependencies) and other transformer components like embeddings or causal masking, leading candidates to confuse the purpose of self-attention with the overall autoregressive nature of the decoder.

How to eliminate wrong answers

Option A is wrong because self-attention actually increases the number of parameters (through query, key, and value projection matrices) rather than reducing them. Option B is wrong because converting tokens into fixed-length vectors is the role of the embedding layer, not the self-attention mechanism. Option C is wrong because self-attention itself is not autoregressive; autoregressive behavior in transformers is enforced by causal masking (masking future tokens) during decoding, not by the self-attention mechanism itself.

Practice this question →

134

Multi-Selecteasy

A developer is comparing different foundation models for a text completion API on OCI. Which TWO of the following are model families available through OCI Generative AI service? (Choose two.)

Select 2 answers

A.OpenAI GPT

B.BERT

C.Meta Llama

D.Cohere Command/Embed

E.Mistral

AnswersC, D

Meta Llama models are available on OCI.

Why this answer

OCI Generative AI offers models including Cohere Command/Embed and Meta Llama. Mistral and GPT are not mentioned in the context of OCI's available models, and BERT is an encoder-only model not typically offered as a generation model.

Practice this question →

135

Multi-Selectmedium

A team wants to reduce hallucinations in their LLM-powered question-answering system. Which TWO techniques are most effective?

Select 2 answers

A.Implementing RAG to retrieve relevant documents

B.Switching to a smaller model

C.Using a lower temperature (e.g., 0) for more deterministic outputs

D.Using a larger context window

E.Increasing the temperature to 1.5

AnswersA, C

RAG grounds answers in retrieved facts.

Why this answer

RAG provides factual grounding, and reducing temperature makes outputs more deterministic, reducing fabricated details.

Practice this question →

136

Multi-Selecthard

A team is building a RAG pipeline on OCI. Which THREE steps are essential components of a standard RAG pipeline? (Select THREE)

Select 3 answers

A.Embedding each chunk into a dense vector using an embedding model

B.Retrieving relevant chunks based on cosine similarity to the query embedding

C.Training a custom LLM from scratch on the document corpus

D.Fine-tuning the generation model on the retrieved chunks

E.Chunking documents into passages

AnswersA, B, E

Chunks are converted to vectors for similarity search.

Why this answer

Option A is correct because embedding each chunk into a dense vector using an embedding model is a fundamental step in a RAG pipeline. The embedding model converts text chunks into high-dimensional vector representations that capture semantic meaning, enabling efficient similarity search during retrieval. Without this step, the system cannot compare query intent with document content in a vector space.

Exam trap

Cisco often tests the distinction between RAG's retrieval-augmented generation (which uses a frozen LLM with retrieved context) and fine-tuning or training a model, leading candidates to mistakenly select fine-tuning or training steps as essential components.

Practice this question →

137

MCQmedium

An organization wants to deploy an LLM for legal document analysis where accuracy is critical, and the model must not reference any external data outside the provided legal corpus. Which approach BEST satisfies these requirements?

A.Use a decoder-only model with zero-shot prompting

B.Use a fine-tuned encoder-only model for classification only

C.Use a large foundation model with a high temperature setting

D.Use RAG with a vector store containing only the legal documents, and set the retriever to return a fixed number of chunks with high similarity threshold

AnswerD

RAG ensures answers are grounded in the provided legal corpus; similarity threshold can prevent retrieval of irrelevant chunks.

Why this answer

RAG can ground generation in a curated corpus, and with strict retrieval settings (e.g., only retrieving from the legal corpus), the model will not use any outside knowledge, reducing hallucinations.

Practice this question →

138

MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use a larger foundation model with a longer context window and paste all documents into each prompt

B.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store

C.Fine-tune a base LLM on the policy documents monthly

D.Train a custom model from scratch on the policy documents each month

AnswerB

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

Retrieval-Augmented Generation (RAG) is the most appropriate approach because it allows the chatbot to answer questions from the policy documents without retraining the model. By indexing the documents in a vector store and retrieving relevant chunks at query time, RAG handles monthly updates by simply re-indexing the new documents, avoiding the cost and complexity of fine-tuning or retraining.

Exam trap

Cisco often tests the misconception that fine-tuning or retraining is required for domain-specific knowledge, when in fact RAG provides a cost-effective, update-friendly alternative that leverages the LLM's existing reasoning capabilities.

How to eliminate wrong answers

Option A is wrong because pasting all documents into each prompt would exceed the context window limits of even the largest foundation models, leading to truncation, high token costs, and degraded performance due to information overload. Option C is wrong because fine-tuning a base LLM monthly on the policy documents is expensive, time-consuming, and risks catastrophic forgetting of previous content, making it impractical for frequent updates. Option D is wrong because training a custom model from scratch each month is prohibitively expensive, requires massive computational resources and data, and is unnecessary when RAG can achieve the same goal with far less overhead.

Practice this question →

139

MCQeasy

Which tokenization algorithm is used by models like BERT and GPT-2?

A.SentencePiece

B.Byte-Pair Encoding (BPE)

C.WordPiece

D.Unigram Language Model

AnswerC

BERT uses WordPiece, and GPT-2 uses BPE; however, among the options, WordPiece is correct for BERT, and the question likely expects the most common answer.

Why this answer

BERT uses WordPiece and GPT-2 uses BPE; both are subword tokenization methods. SentencePiece is used by models like T5 and Llama.

Practice this question →

140

MCQmedium

Which of the following sampling strategies selects tokens based on a cumulative probability threshold from the highest probability tokens?

A.Top-p (nucleus) sampling

B.Top-k sampling

C.Greedy decoding

D.Temperature sampling

AnswerA

Top-p selects the smallest set of tokens whose cumulative probability exceeds p.

Why this answer

Top-p (nucleus) sampling cuts off the tail of the probability distribution where cumulative probability exceeds p, allowing dynamic vocabulary size.

Practice this question →

141

Multi-Selecthard

A machine learning engineer is evaluating the performance of a translation model using BLEU score. Which THREE statements about BLEU are correct? (Choose three.)

Select 3 answers

A.BLEU includes a brevity penalty to penalize outputs that are too short

B.BLEU computes n-gram precision up to a maximum n (usually 4)

C.BLEU correlates well with human judgment at the corpus level

D.BLEU measures recall of n-grams by comparing the output to the reference

E.BLEU is a recall-oriented metric

AnswersA, B, C

The brevity penalty prevents short outputs from achieving artificially high scores.

Why this answer

BLEU is a precision-based metric (not recall). It uses modified n-gram precision with a brevity penalty. It correlates reasonably well with human judgment at the corpus level but has known limitations such as not capturing semantic equivalence.

Practice this question →

142

MCQmedium

A data scientist is fine-tuning a Llama 2 7B model on a custom dataset using OCI Data Science. After training, the model generates fluent but factually incorrect statements about the new domain. Which post-training technique would BEST address this issue without retraining?

A.Decrease the temperature to 0.1

B.Switch to a larger model like Llama 2 70B

C.Apply top-p sampling with p=0.9

D.Use a retrieval-augmented generation (RAG) pipeline

AnswerD

RAG retrieves relevant documents and feeds them as context, reducing hallucinations by grounding responses in verified sources.

Why this answer

RAG retrieves factual information from an external knowledge base to ground the generation, reducing hallucinations. The other options do not address factual accuracy.

Practice this question →

143

MCQeasy

Which component of the Transformer architecture allows the model to weigh the importance of different tokens in the input sequence when generating each output token?

A.Feed-forward neural network

B.Multi-head attention

C.Self-attention mechanism

D.Positional encoding

AnswerC

The self-attention mechanism computes attention scores between each token and every other token, allowing the model to focus on relevant parts of the input.

Why this answer

The self-attention mechanism computes attention scores between all pairs of tokens, enabling the model to dynamically focus on relevant parts of the input. Positional encoding adds order information, multi-head attention runs multiple attention heads in parallel, and the feed-forward network processes each position independently.

Practice this question →

144

MCQmedium

Which sampling strategy selects the token with the highest probability at each step, resulting in deterministic and often repetitive outputs?

A.Beam search

B.Temperature sampling

C.Top-k sampling

D.Greedy decoding

AnswerD

Greedy decoding selects the token with the highest probability at each step, producing deterministic outputs.

Why this answer

Greedy decoding selects the token with the highest probability at each step, making it deterministic and often leading to repetitive outputs because it always chooses the most likely next token without considering future alternatives. This contrasts with stochastic methods that introduce randomness or explore multiple paths.

Exam trap

Cisco often tests the distinction between deterministic (greedy) and stochastic (sampling-based) strategies, and the trap here is confusing 'greedy decoding' with 'beam search' because both involve selecting high-probability tokens, but beam search maintains multiple paths while greedy does not.

How to eliminate wrong answers

Option A is wrong because beam search maintains multiple candidate sequences (beams) at each step, not just the single highest-probability token, and can produce more diverse outputs. Option B is wrong because temperature sampling scales the logits before applying softmax to control randomness, not deterministically picking the top token. Option C is wrong because top-k sampling restricts the next token selection to the k most probable tokens but still samples randomly from that set, not deterministically choosing the single highest-probability token.

Practice this question →

145

MCQeasy

Which component of the transformer architecture allows the model to weigh the importance of different words in a sentence when processing input?

A.Layer normalization

B.Positional encoding

C.Self-attention mechanism

D.Feed-forward neural network

AnswerC

Self-attention computes pairwise relevance scores and produces context-aware representations for each token.

Why this answer

The self-attention mechanism is the core component of the transformer architecture that enables the model to dynamically assign weights to each word in a sentence relative to every other word. This allows the model to capture contextual relationships and dependencies, such as determining which words are most relevant to the current word being processed, regardless of their distance in the sequence.

Exam trap

Cisco often tests the distinction between components that provide positional information (positional encoding) versus those that compute relational importance (self-attention), leading candidates to confuse the role of positional encoding with the weighting of word significance.

How to eliminate wrong answers

Option A is wrong because layer normalization stabilizes training by normalizing activations across features, not by weighing word importance. Option B is wrong because positional encoding adds information about the order of words in a sequence, but does not perform any weighting of importance. Option D is wrong because the feed-forward neural network applies non-linear transformations to each position independently after attention has been computed, and does not weigh word importance.

Practice this question →

← PreviousPage 2 of 2 · 145 questions total

Ready to test yourself?

Try a timed practice session using only Oci Genai Llm Fundamentals questions.

Start 20-question session

CCNA Oci Genai Llm Fundamentals Questions — Page 2 of 2 | Courseiva