Oracle Cloud Infrastructure Generative AI Professional 1Z0-1127 (1Z0-1127) — Questions 751825

991 questions total · 14pages · All types, answers revealed

Page 10

Page 11 of 14

Page 12
751
MCQhard

A company is using OCI Generative AI service to power a customer support chatbot. They observe that the chatbot sometimes provides outdated information because the model was trained on data up to 2022. They want to incorporate real-time knowledge without retraining the model. Which approach should they use?

A.Increase the max-tokens parameter to allow longer responses.
B.Use prompt engineering to instruct the model to ignore old information.
C.Implement a Retrieval-Augmented Generation (RAG) pattern using OCI OpenSearch.
D.Fine-tune the model with recent data from 2023 onwards.
AnswerC

RAG retrieves relevant up-to-date documents and feeds them to the model, enabling current responses without retraining.

Why this answer

Option C is correct because Retrieval-Augmented Generation (RAG) allows the model to access real-time information from an external knowledge base, such as OCI OpenSearch, without retraining. This pattern retrieves relevant documents or data at inference time and injects them into the prompt, enabling the model to answer with up-to-date context. It directly addresses the need for real-time knowledge while keeping the base model static.

Exam trap

The trap here is that candidates often confuse prompt engineering (Option B) as a way to 'override' training data, but in reality, prompt instructions cannot erase the model's learned parameters, making RAG the only viable solution for real-time knowledge without retraining.

How to eliminate wrong answers

Option A is wrong because increasing max-tokens only extends the length of the response, not the recency or accuracy of the information; it does not provide any mechanism to incorporate new data. Option B is wrong because prompt engineering cannot force the model to 'ignore' outdated training data; the model's parametric knowledge is fixed and cannot be selectively suppressed by instructions alone, leading to hallucinations or contradictions. Option D is wrong because fine-tuning requires retraining the model on new data, which contradicts the requirement to avoid retraining and is also resource-intensive and time-consuming.

752
Multi-Selecthard

Which THREE of the following are known limitations of large language models that practitioners must consider?

Select 3 answers
A.Hallucination of facts not present in the input.
B.Generation of toxic or harmful language.
C.Limited to processing only one language at a time.
D.Bias amplification from training data.
E.Inability to process inputs longer than a few hundred tokens.
AnswersA, B, D

LLMs often generate plausible but false information.

Why this answer

Option A is correct because large language models (LLMs) are prone to hallucination, where they generate plausible-sounding but factually incorrect information that was not present in the input. This occurs because LLMs are next-token predictors without a built-in fact-checking mechanism, and they can invent details, citations, or events to maintain coherence. Practitioners must implement retrieval-augmented generation (RAG) or external verification to mitigate this risk.

Exam trap

Oracle often tests the misconception that LLMs have a hard token limit of a few hundred tokens, but the trap is that modern models have large context windows (e.g., 128K tokens) and the real limitation is the quadratic computational cost of attention, not a strict inability to process longer inputs.

753
MCQhard

An architect needs to ensure that an LLM deployed in OCI does not reveal sensitive information in its outputs. Which technique should be used?

A.Limiting max tokens
B.OCI Data Safe masking
C.Output filtering via custom inference wrapper
D.Input sanitization
AnswerC

A custom wrapper can filter outputs to remove sensitive information.

Why this answer

Option C is correct because output filtering via a custom inference wrapper allows the architect to inspect and sanitize the model's generated text before it reaches the user, preventing the leakage of sensitive information such as PII, credentials, or internal data. This technique operates at the application layer, intercepting the LLM's response and applying rules or regex patterns to redact or block prohibited content, which is essential for compliance and data security in production deployments.

Exam trap

Oracle often tests the distinction between input-side controls (like sanitization) and output-side controls (like filtering), and the trap here is that candidates confuse input sanitization with output filtering, assuming that cleaning the input is sufficient to prevent data leakage from the model's training or internal knowledge.

How to eliminate wrong answers

Option A is wrong because limiting max tokens only restricts the length of the output, not its content, and does nothing to prevent sensitive information from appearing within the allowed token count. Option B is wrong because OCI Data Safe masking is designed for structured databases and relational data, not for unstructured text generated by an LLM; it cannot be applied to model outputs in real-time. Option D is wrong because input sanitization focuses on cleaning user prompts before they reach the model, which is important for prompt injection prevention but does not control what the model generates in its response.

754
MCQmedium

What is the main advantage of using Byte-Pair Encoding (BPE) over word-level tokenization?

A.It can represent any word as a sequence of subword tokens, including rare or unseen words
B.It produces fixed-length token sequences
C.It reduces the number of tokens by merging all letters into single tokens
D.It eliminates the need for a vocabulary altogether
AnswerA

BPE's subword approach ensures open vocabulary.

Why this answer

Byte-Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent pairs of bytes or characters into new tokens. Its main advantage over word-level tokenization is that it can represent any word, including rare or unseen words, as a sequence of subword tokens, thereby eliminating the out-of-vocabulary (OOV) problem. This allows models like GPT and BERT to handle arbitrary input without requiring a fixed word vocabulary.

Exam trap

Cisco often tests the misconception that BPE eliminates the need for a vocabulary or that it produces fixed-length outputs, but the core advantage is its ability to handle rare and unseen words through subword decomposition.

How to eliminate wrong answers

Option B is wrong because BPE does not produce fixed-length token sequences; the number of tokens depends on the input text, and BPE can generate variable-length sequences. Option C is wrong because BPE does not merge all letters into single tokens; it merges the most frequent character pairs iteratively, but the final vocabulary contains many multi-character subwords, not just single letters. Option D is wrong because BPE still requires a predefined vocabulary of subword tokens (typically 32k–100k tokens), and the model cannot operate without any vocabulary.

755
MCQeasy

A developer wants to use LangChain to connect to OCI Generative AI Service for text generation. Which LangChain wrapper class should they use for the chat model?

A.OCIGenAI
B.ChatOCIGenAI
C.ChatOpenAI
D.OCIGenAIEmbeddings
AnswerB

ChatOCIGenAI is specifically designed for chat models and supports messages, roles, and conversation context.

Why this answer

ChatOCIGenAI is the correct LangChain wrapper class for chat-based interactions with OCI Generative AI Service. LangChain distinguishes between standard LLM wrappers (OCIGenAI) and chat model wrappers (ChatOCIGenAI), where the latter is designed to handle multi-turn conversation history and structured message formats (HumanMessage, AIMessage, SystemMessage). This aligns with the developer's requirement for a chat model.

Exam trap

Cisco often tests the distinction between LLM and ChatModel wrappers in LangChain, trapping candidates who assume all text generation tasks use the same class, or who confuse OCI-specific wrappers with generic OpenAI ones.

How to eliminate wrong answers

Option A is wrong because OCIGenAI is the base LangChain wrapper for OCI Generative AI Service's text generation (LLM), not the chat-specific variant; it lacks built-in support for chat message history and roles. Option C is wrong because ChatOpenAI is a wrapper for OpenAI's chat models (e.g., GPT-3.5/GPT-4), not for OCI Generative AI Service; using it would require different API keys and endpoints. Option D is wrong because OCIGenAIEmbeddings is a wrapper for generating embeddings (vector representations of text), not for text generation or chat; it serves a completely different purpose in RAG or similarity search pipelines.

756
MCQmedium

A security administrator needs to grant a group of data scientists access to use OCI Generative AI resources (models, endpoints) in compartment 'GenAI-Prod', but not allow them to create or manage infrastructure. Which IAM policy statement should be used?

A.Allow group DataScientists to inspect genai-family in compartment GenAI-Prod
B.Allow group DataScientists to use genai-family in compartment GenAI-Prod
C.Allow group DataScientists to read genai-family in compartment GenAI-Prod
D.Allow group DataScientists to manage genai-family in compartment GenAI-Prod
AnswerB

Use grants permission to invoke models and use endpoints without management rights.

Why this answer

The 'use' verb on genai-family resources allows inference and use of models/endpoints without permitting management (create/update/delete). This matches the requirement.

757
MCQeasy

Which of the following is the correct format for a training dataset used in OCI Generative AI fine-tuning?

A.JSONL file with 'prompt' and 'completion' fields
B.CSV file with columns 'input' and 'output'
C.TXT file with one prompt-completion pair per line separated by a tab
D.Parquet file with 'text' and 'label' columns
AnswerA

This is the required format.

Why this answer

OCI GenAI fine-tuning expects a JSONL file with prompt/completion pairs.

758
MCQeasy

An organization wants to fine-tune a large language model on OCI using their proprietary data. They are concerned about data privacy and want to ensure that fine-tuning data does not leave the OCI region. Which OCI service should they use to securely store and manage their training data?

A.OCI Block Volume
B.OCI File Storage
C.OCI Object Storage
D.Oracle Autonomous Database
AnswerC

Object Storage provides secure, regional storage ideal for large datasets.

Why this answer

C is correct because OCI Object Storage is a regional service that stores data within a specific OCI region, ensuring that fine-tuning data does not leave that region. It provides secure, durable, and scalable storage for large datasets, such as training data for LLMs, with encryption at rest and in transit, and supports direct integration with OCI Data Science and Generative AI services for fine-tuning workflows.

Exam trap

Oracle often tests the misconception that any storage service can be used for data residency, but the trap here is that Block Volume and File Storage are compute-attached services that do not inherently enforce regional data boundaries for data at rest across multiple services, while Object Storage is the only regional service designed for secure, scalable, and region-bound storage of unstructured data like LLM training datasets.

How to eliminate wrong answers

Option A is wrong because OCI Block Volume is a block-level storage service attached to compute instances, designed for low-latency, persistent storage for databases or applications, but it is not a regional service for storing and managing large training datasets; it is tied to a specific compute instance and does not inherently enforce regional data residency for data at rest across multiple services. Option B is wrong because OCI File Storage is a network file system (NFS) service for shared file access across compute instances, but it is not optimized for large-scale object storage of training data and does not provide the same regional data residency guarantees as Object Storage; it is typically used for shared file systems, not as a primary store for fine-tuning datasets. Option D is wrong because Oracle Autonomous Database is a managed database service for transactional and analytical workloads, not designed for storing large unstructured datasets like LLM training data; it is optimized for structured data and SQL queries, and using it for fine-tuning data would be inefficient and misaligned with the data storage requirements for generative AI training.

759
Multi-Selectmedium

Which TWO of the following are characteristics of decoder-only models like GPT? (Select TWO)

Select 2 answers
A.They process input through an encoder and a decoder
B.They use bidirectional self-attention
C.They use masked self-attention to prevent attending to future tokens
D.They are ideal for tasks requiring full bidirectional context like NER
E.They are typically used for generative tasks like text completion
AnswersC, E

Masked self-attention ensures each token only attends to previous tokens.

Why this answer

Decoder-only models use masked self-attention (causal) and generate tokens left-to-right. They cannot use bidirectional context because future tokens are masked.

760
MCQhard

An enterprise with strict data residency requirements wants to use OCI Generative AI. They must ensure that no training data or inference data leaves a specific OCI region. Which configuration option should they choose?

A.Use a dedicated AI cluster in the desired region and disable cross-region access.
B.Configure a service gateway with a private endpoint.
C.Implement a policy restricting data transfer via OCI Identity and Access Management.
D.Use OCI Data Transfer Service to keep data within the region.
AnswerA

Dedicated clusters are region-specific and can be restricted to prevent cross-region data flow.

Why this answer

A dedicated AI cluster in the desired region, with cross-region access disabled, ensures that all compute, training data, and inference data remain physically within that OCI region. This satisfies strict data residency requirements because the cluster is isolated from other regions at the network and infrastructure level, preventing any data egress.

Exam trap

The trap here is that candidates confuse network-level controls (like service gateways or private endpoints) with data residency enforcement, but only a dedicated, region-locked compute cluster guarantees that no data leaves the specified region.

How to eliminate wrong answers

Option B is wrong because a service gateway with a private endpoint only provides private connectivity within a VCN and does not prevent data from being processed or stored in other regions; it does not enforce regional data residency. Option C is wrong because OCI IAM policies control user permissions and resource access, not the physical location or movement of data between regions. Option D is wrong because OCI Data Transfer Service is designed for offline bulk data migration and does not provide ongoing control over where inference or training data resides during active AI workloads.

761
MCQeasy

What is the primary benefit of using a system prompt to set the persona and tone before the user message?

A.It reduces the token cost of each user message
B.It sets the overall behavior, tone, and constraints for the model throughout the conversation
C.It automatically grounds the model in the latest training data
D.It replaces the need for few-shot examples
AnswerB

System prompts define the model's role and rules for the entire session.

Why this answer

The system prompt establishes persistent behavioral guidelines that influence all subsequent interactions, ensuring consistency without repeating instructions in every user message.

762
MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Train a custom model from scratch on the policy documents each month
B.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store
C.Fine-tune a base LLM on the policy documents monthly
D.Use a larger foundation model with a longer context window and paste all documents into each prompt
AnswerB

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

RAG (Retrieval-Augmented Generation) allows the LLM to retrieve relevant document sections at inference time, so knowledge stays current without retraining. The other options either require expensive retraining for each update or lack document grounding.

763
Multi-Selectmedium

A DevOps engineer is setting up monitoring and logging for a generative AI inference endpoint. Which three resources should they enable? (Select THREE.)

Select 3 answers
A.OCI VCN flow logs for network traffic
B.OCI Logging for inference requests and responses
C.OCI Monitoring metrics for endpoint latency and error rates
D.OCI Application Performance Monitoring (APM) for tracing inference requests
E.OCI Audit logs for all API calls
AnswersB, C, D

Correct: Logging allows auditing and debugging of inference calls.

Why this answer

Option B is correct because OCI Logging captures detailed logs of inference requests and responses, which is essential for auditing, debugging, and analyzing the behavior of a generative AI endpoint. This service provides a centralized repository for log data, enabling DevOps engineers to track input prompts and model outputs for compliance and troubleshooting purposes.

Exam trap

The trap here is that candidates may confuse OCI Audit logs (which track administrative API calls) with OCI Logging (which captures data-plane request/response details), leading them to select Audit logs instead of Logging for monitoring inference payloads.

764
Multi-Selecteasy

Which TWO of the following are best practices for building a RAG pipeline in OCI?

Select 2 answers
A.Use overlapping chunks
B.Always use exact vector search for accuracy
C.Use a pre-trained embedding model from OCI Generative AI
D.Avoid storing metadata alongside vectors
E.Use a single large chunk for each document
AnswersA, C

Overlapping chunks preserve context across boundaries, improving retrieval.

Why this answer

Overlapping chunks ensure that context is not lost at chunk boundaries, which is critical for retrieval accuracy in RAG pipelines. By including overlapping text segments, the embedding model can capture semantic continuity, reducing the risk of missing relevant information when a query spans chunk edges.

Exam trap

Cisco often tests the misconception that exact vector search is always superior for accuracy, when in practice ANN search with proper tuning (e.g., efSearch parameter) achieves equivalent recall for RAG while being orders of magnitude faster.

765
MCQmedium

A data engineer is building a RAG application using OCI Generative AI Agents. They have documents stored in OCI Object Storage. Which resource must they create to make these documents searchable by the agent?

A.Create a Dedicated AI Cluster
B.Create a data source pointing to the Object Storage bucket
C.Create an endpoint
D.Create a fine-tuning job
AnswerB

Data sources connect to storage and index the content for retrieval.

Why this answer

In OCI Generative AI Agents, a data source points to Object Storage buckets and indexes the documents. Knowledge bases combine data sources, but the first step is creating the data source.

766
Multi-Selectmedium

A developer is using LangChain's ChatPromptTemplate to construct a prompt for a conversational agent. The prompt should include a system message, a placeholder for conversation history, and the latest user query. Which TWO components should they include in the template?

Select 3 answers
A.AIMessage
B.SystemMessage
C.MessagesPlaceholder with variable name 'history'
D.HumanMessage
AnswersB, C, D

A system message sets the behavior or context for the model.

Why this answer

Option B is correct because `SystemMessage` is the LangChain class used to define the system-level instruction that sets the behavior and context for the conversational agent. In a `ChatPromptTemplate`, the system message is typically the first message in the prompt, providing overarching guidance to the model before any conversation history or user input is added.

Exam trap

Cisco often tests the distinction between message types used in prompt construction versus message types used in conversation history, leading candidates to incorrectly include `AIMessage` as a template component when it should only appear in the `history` placeholder.

767
MCQeasy

Which parameter controls the randomness of the model's output by scaling the probability distribution before sampling?

A.top-k
B.temperature
C.frequency_penalty
D.top-p
AnswerB

Temperature adjusts the softmax distribution's sharpness, directly controlling randomness.

Why this answer

Temperature scales the logits before softmax, affecting creativity. Higher values increase randomness; lower values make output more deterministic.

768
MCQeasy

Which prompting technique involves providing the model with a small number of input-output examples within the prompt to guide its behavior?

A.Few-shot prompting
B.Chain-of-thought prompting
C.Zero-shot prompting
D.Tree-of-thought prompting
AnswerA

Few-shot prompting provides a few examples in the prompt to demonstrate the task.

Why this answer

Few-shot prompting includes several examples of the desired input-output mapping to help the model understand the task and output format.

769
MCQeasy

A company is building a RAG application for customer support. The knowledge base includes documents in English, Spanish, and French. Which embedding model should they use from OCI Generative AI to ensure accurate retrieval across all languages?

A.cohere.embed-multilingual-light-v3.0
B.cohere.generate-english-v2:0
C.cohere.embed-english-light-v3.0
D.cohere.command-r-plus-v1:0
AnswerA

This multilingual embedding model is designed to handle multiple languages, providing accurate retrieval for the company's needs.

Why this answer

Option A is correct because the cohere.embed-multilingual-light-v3.0 model is specifically designed to handle multiple languages, including English, Spanish, and French, ensuring accurate semantic retrieval across all three languages in the RAG application. OCI Generative AI offers this multilingual embedding model to support diverse knowledge bases, whereas single-language models would fail to represent non-English text effectively.

Exam trap

The trap here is that candidates may confuse text generation models (like command-r-plus or generate) with embedding models, or assume that an English-only embedding model can still handle multilingual content through translation, which is incorrect because embeddings are language-specific and not cross-lingual without multilingual training.

How to eliminate wrong answers

Option B is wrong because cohere.generate-english-v2:0 is a text generation model, not an embedding model, and it only supports English, making it unsuitable for retrieval across Spanish and French. Option C is wrong because cohere.embed-english-light-v3.0 is an embedding model but is limited to English, so it cannot accurately retrieve content in Spanish or French. Option D is wrong because cohere.command-r-plus-v1:0 is a large language model for generation and reasoning, not an embedding model, and it does not provide vector representations needed for retrieval.

770
MCQmedium

A data scientist has fine-tuned a Cohere Command R model using the T-Few technique. They now need to deploy this custom model for real-time inference with low latency. What is the recommended deployment option in OCI Generative AI?

A.Use the OCI Generative AI Playground to test the model
B.Provision a dedicated AI cluster and host the fine-tuned model on it
C.Use the shared infrastructure endpoint with an API call
D.Create an InferenceClient pointing to the fine-tuned model directly without a cluster
AnswerB

Dedicated AI clusters allow you to deploy your own fine-tuned models with low latency and dedicated compute resources, suitable for production real-time inference.

Why this answer

Dedicated AI clusters provide isolated, low-latency inference for custom fine-tuned models. Shared infrastructure is multi-tenant and may have variable latency; on-demand inference does not support custom models directly.

771
Multi-Selectmedium

A developer is building a RAG application using OCI Generative AI Agents. They want to ensure the agent only retrieves information from approved documents in a specific compartment. Which THREE steps are required?

Select 3 answers
A.Provision a dedicated AI cluster in the same compartment
B.Create a knowledge base that references the Object Storage bucket
C.Use the Embedding API to manually generate embeddings for each document
D.Create an IAM policy that allows the agent to read objects in the specific compartment
E.Store the approved documents in an Object Storage bucket located in that compartment
AnswersB, D, E

The knowledge base is the bridge between the agent and the data; it must be configured to use that bucket.

Why this answer

IAM policies restrict access to data, placing documents in the correct compartment, and creating a knowledge base pointing to that compartment are required. Creating a dedicated cluster and using the Embedding API directly are not necessary steps for the agent.

772
MCQmedium

An application uses an LLM to summarize legal documents. The summaries sometimes include hallucinations (details not in the original text). Which prompt engineering technique is MOST effective at reducing hallucinations?

A.Increase the temperature to 0.9 to make the model more cautious
B.Use a few-shot prompt with examples of correct summaries
C.Include the full document text in the prompt and instruct the model to base its summary only on that text
D.Set the presence penalty to a high value
AnswerC

Grounding the model with the source text is the most direct way to reduce hallucination.

Why this answer

Providing the full document context within the prompt (e.g., using a template that includes the document text) grounds the model's response and reduces the chance it invents details. Few-shot examples can also help but are secondary to providing the source.

773
MCQmedium

A machine learning engineer is fine-tuning a model on OCI Data Science and notices that the training loss decreases but then suddenly increases. What is the most likely cause?

A.Reduce model size
B.Add dropout
C.Increase batch size
D.Increase learning rate
AnswerA

Reducing model size reduces capacity and helps prevent overfitting, making it the best solution among given options.

Why this answer

The sudden increase in training loss after a period of decrease is a classic sign of gradient explosion, often caused by an excessively large learning rate. When the learning rate is too high, the optimizer overshoots the minima, causing the loss to diverge. Reducing the learning rate stabilizes training by ensuring smaller, more controlled weight updates.

Exam trap

Oracle often tests the misconception that overfitting is the cause of loss divergence, leading candidates to choose regularization techniques like dropout, when the actual issue is an unstable learning rate causing gradient explosion.

How to eliminate wrong answers

Option A is correct because reducing the model size does not directly address the loss divergence caused by an overly large learning rate; it may even reduce capacity. Option B is wrong because adding dropout is a regularization technique to prevent overfitting, not to fix gradient explosion or learning rate issues. Option C is wrong because increasing batch size can improve gradient stability but does not prevent the loss from spiking due to a high learning rate.

Option D is wrong because increasing the learning rate would exacerbate the problem, making the loss diverge further.

774
MCQmedium

An AI engineer is designing a prompt to generate a report summary. The prompt currently says: 'Summarize the following text.' The output is often too verbose. Which modification would best enforce a concise, bullet-list format?

A.Change the prompt to: 'Summarize the following text in a bullet list of at most 5 items. Each bullet must be under 10 words.'
B.Set temperature to 1.0 for more focused outputs
C.Increase the frequency penalty to 2.0
D.Add an example summary at the end of the prompt
AnswerA

This provides explicit format and length constraints.

Why this answer

Explicitly specifying the output format (bullet list, max 5 items) gives the model clear constraints, reducing verbosity.

775
MCQmedium

A developer is using the OCI Generative AI Chat API to build a conversational assistant. They want the assistant to adopt a formal tone regardless of user input. Which parameter should they set in the API request?

A.Set a high temperature (e.g., 0.9)
B.Set the system prompt to 'You are a formal assistant that responds in a professional tone.'
C.Set the frequency penalty to a high value
D.Set max tokens to a low value
AnswerB

The system prompt defines the assistant's behavior and tone across the conversation.

Why this answer

The system prompt (or preamble) sets the assistant's behavior consistently. Temperature affects randomness but not tone directly.

776
MCQhard

A research institution uses OCI Data Flow to process large-scale document corpora for a RAG system. They want to minimize latency for end-user queries. Which architecture decision would most effectively reduce query latency?

A.Embed documents on-the-fly during query time to ensure freshness.
B.Use a larger, more accurate embedding model.
C.Increase the number of Spark workers for parallel processing of queries.
D.Precompute embeddings offline using OCI Data Flow and store them in an OCI OpenSearch index.
AnswerD

Precomputation removes runtime embedding cost.

Why this answer

Precomputing embeddings offline with OCI Data Flow and storing them in an OCI OpenSearch index eliminates the need to generate embeddings at query time, which is the primary source of latency. This approach shifts the computationally expensive embedding generation to a batch process, allowing queries to perform only a fast vector similarity search against the precomputed index, drastically reducing end-user response time.

Exam trap

The trap here is that candidates often confuse batch processing with real-time processing, assuming that more parallelism (Option C) or a better model (Option B) can solve latency issues, when in fact the fundamental latency reduction comes from moving the expensive embedding computation out of the query path entirely.

How to eliminate wrong answers

Option A is wrong because embedding documents on-the-fly during query time introduces significant latency, as the embedding model must process each document in real time, which is impractical for large-scale corpora and defeats the purpose of minimizing query latency. Option B is wrong because using a larger, more accurate embedding model increases the computational cost and time for each embedding generation, which would actually increase latency rather than reduce it, especially if done at query time. Option C is wrong because increasing the number of Spark workers for parallel processing of queries does not address the bottleneck of embedding generation; Spark workers are used for batch processing in OCI Data Flow, not for real-time query serving, and adding more workers would not reduce the latency of the embedding step itself.

777
MCQhard

An OCI GenAI model generates English to French translation. Which metric is most appropriate to evaluate its quality?

A.Perplexity
B.ROUGE
C.F1 score
D.BLEU
AnswerD

BLEU is the standard metric for translation tasks.

Why this answer

BLEU (Bilingual Evaluation Understudy) is the standard metric for machine translation tasks because it measures the n-gram overlap between the generated translation and one or more reference translations, directly assessing fluency and adequacy. For English-to-French translation, BLEU correlates well with human judgment of translation quality, making it the most appropriate choice.

Exam trap

Oracle often tests the distinction between metrics for generation tasks (BLEU for translation, ROUGE for summarization, perplexity for language modeling) and classification metrics (F1 score), leading candidates to confuse their appropriate domains.

How to eliminate wrong answers

Option A is wrong because perplexity measures how well a language model predicts a sequence of tokens, not the quality of a translation against a reference. Option B is wrong because ROUGE is designed for summarization tasks, focusing on recall of n-grams and longest common subsequences, not translation accuracy. Option C is wrong because F1 score is a classification metric (precision and recall) that does not capture the sequential and lexical alignment required for evaluating translation output.

778
MCQhard

A team is building a system to detect duplicate customer support tickets. They have a dataset of 10,000 resolved tickets and want to find pairs with similar intent. Which approach would be MOST efficient and effective?

A.Embed each ticket using an embedding model and compute cosine similarity between all pairs
B.Use a long-context LLM to process all tickets in a single prompt
C.Use a generation model like Cohere Command to compare tickets one pair at a time
D.Fine-tune a generation model on a classification task to predict duplicates
AnswerA

Embeddings reduce tickets to dense vectors; cosine similarity allows efficient comparison, and approximate nearest neighbor algorithms can handle large sets.

Why this answer

Option A is correct because embedding each ticket into a dense vector space and computing cosine similarity between all pairs is both efficient and effective for detecting duplicate intents. This approach leverages pre-trained embedding models (e.g., sentence-transformers) that capture semantic similarity, and pairwise cosine similarity scales well for 10,000 tickets (approximately 50 million comparisons) using optimized matrix operations. It avoids the quadratic cost of LLM inference per pair while preserving high accuracy for intent matching.

Exam trap

Cisco often tests the misconception that using a powerful LLM for every pairwise comparison is the most accurate approach, ignoring the massive computational cost and the fact that embedding similarity is both faster and equally effective for semantic duplicate detection.

How to eliminate wrong answers

Option B is wrong because a long-context LLM cannot process all 10,000 tickets in a single prompt due to context window limits (typically 4K-128K tokens), and even if it could, it would not perform pairwise duplicate detection—it would generate a summary or classification, not identify specific duplicate pairs. Option C is wrong because using a generation model like Cohere Command to compare tickets one pair at a time would require O(n²) LLM calls (50 million for 10,000 tickets), which is computationally prohibitive, slow, and cost-ineffective. Option D is wrong because fine-tuning a generation model on a classification task to predict duplicates is overkill and inefficient; it requires labeled data, significant training resources, and still does not directly solve the pairwise comparison problem—embedding-based similarity is simpler, faster, and equally effective for this unsupervised or semi-supervised task.

779
MCQmedium

A company wants to use OCI Generative AI to analyze legal documents and extract key clauses. Which model type is best suited for this task?

A.Cohere Command (generate)
B.Cohere Chat
C.Cohere Embed
D.Cohere Summarize
AnswerD

Summarize models are optimized for condensing content, suitable for extracting key clauses.

Why this answer

Cohere Summarize is specifically designed to condense long documents into concise summaries, making it ideal for extracting key clauses from legal documents. Unlike other Cohere models, Summarize focuses on distilling the most important information from text, which aligns with the task of identifying and extracting critical clauses.

Exam trap

Oracle often tests the misconception that any generative model can perform extraction tasks, but the key distinction is that Cohere Summarize is purpose-built for condensation and extraction, whereas other models are designed for generation, conversation, or embedding.

How to eliminate wrong answers

Option A is wrong because Cohere Command (generate) is a text generation model for creating new content, not for extracting or summarizing existing information. Option B is wrong because Cohere Chat is optimized for conversational interactions and multi-turn dialogue, not for document analysis or clause extraction. Option C is wrong because Cohere Embed generates vector embeddings for semantic search or clustering, but does not perform text extraction or summarization.

780
MCQhard

During iterative refinement, a prompt engineer tests two prompt variants on the same 100 inputs and measures accuracy. Variant A yields 85% accuracy, Variant B yields 82%. However, Variant B's outputs are more concise and preferred by users. What should the engineer do NEXT?

A.Select Variant B because user preference outweighs the small accuracy difference
B.Select Variant A because accuracy is the primary metric
C.Run a larger A/B test with statistical significance check before deciding
D.Define clear evaluation criteria that balance accuracy and conciseness, then re-evaluate
AnswerD

Establishing weighted criteria ensures both dimensions are considered and the decision is objective.

Why this answer

Evaluation should be based on multiple criteria beyond accuracy, especially user preference. The engineer should establish a composite metric that includes both accuracy and conciseness.

781
MCQeasy

Your organization uses OCI Data Science to train a generative AI model for code generation. After training, you want to deploy it as a REST API. You create a model deployment using the OCI console, but after 30 minutes the deployment status is still 'Creating'. You check the logs and see the message: 'Insufficient capacity for shape VM.GPU.A10.1 in availability domain AD-1'. The deployment is configured with a single replica. You have verified your tenancy has sufficient service limits for GPU instances. What should you do to resolve this issue quickly?

A.Change the deployment to use a different GPU shape, such as VM.GPU.A10.2
B.Delete the deployment and create it in a different region with more GPU capacity
C.Request a service limit increase for GPU shapes
D.Wait for 1 hour and check again; capacity may become available
AnswerA

A different GPU shape may have available capacity in the same availability domain.

Why this answer

Option A is correct because the error indicates that the specific GPU shape VM.GPU.A10.1 lacks capacity in the current availability domain. Switching to a different GPU shape, such as VM.GPU.A10.2, which uses a different instance configuration, can bypass the capacity constraint without requiring a region change or service limit increase. This is the fastest resolution because it directly addresses the availability domain capacity issue while keeping the deployment in the same region and AD.

Exam trap

The trap here is that candidates confuse service limits with capacity availability, assuming a limit increase will fix the issue, when in fact the error explicitly states 'Insufficient capacity' for the shape, not a limit breach.

How to eliminate wrong answers

Option B is wrong because deleting and recreating in a different region is an overreaction; the capacity issue is specific to the shape and AD, not the region, and moving regions introduces latency and complexity. Option C is wrong because the error is about capacity, not service limits; the user already verified sufficient service limits, so a limit increase would not resolve the immediate capacity shortage. Option D is wrong because waiting does not guarantee capacity will become available; the error indicates a persistent lack of capacity for that specific shape in that AD, and waiting could waste time without resolution.

782
MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use a larger foundation model with a longer context window and paste all documents into each prompt
B.Fine-tune a base LLM on the policy documents monthly
C.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store
D.Train a custom model from scratch on the policy documents each month
AnswerC

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

Option C is correct because Retrieval-Augmented Generation (RAG) allows the chatbot to answer questions from dynamic policy documents without retraining. By indexing the documents in a vector store and retrieving relevant chunks at query time, the system can handle monthly updates simply by re-indexing the new documents, keeping the LLM's weights unchanged.

Exam trap

Cisco often tests the misconception that fine-tuning is the only way to incorporate custom data, but the trap here is that RAG provides a cheaper, more flexible alternative for frequently updated documents without modifying the model.

How to eliminate wrong answers

Option A is wrong because pasting all documents into each prompt would exceed the context window limits of even the largest foundation models (e.g., 128K tokens for GPT-4 Turbo), leading to truncation, high latency, and increased cost per query. Option B is wrong because fine-tuning a base LLM monthly on the policy documents is expensive, time-consuming, and still requires retraining each time the documents change, which the team explicitly cannot afford. Option D is wrong because training a custom model from scratch each month is prohibitively costly and resource-intensive, requiring massive datasets, GPU hours, and ML expertise, far beyond the scope of a simple chatbot.

783
MCQeasy

Which tokenization algorithm is commonly used in models like GPT and BERT and builds tokens by merging the most frequent pairs of characters or subwords iteratively?

A.WordPiece
B.SentencePiece
C.Unigram tokenization
D.Byte-Pair Encoding (BPE)
AnswerD

BPE is the algorithm that iteratively merges the most frequent byte pairs to build a subword vocabulary.

Why this answer

Byte-Pair Encoding (BPE) is a subword tokenization method that starts with individual characters and merges the most frequent pairs iteratively until a vocabulary size is reached.

784
MCQeasy

A researcher wants to compare the performance of two LLMs on OCI Generative AI: a base model and an instruct model. They notice the instruct model often refuses to generate certain types of content. Which factor most likely explains this behavior?

A.The base model was programmed to follow stricter rules.
B.The instruct model has been fine-tuned with reinforcement learning from human feedback (RLHF) to align with safety guidelines.
C.The instruct model was trained on a smaller dataset.
D.The base model rejects content more often.
AnswerB

RLHF makes instruct models more likely to reject unsafe requests.

Why this answer

Option B is correct because instruct models are typically fine-tuned using reinforcement learning from human feedback (RLHF) to align with safety guidelines and ethical constraints. This fine-tuning process teaches the model to refuse generating harmful, biased, or unsafe content, which explains why the instruct model refuses certain types of content while the base model does not.

Exam trap

Oracle often tests the misconception that refusal behavior is due to dataset size or rule-based programming, when in fact it is a direct result of RLHF-based safety alignment in instruct models.

How to eliminate wrong answers

Option A is wrong because base models are not programmed with explicit rule-based filters; they are trained on large text corpora without specific refusal mechanisms. Option C is wrong because the training dataset size does not directly cause refusal behavior; instruct models are often fine-tuned on smaller, curated datasets but the refusal stems from RLHF alignment, not dataset size. Option D is wrong because base models typically do not reject content more often; they generate outputs freely without the safety alignment that instruct models undergo.

785
MCQmedium

An AI engineer is testing a large language model on OCI Generative AI and receives this error: 'Token limit exceeded. Maximum context length is 4096 tokens.' The prompt is 4000 tokens long. What is the most effective way to resolve the issue without losing important context?

A.Reduce the prompt length by summarizing or trimming less relevant information.
B.Switch to a model with a larger context window, if available.
C.Increase the max_tokens parameter in the API call.
D.Split the prompt into multiple requests and combine outputs.
AnswerA

Reducing prompt length ensures it fits within the token limit while preserving key context.

Why this answer

Option A is correct because the error indicates that the combined prompt and generated output exceed the model's maximum context length of 4096 tokens. Since the prompt alone is 4000 tokens, there is very little room for the model to generate a response. Trimming or summarizing less relevant parts of the prompt directly reduces the token count, allowing the model to produce a complete output without exceeding the limit.

This approach preserves the most critical context while staying within the model's constraints.

Exam trap

Oracle often tests the misconception that increasing max_tokens or switching models can bypass the token limit, but the core issue is the total context length, which is a fixed architectural constraint of the model.

How to eliminate wrong answers

Option B is wrong because switching to a model with a larger context window may not be available in the current environment or may introduce additional costs and latency; the question asks for the most effective way to resolve the issue without losing important context, and reducing the prompt is a more direct and universally applicable solution. Option C is wrong because increasing the max_tokens parameter does not change the total context length limit; it only controls the maximum number of tokens the model can generate, and if the prompt already consumes 4000 tokens, increasing max_tokens would still cause the total to exceed 4096. Option D is wrong because splitting the prompt into multiple requests and combining outputs can lead to loss of coherence and context across the separate calls, and the model does not maintain state between requests, so important relationships between parts of the prompt would be lost.

786
MCQmedium

An administrator notices that a dedicated AI cluster is not scaling down after a period of low traffic. What could be the cause?

A.The cluster has a minimum size set to the current number of nodes
B.There are pending inference requests
C.The cluster is in a compartment without permissions
D.The autoscaling policy uses a cooldown period that is too short
AnswerA

A minimum size setting prevents scaling down below that threshold.

Why this answer

A dedicated AI cluster in OCI has a minimum size configuration that prevents the autoscaler from reducing the node count below that threshold. If the current number of nodes equals the configured minimum, the cluster will not scale down even during low traffic, as the autoscaler respects this lower bound. This ensures baseline capacity is always available for inference workloads.

Exam trap

Oracle often tests the misconception that autoscaling always scales down when traffic is low, without considering the minimum size constraint that overrides scaling policies.

How to eliminate wrong answers

Option B is wrong because pending inference requests would actually prevent scaling down, but the question states the cluster is not scaling down after a period of low traffic, implying no pending requests are present. Option C is wrong because compartment permissions affect resource access and management operations, not the autoscaling behavior of a cluster. Option D is wrong because a cooldown period that is too short would cause the cluster to scale down too aggressively or oscillate, not prevent scaling down entirely.

787
Multi-Selectmedium

A company is using OCI Generative AI Agents to implement a RAG system for employee onboarding. They want to ensure the agent only answers from the uploaded documents and avoids making up information. Which THREE configuration steps should they take?

Select 3 answers
A.Configure the agent to use only the knowledge base and disable internet search
B.Set the preamble to instruct the agent to only answer based on provided context
C.Increase the temperature to 2.0 for more creative responses
D.Create a knowledge base that indexes the onboarding documents
E.Fine-tune the underlying model on the onboarding documents
AnswersA, B, D

This ensures the agent retrieves only from provided documents.

Why this answer

To ground the agent in provided documents, they should use a knowledge base, set a preamble to restrict knowledge, and disable internet search. Fine-tuning is not needed.

788
MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Train a custom model from scratch on the policy documents each month
B.Use a larger foundation model with a longer context window and paste all documents into each prompt
C.Fine-tune a base LLM on the policy documents monthly
D.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store
AnswerD

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

RAG (Retrieval-Augmented Generation) allows the LLM to retrieve relevant document sections at inference time, so knowledge stays current without retraining. The other options either require expensive retraining for each update or lack document grounding.

789
MCQeasy

A data scientist needs to generate vector embeddings for a large corpus of text documents to use in a semantic search application. Which OCI service is best suited for this task?

A.OCI Vision
B.OCI Speech
C.OCI Generative AI
D.OCI Language
AnswerC

OCI Generative AI offers embedding models (e.g., Cohere embed) specifically for text.

Why this answer

OCI Generative AI is the correct choice because it provides a managed service for generating vector embeddings from text using large language models (LLMs) like Cohere. This service is specifically designed for tasks such as semantic search, where embeddings capture the meaning of text to enable similarity comparisons. OCI Vision, Speech, and Language focus on other modalities (images, audio, and NLP tasks like sentiment analysis) and do not offer embedding generation for semantic search.

Exam trap

Oracle often tests the misconception that OCI Language can generate embeddings because it handles text, but OCI Language lacks an embedding API, while OCI Generative AI is the only service that provides this capability for semantic search.

How to eliminate wrong answers

Option A is wrong because OCI Vision is designed for image and video analysis (e.g., object detection, OCR), not for generating text embeddings. Option B is wrong because OCI Speech handles audio-to-text transcription and speaker diarization, not text embedding generation. Option D is wrong because OCI Language provides NLP features like sentiment analysis, entity extraction, and text classification, but it does not offer a dedicated embedding API for semantic search; that capability is exclusive to OCI Generative AI.

790
MCQhard

A data scientist is designing a prompt to extract structured information (e.g., JSON) from text using an instruct model on OCI Generative AI. The model sometimes outputs additional text beyond the JSON, breaking parsing. Which prompt engineering technique is most effective to enforce structured output?

A.Use a base model instead of an instruct model.
B.Set the temperature to 0.0 to reduce randomness.
C.Include a few-shot example of the expected JSON output in the prompt.
D.Increase max_tokens to allow for additional output.
AnswerC

Few-shot examples teach the model to output precisely in the desired format.

Why this answer

Option C is correct because few-shot prompting provides explicit examples of the desired output format, which instructs the model to follow the exact JSON structure and reduces the likelihood of extraneous text. This technique leverages the model's in-context learning ability to adhere to formatting constraints, making it the most effective for enforcing structured output in OCI Generative AI instruct models.

Exam trap

Oracle often tests the misconception that lowering temperature or increasing tokens can enforce output format, when in reality only explicit formatting examples (few-shot) reliably constrain the model's output structure.

How to eliminate wrong answers

Option A is wrong because base models lack instruction-following capabilities and are more prone to generating unstructured or irrelevant text, making them less suitable for structured output tasks. Option B is wrong because setting temperature to 0.0 reduces randomness but does not prevent the model from outputting additional explanatory text beyond the JSON; it only makes outputs more deterministic, not format-compliant. Option D is wrong because increasing max_tokens allows more room for additional output, which would exacerbate the problem of extra text beyond the JSON, not solve it.

791
MCQeasy

An OCI AI Language text classification request returns the output shown. Which conclusion is most accurate?

A.The model is uncertain about the sentiment.
B.The text is classified as Positive with high confidence.
C.The API endpoint is misconfigured.
D.The --endpoint parameter is optional.
AnswerB

The label 'Positive' with score 0.98 confirms high-confidence classification.

Why this answer

The output shows a sentiment label of 'Positive' with a confidence score of 0.98, indicating the model is highly confident in its classification. Option B correctly identifies this as a positive sentiment with high confidence, which is the most accurate conclusion based on the provided data.

Exam trap

The trap here is that candidates may misinterpret a high confidence score as uncertainty (Option A) due to a common misconception that AI models always express doubt, but in OCI AI Language, a score near 1.0 explicitly indicates high certainty.

How to eliminate wrong answers

Option A is wrong because a confidence score of 0.98 indicates the model is very certain, not uncertain, about the sentiment. Option C is wrong because the API endpoint is not misconfigured; the request returned a valid response with a sentiment label and confidence score, which would not happen if the endpoint were misconfigured. Option D is wrong because the --endpoint parameter is not optional; it is required to specify the OCI AI Language endpoint for the API call, and its absence would cause a request failure.

792
MCQmedium

A model generates code with security issues. Which approach is best to mitigate this?

A.Reduce max_tokens
B.Increase temperature
C.Use a different model
D.Add a system prompt with security guidelines
AnswerD

System prompts can guide the model to produce secure code.

Why this answer

Adding a system prompt with security guidelines (option C) instructs the model to follow best practices, directly addressing security concerns without changing model training.

793
MCQmedium

A developer is using the OCI Generative AI Chat API to build a multi-turn conversational assistant. They want the assistant to adopt a formal tone throughout the conversation. Which parameter should they set in the API request to achieve this?

A.preamble_override
B.temperature
C.frequency_penalty
D.max_tokens
AnswerA

Preamble override sets the system prompt that defines the assistant's behavior and tone for the entire conversation.

Why this answer

The 'preamble_override' parameter sets the system-level instruction (e.g., 'You are a formal assistant...') that persists across turns. The other options control generation statistics, not system behavior.

794
MCQhard

An ML engineer notices that when using temperature sampling with temperature=0.8 for code generation, the model sometimes produces syntactically incorrect code. The engineer needs to ensure syntactically valid outputs while maintaining some creativity. Which combination of sampling parameters is MOST appropriate?

A.Reduce temperature to 0.2 and use top-p=0.9
B.Increase temperature to 1.2 and use top-k=50
C.Use beam search with width=5
D.Set temperature=1.0 and use greedy decoding
AnswerA

Low temperature sharpens the distribution; top-p limits the token pool to the top 90% probability mass, reducing chances of sampling improbable tokens that break syntax.

Why this answer

Reducing temperature to 0.2 makes the output distribution more peaked, favoring high-probability tokens, which reduces syntax errors. Combining this with top-p=0.9 (nucleus sampling) limits the sampling pool to the smallest set of tokens whose cumulative probability reaches 0.9, further filtering out low-probability tokens that often cause invalid syntax. This balance preserves some creativity while ensuring syntactically valid code.

Exam trap

Cisco often tests the misconception that increasing temperature or using beam search improves output quality, when in fact reducing temperature and using top-p sampling is the standard approach for balancing correctness and creativity in code generation tasks.

How to eliminate wrong answers

Option B is wrong because increasing temperature to 1.2 flattens the probability distribution, making low-probability tokens more likely, which would increase syntax errors, not reduce them. Option C is wrong because beam search with width=5 is a deterministic decoding method that maximizes sequence probability, which can produce repetitive or overly conservative outputs and does not inherently guarantee syntactic validity; it also lacks the stochastic creativity needed. Option D is wrong because setting temperature=1.0 with greedy decoding (temperature=1.0 is effectively no scaling, and greedy decoding always picks the highest-probability token) eliminates creativity entirely and can still produce syntactically incorrect code if the highest-probability token leads to an invalid sequence.

795
MCQhard

Refer to the exhibit. A developer runs the OCI CLI command and receives the output. However, the text "Hello, how are you?" is actually a mix of English and French words. Why does the model assign only 0.03 to French?

A.The text is overwhelmingly English, so the model assigns a low probability to French.
B.The model is limited to identifying a single language per query.
C.The model cannot detect multiple languages in a single text.
D.The model's scores are normalized to sum to 1, so a high English score forces low others.
AnswerA

The phrase is mostly English, so the model is confident it is English.

Why this answer

Option A is correct because the model's output shows a probability distribution over languages, and the text is predominantly English with only a few French words. The model assigns a low probability (0.03) to French because the overwhelming majority of tokens are English, making the text far more likely to be classified as English. This reflects how language identification models evaluate the overall composition of the input.

Exam trap

Oracle often tests the misconception that normalized probabilities force a single language to dominate, but the trap here is that candidates may think the low French score is an artifact of normalization rather than a reflection of the actual token distribution in the text.

How to eliminate wrong answers

Option B is wrong because the model can output probabilities for multiple languages simultaneously, as shown in the exhibit where both English and French scores are present. Option C is wrong because the model can detect multiple languages in a single text, as evidenced by the non-zero probability assigned to French; it does not have a hard limit of one language per query. Option D is wrong because while scores are normalized to sum to 1, the low French score is due to the actual token composition, not merely a forced consequence of normalization; normalization reflects the relative likelihoods, but the model could assign high scores to multiple languages if the text were genuinely multilingual.

796
MCQhard

A company runs batch inference jobs daily using the OCI Generative AI service. The current cost is higher than expected. Which change would most effectively reduce cost while maintaining throughput?

A.Switch from on-demand to dedicated AI cluster with batch endpoint.
B.Reduce the max token limit for all requests.
C.Use a larger model to reduce retries.
D.Increase the number of parallel requests to improve efficiency.
AnswerA

Dedicated clusters provide lower cost per token for batch workloads and avoid contention.

Why this answer

Switching from on-demand to a dedicated AI cluster with a batch endpoint reduces cost because dedicated clusters provide reserved capacity at a lower per-token rate compared to on-demand pay-per-token pricing, and batch endpoints allow you to process multiple inference requests in a single job, amortizing overhead and reducing idle time. This combination directly addresses the high cost of per-request on-demand pricing while maintaining the same throughput for daily batch jobs.

Exam trap

Oracle often tests the misconception that reducing token limits or increasing parallelism is the most effective cost-saving measure, when in fact the pricing model change from on-demand to dedicated infrastructure yields the greatest savings for predictable batch workloads.

How to eliminate wrong answers

Option B is wrong because reducing the max token limit may lower per-request cost but can degrade output quality or truncate results, and it does not address the underlying pricing model inefficiency for batch workloads. Option C is wrong because using a larger model typically increases cost per token and latency, and retries are not a significant cost driver in batch inference; larger models would worsen, not reduce, cost. Option D is wrong because increasing parallel requests on an on-demand endpoint can actually increase cost due to higher concurrency charges or rate-limiting penalties, and it does not change the per-token pricing structure.

797
MCQhard

A RAG application is hallucinating because the LLM receives irrelevant context from the retrieval step, even when topK is set to 3. Which strategy would best reduce hallucination by improving the relevance of retrieved documents?

A.Reduce the chunk size to one sentence per chunk
B.Add a reranking step after retrieval to select the most relevant chunks
C.Implement a query rewriting mechanism
D.Increase topK to 10 to provide more context
AnswerB

Reranking improves the relevance of the final context set.

Why this answer

Adding a reranking step after retrieval directly addresses the core issue: even with a low topK, the initial retrieval may return chunks that are semantically similar but not precisely relevant to the query. Reranking uses a cross-encoder model to score each retrieved chunk against the query, reordering them so that only the most contextually relevant chunks are passed to the LLM. This reduces the chance of the LLM receiving irrelevant context, thereby minimizing hallucination.

Exam trap

Cisco often tests the misconception that simply adjusting retrieval parameters (like chunk size or topK) can fix relevance issues, when the real solution is a dedicated reranking step that re-evaluates relevance with a more powerful model.

How to eliminate wrong answers

Option A is wrong because reducing chunk size to one sentence per chunk can fragment context and lose necessary supporting information, often making retrieval less coherent and potentially increasing hallucination. Option C is wrong because query rewriting improves the query itself but does not fix the problem of irrelevant chunks already retrieved; it addresses query ambiguity, not retrieval relevance. Option D is wrong because increasing topK to 10 would retrieve more chunks, which could introduce even more irrelevant context and worsen hallucination, not reduce it.

798
MCQeasy

A user wants to invoke an OCI Generative AI endpoint from a cloud function. What is the required authentication method?

A.API signing key
B.User name and password
C.Session token
D.OCI certificate
AnswerA

API signing key is required for OCI API authentication.

Why this answer

OCI Generative AI endpoints require API signing keys for authentication because they are REST APIs that use the Signature Version 1 algorithm (based on HMAC-SHA256) to sign requests. Cloud Functions must include a signed HTTP header using a user's or service principal's OCI API signing key pair (private key for signing, public key uploaded to OCI) to prove identity and authorization. This is the standard method for programmatic access to OCI services, including Generative AI, and is enforced by the OCI Identity and Access Management (IAM) policy layer.

Exam trap

Oracle often tests the misconception that OCI always uses session tokens or OAuth2 for service-to-service calls, but for Generative AI and most OCI REST APIs, the required method is API signing key authentication, not token-based or certificate-based methods.

How to eliminate wrong answers

Option B is wrong because username and password are used for interactive console login (OCI IAM user password authentication) and are not supported for programmatic API calls from cloud functions; they would expose credentials in code and violate OCI security best practices. Option C is wrong because a session token is a temporary credential obtained via federation or token exchange (e.g., from an identity provider) and is typically used for CLI or SDK sessions, not for direct REST API signing from a cloud function without a token exchange flow. Option D is wrong because OCI certificate authentication (mTLS) is used for specific services like API Gateway or load balancer mutual TLS, not for standard OCI REST API endpoints like Generative AI, which rely on API signing keys.

799
Multi-Selectmedium

A machine learning engineer is designing a RAG pipeline in OCI to improve the accuracy of an LLM-based FAQ bot. Which TWO components are essential for the retrieval phase? (Select TWO.)

Select 2 answers
A.Document chunking
B.Tokenization before the generation step
C.Text generation model
D.A reranker model
E.Embedding model to convert chunks into vectors
AnswersA, E

Documents must be split into chunks for effective retrieval.

Why this answer

Document chunking is essential because it breaks large documents into smaller, manageable pieces that can be individually indexed and retrieved. Without chunking, the retrieval phase would either miss relevant context or return overly large documents that exceed the LLM's context window, reducing accuracy.

Exam trap

Cisco often tests the distinction between retrieval-phase components (chunking and embeddings) and generation-phase components (tokenization and the LLM itself), leading candidates to mistakenly include reranking as essential when it is only an optional refinement.

800
Multi-Selectmedium

Which TWO deployment options are available for using fine-tuned models with OCI Generative AI service?

Select 2 answers
A.Bring Your Own Container (BYOC)
B.Serverless Endpoint
C.On-Demand Endpoint
D.Edge Deployment
E.Managed Dedicated Endpoint
AnswersC, E

On-demand endpoints are for base models but fine-tuned models can also be deployed via dedicated endpoints that use on-demand scaling.

Why this answer

The OCI Generative AI service provides two deployment options for fine-tuned models: On-Demand Endpoint and Managed Dedicated Endpoint. The On-Demand Endpoint (Option C) is a serverless, pay-per-token option that automatically scales, suitable for variable workloads. The Managed Dedicated Endpoint (Option E) provides a dedicated, single-tenant endpoint with guaranteed throughput and lower latency for production workloads.

Exam trap

Cisco often tests the distinction between 'serverless' as a general concept versus the specific named deployment options in OCI Generative AI, leading candidates to incorrectly select 'Serverless Endpoint' as a separate option when it is actually the underlying model for the On-Demand Endpoint.

801
MCQeasy

Which parameter controls the creativity and randomness of a model's output by adjusting the probability distribution before sampling the next token?

A.Frequency penalty
B.Max tokens
C.Temperature
D.Top-k
AnswerC

Temperature directly controls the randomness of token selection.

Why this answer

Temperature scales logits before softmax; higher values increase randomness.

802
MCQeasy

What is the role of the softmax function in the output layer of an LLM?

A.Apply attention
B.Tokenize input
C.Compute gradients
D.Convert logits to probabilities
AnswerD

Softmax normalizes logits into a probability distribution.

Why this answer

The softmax function in the output layer of an LLM converts the raw, unnormalized scores (logits) produced by the final linear layer into a probability distribution over the vocabulary. This allows the model to output a valid probability for each token, where all probabilities sum to 1, enabling sampling or greedy decoding for next-token prediction.

Exam trap

The trap here is that candidates may confuse the role of softmax with other transformer components like attention or tokenization, especially since all are critical to LLM operation, but only softmax directly converts logits to probabilities in the output layer.

How to eliminate wrong answers

Option A is wrong because attention is a mechanism within the transformer architecture (e.g., self-attention in the encoder/decoder blocks) that computes weighted sums of values based on queries and keys, not a function applied in the output layer. Option B is wrong because tokenization is a preprocessing step that splits input text into tokens (e.g., using BPE or WordPiece) before the model processes them, not a function of the output layer. Option C is wrong because gradient computation is part of the backpropagation algorithm during training, not an inference-time operation of the output layer; softmax itself is differentiable but its role is to produce probabilities, not compute gradients.

803
MCQeasy

A user has a prompt that exceeds the model's token limit. What is the best practice to handle this?

A.Summarize the earlier parts of the prompt and include the summary.
B.Increase the max tokens parameter in the API call.
C.Truncate the prompt and hope the model understands.
D.Split the input into multiple calls and merge results.
AnswerA

Correct: Summarization preserves context while reducing token count.

Why this answer

Option A is correct because when a prompt exceeds the model's token limit, the best practice is to summarize the earlier parts of the prompt and include the summary. This preserves the essential context without exceeding the token limit, as the model's context window is fixed (e.g., 4,096 tokens for GPT-3.5 or 8,192 for GPT-4). Summarization reduces token count while retaining key information, enabling the model to process the entire input within its constraints.

Exam trap

Oracle often tests the misconception that increasing the max tokens parameter can extend the input capacity, when in reality it only affects output length, not the fixed context window.

How to eliminate wrong answers

Option B is wrong because increasing the max tokens parameter does not expand the model's context window; it only controls the length of the generated response, not the input prompt limit. Option C is wrong because truncating the prompt arbitrarily removes potentially critical context, leading to incomplete or incorrect model understanding, as the model cannot infer missing information. Option D is wrong because splitting the input into multiple calls and merging results breaks the conversational context; the model has no memory across separate API calls, so the merged output would lack coherence and continuity.

804
MCQmedium

An organization wants to allow its data science group to use OCI Generative AI services but restrict access to a specific compartment. Which IAM policy statement correctly achieves this?

A.allow group data-scientists to use generative-ai-family in compartment GenAI-Prod
B.allow group data-scientists to manage generative-ai-family in tenancy
C.allow group data-scientists to read generative-ai-family in compartment GenAI-Prod
D.allow group data-scientists to use generative-ai-family where target.compartment.id = 'GenAI-Prod'
AnswerA

This grants the necessary permissions within the specified compartment only.

Why this answer

The 'allow group <group_name> to use generative-ai-family in compartment <compartment_name>' policy grants access to all GenAI resources within that compartment. The other options either miss the compartment scope or use incorrect verbs.

805
Multi-Selecteasy

Which TWO actions are required to enable a user to access OCI Generative AI service?

Select 2 answers
A.Create a dedicated AI cluster.
B.Enable the GenAI service in the region's service limits.
C.Subscribe to the GenAI service in the OCI Console.
D.Install the OCI SDK.
E.Ensure the user has the appropriate IAM policy.
AnswersC, E

The service must be enabled for the tenancy.

Why this answer

Option C is correct because to use OCI Generative AI service, a user must first subscribe to the service in the OCI Console. This subscription step activates the service for the tenancy, making it available for use. Without this explicit subscription, the service remains disabled and inaccessible, even if other prerequisites like IAM policies are in place.

Exam trap

The trap here is that candidates often confuse 'enabling a service' with adjusting service limits or creating infrastructure resources, when in OCI many AI services require an explicit subscription step in the Console before they can be used.

806
Multi-Selecthard

A LangChain ReAct agent is failing to correctly use a custom tool that requires a specific parameter format. The agent keeps calling the tool with incorrect parameters. Which THREE steps should the developer take to debug and fix the issue?

Select 3 answers
A.Check the model's token limit
B.Enable verbose logging in the AgentExecutor to see the agent's reasoning steps
C.Test the agent with a minimal example that only uses the problematic tool
D.Improve the tool's description and parameter schema in the tool definition
E.Increase the chunk size in the text splitter
AnswersB, C, D

Verbose mode shows the agent's thoughts and actions.

Why this answer

Option B is correct because enabling verbose logging in the AgentExecutor (by setting `verbose=True`) prints the full chain-of-thought reasoning steps the ReAct agent generates before calling a tool. This allows the developer to see exactly what parameters the LLM is constructing and why it deviates from the expected format, making it the first step in diagnosing the misconfiguration.

Exam trap

Cisco often tests the distinction between debugging agent reasoning (verbose logging) and unrelated hyperparameters (token limits, chunk sizes) to see if candidates understand the ReAct agent's internal execution flow versus general LLM or RAG configuration.

807
Multi-Selectmedium

A team is designing a production-grade LangChain agent that uses multiple tools, including a custom SQL query tool and a web search tool. They need to ensure the agent handles errors gracefully and logs all actions. Which three practices should they implement? (Choose THREE.)

Select 3 answers
A.Use AgentType.ZERO_SHOT_REACT_DESCRIPTION to simplify the agent's reasoning
B.Set max_iterations in AgentExecutor to limit the number of reasoning steps and avoid infinite loops
C.Wrap each tool's execution in try-except blocks to catch exceptions and return meaningful error messages
D.Set verbose=True in the AgentExecutor to show all reasoning steps to end users
E.Add a custom callback handler to log tool inputs and outputs for monitoring
AnswersB, C, E

Limiting iterations prevents the agent from getting stuck in a loop, a common production issue.

Why this answer

Option B is correct because setting `max_iterations` in `AgentExecutor` prevents infinite loops by capping the number of reasoning steps the agent can take. In production, an agent might repeatedly call tools without converging on a final answer, consuming resources and causing timeouts. This parameter directly enforces a hard stop, which is essential for reliability.

Exam trap

Cisco often tests the distinction between debugging features (like `verbose=True`) and production-grade logging (like custom callback handlers), leading candidates to mistakenly choose `verbose=True` as a valid logging practice for end users.

808
MCQhard

A company is deploying a customer-facing chatbot using OCI Generative AI. They need to prevent the model from generating offensive or harmful content. Which feature should they implement?

A.Custom post-processing to scan each response.
B.Enable OCI Generative AI guardrails with content filtering.
C.Limit the user input length.
D.Use a smaller model that is less capable of generating harm.
AnswerB

Guardrails are designed to filter both input and output for safety.

Why this answer

OCI Generative AI guardrails provide built-in content filtering that automatically detects and blocks offensive or harmful content in both user inputs and model outputs. This is the recommended, native feature for safety compliance, eliminating the need for custom post-processing and ensuring consistent policy enforcement across all chatbot interactions.

Exam trap

Cisco often tests the misconception that limiting input or using a smaller model can prevent harmful outputs, when in fact only dedicated content filtering (guardrails) provides comprehensive, policy-driven safety for both inputs and outputs.

How to eliminate wrong answers

Option A is wrong because custom post-processing adds latency, requires manual maintenance, and duplicates functionality already provided natively by OCI guardrails; it also risks missing edge cases that the built-in filters handle. Option C is wrong because limiting user input length does not prevent the model from generating harmful content in its responses; it only restricts the input, not the output. Option D is wrong because using a smaller model does not guarantee safety—smaller models can still generate offensive content and may lack the nuanced understanding needed for accurate filtering, making guardrails the correct solution.

809
Multi-Selectmedium

Which TWO of the following are common applications of large language models in enterprise settings?

Select 2 answers
A.Summarizing lengthy legal documents.
B.Performing real-time signal processing for audio streams.
C.Generating boilerplate code from natural language descriptions.
D.Replacing relational databases for data storage.
E.Enhancing low-resolution images through super-resolution.
AnswersA, C

LLMs are effective for text summarization.

Why this answer

Option A is correct because large language models (LLMs) excel at abstractive summarization, which involves condensing lengthy legal documents into concise summaries while preserving key facts and legal reasoning. This is a common enterprise application for legal departments, as LLMs can process large volumes of text and generate coherent, context-aware summaries without requiring manual reading.

Exam trap

Oracle often tests the distinction between LLMs' text-based capabilities and specialized AI tasks (e.g., signal processing, image enhancement), leading candidates to mistakenly assume LLMs can handle any AI task due to their broad 'general intelligence' appearance.

810
MCQmedium

A developer is using the OCI GenAI Playground to test a summarisation model. They want the summary to be concise and less creative. Which combination of parameter adjustments would best achieve this?

A.Decrease temperature to 0.1 and increase frequency penalty
B.Set presence penalty to high and temperature to 0.5
C.Increase temperature to 0.9 and decrease frequency penalty to 0
D.Increase max tokens and set stop sequences
AnswerA

Low temperature makes output more focused and deterministic; higher frequency penalty reduces repetitive phrases, yielding concise summaries.

Why this answer

Decreasing temperature to 0.1 makes the model more deterministic and less creative, while increasing the frequency penalty discourages repetition, both of which help produce a concise, less creative summary. This combination directly aligns with the goal of reducing randomness and controlling output length.

Exam trap

Cisco often tests the misconception that increasing penalties or adjusting max tokens alone can control creativity, when in fact temperature is the primary parameter for randomness, and penalties serve different roles (frequency for repetition, presence for novelty).

How to eliminate wrong answers

Option B is wrong because setting presence penalty to high encourages the model to introduce new topics, which can increase creativity and verbosity, and a temperature of 0.5 still allows moderate randomness, neither of which supports a concise, less creative summary. Option C is wrong because increasing temperature to 0.9 increases randomness and creativity, and decreasing frequency penalty to 0 removes the penalty for repetition, both of which lead to more varied and potentially longer outputs. Option D is wrong because increasing max tokens allows longer outputs, and setting stop sequences only controls where generation ends, neither of which directly reduces creativity or ensures conciseness.

811
MCQmedium

A team wants to use OCI Generative AI Agents to build a question-answering system over documents stored in OCI Object Storage. They have created a knowledge base and are ready to test. Which API should they use to interact with the agent for multi-turn conversations?

A.Chat API
B.Embedding API
C.Sessions API
D.Generate API
AnswerC

The Sessions API provides multi-turn conversation management for OCI GenAI Agents.

Why this answer

The Sessions API is used to manage conversational sessions with OCI GenAI Agents, allowing multi-turn interactions. The Chat API is for direct LLM chat without agent capabilities. The Generate API is for single-turn text generation, and the Embedding API is for vector creation.

812
MCQmedium

A developer is building a RAG pipeline using LangChain and Oracle AI Vector Search. After loading and splitting PDF documents, they generate embeddings and store them in Oracle Database using OracleVS. Which method should they call on the vector store object to create a retriever that uses similarity search with a configurable number of results?

A.as_retriever()
B.from_texts()
C.max_marginal_relevance_search()
D.similarity_search()
AnswerA

as_retriever() creates a retriever that uses the vector store's search method.

Why this answer

The as_retriever() method on a vector store returns a retriever object that can be configured with search_kwargs like 'k'.

813
MCQmedium

A team is evaluating two embedding models for a similarity search task. Model A has a higher BERTScore on a reference dataset. Model B has a lower perplexity on the same dataset. Which model is likely better for retrieval?

A.Both are equally good for retrieval
B.Model A, because BERTScore directly measures semantic similarity of embeddings
C.Model B, because lower perplexity indicates better language modeling, which improves retrieval
D.Neither metric is relevant for retrieval tasks
AnswerB

BERTScore is a semantic similarity metric that evaluates the quality of embeddings for capturing meaning, which is crucial for retrieval.

Why this answer

For retrieval tasks, embedding quality is best measured by semantic similarity metrics like BERTScore, which correlate with how well embeddings capture meaning. Perplexity measures language model fluency, not embedding quality.

814
MCQhard

A financial services company deployed a fine-tuned model using OCI Generative AI Service to generate investment advice based on quarterly reports. The model was trained on 10,000 labeled examples and achieved high accuracy in testing. However, after three months in production, the model's outputs have become inconsistent and sometimes recommend investments based on outdated market conditions. The team has received multiple complaints from users about inaccurate advice. The model is deployed on a dedicated AI cluster with auto-scaling disabled. The OCI audit logs show no configuration changes. The team suspects data drift and wants to mitigate it without incurring high costs. They have a pipeline that can collect new labeled data monthly, but it takes two weeks to process. What should the team do?

A.Set up a monthly retraining schedule using the new labeled data as soon as it is available, and use a champion/challenger deployment to validate the new model before full rollout.
B.Decrease the temperature parameter to 0.1 to make outputs more deterministic.
C.Revert to the base model (Cohere Command) and use few-shot prompting with recent reports.
D.Enable auto-scaling on the dedicated AI cluster to handle increased load.
AnswerA

Monthly retraining with fresh data mitigates drift, and champion/challenger ensures safe deployment.

Why this answer

Option A is correct because it directly addresses data drift by establishing a regular retraining cycle with the new labeled data, which is the standard mitigation strategy for model degradation over time. The champion/challenger deployment pattern allows the team to validate the updated model's performance against the current production model before full rollout, ensuring no regression in accuracy. This approach balances cost efficiency (monthly retraining) with the operational constraint of a two-week data processing pipeline.

Exam trap

Oracle often tests the misconception that hyperparameter tuning (like temperature) or infrastructure scaling can fix data drift, when in reality only retraining with fresh, representative data addresses the root cause.

How to eliminate wrong answers

Option B is wrong because decreasing the temperature parameter only affects the randomness of token generation, not the underlying model's knowledge of market conditions; it cannot fix data drift or outdated recommendations. Option C is wrong because reverting to the base model and using few-shot prompting would lose all the domain-specific fine-tuning and would not scale to handle the volume of quarterly reports, nor does it address the root cause of data drift. Option D is wrong because enabling auto-scaling addresses throughput and latency issues, not model accuracy or data drift; the problem is inconsistent outputs due to outdated training data, not insufficient compute resources.

815
MCQeasy

Which model family is NOT currently available in OCI Generative AI service?

A.OpenAI GPT-4
B.Meta Llama
C.Anthropic Claude
D.Cohere
AnswerA

GPT-4 is not part of OCI Generative AI service.

Why this answer

OpenAI GPT-4 is not available in OCI Generative AI service because OCI's native generative AI offerings are built on open-source and partner models like Meta Llama, Anthropic Claude, and Cohere, but not on OpenAI's proprietary models. OCI Generative AI service provides access to models hosted on OCI, and OpenAI GPT-4 is only accessible via Azure OpenAI Service or direct OpenAI API, not through OCI's managed service.

Exam trap

The trap here is that candidates may assume OCI Generative AI service includes all major commercial models like GPT-4, but OCI only supports models from partners that have signed direct agreements with Oracle, excluding OpenAI due to its exclusive partnership with Microsoft Azure.

How to eliminate wrong answers

Option B is wrong because Meta Llama is available in OCI Generative AI service as a supported open-source model family, including Llama 2 and Llama 3 variants, which can be deployed via OCI's managed endpoints. Option C is wrong because Anthropic Claude is available in OCI Generative AI service, specifically Claude 3 models, as part of OCI's partnership with Anthropic for enterprise AI workloads. Option D is wrong because Cohere models, including Command and Embed, are available in OCI Generative AI service as a native offering, with Cohere being a key partner for OCI's AI services.

816
MCQeasy

Which LangChain memory type is best suited for a long-running conversation where token consumption must be minimized, and the gist of previous exchanges should be retained?

A.ConversationBufferMemory
B.VectorStoreMemory
C.ConversationBufferWindowMemory
D.ConversationSummaryMemory
AnswerD

Summary memory periodically summarizes the conversation, significantly reducing token count while maintaining context.

Why this answer

ConversationSummaryMemory (D) is best suited for long-running conversations where token consumption must be minimized because it periodically summarizes the conversation history, retaining the gist of previous exchanges in a compressed form. This avoids storing every raw message (as in BufferMemory) while still preserving context, making it ideal for cost-sensitive or token-limited LLM deployments.

Exam trap

Cisco often tests the distinction between 'retaining the gist' (summarization) versus 'retaining recent messages' (window) or 'retaining everything' (buffer), and the trap here is that candidates confuse ConversationBufferWindowMemory (which drops old context) with a memory that preserves the essence of all prior exchanges.

How to eliminate wrong answers

Option A (ConversationBufferMemory) is wrong because it stores every message in full, leading to unbounded token consumption that grows linearly with conversation length, making it unsuitable for long-running chats. Option B (VectorStoreMemory) is wrong because it is designed for semantic search over large document stores, not for compact summarization of conversation history; it stores embeddings and retrieves chunks, which incurs high token and compute overhead for chat context. Option C (ConversationBufferWindowMemory) is wrong because it only keeps a fixed window of recent messages, discarding older context entirely, so the gist of early exchanges is lost, which fails the requirement to retain the gist of previous exchanges.

817
MCQeasy

An organization needs to extract text from PDF documents and convert them into embeddings for a RAG pipeline using OCI. Which OCI service is best suited for extracting text from PDFs?

A.OCI Language
B.OCI Speech
C.OCI Vision
D.OCI Document Understanding
AnswerD

This service provides OCR and text extraction from documents.

Why this answer

OCI Document Understanding is purpose-built for extracting text, tables, and key-value pairs from PDFs and images using pre-trained AI models. It directly supports the text extraction step required to prepare documents for embedding generation in a RAG pipeline, unlike the other services which focus on different modalities or lack native PDF text extraction capabilities.

Exam trap

The trap here is that candidates may confuse OCI Vision's OCR capability with full document text extraction, overlooking that Document Understanding is the dedicated service for extracting structured content from PDFs in a RAG workflow.

How to eliminate wrong answers

Option A is wrong because OCI Language is designed for natural language processing tasks like sentiment analysis and entity extraction, not for extracting raw text from PDF documents. Option B is wrong because OCI Speech is specialized for transcribing audio and speech into text, not for processing PDF files. Option C is wrong because OCI Vision focuses on image analysis (object detection, image classification) and can perform OCR on images, but it is not optimized for extracting structured text from multi-page PDFs, whereas Document Understanding provides a dedicated document parsing pipeline.

818
Multi-Selectmedium

Which TWO statements about OCI Generative AI fine-tuning are true? (Choose two.)

Select 2 answers
A.Fine-tuning adjusts the model's weights based on custom data
B.Fine-tuning can only handle up to 10 examples
C.Fine-tuning permanently alters the base model in OCI
D.Fine-tuning is equivalent to providing few-shot examples in the prompt
E.Fine-tuning requires a dataset of input-output pairs
AnswersA, E

Supervised fine-tuning updates model parameters.

Why this answer

Fine-tuning in OCI Generative AI adjusts the model's weights using custom training data, which allows the model to learn domain-specific patterns and improve performance on targeted tasks. This process modifies the internal parameters of the base model, making option A correct because it directly describes the core mechanism of fine-tuning.

Exam trap

Cisco often tests the distinction between fine-tuning and few-shot prompting, trapping candidates who confuse the two by assuming they are equivalent or that fine-tuning permanently modifies the base model.

819
MCQeasy

Which component of the Transformer architecture allows each token to consider the relevance of every other token in the input sequence?

A.Multi-head attention
B.Self-attention
C.Feed-forward network
D.Positional encoding
AnswerB

Self-attention directly computes relevance weights between every pair of tokens in the input.

Why this answer

Self-attention computes attention scores between all pairs of tokens, enabling the model to capture dependencies across the entire sequence.

820
MCQeasy

Based on the exhibit, what is the primary action the developer must take to successfully make the inference request?

A.Increase max_new_tokens to 5000 to get a longer response.
B.Ignore the error and retry the request.
C.Reduce max_new_tokens to 2000 to stay within the context length.
D.Switch to a model in a different region.
AnswerC

This reduces total tokens to 8000, within 8192 limit.

Why this answer

The error indicates that the combined length of the input prompt and the requested max_new_tokens exceeds the model's context length limit (e.g., 4096 tokens). Reducing max_new_tokens to 2000 ensures the total token count stays within the allowed context window, allowing the inference request to succeed. This is a fundamental constraint in transformer-based LLMs where the model cannot generate beyond its fixed context size.

Exam trap

Cisco often tests the misconception that max_new_tokens can be increased arbitrarily to get longer responses, without considering the total context length constraint that includes the input prompt.

How to eliminate wrong answers

Option A is wrong because increasing max_new_tokens to 5000 would further exceed the context length, making the error worse. Option B is wrong because ignoring the error and retrying will produce the same failure; the request parameters must be adjusted to comply with the model's limits. Option D is wrong because the error is not related to regional availability; switching regions does not change the model's context length or token limits.

821
MCQhard

A company is deploying a RAG pipeline using OCI Data Science and OCI Generative AI. The pipeline uses a Cohere command model for generation and a Cohere embed model for retrieval. The team notices that the model occasionally produces hallucinated answers that are not supported by the retrieved context. Which strategy is MOST effective at reducing hallucinations?

A.Implement a faithfulness verification step that re-ranks retrieved passages based on alignment with the generated answer.
B.Increase the temperature parameter of the generation model.
C.Increase the number of retrieved chunks (k) to provide more context.
D.Use a larger generative model with more parameters.
AnswerA

A verification step can detect and mitigate unsupported claims.

Why this answer

Option D is correct because incorporating a faithfulness check that re-ranks retrieval results can directly filter out unsupported claims. Option A is wrong because increasing temperature may increase randomness and hallucinations. Option B is wrong because more retrieved chunks can introduce conflicting information.

Option C is wrong because a larger model does not guarantee faithfulness and increases cost.

822
MCQhard

An AI engineer is deploying a RAG pipeline using OCI Generative AI. They notice the generated answers sometimes include information not present in the retrieved documents. What is the MOST likely cause?

A.The chunking strategy splits sentences across chunks
B.The embedding model is too small to represent document semantics
C.The context window of the generation model is exceeded
D.The generation model is hallucinating because it does not rely solely on the retrieved context
AnswerD

Even with RAG, generation models can ignore or misuse the context, leading to hallucinations.

Why this answer

Option D is correct because the generation model in a RAG pipeline is designed to leverage retrieved context but can still produce outputs not grounded in that context, a phenomenon known as hallucination. This occurs when the model relies on its parametric knowledge or statistical patterns rather than strictly adhering to the retrieved documents, especially if the prompt does not enforce strict grounding or the model's training biases override the context.

Exam trap

The trap here is that candidates often confuse retrieval failures (e.g., poor chunking or embedding) with generation failures, assuming that if the retrieved documents are correct, the model will always use them faithfully, but the core issue is the generation model's tendency to hallucinate when not strictly constrained to the context.

How to eliminate wrong answers

Option A is wrong because splitting sentences across chunks affects retrieval quality and context coherence, but it does not directly cause the model to generate information absent from retrieved documents; it may lead to incomplete or fragmented context, not hallucination. Option B is wrong because the embedding model's size impacts the quality of semantic representation and retrieval accuracy, but a small embedding model would likely cause poor retrieval (missing relevant documents) rather than causing the generation model to invent information not in the retrieved set. Option C is wrong because exceeding the context window would cause truncation of input, potentially losing retrieved context, but the generation model would then operate on incomplete context, not necessarily invent new information; hallucination can occur even within the context window.

823
Multi-Selecthard

A data scientist is evaluating an LLM's performance on a summarization task. They observe that the model produces fluent summaries but often misses key information. Which TWO metrics would best capture this issue? (Select TWO.)

Select 2 answers
A.BLEU score
B.Perplexity
C.Human evaluation with a rubric for completeness
D.ROUGE-L
E.BERTScore
AnswersC, D

Human judgment can directly assess whether key information is included.

Why this answer

ROUGE-L measures recall of the longest common subsequence, capturing information coverage. Human evaluation can assess completeness. BLEU emphasizes precision and fluency.

BERTScore measures semantic similarity but not directly the presence of key points. Perplexity measures model confidence, not recall.

824
MCQhard

An architect is designing a multi-tenant application using OCI Generative AI. Each tenant has custom instructions and data. To minimize cost while maintaining isolation, which deployment approach is recommended?

A.Dedicated fine-tuned endpoint per tenant.
B.Shared base model with per-tenant system prompts and retrieval.
C.On-premises deployment of open-source models.
D.Single large fine-tuned model with conditional logic.
AnswerB

This approach uses a shared model with tenant-specific prompts and RAG, balancing cost and isolation.

Why this answer

Option B is correct because it leverages a shared base model with per-tenant system prompts and retrieval-augmented generation (RAG) to isolate custom instructions and data without the cost of dedicated endpoints. This approach minimizes compute overhead by reusing a single model instance while maintaining logical isolation through prompt engineering and vector-based retrieval, aligning with OCI's pay-as-you-go pricing model.

Exam trap

The trap here is that candidates often assume fine-tuning is necessary for customization, overlooking that system prompts and retrieval can achieve equivalent isolation at a fraction of the cost, which Cisco tests by contrasting dedicated endpoints against shared-model strategies.

How to eliminate wrong answers

Option A is wrong because dedicating a fine-tuned endpoint per tenant multiplies infrastructure costs linearly with each tenant, defeating the cost-minimization goal. Option C is wrong because on-premises deployment of open-source models incurs fixed hardware and operational costs, lacks OCI's managed scaling benefits, and still requires per-tenant isolation mechanisms. Option D is wrong because a single large fine-tuned model with conditional logic introduces coupling between tenants, risking data leakage and making updates or rollbacks complex without true isolation.

825
MCQeasy

Which component of the Transformer architecture allows the model to weigh the importance of different tokens in the input sequence when generating an output?

A.Positional encoding
B.Self-attention mechanism
C.Layer normalization
D.Feed-forward network
AnswerB

Self-attention computes query-key-value dot products to assign importance weights across tokens.

Why this answer

The self-attention mechanism is the core component of the Transformer architecture that computes attention scores between every pair of tokens in the input sequence. These scores determine how much each token should influence the representation of every other token, allowing the model to dynamically weigh the importance of different tokens when generating an output. This is achieved through scaled dot-product attention, where queries, keys, and values are derived from the input embeddings.

Exam trap

Cisco often tests the misconception that positional encoding or layer normalization is responsible for weighting token importance, when in fact only the self-attention mechanism performs this dynamic weighting based on content relationships.

How to eliminate wrong answers

Option A is wrong because positional encoding adds information about the position of tokens in the sequence, not about their relative importance; it enables the model to use order but does not weigh token importance. Option C is wrong because layer normalization stabilizes training by normalizing activations across features, but it does not perform any weighting or attention between tokens. Option D is wrong because the feed-forward network applies a non-linear transformation to each token independently after attention, processing individual token representations without considering inter-token relationships.

Page 10

Page 11 of 14

Page 12