Knowledge + Practice

Oracle Cloud Infrastructure Generative AI Professional 1Z0-1127 (1Z0-1127) — Questions 151–225

500 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 3 of 7

151

Multi-Selecteasy

Which two are essential components of the Transformer architecture? (Select TWO)

Select 2 answers

A.Pooling layers

B.Recurrent connections

C.Self-attention mechanism

D.Feed-forward neural network

E.Convolutional layers

AnswersC, D

Correct: Core component of Transformers.

Why this answer

The self-attention mechanism is essential because it allows each token in the input sequence to attend to every other token, capturing long-range dependencies without the sequential bottleneck of RNNs. This mechanism computes attention scores using queries, keys, and values, enabling parallel processing and forming the core of the Transformer's ability to model context.

Exam trap

Oracle often tests the misconception that Transformers still use recurrence or convolution for sequence processing, when in fact they rely solely on self-attention and feed-forward networks.

Full explanation →

152

MCQmedium

Refer to the exhibit. What is the solution?

A.Use a different base model that supports fine-tuning.

B.Change the learning rate.

C.Increase the training epochs.

D.Use a different compartment.

AnswerA

The error indicates the base model does not support fine-tuning; switch to a supported model.

Why this answer

The exhibit indicates that the base model does not support fine-tuning, which is a prerequisite for adapting a large language model to a specific task or domain. Using a different base model that supports fine-tuning allows the model to be customized through supervised learning on task-specific data, enabling it to learn new patterns and improve performance. This is the correct solution because without fine-tuning capability, the model cannot be effectively adapted regardless of other hyperparameter adjustments.

Exam trap

Oracle often tests the distinction between hyperparameter tuning (learning rate, epochs) and fundamental model capability (fine-tuning support), leading candidates to mistakenly choose a hyperparameter adjustment when the core issue is that the model cannot be fine-tuned at all.

How to eliminate wrong answers

Option B is wrong because changing the learning rate only affects the optimization process during training, but if the base model does not support fine-tuning, no amount of learning rate adjustment will enable the model to be trained on new data. Option C is wrong because increasing the training epochs will not help if the model cannot be fine-tuned at all; epochs only matter when the model is actually being trained or fine-tuned. Option D is wrong because using a different compartment (a tenancy or organizational boundary in Oracle Cloud Infrastructure) does not change the underlying model's architecture or its ability to be fine-tuned; it only affects resource isolation and access control.

Full explanation →

153

MCQhard

A company deploys a large language model on a dedicated AI cluster with 4 nodes. The model requires 128 GB of memory per instance, but the nodes have only 64 GB each. During inference, the nodes experience out-of-memory errors. What is the best solution?

A.Enable model parallelism across nodes

B.Increase the number of nodes to 8

C.Upgrade to higher memory node shapes

D.Reduce the batch size in inference requests

AnswerA

Model parallelism distributes the model across nodes, enabling inference with the available memory.

Why this answer

Model parallelism splits the model's layers or parameters across multiple nodes, allowing the 128 GB model to be distributed across the 4 nodes (each with 64 GB) so that no single node exceeds its memory capacity. This is the best solution because it directly addresses the memory constraint without requiring additional hardware or sacrificing inference throughput, and it is a standard technique for deploying large language models on distributed AI clusters.

Exam trap

Oracle often tests the misconception that scaling out (more nodes) or scaling down (batch size) can fix memory constraints for large models, but the trap here is that the model's parameter memory is fixed and cannot be reduced by batch size changes, and adding more nodes without parallelism still leaves each node unable to host the full model.

How to eliminate wrong answers

Option B is wrong because increasing the number of nodes to 8 does not solve the fundamental issue: each node still has only 64 GB, and the model requires 128 GB per instance; without model parallelism, each node would still try to load the entire model and fail. Option C is wrong because upgrading to higher memory node shapes (e.g., 128 GB per node) would work but is often cost-prohibitive or unavailable, and the question asks for the best solution given the existing cluster; model parallelism is more efficient and scalable. Option D is wrong because reducing the batch size reduces per-request memory usage but does not reduce the model's parameter memory footprint (128 GB), so the model itself still cannot fit into a single node's 64 GB memory.

Full explanation →

154

MCQmedium

A company uses OCI Generative AI Service to build a chatbot for customer support. They notice that the model sometimes generates inappropriate responses. What is the MOST effective way to mitigate this without retraining the model?

A.Fine-tune the model with curated safe examples

B.Configure system instructions to define acceptable behavior

C.Reduce the temperature parameter to 0

D.Use the moderation API to filter responses

AnswerB

System instructions constrain the model's output at inference time without retraining.

Why this answer

Configuring system instructions is the most effective approach because it allows you to define the model's behavior and constraints at inference time without modifying the underlying model weights. In OCI Generative AI Service, system instructions act as a persistent prompt that guides the model's responses, enabling you to explicitly prohibit inappropriate content and enforce safety guidelines. This is a non-invasive, immediate mitigation that does not require the time, cost, or data preparation associated with retraining or fine-tuning.

Exam trap

Oracle often tests the distinction between inference-time controls (like system instructions) and training-time modifications (like fine-tuning), trapping candidates who assume that only retraining can fix behavioral issues, when in fact prompt-level constraints are the fastest and most practical solution for immediate mitigation.

How to eliminate wrong answers

Option A is wrong because fine-tuning requires retraining the model with curated datasets, which is time-consuming, resource-intensive, and contradicts the question's constraint of 'without retraining the model.' Option C is wrong because reducing the temperature to 0 makes the model deterministic and less creative, but it does not prevent inappropriate responses—it only reduces randomness, not the likelihood of generating harmful content based on learned patterns. Option D is wrong because OCI Generative AI Service does not have a built-in 'moderation API' like some other cloud providers; while you could implement a separate content filter, this would be an external post-processing step rather than a direct configuration of the model's behavior, and the question asks for the most effective method within the service itself.

Full explanation →

155

MCQmedium

A team is deploying a generative AI model using OCI Functions for serverless inference. They are experiencing cold start latency of over 10 seconds for the first invocation after idle periods. What is the best strategy to reduce cold start latency?

A.Migrate the inference to OCI Data Flow for better performance.

B.Use provisioned concurrency to keep a set number of function instances warm.

C.Reduce the function timeout to force faster execution.

D.Increase the memory allocation for the function.

AnswerB

Provisioned concurrency eliminates cold start by pre-warming instances.

Why this answer

Option B is correct because OCI Functions supports provisioned concurrency, which keeps a specified number of instances warm. Option A (increasing memory) can reduce cold start but not as effectively. Option C (reducing timeout) might cause failures.

Option D (using OCI Data Flow) is for data processing, not inference.

Full explanation →

156

Multi-Selecteasy

Which TWO are best practices for securing a generative AI endpoint on OCI? (Select TWO)

Select 2 answers

A.Enable OCI Logging for audit

B.Use a public endpoint with IP restrictions

C.Disable authentication for internal use

D.Store API keys in OCI Vault

E.Use a dedicated AI cluster with a private subnet

AnswersD, E

OCI Vault securely manages secrets and API keys.

Why this answer

Option D is correct because OCI Vault provides a secure, centralized service for storing and managing API keys used to authenticate requests to generative AI endpoints. Storing keys in Vault prevents hardcoding them in application code or configuration files, reducing the risk of exposure and enabling automated rotation and access control via IAM policies.

Exam trap

The trap here is that candidates often confuse logging (Option A) with a security control, or assume that IP restrictions (Option B) are sufficient for securing an AI endpoint, when in fact OCI emphasizes private endpoints and authentication as best practices.

Full explanation →

157

MCQmedium

A healthcare organization plans to deploy a RAG application on OCI that handles sensitive patient data. They require that all LLM inference and embedding processing happen within a controlled environment to avoid data leakage to public endpoints. Which OCI feature should they use?

A.OCI Data Labeling

B.OCI Vault

C.OCI Data Masking

D.OCI Dedicated AI Cluster

AnswerD

Dedicated AI Cluster provides isolated compute for AI workloads.

Why this answer

OCI Dedicated AI Cluster provides a private, isolated environment for AI workloads, ensuring data stays within the customer's tenancy. OCI Data Labeling is for labeling. OCI Data Masking is for masking but not for inference isolation.

OCI Vault manages keys, but doesn't isolate inference.

Full explanation →

158

MCQhard

A research team is using OCI Data Science and OCI GenAI to build a multilingual chatbot for customer service. They have training data in English, Spanish, and French. The model currently struggles with code-switching—users often mix languages in a single query (e.g., 'Quiero cancel my order'), and the model responds inconsistently, sometimes in English, sometimes mixing incorrectly. The team wants to improve performance on code-switching while maintaining fluency in each language. They have limited compute resources and cannot deploy separate models per language. Which approach should they take?

A.Train separate fine-tuned models for each language and route queries based on detected language.

B.Fine-tune a multilingual model on a combined dataset that includes code-switching examples.

C.Use language detection to route the query to a specific language model, then translate the response.

D.Use a multilingual embedding model for retrieval to improve context understanding.

AnswerB

This directly trains the model to handle mixed-language inputs and outputs.

Why this answer

Option C is correct because fine-tuning a multilingual model (e.g., Cohere Command with multilingual support) on a combined dataset that includes code-switching examples directly teaches the model to handle mixed-language inputs. Option A is wrong because multilingual embeddings improve retrieval but do not address generation fluency for code-switching. Option B is wrong because training separate models per language would prevent any code-switching capability.

Option D is wrong because language detection and routing is complex, may not handle mixed queries, and could lose cross-lingual context.

Full explanation →

159

MCQhard

During multi-turn conversation with an OCI GenAI model, the model repeats user messages from earlier turns. What is the most likely cause?

A.Low top-p

B.High temperature

C.Low presence penalty

D.High frequency penalty

AnswerC

Low presence penalty means the model is less penalized for repeating topics, leading to repetition.

Why this answer

A low presence penalty reduces the model's incentive to avoid repeating previously mentioned content. In multi-turn conversations, this can cause the model to echo user messages from earlier turns because the penalty is too weak to discourage repetition of tokens that have already appeared in the context window.

Exam trap

Oracle often tests the distinction between presence penalty (which penalizes any occurrence) and frequency penalty (which penalizes based on count), leading candidates to mistakenly think a high frequency penalty causes repetition when it actually prevents it.

How to eliminate wrong answers

Option A is wrong because low top-p limits the cumulative probability mass for token sampling, which reduces diversity but does not directly cause repetition of earlier user messages; it may instead make outputs more deterministic. Option B is wrong because high temperature increases randomness in token selection, which can lead to more creative or even nonsensical outputs, not specifically the repetition of prior user messages. Option D is wrong because a high frequency penalty actively discourages the model from using tokens that have already appeared, which would reduce repetition, not cause it.

Full explanation →

160

MCQhard

A security audit reveals that the RAG application exposes internal documents through the chatbot. The vector search index contains sensitive data. Which action should be taken FIRST to mitigate?

A.Reduce the number of retrieved chunks

B.Implement access control at the OpenSearch index level

C.Redact sensitive terms from documents before embedding

D.Use a different embedding model

AnswerB

Index-level security restricts which documents can be searched by user roles.

Why this answer

Implementing access control at the OpenSearch index level prevents unauthorized users from retrieving sensitive documents. Redaction reduces risk but is less comprehensive. Changing model or reducing chunks does not address the exposure.

Full explanation →

161

Multi-Selectmedium

Which TWO actions are best practices when deploying a RAG application using OCI OpenSearch and OCI Generative AI?

Select 2 answers

A.Embed every document chunk in real-time during query processing.

B.Implement a reranker to improve the relevance of retrieved documents.

C.Use very small chunk sizes (e.g., 50 tokens) to maximize granularity.

D.Monitor query latency and adjust the number of retrieved documents accordingly.

E.Set the LLM temperature to 1.5 to encourage diverse outputs.

AnswersB, D

Improves precision.

Why this answer

Option B is correct because implementing a reranker improves retrieval precision by re-scoring the top-k documents from the initial vector search using a cross-encoder model, which captures deeper semantic relevance than cosine similarity alone. In OCI OpenSearch, this is typically done via a post-processing step with OCI Generative AI or a dedicated reranking model, ensuring only the most contextually relevant chunks are passed to the LLM for generation.

Exam trap

Oracle often tests the misconception that real-time embedding (Option A) is efficient for RAG, when in fact pre-computed embeddings are standard, and that very small chunks (Option C) improve granularity, whereas they actually harm context coherence and retrieval quality.

Full explanation →

162

MCQeasy

A company wants to use OCI Generative AI service to generate marketing copy that adheres to brand guidelines. Which technique should they use?

A.Use model distillation

B.Use prompt engineering with a pre-trained model

C.Use knowledge distillation

D.Fine-tune the model with brand-specific data

AnswerD

Correct: Fine-tuning adjusts model weights to match brand style and guidelines.

Why this answer

Fine-tuning a pre-trained model with brand-specific data (Option D) is the correct approach because it adjusts the model's weights to align with the company's unique brand guidelines, tone, and vocabulary. This supervised learning process ensures the generated marketing copy consistently adheres to specific requirements, unlike prompt engineering which relies on ephemeral instructions that may not reliably enforce brand constraints.

Exam trap

Oracle often tests the distinction between prompt engineering (which is temporary and instruction-based) and fine-tuning (which permanently alters model behavior), leading candidates to choose prompt engineering because it seems simpler, but it fails to guarantee adherence to brand guidelines.

How to eliminate wrong answers

Option A is wrong because model distillation is a technique to compress a large model into a smaller, faster one, not to adapt outputs to brand guidelines. Option B is wrong because prompt engineering with a pre-trained model can guide outputs but does not permanently embed brand-specific rules; the model may still deviate from guidelines without fine-tuned weights. Option C is wrong because knowledge distillation transfers knowledge from a teacher model to a student model for efficiency, not for customizing outputs to brand-specific data.

Full explanation →

163

Multi-Selecthard

Which TWO steps are necessary to deploy a fine-tuned model on a dedicated AI cluster?

Select 2 answers

A.Create the dedicated AI cluster

B.Set up a VCN with public subnet

C.Upload the model artifacts to OCI Object Storage

D.Obtain a third-party license

E.Create a model deployment endpoint

AnswersA, E

The cluster must exist to host the model.

Why this answer

Options A and C are necessary. You must create the dedicated AI cluster and then create a model deployment endpoint pointing to that cluster. Uploading artifacts (B) is typically handled automatically; VCN setup (D) may be needed but not always; third-party license (E) is not required.

Full explanation →

164

MCQhard

A machine learning engineer is deploying a fine-tuned Llama 2 model on OCI Data Science model deployment. The deployment fails with an error: 'Model artifact exceeds the maximum allowed size of 10 GB.' The model files total 12 GB. What is the best approach to resolve this?

A.Store the model in Object Storage and reference it in the deployment configuration

B.Use a different model that is smaller than 10 GB

C.Increase the model deployment artifact size limit via a service request

D.Compress the model artifact to under 10 GB using gzip

AnswerA

Object Storage allows large models and is supported by model deployment.

Why this answer

Option A is correct because OCI Data Science model deployment has a hard limit of 10 GB for the model artifact uploaded directly. By storing the model in Object Storage and referencing it in the deployment configuration, you bypass this limit entirely, as the deployment service can load the model from Object Storage at runtime without requiring the artifact to be part of the deployment package.

Exam trap

Oracle often tests the misconception that you can increase service limits via a support ticket, but for model artifact size, the limit is architectural and not adjustable; candidates may also incorrectly assume compression solves the issue without considering decompression at runtime.

How to eliminate wrong answers

Option B is wrong because it suggests a workaround that may not be feasible; the engineer has already fine-tuned a specific Llama 2 model, and switching to a smaller model would require retraining and may not meet business requirements. Option C is wrong because the 10 GB artifact size limit is a hard platform constraint that cannot be increased via a service request; OCI does not allow raising this limit for model deployments. Option D is wrong because compressing the artifact with gzip does not reduce the actual size of the model files when decompressed; the deployment service would need to decompress them, and the uncompressed size would still exceed the 10 GB limit, causing the same error.

Full explanation →

165

MCQeasy

A company uses OCI Generative AI's chat endpoint with RAG for customer support. They have observed that the model sometimes generates answers that contradict the retrieved context. The retrieved chunks are correct and relevant, but the model ignores them. What configuration change should they implement first?

A.Fine-tune the generation model on customer support dialogues.

B.Increase the number of retrieved chunks from 3 to 5.

C.Add explicit instructions in the system prompt to base answers solely on provided context.

D.Use a different embedding model for retrieval.

AnswerC

A strong prompt can enforce grounding in the retrieved chunks.

Why this answer

Option D is correct because strengthening the system prompt to enforce grounding on the provided context is the most immediate fix. Option A may not help if the model ignores context. Option B is about fine-tuning, which is resource-intensive.

Option C addresses retrieval, not generation behavior.

Full explanation →

166

MCQhard

A company uses an LLM to generate product descriptions. The outputs are consistently too verbose and include irrelevant details. The prompt includes a simple instruction: 'Describe the product.' Which adjustment to the prompt is most likely to yield concise, relevant descriptions?

A.Set temperature to 0.

B.Increase max_tokens to 500.

C.Add constraints like 'Max 30 words. Focus on key features.'

D.Include a few examples of desired short descriptions.

AnswerC

Explicit constraints directly limit length and scope.

Why this answer

Option C is correct because adding explicit constraints like 'Max 30 words. Focus on key features.' directly instructs the LLM to limit verbosity and prioritize relevant details. This technique, known as prompt engineering with constraints, is the most effective way to control output length and content without altering model parameters or relying on examples that may not generalize.

Exam trap

Oracle often tests the misconception that adjusting model parameters (temperature or max_tokens) is the primary way to control output quality, when in fact prompt engineering with explicit constraints is a more direct and reliable method for achieving specific formatting or length requirements.

How to eliminate wrong answers

Option A is wrong because setting temperature to 0 makes the model deterministic (greedy decoding), which reduces randomness but does not inherently shorten or focus the output—it may still produce verbose descriptions. Option B is wrong because increasing max_tokens to 500 actually allows the model to generate longer responses, which is counterproductive to achieving concise descriptions. Option D is wrong because including a few examples (few-shot prompting) can guide the style but does not guarantee brevity; the model may still extrapolate irrelevant details or exceed the desired length without explicit constraints.

Full explanation →

167

MCQmedium

An e-commerce company fine-tuned a Cohere Command model on their product catalog to generate product descriptions. During inference, they notice the model outputs are too repetitive: it often repeats similar phrases across different products, and the descriptions lack diversity. The team wants to increase the variety of the generated text without sacrificing relevance. They are currently using temperature=0.8, top_p=0.9, frequency_penalty=0, and presence_penalty=0. Which parameter adjustment should they make to most effectively increase diversity?

A.Decrease temperature from 0.8 to 0.5.

B.Set frequency_penalty to a negative value (e.g., -0.5).

C.Increase max_tokens from 200 to 500.

D.Increase top_p from 0.9 to 0.95.

AnswerD

Higher top_p includes more tokens in the sampling pool, increasing diversity.

Why this answer

Increasing top_p from 0.9 to 0.95 expands the nucleus of tokens considered during sampling, allowing the model to select from a wider set of plausible next tokens. This directly increases output diversity while still maintaining relevance, as tokens outside the top 90% probability mass are now included. The current settings already have moderate temperature and no penalties, so broadening top_p is the most effective single adjustment to reduce repetitiveness.

Exam trap

Oracle often tests the misconception that increasing temperature always increases diversity, when in fact decreasing temperature reduces randomness, and the most effective lever for diversity in a fine-tuned model is often adjusting top-p or adding a positive frequency penalty.

How to eliminate wrong answers

Option A is wrong because decreasing temperature from 0.8 to 0.5 makes the model more deterministic, reducing randomness and likely increasing repetitiveness, which is the opposite of the desired outcome. Option B is wrong because setting frequency_penalty to a negative value (e.g., -0.5) encourages the model to repeat tokens, exacerbating the repetitiveness problem rather than solving it. Option C is wrong because increasing max_tokens from 200 to 500 only extends the length of generated text; it does not alter the sampling strategy, so the model will continue to repeat phrases within the longer output.

Full explanation →

168

MCQmedium

A financial institution uses OCI GenAI to power a customer support chatbot. The compliance team requires that responses are strictly consistent with regulatory guidelines and approved responses. The company has a curated set of question-answer pairs that cover common scenarios. They want to ensure that the chatbot never deviates from these approved answers. The data science team is considering various approaches to enforce this consistency. Which approach is most effective?

A.Few-shot prompting with three example responses in every query.

B.Fine-tuning the model on the curated dataset of question-answer pairs.

C.Using a large context window to include all regulatory guidelines in the prompt.

D.Setting a low temperature (0.1) to make outputs deterministic.

AnswerB

Fine-tuning adapts the model to mimic the approved responses, providing strong consistency.

Why this answer

Option B is correct because fine-tuning the model on the curated dataset of approved responses teaches the model to output similar responses for related questions, ensuring consistency. Option A is wrong because few-shot prompting may fail for unseen variations and does not guarantee strict adherence. Option C is wrong because using a large context window does not enforce specific content.

Option D is wrong because setting a low temperature reduces randomness but does not guarantee the model will choose approved responses.

Full explanation →

169

MCQmedium

A data scientist is designing a RAG system with a large vector database (hundreds of millions of documents) and requires high recall accuracy. Which vector search index type should be used in OCI Search with OpenSearch?

A.LSH (Locality Sensitive Hashing)

B.Flat (brute-force)

C.HNSW (Hierarchical Navigable Small World)

D.IVF (Inverted File Index)

AnswerC

HNSW offers a good balance of high recall and reasonable latency, suitable for large-scale vector search.

Why this answer

HNSW (Hierarchical Navigable Small World) provides excellent recall and speed for large datasets, making it ideal for high-accuracy requirements.

Full explanation →

170

MCQeasy

A data scientist wants to deploy a fine-tuned LLM on OCI for inference with low latency. Which OCI service should they use?

A.OCI Data Science Notebook Session

B.OCI Generative AI Service (Dedicated AI Cluster)

C.OCI Data Flow

D.OCI Functions

AnswerB

Dedicated AI Cluster is optimized for low-latency inference with reserved resources.

Why this answer

B is correct because OCI Generative AI Service with a Dedicated AI Cluster provides a managed, high-throughput, low-latency inference endpoint for fine-tuned LLMs. It leverages GPU-accelerated infrastructure and optimized serving stacks (e.g., vLLM, TensorRT-LLM) to minimize response times, making it ideal for production inference workloads.

Exam trap

The trap here is that candidates confuse development environments (Notebook Sessions) or general-purpose serverless compute (Functions) with purpose-built inference services, overlooking the need for GPU-accelerated, managed inference endpoints for low-latency LLM deployment.

How to eliminate wrong answers

Option A is wrong because OCI Data Science Notebook Session is an interactive development environment for prototyping and training, not a production-grade inference endpoint; it lacks auto-scaling, load balancing, and dedicated GPU serving for low-latency inference. Option C is wrong because OCI Data Flow is a serverless Apache Spark service designed for batch and stream data processing, not for real-time LLM inference. Option D is wrong because OCI Functions is a serverless compute service for short-lived, stateless functions (max 5-minute timeout) and does not support GPU acceleration or persistent model serving required for low-latency LLM inference.

Full explanation →

171

MCQmedium

A data scientist receives an error when calling the embed_text API: "InvalidRequest: input too long". What is the most likely cause and solution?

A.The model specified is not supported for embeddings; use a different model.

B.The input text exceeds the maximum token limit for the model; truncate the input.

C.The API request rate exceeds the tenancy limit; reduce the request rate.

D.The API key is invalid or expired; regenerate the key.

AnswerB

Embedding models have a fixed maximum input length.

Why this answer

Option C is correct because embedding models have a maximum token input length (e.g., 512 tokens); truncating the input resolves the error. Option A is incorrect because rate limiting returns a 429 status. Option B is incorrect because the API key is not related to input length.

Option D is incorrect because model availability returns a model not found error.

Full explanation →

172

Multi-Selecthard

Which THREE factors should be considered when choosing between fine-tuning a model and using a pre-trained model with prompt engineering? (Select three.)

Select 3 answers

A.Required response time

B.Size of available dataset

C.Internet connectivity

D.Available budget for compute resources

E.Need for domain-specific terminology

AnswersB, D, E

Fine-tuning requires a sufficiently large dataset; prompt engineering can work with few examples.

Why this answer

Option B is correct because the size of the available dataset is a critical factor: fine-tuning requires a sufficiently large, labeled dataset (typically thousands of examples) to adjust model weights effectively, while prompt engineering can work with zero or few examples. If the dataset is too small, fine-tuning risks overfitting and poor generalization, making prompt engineering the safer choice.

Exam trap

Oracle often tests the misconception that response time or internet connectivity are decisive factors, when in reality the core trade-off is between data availability and the need for deep domain adaptation versus lightweight, zero-shot customization.

Full explanation →

173

MCQhard

A healthcare startup is building a chatbot that retrieves patient treatment guidelines using OCI Generative AI Service and OCI OpenSearch. They require that all retrieved documents are from approved sources only and that the system can explain which source was used for each response. Which combination of features should they implement?

A.Add a metadata filter for source_type='approved' in the retrieval step and include document IDs in the context for the model.

B.Rely on the vector search's cosine similarity to rank approved sources higher.

C.Use prompt engineering to ask the model to ignore non-approved sources.

D.Reduce the top-K value to limit the number of retrieved documents.

AnswerA

Metadata filtering enforces source restriction; document IDs provide provenance.

Why this answer

Option A is correct because it directly addresses both requirements: a metadata filter on `source_type='approved'` ensures only approved documents are retrieved from OpenSearch, and including document IDs in the context allows the model to cite the specific source for each response. This approach enforces access control at the retrieval layer while providing traceability, which is essential for compliance in healthcare applications.

Exam trap

The trap here is that candidates may assume semantic similarity or prompt engineering alone can enforce access control, but in RAG systems, retrieval-layer filtering is the only reliable way to restrict document access before the model sees the content.

How to eliminate wrong answers

Option B is wrong because cosine similarity measures semantic relevance, not source approval status; approved and non-approved documents can be equally similar to a query, so ranking by similarity alone cannot guarantee that only approved sources are used. Option C is wrong because prompt engineering cannot reliably filter out non-approved sources; the model may still see and inadvertently use non-approved content in its context, and it has no inherent mechanism to verify source approval. Option D is wrong because reducing the top-K value limits the number of retrieved documents but does not enforce any approval criterion; non-approved documents can still appear in the top-K results if they are semantically similar.

Full explanation →

174

Multi-Selecthard

An enterprise is deploying a generative AI model that must comply with data residency regulations. Which two configurations should they implement? (Select TWO.)

Select 2 answers

A.Set up OCI IAM policies to prevent data egress from the region for the model's resources

B.Enable OCI Logging for all API calls

C.Use OCI Object Storage with cross-region replication for redundancy

D.Store encryption keys in an OCI Vault in a different region

E.Deploy the dedicated AI cluster in the region that meets data residency requirements

AnswersA, E

Correct: IAM policies can restrict access to resources from outside the region.

Why this answer

Option A is correct because OCI IAM policies can explicitly deny data egress from a specific region, ensuring that the generative AI model's resources (such as training data, model artifacts, and inference endpoints) remain within the region that satisfies data residency regulations. This is achieved by writing policy statements that restrict the movement of data across regional boundaries, which is a direct control for compliance.

Exam trap

The trap here is that candidates often confuse data residency enforcement with monitoring or key management, mistakenly selecting logging (Option B) or cross-region replication (Option C) as compliance controls, when only IAM policies and regional deployment directly prevent data movement.

Full explanation →

175

MCQhard

An enterprise wants to deploy a large language model for processing sensitive internal documents. They must ensure that data does not leave their OCI tenancy. Which OCI GenAI deployment option meets this requirement?

A.Using a third-party model via OCI Marketplace

B.Accessing models through OCI Console only

C.Dedicated AI Cluster with on-demand model hosting

D.Using the OCI GenAI API with the default endpoint

AnswerC

A dedicated cluster runs in your own tenancy, providing complete data isolation.

Why this answer

Option B is correct because a Dedicated AI Cluster provides isolated compute resources within the customer's tenancy, ensuring data stays within tenancy boundaries. Option A is wrong because the default API endpoint may use shared infrastructure. Option C is wrong because third-party models via Marketplace may not guarantee data isolation.

Option D is wrong because the console is just a management interface, not a deployment option.

Full explanation →

176

MCQeasy

A company wants to build a customer support chatbot using OCI Generative AI. They have a large number of historical support tickets. Which approach is most effective for leveraging this data to improve the chatbot's responses?

A.Use a pre-loaded prompt template from the OCI console.

B.Fine-tune the Cohere Command model on the historical tickets using OCI Data Science.

C.Increase the temperature parameter to 1.0 to encourage diverse responses.

D.Use zero-shot prompting with the base model and include few-shot examples in the prompt.

AnswerB

Fine-tuning on the company's own support tickets adapts the model to the specific language, context, and resolutions, significantly improving response quality.

Why this answer

Fine-tuning the Cohere Command model on the historical support tickets using OCI Data Science is the most effective approach because it adapts the model's weights to the specific domain language, terminology, and resolution patterns found in the company's data. This supervised learning process creates a specialized model that can generate accurate, context-aware responses for customer support queries, unlike generic prompting methods that lack deep domain adaptation.

Exam trap

Oracle often tests the misconception that increasing temperature or using few-shot examples can substitute for fine-tuning when adapting a model to proprietary domain data, but in reality only fine-tuning modifies model weights to deeply learn domain-specific patterns from large datasets.

How to eliminate wrong answers

Option A is wrong because pre-loaded prompt templates in the OCI console are generic and not trained on the company's specific historical ticket data, so they cannot capture domain-specific nuances or improve response accuracy beyond basic instruction following. Option C is wrong because increasing the temperature parameter to 1.0 maximizes randomness in token selection, which reduces coherence and factual reliability—exactly the opposite of what is needed for a customer support chatbot that requires consistent, accurate answers. Option D is wrong because zero-shot prompting with few-shot examples only provides a few static examples in the context window, which does not modify the model's underlying weights and cannot match the depth of learning achieved by fine-tuning on thousands of historical tickets.

Full explanation →

177

MCQmedium

A company is deploying a fine-tuned Cohere model on OCI Generative AI service for real-time inference. They need to ensure low latency even during demand spikes. Which configuration should they prioritize?

A.Enable model caching on the endpoint.

B.Use a dedicated AI cluster for the endpoint.

C.Use streaming responses.

D.Increase the max tokens parameter.

AnswerB

A dedicated AI cluster with autoscaling ensures consistent low latency under variable load.

Why this answer

A dedicated AI cluster provides isolated compute resources (GPUs) that are not shared with other tenants or workloads, ensuring consistent low latency even under demand spikes. This is critical for real-time inference because shared endpoints can experience resource contention and throttling during high traffic, while a dedicated cluster guarantees predictable performance.

Exam trap

Oracle often tests the misconception that caching or streaming alone can solve latency under load, when in fact only dedicated compute resources guarantee isolation and consistent performance during demand spikes.

How to eliminate wrong answers

Option A is wrong because model caching reduces latency for repeated requests by storing intermediate results, but it does not prevent resource contention during demand spikes; it only helps with cache hits, not with ensuring low latency under sustained high load. Option C is wrong because streaming responses improve perceived latency by sending tokens as they are generated, but they do not address the underlying compute resource availability or prevent queuing delays during spikes. Option D is wrong because increasing the max tokens parameter increases the maximum output length, which can actually increase latency per request and does nothing to handle demand spikes or resource contention.

Full explanation →

178

Multi-Selecthard

A company is deploying a generative AI model on OCI for an internal application that must comply with strict security policies. The model will be accessed by a limited group of users. Which three actions should the administrator take to ensure security? (Choose three.)

Select 3 answers

A.Expose the model endpoint to the internet for ease of access

B.Deploy the model in a private VCN subnet

C.Use IAM policies to restrict model endpoint access to specific users

D.Disable audit logging to minimize storage costs

E.Store model authentication keys in OCI Vault

AnswersB, C, E

A private subnet ensures the endpoint is not publicly accessible.

Why this answer

Deploying the model in a private VCN subnet ensures that the model endpoint is not exposed to the internet, which is a fundamental security requirement for compliance with strict security policies. By placing the model in a private subnet, all traffic must traverse through a bastion host, VPN, or FastConnect, providing network isolation and reducing the attack surface. This aligns with OCI's shared responsibility model where the customer controls network security.

Exam trap

The trap here is that candidates may think exposing the endpoint to the internet is acceptable if IAM policies are used, but network isolation (private subnet) is a separate and mandatory layer of defense that cannot be replaced by IAM alone.

Full explanation →

179

MCQmedium

Refer to the exhibit. The user requested a long story but the response is cut short. What is the most likely cause?

A.The model content filter blocked part of the output.

B.There was a network error during inference.

C.The max_tokens parameter is too low for the requested length.

D.The model is not capable of generating long stories.

AnswerC

max_tokens=100 restricts output length; finish_reason 'length' confirms this.

Why this answer

The finish_reason is 'length' indicating the output hit the max_tokens limit. The model stopped because it reached the token limit.

Full explanation →

180

MCQeasy

A developer wants to generate text embeddings using OCI Generative AI. Which model endpoint should they call?

A.POST /v1/generate

B.POST /v1/summarize

C.POST /v1/embed

D.POST /v1/chat

AnswerC

The embed endpoint returns vector embeddings for input text.

Why this answer

Option C is correct: The embed endpoint is for generating embeddings. Option A (generate) is for text generation. Option B (summarize) is for summarization.

Option D (chat) is for conversation.

Full explanation →

181

MCQmedium

An organization is experiencing low recall in their RAG system. They are using OCI OpenSearch as the vector store with cosine similarity. After reviewing the retrieved chunks, they notice that relevant documents are not being returned. Which configuration change is most likely to improve recall?

A.Use a deterministic ID generator for consistent chunk IDs.

B.Increase the chunk size to provide more context per chunk.

C.Reduce the chunk size to capture more granular information.

D.Switch similarity metric from cosine to Euclidean distance.

AnswerC

Smaller chunks increase the number of vectors and can help retrieve relevant passages that might be buried in larger chunks.

Why this answer

Option A is correct because reducing the chunk size increases the number of chunks and can capture more fine-grained information, improving recall at the cost of precision. Option B is wrong because increasing chunk size may reduce recall by missing details. Option C is wrong because switching to Euclidean distance does not inherently improve recall.

Option D is wrong because using a deterministic ID generator does not affect retrieval quality.

Full explanation →

182

MCQhard

A data scientist observes that their fine-tuned LLM performs well on training data but generates repetitive and dull responses in production. What is the most likely cause and best solution?

A.The model is overfitted; apply stronger regularization

B.The temperature is set too low; increase temperature during inference

C.The training data lacks diversity; add more varied examples

D.The model has too many layers; reduce model size

AnswerB

Low temperature makes outputs deterministic and repetitive; increasing it adds variability.

Why this answer

The model's repetitive and dull responses indicate that the temperature parameter is too low, causing the model to always select the most probable tokens, leading to deterministic and monotonous outputs. Increasing temperature during inference introduces randomness into token sampling, allowing for more diverse and creative responses. This is a common issue in production LLMs where low temperature settings optimized for training metrics fail to produce engaging real-world outputs.

Exam trap

Oracle often tests the misconception that poor production performance is always due to overfitting or data issues, when in fact inference-time hyperparameters like temperature are the direct cause of repetitive/dull outputs.

How to eliminate wrong answers

Option A is wrong because overfitting would cause poor generalization to new inputs, not specifically repetitive/dull outputs; regularization reduces overfitting but does not address the deterministic token selection caused by low temperature. Option C is wrong because while training data diversity affects model knowledge, the described symptom of repetitive outputs in production despite good training performance points to inference-time sampling issues, not data diversity. Option D is wrong because having too many layers might cause overfitting or computational inefficiency, but it does not directly cause repetitive or dull responses; reducing model size would not fix the temperature-related sampling behavior.

Full explanation →

183

MCQhard

A security administrator wrote the above IAM policy for a compartment named MyCompartment. Users in the GenerativeAIUsers group can successfully list dedicated AI clusters and models in MyCompartment, but when they try to create an inference endpoint using a model from a different compartment (SharedModels), they get an authorization error. What is the most likely missing policy statement?

A.ALLOW GROUP GenerativeAIUsers TO MANAGE generative-ai-models IN COMPARTMENT SharedModels

B.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-family IN TENANCY

C.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-dedicated-ai-clusters IN COMPARTMENT SharedModels

D.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-models IN COMPARTMENT SharedModels

AnswerD

This allows them to use models from SharedModels compartment.

Why this answer

The error occurs because the user has permission to list models in MyCompartment but not to use a model from SharedModels when creating an inference endpoint. The missing policy must grant the USE permission on generative-ai-models in the SharedModels compartment, as creating an endpoint requires the ability to reference and use the model resource from that compartment. Option D correctly provides this permission.

Exam trap

Oracle often tests the distinction between 'read' and 'use' permissions, where candidates mistakenly think listing models (read) is sufficient to use them in another resource creation, but OCI requires the 'use' verb for referencing a resource across compartments.

How to eliminate wrong answers

Option A is wrong because it grants MANAGE permission, which is excessive; the user only needs USE permission to reference the model for creating an endpoint. Option B is wrong because it grants USE on the entire generative-ai-family at the tenancy level, which is too broad and not scoped to the specific model resource needed from SharedModels. Option C is wrong because it grants USE on generative-ai-dedicated-ai-clusters in SharedModels, but the error is about using a model, not a dedicated AI cluster.

Full explanation →

184

MCQhard

A global enterprise is deploying a generative AI application that requires high availability across multiple OCI regions. The application must automatically fail over to a secondary region if the primary region becomes unavailable. What is the recommended architecture to achieve this?

A.Deploy endpoints in two regions behind an OCI Load Balancer with cross-region failover

B.Deploy OCI Generative AI endpoints in two regions and use a global DNS round-robin

C.Use OCI Streaming to replicate requests between regions

D.Use DNS failover with a single endpoint in the primary region

AnswerA

OCI Load Balancer can route traffic to a backup region when primary is unhealthy.

Why this answer

Option A is correct because OCI Load Balancer supports cross-region failover by distributing traffic across backend sets in multiple regions, enabling automatic failover to a secondary region when the primary region becomes unavailable. This architecture ensures high availability for generative AI applications by leveraging health checks and failover policies at the load balancer level, which is the recommended approach for multi-region active-passive setups.

Exam trap

Oracle often tests the misconception that DNS-based solutions (like round-robin or simple failover) provide automatic failover with health checks, but in OCI, DNS failover requires manual intervention or additional services like Traffic Management Steering, whereas OCI Load Balancer natively supports automatic cross-region failover.

How to eliminate wrong answers

Option B is wrong because global DNS round-robin does not provide automatic failover; it distributes traffic statically and cannot detect regional outages, leading to continued traffic to an unavailable endpoint. Option C is wrong because OCI Streaming is a messaging service for real-time data ingestion and replication, not a traffic routing or failover mechanism for application endpoints. Option D is wrong because DNS failover with a single endpoint in the primary region lacks a secondary region for failover, offering no high availability if the primary region fails.

Full explanation →

185

Multi-Selecteasy

A data scientist is preparing to fine-tune a foundation model on OCI. Which two actions should they take to optimize costs? (Select TWO.)

Select 2 answers

A.Use the smallest model that meets accuracy requirements

B.Use a single OCPU shape to minimize per-hour cost

C.Use spot preemptible instances to save on compute

D.Monitor fine-tuning progress and stop early if validation loss plateaus

E.Store training data in Archive Storage to reduce storage costs

AnswersA, D

Correct: Smaller models require less compute and memory.

Why this answer

Option A is correct because using the smallest model that meets accuracy requirements directly reduces the number of parameters and computational operations required during fine-tuning. On OCI, larger models consume significantly more GPU memory and compute hours, so selecting the minimal viable model minimizes both training time and associated costs. This aligns with cost optimization best practices for generative AI workloads.

Exam trap

Oracle often tests the misconception that spot/preemptible instances are universally cost-effective for all AI workloads, but in OCI, they are not supported for interactive or stateful fine-tuning jobs, making Option C a classic distractor.

Full explanation →

186

Multi-Selecteasy

Which THREE OCI Generative AI service features help in controlling the cost of API calls? (Select three.)

Select 3 answers

A.Using a smaller model

B.Increasing temperature

C.Using stop sequences

D.Setting max_tokens limit

E.Enabling response streaming

AnswersA, C, D

Smaller models have lower per-token pricing.

Why this answer

Setting max_tokens limits output length, using a smaller model reduces cost per token, and using stop sequences can end generation early to save tokens.

Full explanation →

187

Multi-Selectmedium

Which TWO configurations are required to use a custom fine-tuned model on OCI Gen AI?

Select 2 answers

A.A security list

B.A dedicated AI cluster

C.Training data in Object Storage

D.An API key

E.A serverless endpoint

AnswersB, C

Required for fine-tuning and hosting custom models.

Why this answer

Options A and C are required. A dedicated AI cluster is needed for training and inference, and training data must be stored in Object Storage. Serverless endpoint (B) is optional, API key (D) is always required but not specific to custom models, and security list (E) is network configuration.

Full explanation →

188

MCQeasy

A company wants to build a retrieval-augmented generation (RAG) system using OCI Generative AI and a vector database. Which model type should they use to convert documents into vector embeddings?

A.Instruct model (e.g., cohere.command)

B.Image generation model

C.Embedding model (e.g., cohere.embed)

D.Base model (e.g., cohere.base)

AnswerC

Embedding models produce vector embeddings for similarity search.

Why this answer

Option C is correct because embedding models are specifically designed to generate vector representations of text for retrieval. Option A (instruct models) are for generation. Option B (base models) are for general text generation.

Option D (image models) are for images.

Full explanation →

189

MCQhard

A team is fine-tuning a large language model for a domain-specific Q&A application. After fine-tuning, they observe that the model performs well on the training distribution but struggles with out-of-distribution (OOD) questions. Which approach would best improve OOD robustness?

A.Include a diverse set of examples from related domains in the fine-tuning dataset.

B.Use early stopping based on training loss to avoid overfitting.

C.Reduce the model size to prevent overfitting to the training data.

D.Increase the learning rate during fine-tuning to adapt faster to new patterns.

AnswerA

Diverse data improves generalization and OOD performance.

Why this answer

Option C is correct because incorporating diverse data during fine-tuning helps the model generalize to OOD inputs. Option A is wrong because increasing learning rate may cause catastrophic forgetting. Option B is wrong because reducing model size reduces capacity.

Option D is wrong because early stopping on training loss may not help OOD.

Full explanation →

190

MCQhard

An enterprise is fine-tuning a Cohere model using OCI Generative AI for a domain-specific task. After training, the model shows high accuracy on validation data but poor performance on unseen test data. What is the most likely cause?

A.The training dataset was too small

B.The number of training epochs was too low

C.The model overfitted to training data

D.The learning rate was set too high

AnswerC

High validation accuracy but poor test accuracy is classic overfitting.

Why this answer

Option D is correct: Overfitting occurs when model learns training data too well. Option A (learning rate) is possible but overfitting is more indicative. Option B (dataset size) could cause underfitting, not overfitting.

Option C (epochs) too many can cause overfitting, but the symptom matches D.

Full explanation →

191

Multi-Selecthard

Which THREE factors are important when designing a multi-turn conversational agent using OCI Generative AI Agents?

Select 3 answers

A.Always generate the longest possible response to be thorough.

B.Manage the context window size to avoid truncating important earlier messages.

C.Implement guardrails to detect and filter sensitive topics or harmful intents.

D.Disable logging to reduce latency and cost.

E.Enable session management to maintain conversation history across turns.

AnswersB, C, E

If the context window is too small, the agent may lose track of earlier parts of the conversation.

Why this answer

Managing the context window size is critical because OCI Generative AI Agents have a fixed token limit for the conversation history. If the context window is exceeded, the agent truncates the oldest messages, which can remove essential context from earlier turns, leading to incoherent or incorrect responses. Proper management ensures that the most relevant history is retained without exceeding the model's maximum input length.

Exam trap

Oracle often tests the misconception that longer responses are better for thoroughness, when in fact they degrade performance and user experience, and that disabling logging is a harmless optimization, whereas it removes critical observability and debugging capabilities.

Full explanation →

192

MCQeasy

A developer is testing the OCI Generative AI API by sending a request to generate text using the Cohere Command R model. The request returns the following error: 'The model 'cohere.command-r-08-2024' is not available in this region. Please check the model availability in your region.' The developer is using the us-ashburn-1 region. What is the most likely cause of this error?

A.The request body format is incorrect.

B.The model is not deployed in the us-ashburn-1 region.

C.The model name is misspelled (e.g., 'cohere.command-r-08-2024' vs 'cohere.command-r-08-2024').

D.The API key used in the request is invalid.

AnswerB

Cohere Command R may not be available in all regions; check supported regions in OCI documentation.

Why this answer

The error message explicitly states that the model 'cohere.command-r-08-2024' is not available in the region. OCI Generative AI models are deployed regionally, and the Cohere Command R model is not available in the us-ashburn-1 (Ashburn) region. The developer must select a supported region, such as us-chicago-1, where this model is deployed.

Exam trap

Oracle often tests the misconception that model names must be perfectly spelled or that API keys are the cause of all errors, but here the trap is that candidates overlook regional availability and assume the error is due to a typo or authentication failure.

How to eliminate wrong answers

Option A is wrong because an incorrect request body format would typically result in a 400 Bad Request or validation error, not a model availability error. Option C is wrong because the model name in the error matches the one sent, so a misspelling would cause a different error (e.g., 'model not found'), not a region availability error. Option D is wrong because an invalid API key would result in a 401 Unauthorized or 403 Forbidden error, not a model availability error.

Full explanation →

193

MCQhard

Refer to the exhibit. What is the best action to resolve this error?

A.Decrease the temperature of the generation model

B.Increase the max_tokens parameter for generation

C.Reduce the number of retrieved documents

D.Use a smaller chunk size for documents

AnswerC

Reducing retrieved documents directly decreases the token count from that segment, bringing total under the limit.

Why this answer

The input exceeds the model's context length due to a high number of retrieved document tokens. Reducing the number of documents retrieved (or their size) is the most direct fix.

Full explanation →

194

MCQmedium

A data scientist deployed a fine-tuned Llama 2 7B model on OCI Model Deployment with a single VM.GPU.A10.1 shape. Users report average latency of 3 seconds per request, which is too high for the intended real-time application. The model is used for short text generation (max 128 tokens). The data scientist wants to reduce per-request latency without significant accuracy loss. Which action would be most effective?

A.Increase the number of workers per replica

B.Increase the max_tokens parameter for the model

C.Enable response streaming for the model endpoint

D.Apply 4-bit quantization using AWQ

AnswerD

Quantization reduces model size and inference time with minimal accuracy loss.

Why this answer

4-bit quantization using AWQ reduces the model's memory footprint and computational requirements by compressing weights to 4-bit integers, which directly decreases inference latency on the VM.GPU.A10.1 shape. This technique preserves most of the model's accuracy while enabling faster token generation, making it the most effective single action for reducing per-request latency in a real-time short text generation scenario.

Exam trap

The trap here is that candidates confuse throughput improvements (Option A) or perceived latency (Option C) with actual per-request latency reduction, or mistakenly think increasing max_tokens (Option B) would help, when in fact it worsens the problem.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers per replica does not reduce per-request latency; it only increases throughput by handling more concurrent requests, but each individual request still experiences the same inference time. Option B is wrong because increasing the max_tokens parameter would actually increase latency, as the model would generate more tokens per request, making the problem worse. Option C is wrong because enabling response streaming does not reduce the total time to generate the full response; it only sends tokens incrementally to the client, improving perceived latency but not actual end-to-end latency.

Full explanation →

195

MCQhard

You are a cloud architect at a global e-commerce company. The company is building a RAG-based product support chatbot using OCI Generative AI Service and OCI OpenSearch. The chatbot must answer customer questions in real-time by retrieving from a product knowledge base containing over 10 million documents. The current architecture uses a single vector index with all documents, and the LLM (Cohere Command R+) returns answers in English only. The team observes that queries from non-English customers often return irrelevant results, and the chatbot sometimes fails to generate answers within the 5-second SLA. The leadership wants to support 10 languages and reduce the average response time to under 3 seconds. You need to propose a solution that improves both relevance and latency. Which course of action should you take?

A.Increase the number of OCI OpenSearch nodes and upgrade the LLM to a faster variant.

B.Replace the embedding model with a multilingual model and partition the vector index by language to reduce search space.

C.Translate all non-English queries to English before retrieval and use an English-only embedding model.

D.Implement a caching layer for frequent queries and use a larger LLM for better accuracy.

AnswerB

Multilingual model improves relevance; partitioning improves latency.

Why this answer

Option B is correct because partitioning the vector index by language reduces the search space for each query, directly improving retrieval latency, while using a multilingual embedding model ensures that non-English queries are semantically matched to documents in their original language, improving relevance. This combination addresses both the 3-second SLA and the 10-language requirement without relying on translation, which introduces latency and potential loss of meaning.

Exam trap

The trap here is that candidates often assume translation is the simplest path to multilingual support, overlooking the latency and semantic drift it introduces, and fail to recognize that partitioning the index is a standard optimization for both relevance and speed in large-scale RAG systems.

How to eliminate wrong answers

Option A is wrong because simply scaling nodes and upgrading the LLM does not fix the root cause of irrelevant results for non-English queries—the embedding model remains English-only, so multilingual queries will still map poorly in vector space. Option C is wrong because translating all non-English queries to English before retrieval adds significant latency (often 200-500ms per translation) and can lose cultural or contextual nuances, making it unsuitable for a 3-second SLA and 10-language support. Option D is wrong because a caching layer only helps with repeated queries, not novel ones, and using a larger LLM increases inference latency, making it harder to meet the 3-second target; it also does not address the embedding mismatch for non-English content.

Full explanation →

196

MCQmedium

A user has attached an IAM policy granting access to the generative-ai-family resource type, but API calls to the Generative AI service return a 403 Forbidden error. What is the most likely cause?

A.The service is not enabled in the tenancy.

B.The policy does not include the 'use' verb for the resource type.

C.The user has not created a Dedicated AI Cluster.

D.The policy does not specify the correct compartment or resource.

AnswerD

Policies must include a target compartment; otherwise, access defaults to denial.

Why this answer

Option C is correct because OCI Generative AI resources are compartment-scoped, and if the policy does not specify the correct compartment, access is denied. Option A is incorrect because the policy allows the resource type. Option B is incorrect because Dedicated AI Cluster is not a prerequisite for managed inference.

Option D is incorrect because the policy already includes the service.

Full explanation →

197

MCQmedium

Based on the exhibit, which model is best suited for a conversational chatbot that needs to handle multi-turn dialogues?

A.cohere.embed

B.A model with embeddings capability

C.cohere.base

D.cohere.command

AnswerD

Has 'chat' capability, ideal for multi-turn dialogue.

Why this answer

Option A is correct because cohere.command has the 'chat' capability explicitly listed. Options B and C only have text-generation or embeddings. Option D (embed) is not for generation.

Full explanation →

198

MCQeasy

A startup is building a chatbot for customer support using OCI Generative AI Service. The chatbot needs to answer queries about product features based on a knowledge base of product documentation. Which configuration is most appropriate for this use case?

A.Use the Summarization task type to generate concise answers from the documentation.

B.Use a Cohere Command model with the knowledge base as context in a prompt, and enable retrieval-augmented generation (RAG) via OCI Generative AI Agents.

C.Fine-tune a Llama 2 70B model on the product documentation to create a custom model.

D.Use the Code Generation model to produce SQL queries that retrieve answers from a database.

AnswerB

This approach uses a foundation model with RAG to ground responses in the knowledge base, which is ideal for question answering.

Why this answer

Option B is correct because OCI Generative AI Agents with retrieval-augmented generation (RAG) allows the chatbot to dynamically retrieve relevant chunks from the product documentation knowledge base and inject them as context into a Cohere Command model prompt. This approach ensures answers are grounded in the latest documentation without requiring fine-tuning, and it scales efficiently as the knowledge base grows.

Exam trap

Oracle often tests the distinction between task-specific models (summarization, code generation) and the RAG architecture, leading candidates to mistakenly choose a simpler task type like summarization instead of recognizing the need for retrieval-augmented generation.

How to eliminate wrong answers

Option A is wrong because the Summarization task type is designed to condense a given text into a shorter summary, not to answer specific queries by retrieving and reasoning over a knowledge base; it lacks the retrieval component needed for question answering. Option C is wrong because fine-tuning a Llama 2 70B model on product documentation would be computationally expensive, requires significant labeled data, and does not easily accommodate updates to the documentation without retraining, making it impractical for a dynamic knowledge base. Option D is wrong because Code Generation models are specialized for generating code (e.g., SQL, Python), not for answering natural language questions from a knowledge base; using SQL queries would require a structured database schema, which is not the case for unstructured product documentation.

Full explanation →

199

MCQeasy

A retail company wants to generate product descriptions from attribute data. They have no prior AI experience. Which approach is most appropriate?

A.Use the Cohere Command model with carefully crafted prompts.

B.Train a custom model from scratch.

C.Fine-tune a model on a synthetic dataset.

D.Use the Cohere Embed model to generate embeddings and then decode.

AnswerA

Cohere Command can generate descriptions directly with simple prompts, requiring no additional training.

Why this answer

Using a pre-trained model with prompt engineering is the fastest and most cost-effective way to start.

Full explanation →

200

MCQmedium

A company is deploying a large language model for a customer service chatbot. The model needs to understand industry-specific jargon and maintain low latency. Which approach best balances these requirements?

A.Employ retrieval-augmented generation (RAG) with a general model

B.Rely solely on prompt engineering with a general model

C.Use a large general-purpose LLM with zero-shot prompting

D.Fine-tune a small open-source LLM on domain-specific data

AnswerD

Fine-tuning adapts the model to jargon and a smaller model keeps latency low.

Why this answer

Fine-tuning a small open-source LLM on domain-specific data is the best approach because it adapts the model to understand industry-specific jargon while keeping the model small enough to maintain low latency. Unlike larger models, a fine-tuned small model can run efficiently on local hardware, reducing inference time and avoiding the overhead of external API calls or large model sizes.

Exam trap

Oracle often tests the misconception that larger models always perform better or that RAG alone solves domain adaptation, ignoring the latency and efficiency trade-offs that make fine-tuning a smaller model the optimal choice for production systems with strict response time requirements.

How to eliminate wrong answers

Option A is wrong because retrieval-augmented generation (RAG) with a general model still relies on a general model that may not inherently understand industry-specific jargon, and the retrieval step adds latency, which conflicts with the low-latency requirement. Option B is wrong because relying solely on prompt engineering with a general model does not embed domain-specific knowledge into the model weights, so the model may still misinterpret or fail to generate accurate responses for niche jargon, and it often requires longer prompts that increase latency. Option C is wrong because a large general-purpose LLM with zero-shot prompting has high inference latency due to its size and lacks domain-specific training, making it unsuitable for both understanding jargon and meeting low-latency constraints.

Full explanation →

201

MCQmedium

A developer is using OCI Generative AI Service to generate product descriptions. The outputs are often too generic and lack brand-specific tone. The developer has a small set of 20 high-quality example descriptions. What is the most efficient approach to improve output quality?

A.Fine-tune a base model on the 20 examples.

B.Use few-shot prompting by including the 20 examples in the prompt.

C.Use a more detailed system prompt describing the brand tone.

D.Use chain-of-thought prompting to guide the model step by step.

AnswerB

Few-shot prompting leverages examples without retraining, ideal for small datasets.

Why this answer

Option B is correct because few-shot prompting is the most efficient approach when you have a small set of high-quality examples (20 in this case). It allows the model to infer the desired tone and style directly from the provided examples without requiring any training or fine-tuning, which would be inefficient and potentially ineffective with such a small dataset. In OCI Generative AI Service, few-shot prompting leverages the model's in-context learning capability to adapt its output to the brand-specific tone.

Exam trap

Oracle often tests the misconception that fine-tuning is always the best approach for customization, but candidates overlook the fact that with very small datasets (like 20 examples), few-shot prompting is more practical and efficient than fine-tuning.

How to eliminate wrong answers

Option A is wrong because fine-tuning a base model on only 20 examples is inefficient and unlikely to produce reliable results; fine-tuning typically requires hundreds to thousands of high-quality examples to avoid overfitting and to meaningfully adjust model weights. Option C is wrong because a more detailed system prompt describing the brand tone, while helpful, is less effective than providing concrete examples; the model may still produce generic outputs without specific stylistic references. Option D is wrong because chain-of-thought prompting is designed to improve reasoning and step-by-step logic, not to adapt tone or style; it does not address the core issue of generating brand-specific product descriptions.

Full explanation →

202

MCQeasy

A developer is using a large language model to generate code snippets. The model often produces code that is syntactically correct but functionally incorrect. What is the most effective way to improve the functional correctness of the generated code?

A.Provide few-shot examples of correct code in the prompt.

B.Increase the temperature parameter to generate more creative solutions.

C.Ask the model to only output syntactically valid code.

D.Set max_tokens to a very high value to allow the model more room to think.

AnswerA

Few-shot examples help the model understand the expected output.

Why this answer

Option A is correct because providing few-shot examples of correct code in the prompt directly demonstrates the desired functional behavior to the model. This technique, known as few-shot prompting, grounds the model's output in concrete examples, significantly improving the likelihood that the generated code will be functionally correct by aligning the model's pattern completion with the intended logic, not just syntax.

Exam trap

Oracle often tests the misconception that increasing model parameters like temperature or max_tokens can improve output quality, when in fact these parameters control randomness and length, not functional correctness, which is best addressed through prompt engineering techniques like few-shot learning.

How to eliminate wrong answers

Option B is wrong because increasing the temperature parameter makes the model's output more random and creative, which typically reduces functional correctness by increasing the chance of generating plausible but incorrect logic. Option C is wrong because asking the model to only output syntactically valid code does not address functional correctness; the model already generates syntactically valid code by default, and this instruction does not guide it toward correct logic. Option D is wrong because setting max_tokens to a very high value does not improve reasoning quality; it only allows longer outputs, which can actually increase the risk of generating more irrelevant or incorrect code without improving functional correctness.

Full explanation →

203

MCQhard

A team fine-tunes an embedding model for a legal document RAG system but observes low retrieval recall. Which technique is most likely to improve recall?

A.Use a smaller batch size

B.Use hard negative mining during training

C.Reduce the learning rate

D.Increase the number of fine-tuning epochs

AnswerB

Hard negatives force the model to differentiate between similar but irrelevant documents, improving retrieval discrimination.

Why this answer

Hard negative mining exposes the model to challenging negatives during training, which sharpens the embedding space and improves recall.

Full explanation →

204

MCQmedium

Refer to the exhibit. A developer runs the command and immediately tries to use the endpoint. The application fails with an error indicating the endpoint is not active. What is the most likely reason?

A.The model ID is not available in us-ashburn-1

B.The purpose parameter is misspelled

C.The endpoint is in provisioning state and not yet ready

D.The compartment ID is incorrect

AnswerC

Endpoints take time to provision; using them immediately fails.

Why this answer

The service endpoint creation is asynchronous; the endpoint is initially in a 'provisioning' state and will become active after a few minutes.

Full explanation →

205

MCQmedium

A team uses Cohere's `rerank` endpoint after initial retrieval to improve result quality. What is the main benefit of reranking?

A.It generates new embeddings for chunks

B.It combines multiple queries

C.It reorders chunks by relevance to the query

D.It reduces the number of retrieved chunks

AnswerC

Reranking improves the ordering so the most relevant appear first.

Why this answer

Reranking reorders the initially retrieved chunks by more accurately assessing relevance to the query, improving the quality of the top-k results presented to the LLM. It does not reduce the number of chunks, generate new embeddings, or combine queries.

Full explanation →

206

Multi-Selecteasy

Which TWO statements about large language model (LLM) capabilities are correct?

Select 2 answers

A.LLMs have a fixed context window that cannot be extended.

B.LLMs can perform zero-shot learning without any task-specific training.

C.LLMs understand and reason about code as well as natural language.

D.LLMs always produce factually accurate outputs.

E.LLMs require fine-tuning for every new task.

AnswersB, C

Zero-shot learning is a key capability of LLMs.

Why this answer

Option A is correct because LLMs can perform zero-shot learning without task-specific training, generalizing to unseen tasks. Option C is correct because LLMs like Codex are trained on code and understand programming languages. Option B is incorrect because context windows can be extended via techniques like sliding window or ALiBi.

Option D is incorrect due to hallucination risks. Option E is incorrect because few-shot prompting often suffices without fine-tuning.

Full explanation →

207

MCQmedium

A customer support company uses Cohere Command on OCI to answer user queries. They have enabled grounding with a knowledge base of product manuals. However, for about 20% of queries, the model provides incorrect product recommendations that are not in the manuals. The team has verified the knowledge base is up to date. What is the most likely cause and solution?

A.The model's temperature is too high, causing creative responses. Lower temperature to 0.

B.The model is hallucinating; switch to a larger model.

C.The query phrasing may not match the knowledge base; improve the retrieval system or use query rewriting.

D.The grounding settings are too restrictive; increase the number of retrieved documents.

AnswerC

Correct: Query mismatch causes retrieval of irrelevant content, leading to incorrect recommendations.

Why this answer

Option D is correct because query phrasing may not match the knowledge base, leading to retrieval of irrelevant documents. Improving retrieval or using query rewriting bridges the gap. Option A might help if temperature were high, but the core issue is retrieval.

Option B could introduce noise, and Option C may not solve the grounding issue.

Full explanation →

208

MCQeasy

A user wants to access the OCI Generative AI service programmatically. Which credential method is recommended for use in a production application running on OCI Compute?

A.API signing keys

B.Instance principal

C.User password and OCID

D.Resource principal

AnswerB

Instance principal dynamically obtains credentials via instance metadata service.

Why this answer

Instance principal authentication is the recommended method for production applications running on OCI Compute because it allows the application to authenticate with OCI services without managing or embedding any credentials. The OCI Compute instance assumes a dynamic group and IAM policy that grants it permissions, and the SDK automatically handles token exchange via the instance metadata service, eliminating the need for long-lived secrets.

Exam trap

Oracle often tests the distinction between instance principal (for Compute instances) and resource principal (for serverless or managed services), leading candidates to confuse the two or to incorrectly select API signing keys as the 'most secure' option without considering operational overhead.

How to eliminate wrong answers

Option A is wrong because API signing keys are long-lived secrets that must be securely stored and rotated, which adds operational overhead and risk in a production environment. Option C is wrong because user passwords and OCIDs are intended for interactive console login, not programmatic API access, and they cannot be used with the OCI SDK or CLI for service calls. Option D is wrong because resource principal is used for serverless functions like OCI Functions or for resources like OCI Object Storage buckets, not for a Compute instance running a custom application.

Full explanation →

209

MCQhard

A financial services firm needs to extract named entities from legal contracts using OCI Generative AI. They require high accuracy and must handle domain-specific terminology. Which approach is most effective?

A.Fine-tune a Cohere Command model using a dataset of annotated legal contracts.

B.Use the base Cohere Command model with zero-shot prompting for entity extraction.

C.Use the Cohere Chat model with system prompts describing the entities.

D.Use the Cohere Embed model to generate embeddings and then train a separate classifier.

AnswerA

Fine-tuning adapts the model to the domain and entity types.

Why this answer

Option A is correct because fine-tuning a Cohere Command model on a labeled dataset of legal contracts yields the best accuracy for domain-specific entities. Option B is incorrect because zero-shot extraction is less accurate for specialized terms. Option C is incorrect because using a generic embedding model would require a separate classifier and may underperform.

Option D is incorrect because prompt engineering alone cannot achieve high accuracy for complex entity extraction.

Full explanation →

210

Multi-Selectmedium

Which THREE of the following are likely causes if retrieval returns no results despite documents being indexed in an OCI OpenSearch vector store?

Select 3 answers

A.The embedding model dimension mismatch

B.The k-NN algorithm is misconfigured (e.g., k=0)

C.The query embedding is out of distribution

D.The database connection string is incorrect

E.The index is not fully built or refreshed

AnswersB, D, E

Misconfiguration like k=0 causes no candidates to be returned.

Why this answer

An unbuilt/refreshed index, a misconfigured k-NN algorithm, and an incorrect connection string are common causes of empty retrieval results.

Full explanation →

211

MCQmedium

A team is designing a RAG system for legal document review. They want to ensure that the retrieved chunks are contextually coherent and not truncated mid-sentence. Which chunking strategy should they use?

A.Recursive chunking based on sentence boundaries.

B.Token-level chunking.

C.Semantic chunking using document section headers.

D.Fixed-size character chunking with overlap.

AnswerA

Sentence boundary chunking ensures each chunk contains complete sentences, improving coherence.

Why this answer

Option B is correct because sentence-based chunking preserves semantic boundaries, avoiding mid-sentence truncation. Option A is wrong because fixed-size chunks often cut sentences. Option C is wrong because paragraph-level may be too large.

Option D is wrong because token-level is too fine-grained and loses context.

Full explanation →

212

MCQeasy

Which OCI Generative AI capability allows you to provide example input-output pairs to guide the model's behavior without fine-tuning?

A.Few-shot learning

B.Reinforcement Learning from Human Feedback (RLHF)

C.Prompt engineering

D.Fine-tuning

AnswerA

In-context learning with a few examples steers the model without retraining.

Why this answer

Option A is correct: Few-shot learning uses examples in the prompt. Option B (Fine-tuning) retrains the model. Option C (Prompt engineering) is broader.

Option D (RLHF) uses human feedback.

Full explanation →

213

MCQhard

Refer to the exhibit. A developer has set this policy to allow an OCI Data Science session to generate embeddings. However, the API call returns a 403 Forbidden. Which of the following is likely missing?

A.The policy needs a 'where request.region != ...' condition

B.The policy should include 'in tenancy' instead of compartment

C.The service requires 'manage' permission instead of 'use'

D.The dynamic group does not include the Data Science session

AnswerD

The session must be matched by a rule in the dynamic group for the policy to apply.

Why this answer

The policy is correctly written but it applies to the dynamic group 'RAGGroup'. If the Data Science session is not a member of that dynamic group, the policy has no effect.

Full explanation →

214

Multi-Selectmedium

Which two actions are required when deploying a custom fine-tuned model using the OCI Generative AI service? (Choose two.)

Select 2 answers

A.Configure an API Gateway for the model endpoint

B.Register the model in the OCI Data Science Model Catalog

C.Set up a load balancer for the deployment

D.Upload model artifacts to OCI Object Storage

E.Create a dedicated AI cluster

AnswersD, E

Model artifacts must be stored in Object Storage before deployment.

Why this answer

Option D is correct because deploying a custom fine-tuned model in OCI Generative AI requires the model artifacts (e.g., weights, configuration files) to be stored in OCI Object Storage. The service pulls the artifacts from a designated bucket during deployment. Option E is correct because a dedicated AI cluster must be created to host the model, providing the necessary compute resources for inference.

Exam trap

The trap here is that candidates confuse the OCI Data Science Model Catalog (used for ML model lifecycle management) with the Generative AI service's own model registration, leading them to incorrectly select Option B.

Full explanation →

215

Multi-Selecthard

Which THREE parameters can be adjusted to reduce repetition in generated text? (Choose three.)

Select 3 answers

A.presence_penalty

B.top_k

C.max_tokens

D.frequency_penalty

E.temperature

AnswersA, B, D

Penalizes tokens that have appeared at all.

Why this answer

Options A, C, and D help reduce repetition. B (temperature) increases randomness but can also cause repetition? Actually temperature can reduce repetition but not primarily. Typical repetition reduction uses frequency, presence, and top_k.

But top_k affects diversity. Correct: frequency, presence, and top_k. Temperature affects randomness.

So A, C, D are correct. B is not primarily for repetition.

Full explanation →

216

MCQeasy

Refer to the exhibit. What is the primary reason the response is incomplete?

A.The temperature is not set.

B.The model-id is incorrect.

C.The max-tokens limit is too low.

D.The prompt is too short.

AnswerC

Setting max-tokens to 100 restricts the output length, causing truncation.

Why this answer

The response is incomplete because the max-tokens limit is too low, causing the model to truncate its output before completing the full answer. When the token budget is exhausted, the generation stops mid-sentence or mid-thought, leaving the response unfinished regardless of prompt length or other parameters.

Exam trap

Oracle often tests the distinction between parameters that affect output quality (temperature, top_p) versus those that constrain output length (max_tokens, stop sequences), and the trap here is that candidates mistake a short prompt or missing temperature for the cause of truncation when the real culprit is the token budget.

How to eliminate wrong answers

Option A is wrong because the temperature parameter controls randomness in token selection, not the length or completeness of the response; a missing temperature would default to 1.0 and still allow full output. Option B is wrong because the model-id identifies which LLM to use (e.g., gpt-3.5-turbo or cohere.command-text-v14) and does not affect whether the response is truncated; an incorrect model-id would either fail to load or produce different output, not an incomplete one. Option D is wrong because a short prompt can still yield a complete response; prompt length influences context and relevance, but the max-tokens parameter is the direct limiter of output length.

Full explanation →

217

MCQeasy

A company has implemented a RAG-based chatbot using OCI Generative AI and OCI OpenSearch as the vector store. The chatbot answers questions about internal policies. The team uses a dense vector embedding model with 768 dimensions and the HNSW algorithm. The corpus contains 5 million documents. Users report that the chatbot takes 8-12 seconds to respond, and the answers are often not relevant, missing key policy details. Upon investigation, the team finds that the k-NN search returns results based solely on vector similarity, ignoring exact keyword matches that are critical for policy documents. Which course of action will most effectively improve both response time and relevance?

A.Implement hybrid search using a combination of match (keyword) and k-NN (vector) queries with boosting.

B.Increase the number of OpenSearch data nodes to 5 and use higher-memory instances.

C.Reduce the ef_search parameter to 100 and retrain the embedding model on domain-specific data.

D.Switch to OCI Generative AI's built-in vector store instead of OpenSearch.

AnswerA

Hybrid search enhances relevance by integrating keyword and semantic matching, and pre-filtering can reduce latency.

Why this answer

Hybrid search combines keyword and vector queries, improving relevance by including exact matches. It can also reduce the search space by filtering on keywords, thereby reducing latency. Increasing nodes (A) only addresses speed.

Reducing ef_search (C) may speed up but can reduce recall and does not fix relevance. Using OCI GenAI's built-in vector store (D) is not guaranteed to improve either.

Full explanation →

218

MCQhard

A company deploys a fine-tuned model on an OCI Generative AI dedicated AI cluster. After deployment, they observe high latency during peak hours. The cluster has only one replica. Which action would most effectively reduce latency without increasing cost unnecessarily?

A.Increase the number of replicas to 10.

B.Enable auto-scaling with a maximum of 3 replicas.

C.Switch to a larger base model.

D.Move to a serverless deployment model.

AnswerB

Auto-scaling adjusts to demand; a max of 3 provides headroom without waste.

Why this answer

Enabling auto-scaling with a maximum of 3 replicas (Option B) is the most effective action because it dynamically adds replicas during peak hours to handle increased load, reducing latency, while limiting the maximum to 3 prevents unnecessary cost overruns. This balances performance and cost, unlike a fixed large replica count or switching models, which either wastes resources or fails to address the root cause of insufficient compute capacity.

Exam trap

Oracle often tests the misconception that more replicas always reduce latency, but the trap here is that candidates may choose Option A (10 replicas) without considering cost efficiency, while the correct answer requires balancing performance with cost constraints via auto-scaling.

How to eliminate wrong answers

Option A is wrong because increasing replicas to 10 would significantly raise costs without proportional latency benefits, as the cluster likely doesn't need that many replicas during non-peak hours, leading to idle resource waste. Option C is wrong because switching to a larger base model would increase inference latency and cost due to higher computational requirements, exacerbating the problem rather than solving it. Option D is wrong because moving to a serverless deployment model on OCI Generative AI would introduce cold-start latency and unpredictable scaling behavior, and it may not support fine-tuned models or dedicated cluster features, potentially increasing latency and cost.

Full explanation →

219

MCQmedium

A company is deploying a large language model in a customer-facing chatbot. The model's responses must be both accurate and safe. Which combination of techniques should be employed?

A.Use only a system prompt instructing the model to be accurate and safe.

B.Use retrieval-augmented generation (RAG) for factual accuracy and a content safety filter for safe outputs.

C.Use a high temperature for creativity and a safety classifier for blocking toxic outputs.

D.Fine-tune the model on all historical chat logs and use a high temperature.

AnswerB

RAG improves accuracy; safety filter ensures safety.

Why this answer

Option B is correct because RAG grounds the model's responses in a verified external knowledge base, reducing hallucinations and improving factual accuracy, while a content safety filter (e.g., a classifier or guardrail) actively blocks toxic or unsafe outputs before they reach the user. This combination addresses both accuracy and safety independently, unlike a single system prompt which is easily bypassed.

Exam trap

Oracle often tests the misconception that a single technique (like a system prompt or fine-tuning) can simultaneously guarantee both accuracy and safety, when in practice they require separate, complementary mechanisms.

How to eliminate wrong answers

Option A is wrong because a system prompt alone is a static instruction that can be overridden by user input or model behavior, providing no enforcement mechanism for accuracy or safety. Option C is wrong because a high temperature increases randomness and creativity, which is counterproductive for accuracy and can amplify unsafe outputs; a safety classifier is a partial solution but does not address factual grounding. Option D is wrong because fine-tuning on all historical chat logs may introduce biases, errors, or unsafe patterns from the data, and a high temperature further degrades reliability.

Full explanation →

220

Multi-Selectmedium

Which TWO measures can help reduce the risk of generating toxic or unsafe content when using OCI Generative AI Service?

Select 2 answers

A.Use few-shot prompting with examples that demonstrate safe and appropriate responses.

B.Disable model monitoring and logging to reduce overhead.

C.Increase the temperature parameter to make output more deterministic.

D.Fine-tune the model on a large dataset without any safety filtering.

E.Enable the built-in content filtering features provided by OCI Generative AI Service.

AnswersA, E

Safe examples help steer the model toward desired behavior.

Why this answer

Few-shot prompting provides the model with explicit examples of safe, appropriate responses, which helps steer the model's behavior toward desired outputs and reduces the likelihood of generating toxic or unsafe content. This technique leverages in-context learning to align the model's responses with the provided examples, making it a practical measure for content safety.

Exam trap

Oracle often tests the misconception that increasing temperature or disabling monitoring improves safety, when in fact these actions increase randomness and reduce oversight, respectively.

Full explanation →

221

MCQeasy

What is the primary benefit of using a Dedicated AI Cluster over On-Demand serving for deploying generative AI models on OCI?

A.Higher throughput and lower latency due to reserved capacity

B.No need to manage model versions

C.Automatic scaling to zero when not in use

D.Lower cost for variable workloads

AnswerA

Reserved capacity minimizes resource contention, improving performance.

Why this answer

A Dedicated AI Cluster provides reserved compute capacity on OCI, ensuring consistent high throughput and low latency for generative AI inference workloads. Unlike On-Demand serving, which shares resources and can suffer from contention or cold starts, a Dedicated AI Cluster guarantees that GPU resources are always available for your model, eliminating variability in response times.

Exam trap

Oracle often tests the misconception that Dedicated AI Clusters are cheaper for variable workloads, when in fact their fixed-cost model makes them optimal for steady-state, high-volume inference, not spiky or unpredictable traffic.

How to eliminate wrong answers

Option B is wrong because managing model versions is a separate concern handled by model registries and deployment pipelines, not a benefit specific to Dedicated AI Clusters. Option C is wrong because Dedicated AI Clusters are always-on and do not scale to zero; automatic scaling to zero is a feature of serverless or On-Demand serving to reduce costs when idle. Option D is wrong because Dedicated AI Clusters incur fixed costs for reserved capacity, making them more expensive for variable workloads compared to On-Demand serving, which charges per-usage and can scale down.

Full explanation →

222

MCQmedium

A company notices that their OCI GenAI managed serving endpoint returns incomplete responses for long prompts. What is the most likely cause?

A.The model's context window is exceeded.

B.The max tokens parameter is set too high.

C.The top_p value is too low.

D.The temperature is set too high.

AnswerA

Models have a maximum input token limit; exceeding it truncates the input or output.

Why this answer

Option A is correct because incomplete responses typically indicate that the input prompt exceeds the model's context window. Other options affect output quality but not truncation.

Full explanation →

223

MCQmedium

Refer to the exhibit. A data scientist runs this inference request and receives a response that is incomplete and seems to stop mid-sentence. Which parameter should be adjusted to allow the model to generate longer outputs?

A.maxTokens

B.temperature

C.topP

D.frequencyPenalty

E.presencePenalty

AnswerA

maxTokens sets the maximum number of tokens to generate; increasing it yields longer outputs.

Why this answer

Option B is correct because maxTokens directly limits the number of tokens generated; increasing it allows the model to produce longer responses. Option A (temperature) affects randomness, not length. Option C (topP) affects token selection diversity.

Options D and E affect repetition penalties, not output length.

Full explanation →

224

MCQmedium

A team is fine-tuning an LLM on OCI Generative AI for a domain-specific task. They have a dataset of 10,000 labeled examples. What is a best practice to avoid catastrophic forgetting during fine-tuning?

A.Increase the learning rate to speed up adaptation.

B.Use only the new domain-specific data for fine-tuning.

C.Reduce the number of training epochs to the minimum.

D.Include a small percentage of general-domain data in the training mix.

AnswerD

General data acts as a regularizer to maintain base knowledge.

Why this answer

Option D is correct because catastrophic forgetting occurs when a fine-tuned model loses previously learned general knowledge. By including a small percentage (e.g., 5–10%) of general-domain data in the training mix, the model retains its broad capabilities while adapting to the new domain-specific task. This technique, often called 'replay' or 'experience replay,' is a standard practice in continual learning for LLMs.

Exam trap

Oracle often tests the misconception that fine-tuning should exclusively use the new dataset, whereas the best practice is to blend in general data to preserve prior knowledge.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate can cause the model to overfit to the new domain data and accelerate forgetting, not prevent it. Option B is wrong because using only domain-specific data removes all exposure to general knowledge, which is the primary cause of catastrophic forgetting. Option C is wrong because reducing epochs to the minimum may prevent the model from learning the new task adequately, but it does not address the retention of general knowledge; the model can still forget if the new data dominates the gradient updates.

Full explanation →

225

Multi-Selecteasy

Which TWO are advantages of using LoRA for fine-tuning?

Select 2 answers

A.Requires less GPU memory

B.Guarantees higher accuracy

C.Reduces number of trainable parameters

D.Increases model size

E.Improves inference speed

AnswersA, C

Fewer trainable parameters means lower memory usage during training.

Why this answer

LoRA (Low-Rank Adaptation) reduces GPU memory requirements because it freezes the original model weights and injects trainable low-rank matrices into specific layers. This means only a tiny fraction of parameters need gradients and optimizer states, drastically lowering memory consumption during fine-tuning compared to full fine-tuning.

Exam trap

Oracle often tests the misconception that reducing trainable parameters automatically improves inference speed, but LoRA's memory and parameter savings apply only to training, not to inference latency.

Full explanation →

Page 3 of 7

All pages

Practice 1Z0-1127 by domain

Target a specific domain to shore up weak areas.

Fundamentals of Large Language Models Using OCI Generative AI Service Building LLM Applications with RAG and Vector Search Deploying and Managing Generative AI on OCI

See all domains with question counts →