Knowledge + Practice

Oracle Cloud Infrastructure Generative AI Professional 1Z0-1127 (1Z0-1127) — Questions 526–600

991 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 8 of 14

526

MCQeasy

A developer integrates OCI GenAI into a mobile app to provide product descriptions. The responses sometimes include explanations or questions instead of the requested format. The developer is using a simple prompt: 'Describe product X.' The app expects a single paragraph. Which corrective action should the developer take?

A.Add a structured prompt with format instructions and an example.

B.Lower the temperature to 0 to make responses deterministic.

C.Increase the max tokens to allow longer responses.

D.Switch to a different model with better language understanding.

AnswerA

Correct: Structured prompts effectively enforce output format.

Why this answer

Option A is correct because adding a structured prompt with format instructions and an example directly addresses the issue of the model producing off-format responses. By explicitly specifying the expected output (e.g., 'Provide a single paragraph description without questions or explanations') and including a few-shot example, the developer constrains the model's generation to match the desired format, leveraging prompt engineering to guide the LLM's behavior without changing model parameters or architecture.

Exam trap

Cisco often tests the misconception that parameter tuning (temperature, max tokens) or model selection can substitute for proper prompt engineering, when in fact format compliance is primarily achieved through explicit instructions and examples in the prompt.

How to eliminate wrong answers

Option B is wrong because lowering the temperature to 0 makes the output deterministic but does not enforce a specific format; the model may still produce explanations or questions in a deterministic but non-compliant manner. Option C is wrong because increasing max tokens allows longer responses but does not prevent the model from including unwanted content like explanations or questions; it may even exacerbate the issue by giving the model more room to deviate. Option D is wrong because switching to a different model does not guarantee format compliance; the core problem is the lack of explicit format instructions, not the model's language understanding capability, and any model will benefit from structured prompting.

Full explanation →

527

MCQhard

A company uses OCI Data Science to fine-tune an embedding model for a specialized domain. After fine-tuning, the model produces embeddings that are not aligned with the vector index used in OCI OpenSearch. What is the most likely cause?

A.The fine-tuning process modified the model architecture

B.The embedding dimension changed after fine-tuning

C.The fine-tuning dataset was too small

D.The vector index was built using a different distance metric than used during fine-tuning

AnswerB

If fine-tuning added/removed layers or changed the output size, the embedding dimension differs, causing index incompatibility.

Why this answer

Fine-tuning an embedding model can alter the dimensionality of the output embeddings if the model's architecture is modified (e.g., changing the final pooling layer or projection head). OCI OpenSearch vector indexes are built with a fixed dimension; if the fine-tuned model produces embeddings of a different size, they cannot be indexed or queried against the existing index, causing a misalignment. This is the most direct and common cause of such a mismatch.

Exam trap

Cisco often tests the misconception that fine-tuning only affects embedding quality or semantic alignment, when in fact it can also alter the vector dimension, which is a hard constraint for vector databases like OpenSearch.

How to eliminate wrong answers

Option A is wrong because fine-tuning typically adjusts weights without changing the model architecture; if the architecture were modified, it would be a deliberate design change, not a side effect of fine-tuning. Option C is wrong because a small dataset may lead to poor quality embeddings but does not change the embedding dimension or cause index misalignment. Option D is wrong because the distance metric (e.g., cosine, Euclidean) is a search-time parameter in OpenSearch, not a property of the embeddings themselves; mismatched metrics affect ranking but not the ability to index or query the vectors.

Full explanation →

528

MCQmedium

An enterprise is deploying a RAG application for compliance document analysis using OCI. They use OCI OpenSearch as the vector store and have millions of documents. Retrieval latency is critical. Currently, a single query takes over 2 seconds. The index uses a flat (brute-force) distance computation. They have considered using approximate nearest neighbor (ANN) algorithms but are unsure about the impact on recall. They need to reduce latency to under 500ms while maintaining high recall. What should they do?

A.Use a smaller embedding dimension by truncating the existing embeddings.

B.Reduce the number of shards in the OpenSearch index to improve parallelism.

C.Switch to an HNSW algorithm with an appropriate M and ef_search parameters.

D.Increase the top-k parameter to retrieve more candidates then filter.

AnswerC

HNSW provides sub-linear search time with good recall.

Why this answer

Option C is correct because switching to HNSW with appropriate parameters provides fast approximate search with configurable recall. Option A (reducing shards) may not achieve the required latency reduction. Option B (reducing dimensions) can degrade embedding quality.

Option D (increasing top-k) would increase latency.

Full explanation →

529

MCQhard

A healthcare company is using OCI Generative AI to analyze patient records and generate clinical summaries. The company must comply with HIPAA regulations, which require that all protected health information (PHI) be encrypted at rest and in transit, and that access be logged and audited. The current architecture uses an OCI Data Science model deployment with a public endpoint. The model is stored in an OCI Object Storage bucket that is publicly accessible for testing. The company is now moving to production. The compliance officer has flagged the following issues: (1) The model endpoint is publicly accessible. (2) The bucket containing the model is public. (3) No audit logs are enabled. The company wants to remediate these issues while maintaining the ability to invoke the model from on-premises applications via a secure connection. Which set of actions should the architect take?

A.Switch the model endpoint to a private subnet with a service gateway, change the bucket to be accessible only via pre-authenticated requests, and enable OCI Logging for the model deployment.

B.Keep the public endpoint but restrict access using IAM policies and source IP addresses, make the bucket private, and enable OCI Audit.

C.Switch the model endpoint to a private subnet with a service gateway, update the bucket policy to block all public access, enable OCI Audit service, and set up a VPN or FastConnect for on-premises access.

D.Use a public load balancer with SSL termination, restrict bucket access to the load balancer's OCID, and enable OCI Audit.

AnswerC

This ensures private endpoint, private bucket, audit logging, and secure on-premises connectivity.

Why this answer

Option C is correct because it addresses all three compliance issues: moving the model endpoint to a private subnet with a service gateway removes public exposure, making the bucket private with a policy that blocks all public access secures the model artifacts, and enabling OCI Audit provides the required logging. Additionally, setting up a VPN or FastConnect allows secure on-premises access without exposing the endpoint to the public internet, fully satisfying HIPAA encryption and audit requirements.

Exam trap

The trap here is that candidates often think IP restrictions or pre-authenticated requests are sufficient for HIPAA compliance, but HIPAA requires that PHI be encrypted at rest and in transit and that access be logged and audited—public endpoints and shared URLs violate the 'encryption in transit' and 'audit' requirements because they rely on internet-exposed paths and lack proper access controls.

How to eliminate wrong answers

Option A is wrong because pre-authenticated requests (PARs) still expose the bucket via a URL that can be shared, which does not meet HIPAA's requirement for access logging and audit; PARs are not a substitute for private bucket policies and audit logging. Option B is wrong because keeping the public endpoint even with IP restrictions is not sufficient for HIPAA compliance—public endpoints are inherently exposed to network-level attacks and do not satisfy the requirement for encryption at rest and in transit in a fully private manner; also, OCI Audit alone does not cover logging for the model deployment itself. Option D is wrong because a public load balancer with SSL termination still leaves the endpoint publicly accessible, and restricting bucket access to the load balancer's OCID does not prevent the bucket from being publicly listed or accessed via other paths; OCI Audit alone does not address the public endpoint issue.

Full explanation →

530

Multi-Selectmedium

A developer is using LangChain's RetrievalQA chain with a vector store. They want to improve the diversity of retrieved documents to avoid redundant information. Which TWO parameters or methods should they adjust?

Select 2 answers

A.Increase the 'chunk_size' in the text splitter

B.Set the 'k' parameter to a very high number

C.Use a different embedding model

D.Set the 'fetch_k' parameter to a value larger than 'k'

E.Enable MMR (maximum marginal relevance) in the retriever

AnswersD, E

fetch_k retrieves a larger initial set from which MMR can choose a diverse subset.

Why this answer

Option D is correct because setting 'fetch_k' to a value larger than 'k' allows the retriever to first fetch a larger pool of candidate documents (fetch_k) and then apply a diversity-promoting algorithm like MMR to select the final 'k' documents. This reduces redundancy by ensuring the returned set is not dominated by similar documents from the same region of the vector space.

Exam trap

Cisco often tests the misconception that simply increasing the number of retrieved documents (k) improves diversity, when in fact without MMR or a similar algorithm, more documents often means more redundancy.

Full explanation →

531

MCQhard

A company is using OCI GenAI with a Dedicated AI Cluster to serve a large language model for real-time chat applications. They notice high inference latency (average 2 seconds per response) and want to reduce it to under 500 milliseconds without significantly degrading the quality of responses. The cluster is configured with NVIDIA A100 GPUs. The model is the base Cohere Command model (52B parameters). They have explored increasing batch size, but that increases latency for interactive use cases. Which action should they take?

A.Deploy the model with inference optimization frameworks like vLLM, TensorRT, or ONNX Runtime.

B.Increase batch size to process multiple queries at once.

C.Swap the model to a smaller variant, such as Cohere Command Light (6B).

D.Enable model quantization (e.g., int8) to reduce memory and computation.

AnswerA

These frameworks optimize GPU utilization and reduce latency without changing the model.

Why this answer

Option A is correct because inference optimization frameworks like vLLM, TensorRT, and ONNX Runtime are specifically designed to reduce latency for large language models on NVIDIA A100 GPUs. These frameworks use techniques such as PagedAttention (vLLM), kernel fusion, and graph optimization to significantly lower per-request latency without degrading output quality, making them ideal for real-time chat applications where sub-500ms responses are required.

Exam trap

Cisco often tests the misconception that model quantization or smaller models are the only ways to reduce latency, but the trap here is that inference optimization frameworks can achieve dramatic latency reductions without sacrificing model quality or capability.

How to eliminate wrong answers

Option B is wrong because increasing batch size improves throughput but increases per-request latency, which is counterproductive for interactive use cases requiring low latency. Option C is wrong because swapping to a smaller model (Cohere Command Light, 6B) would reduce latency but also significantly degrade response quality and capability, which the question explicitly wants to avoid. Option D is wrong because enabling model quantization (e.g., int8) reduces memory and computation, which can lower latency, but it often introduces a trade-off in model accuracy and may not achieve the target latency on its own without combining with inference optimization frameworks; the question asks for the best single action, and optimization frameworks directly target latency reduction more effectively.

Full explanation →

532

MCQeasy

Which OCI Generative AI model family is specifically designed to convert text into vector embeddings for semantic search and clustering tasks?

A.Cohere Embed

B.Meta Llama 3

C.Cohere Command R

D.Cohere Rerank

AnswerA

Embed models produce vector embeddings for semantic search, clustering, and classification.

Why this answer

Cohere Embed models (e.g., embed-english-v3.0, embed-multilingual-v3.0) are explicitly designed for generating text embeddings. Cohere Command R and R+ are for generation, Meta Llama 3 is a general-purpose LLM, and Cohere Rerank is for re-ranking search results.

Full explanation →

533

MCQmedium

A data scientist is deploying a custom generative AI model using OCI Data Science. After deploying the model to an endpoint, they notice that inference requests are failing with a timeout error when the payload size exceeds 1 MB. What is the most likely cause and solution?

A.The load balancer is misconfigured; reconfigure the load balancer timeout settings.

B.The model server lacks sufficient memory; scale out to more instances.

C.The model is not optimized for large payloads; use AutoML to optimize the model.

D.The model deployment has a default payload size limit of ~1 MB; increase the payload limit in the deployment configuration.

AnswerD

OCI Data Science model deployments have a default request payload limit that can be increased.

Why this answer

The correct answer is D because OCI Data Science model deployments have a default payload size limit of approximately 1 MB. When inference requests exceed this limit, the load balancer or gateway times out the request. The solution is to increase the payload limit in the deployment configuration, which can be adjusted via the OCI console or API by modifying the `maximumRequestPayloadSize` setting.

Exam trap

The trap here is that candidates often confuse a payload size limit with a generic timeout or resource issue, leading them to choose load balancer reconfiguration (A) or scaling (B) instead of recognizing the explicit payload limit enforced by the deployment configuration.

How to eliminate wrong answers

Option A is wrong because the load balancer timeout settings are not the root cause; the timeout is a symptom of hitting the payload size limit, not a misconfiguration of the load balancer itself. Option B is wrong because insufficient memory would cause out-of-memory errors or slow inference, not a timeout specifically triggered by payload size exceeding 1 MB. Option C is wrong because AutoML optimizes model training and hyperparameters, not the runtime payload handling; the issue is a deployment configuration limit, not model optimization.

Full explanation →

534

MCQmedium

Your team is deploying a generative AI model for a clinical decision support system. The model must meet HIPAA compliance requirements. You have trained a model using OCI Data Science and now need to deploy it so that patient data is protected. The application requires real-time inference. Which set of actions should you take to ensure compliance while maintaining low latency?

A.Use OCI Functions with API Gateway and allow anonymous access

B.Deploy in a public subnet with HTTPS and enable OCI Audit

C.Use OCI Data Flow for batch inference and store results in Object Storage with SSE

D.Deploy in a private VCN subnet, use a service gateway, store keys in OCI Vault, and enable OCI Logging and OCI Audit

AnswerD

These actions address HIPAA requirements for access control, encryption, and auditing.

Why this answer

Option D is correct because deploying the model in a private VCN subnet ensures the inference endpoint is not exposed to the internet, meeting HIPAA's requirement for network isolation. Using a service gateway allows private connectivity to OCI services without traversing the internet, while storing encryption keys in OCI Vault enables customer-managed key control for data at rest. Enabling OCI Logging and OCI Audit provides the necessary audit trail for compliance, and the private subnet with service gateway keeps latency low by avoiding internet hops.

Exam trap

Oracle often tests the misconception that HTTPS encryption alone is sufficient for HIPAA compliance, but the trap here is that network isolation (private subnet) is mandatory for PHI, and public subnet exposure violates the HIPAA Security Rule even with encryption in transit.

How to eliminate wrong answers

Option A is wrong because OCI Functions with API Gateway and anonymous access bypasses authentication and authorization, violating HIPAA's access control requirements, and anonymous access exposes patient data to unauthorized users. Option B is wrong because deploying in a public subnet, even with HTTPS, exposes the inference endpoint to the internet, which is not permitted for protected health information (PHI) under HIPAA's security rule, and OCI Audit alone does not enforce network isolation. Option C is wrong because OCI Data Flow is a batch processing service, not suitable for real-time inference, and storing results in Object Storage with SSE does not address the need for low-latency, synchronous inference required by the clinical decision support system.

Full explanation →

535

MCQmedium

A company wants to create a chatbot that answers questions based on a large internal document set that is updated weekly. They have limited ML expertise. Which approach is recommended?

A.Fine-tune a model on the entire document set.

B.Train a custom model from scratch.

C.Include all documents in the system prompt.

D.Use retrieval-augmented generation (RAG) with a vector database.

AnswerD

Correct: RAG handles dynamic data without retraining.

Why this answer

Retrieval-Augmented Generation (RAG) with a vector database is the recommended approach because it allows the chatbot to answer questions based on a large, frequently updated document set without requiring model retraining. RAG retrieves relevant document chunks at query time using vector similarity search, then passes them as context to the LLM, ensuring up-to-date answers with minimal ML expertise.

Exam trap

Cisco often tests the misconception that fine-tuning is the only way to incorporate custom data, when in fact RAG is the preferred method for dynamic, large-scale document sets due to its cost-effectiveness, ease of updates, and lower ML expertise requirements.

How to eliminate wrong answers

Option A is wrong because fine-tuning a model on the entire document set would require significant ML expertise, computational resources, and would need to be repeated weekly to incorporate updates, making it impractical for a dynamically changing corpus. Option B is wrong because training a custom model from scratch is extremely resource-intensive, requires deep ML expertise, and is unnecessary when pre-trained LLMs can be leveraged with RAG. Option C is wrong because including all documents in the system prompt would exceed the LLM's context window limits (typically 4K-128K tokens), causing truncation, high latency, and increased cost, while also failing to scale with a large document set.

Full explanation →

536

MCQmedium

A company uses OCI GenAI to build a content moderation system that filters toxic language in user-generated comments. They have a small labeled dataset of 1,000 comments (500 toxic, 500 non-toxic) and need an efficient solution that balances accuracy, cost, and latency. They are considering different model options: fine-tuning a large LLM (e.g., Cohere Command), using a pre-trained LLM with prompting, fine-tuning a smaller BERT-based classifier, or building a rule-based system. The team has moderate ML experience and wants to deploy using OCI Data Science. Which approach is most efficient for this binary classification task?

A.Fine-tune a BERT-based classifier (e.g., 'bert-base-uncased') on the dataset.

B.Develop a rule-based system using regular expressions and keyword lists.

C.Use a pre-trained LLM with a toxic/non-toxic prompt.

D.Fine-tune the Cohere Command model on the labeled dataset.

AnswerA

BERT is efficient for classification, fine-tunes quickly on small data, and has low inference cost.

Why this answer

Fine-tuning a BERT-based classifier (e.g., 'bert-base-uncased') is the most efficient approach because BERT is specifically designed for text classification tasks, requiring far fewer computational resources and lower latency than large LLMs. With only 1,000 labeled samples, BERT can achieve high accuracy through transfer learning, while keeping inference costs minimal—ideal for a production content moderation system on OCI Data Science.

Exam trap

Oracle often tests the misconception that larger LLMs (like Cohere Command) are always superior for classification tasks, ignoring the practical constraints of small datasets, cost, and latency that make fine-tuned BERT models the optimal choice for binary classification.

How to eliminate wrong answers

Option B is wrong because rule-based systems using regex and keyword lists cannot generalize to nuanced toxic language (e.g., sarcasm, misspellings, or context-dependent toxicity) and require constant manual maintenance, leading to poor accuracy and high operational overhead. Option C is wrong because using a pre-trained LLM with prompting (e.g., Cohere Command) incurs high per-token inference costs and latency, and with only 1,000 examples, few-shot prompting may not reliably capture the specific toxicity patterns in the dataset. Option D is wrong because fine-tuning a large LLM like Cohere Command on a tiny dataset of 1,000 samples risks catastrophic forgetting and overfitting, while also being computationally expensive and slower for real-time moderation compared to a smaller BERT model.

Full explanation →

537

MCQmedium

A company wants to use OCI Generative AI service to automatically generate product descriptions for an e-commerce catalog. They have 10,000 products. What is the best approach to ensure high-quality, consistent descriptions?

A.Use a pre-trained summarization model.

B.Use a template-based generation with keyword insertion.

C.Use the built-in chat model with few-shot examples in the prompt.

D.Fine-tune a base model on a dataset of existing product descriptions.

AnswerD

Fine-tuning adapts the model to the specific domain and produces consistent outputs across many products.

Why this answer

Fine-tuning a base model on a dataset of existing product descriptions is the best approach because it adapts the model to the specific domain, style, and vocabulary of the e-commerce catalog. This ensures high-quality, consistent outputs across 10,000 products by learning the patterns and terminology from the company's own data, rather than relying on generic or template-based methods.

Exam trap

Cisco often tests the misconception that few-shot prompting (Option C) is sufficient for large-scale, consistent generation, when in reality it suffers from context window limits and lack of domain-specific adaptation, making fine-tuning the only viable option for production workloads with thousands of items.

How to eliminate wrong answers

Option A is wrong because a pre-trained summarization model is designed to condense existing text, not generate new product descriptions from scratch, and would produce inconsistent or irrelevant outputs for this task. Option B is wrong because template-based generation with keyword insertion lacks the flexibility and natural language understanding needed for 10,000 unique products, resulting in repetitive, low-quality descriptions that do not capture nuanced product features. Option C is wrong because using the built-in chat model with few-shot examples in the prompt can work for small-scale tasks but is not scalable or reliable for 10,000 products; the model may drift, exceed token limits, or fail to maintain consistent style and accuracy across such a large volume.

Full explanation →

538

Multi-Selecteasy

Which THREE are essential steps in the prompt engineering process for an LLM?

Select 3 answers

A.Test the prompt with a variety of input examples

B.Fine-tune the model on a domain corpus

C.Define the desired output format and constraints

D.Quantize the model to INT8

E.Iteratively refine the prompt based on model responses

AnswersA, C, E

Testing ensures robustness across different inputs.

Why this answer

Option A is correct because testing the prompt with a variety of input examples is essential to evaluate the LLM's generalization, robustness, and sensitivity to different phrasing or contexts. This step helps identify edge cases, biases, or inconsistencies in the model's responses before deployment.

Exam trap

Oracle often tests the distinction between prompt engineering (input-side optimization) and model modification (fine-tuning, quantization) to trap candidates who confuse these fundamentally different processes.

Full explanation →

539

MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use a larger foundation model with a longer context window and paste all documents into each prompt

B.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store

C.Train a custom model from scratch on the policy documents each month

D.Fine-tune a base LLM on the policy documents monthly

AnswerB

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

RAG (Retrieval-Augmented Generation) allows the LLM to retrieve relevant document sections at inference time, so knowledge stays current without retraining. The other options either require expensive retraining for each update or lack document grounding.

Full explanation →

540

MCQeasy

A company needs to ensure that only authorized users can invoke an endpoint for a generative AI model. Which OCI feature should be used to control access?

A.Network security groups (NSGs)

B.VCN flow logs

C.OCI Web Application Firewall (WAF)

D.OCI Identity and Access Management (IAM) policies

AnswerD

Correct: IAM policies grant or deny access to specific resources like models and endpoints.

Why this answer

OCI Identity and Access Management (IAM) policies are the correct choice because they define who (users, groups, or service principals) can invoke which OCI resources, including generative AI model endpoints. IAM policies use resource-type and verb-based statements (e.g., 'allow group A to manage ai-service-family in compartment X') to enforce authorization at the API level, ensuring only authorized principals can call the model's inference endpoint.

Exam trap

The trap here is that candidates confuse network-level controls (NSGs, WAF) with identity-based access control, mistakenly thinking that restricting network traffic to the endpoint is sufficient for authorization, whereas OCI requires IAM policies to authenticate and authorize the caller's identity at the API layer.

How to eliminate wrong answers

Option A is wrong because Network Security Groups (NSGs) control network traffic at the subnet or VNIC level using stateful firewall rules (e.g., allow/deny TCP port 443), not user identity or API-level authorization. Option B is wrong because VCN flow logs capture metadata about network traffic (source IP, destination port, etc.) for auditing or troubleshooting, but they do not enforce access control. Option C is wrong because OCI Web Application Firewall (WAF) protects against HTTP-based attacks (e.g., SQL injection, XSS) and can filter by IP or request patterns, but it cannot authenticate or authorize individual users or service principals invoking the model endpoint.

Full explanation →

541

MCQmedium

A company needs to generate vector embeddings for a multilingual document set to support semantic search across English and French documents. Which embedding model should they use?

A.Cohere embed-english-v3.0

B.Cohere Command R

C.Cohere embed-multilingual-v3.0

D.Meta Llama 3 8B

AnswerC

This model supports multiple languages including French and English.

Why this answer

The embed-multilingual-v3.0 model supports multiple languages, including English and French.

Full explanation →

542

MCQhard

You are a cloud architect at a healthcare company that uses OCI Generative AI Service to analyze patient records and generate clinical summaries. The service is deployed in the Frankfurt region with a dedicated AI cluster. Recently, the compliance team flagged that some generated summaries contain hallucinated diagnoses not present in the source records. They demand immediate mitigation. The current setup uses the default model (cohere.command-r-08-2024) with temperature=0.7, top_p=0.9, and max_tokens=2048. The application sends the entire patient record as a single prompt. You have access to OCI Logging, monitoring metrics (latency, request count, token count, safety filter rejections), and the AI service's model fine-tuning capability. You must reduce hallucinations while minimizing latency increase. What is the most effective course of action?

A.Switch to cohere.command-light model for faster inference and add a post-processing step using a BERT-based NER model to validate entities.

B.Increase max_tokens to 4096 and use chunked processing with overlapping context windows to provide more context.

C.Enable the safety filter with strict content moderation and set up OCI Logging to audit all generations.

D.Reduce temperature to 0.2, top_p to 0.5, and fine-tune the model on a curated dataset of 5,000 clinical summaries with a learning rate of 0.00005 and batch size of 8.

AnswerD

Lower temperature/top_p yields more deterministic outputs; fine-tuning on domain-specific data directly reduces hallucinations.

Why this answer

Option D is correct because reducing temperature and top_p makes the model more deterministic, directly reducing the likelihood of hallucinated content. Fine-tuning on a curated dataset of 5,000 clinical summaries teaches the model domain-specific patterns and constraints, further minimizing hallucinations. This approach addresses the root cause without significantly increasing latency, as fine-tuning does not affect inference speed and lower sampling parameters add no computational overhead.

Exam trap

Cisco often tests the misconception that adding more context or post-processing steps can fix hallucinations, when in fact controlling model parameters and fine-tuning are the primary methods to reduce fabricated content in generative AI.

How to eliminate wrong answers

Option A is wrong because switching to cohere.command-light may reduce latency but does not address hallucinations; a BERT-based NER post-processing step adds latency and only validates entities, not the clinical accuracy of generated diagnoses. Option B is wrong because increasing max_tokens to 4096 and using chunked processing with overlapping context windows increases latency and token consumption, and providing more context does not prevent the model from fabricating information. Option C is wrong because enabling the safety filter with strict content moderation only blocks harmful or unsafe content, not hallucinations; OCI Logging audits generations but does not mitigate the problem.

Full explanation →

543

MCQmedium

A data scientist is fine-tuning a model on OCI Generative AI to generate code comments. They use a dataset of 10,000 examples. After fine-tuning, the model generates comments that are too similar to the training data and lack generalization. What is the most likely cause?

A.Incorrect tokenizer.

B.Insufficient training data.

C.Too many training epochs.

D.Too high learning rate.

AnswerC

Excessive epochs cause the model to memorize training data, reducing generalization.

Why this answer

When a fine-tuned model generates outputs that are too similar to the training data and lack generalization, it is a classic sign of overfitting. Overfitting occurs when the model is trained for too many epochs, causing it to memorize the training examples rather than learning the underlying patterns. In OCI Generative AI, the fine-tuning process adjusts model weights iteratively, and excessive epochs lead to poor performance on unseen data.

Exam trap

The trap here is that candidates often confuse overfitting (caused by too many epochs) with underfitting (caused by insufficient data or low learning rate), leading them to incorrectly select option B or D.

How to eliminate wrong answers

Option A is wrong because an incorrect tokenizer would cause tokenization errors or mismatched vocabulary, not overfitting or memorization of training data. Option B is wrong because insufficient training data typically leads to underfitting, not overfitting; with 10,000 examples, the dataset size is reasonable for fine-tuning. Option D is wrong because a too high learning rate usually causes training instability or divergence, not memorization; it would prevent the model from converging properly.

Full explanation →

544

MCQmedium

A healthcare company is deploying an OCI Generative AI service to summarize patient notes. They have recently moved from a managed serving endpoint to a dedicated AI cluster to ensure data privacy. The fine-tuned model is deployed on a dedicated cluster in the US West region. Users report that the summarization responses are now slower and occasionally timeout. The IT team checks the metrics: the cluster has 1 replica and CPU utilization is at 90%. The Object Storage bucket containing the model artifacts is in the same region. They have increased the timeout in their client configuration to 120 seconds, but still get timeouts. What should they do first to address the issue?

A.Move the Object Storage bucket to a local NVMe cache in the cluster.

B.Move the model back to a managed serving endpoint in a different region.

C.Increase the number of replicas in the dedicated cluster.

D.Increase the max tokens parameter in the API call.

AnswerC

Adding replicas provides more compute capacity to handle the load.

Why this answer

The dedicated AI cluster has only 1 replica and CPU utilization is at 90%, indicating that the single replica is overloaded and cannot handle the inference request volume. Increasing the number of replicas distributes the load, reduces latency, and prevents timeouts. This is the most direct and scalable fix for performance bottlenecks in a dedicated OCI Generative AI cluster.

Exam trap

The trap here is that candidates may focus on storage or client-side tuning (like timeout or token limits) instead of recognizing that a single overloaded replica is the root cause of performance degradation.

How to eliminate wrong answers

Option A is wrong because moving the Object Storage bucket to a local NVMe cache does not address the compute bottleneck; model artifacts are loaded into memory at deployment time, and runtime inference latency is driven by CPU/GPU load, not storage I/O. Option B is wrong because moving back to a managed serving endpoint would compromise the data privacy requirement that prompted the move to a dedicated cluster, and a different region could introduce additional latency. Option D is wrong because increasing the max tokens parameter would increase the output length, making the inference slower and worsening timeouts, not solving the underlying resource contention.

Full explanation →

545

MCQhard

An LLM is being used to answer customer queries about a product catalog. The answers are fluent but sometimes include plausible-sounding but incorrect product details. What is this phenomenon called, and which technique is most effective to mitigate it?

A.Knowledge cutoff; fine-tune the model on the catalog

B.Hallucination; use Retrieval-Augmented Generation (RAG) with the catalog indexed

C.Bias amplification; increase temperature

D.Overfitting; reduce the model size

AnswerB

Hallucination is the correct term; RAG is the standard mitigation.

Why this answer

Hallucination is the generation of false information; RAG grounds responses in retrieved factual documents, reducing hallucinations.

Full explanation →

546

MCQeasy

In the OCI Generative AI Playground, a developer wants to control how creative the model responses are. Which parameter should they adjust?

A.Presence penalty

B.Temperature

C.Max tokens

D.Stop sequences

AnswerB

Temperature directly controls randomness and creativity in model outputs.

Why this answer

Temperature controls randomness; higher values produce more creative outputs. The other parameters control other aspects of generation.

Full explanation →

547

Multi-Selectmedium

A company wants to use OCI Generative AI Agents to create a RAG-powered customer support system. Which THREE components are essential for the agent to work?

Select 3 answers

A.A knowledge base

B.A Dedicated AI Cluster

C.A data source (e.g., OCI Object Storage bucket)

D.A base LLM (e.g., Cohere Command R)

E.A fine-tuned embedding model

AnswersA, C, D

Why this answer

OCI Generative AI Agents require a knowledge base, a data source (like Object Storage), and a base LLM to generate responses. The other options are optional.

Full explanation →

548

MCQeasy

What is the primary purpose of chunking documents in a RAG pipeline?

A.To improve embedding quality

B.To speed up training

C.To reduce storage costs

D.To ensure each chunk fits within the model's context window

AnswerD

Models have token limits; chunking prevents truncation.

Why this answer

The primary purpose of chunking documents in a RAG pipeline is to divide large documents into smaller, manageable pieces that each fit within the model's context window. This ensures that the retrieval step can fetch relevant chunks without exceeding the token limit of the LLM, which would otherwise cause truncation or failure to process the input.

Exam trap

Cisco often tests the misconception that chunking improves embedding quality or reduces costs, but the core technical constraint is the LLM's fixed context window, which chunking directly addresses.

How to eliminate wrong answers

Option A is wrong because embedding quality depends on the embedding model and the semantic coherence of the chunk, not on chunking itself; poorly chosen chunk sizes can degrade quality, but chunking is not designed to improve it. Option B is wrong because chunking is a preprocessing step for retrieval and generation, not for training; RAG pipelines typically use pre-trained models and do not involve training on the chunks. Option C is wrong because chunking often increases storage overhead due to metadata and overlapping chunks, and the primary goal is functional (context window fit), not cost reduction.

Full explanation →

549

MCQmedium

A data scientist needs to fine-tune a Llama 3 model for a legal document classification task. They have a dataset of 10,000 labeled examples. Which fine-tuning technique available in OCI Generative AI is most suitable for efficiently adapting the model with limited computational overhead?

A.Full fine-tuning all model parameters

B.LoRA (Low-Rank Adaptation)

C.Prefix tuning

D.T-Few fine-tuning

AnswerD

T-Few is an efficient parameter-update technique designed for fine-tuning with limited compute, available in OCI GenAI.

Why this answer

T-Few fine-tuning is a parameter-efficient technique that updates only a small number of weights, making it suitable for fine-tuning large models with limited compute. It is the technique offered by OCI GenAI for fine-tuning.

Full explanation →

550

MCQmedium

A developer is building a code generation assistant and wants to minimize the number of API calls to the OCI Generative AI service. Which tokenization approach results in the lowest token count for a given code snippet?

A.WordPiece tokenizer

B.SentencePiece tokenizer with unigram LM

C.BPE tokenizer trained on code corpora

D.Character-level tokenization

AnswerC

BPE learns frequent subword patterns in code, reducing token count.

Why this answer

Option C is correct because BPE (Byte Pair Encoding) tokenizers trained specifically on code corpora learn subword units that align closely with programming language syntax (e.g., common keywords, operators, and variable patterns), resulting in fewer tokens for a given code snippet compared to general-purpose tokenizers. This reduces API calls by encoding more semantic meaning per token, directly minimizing token count.

Exam trap

Cisco often tests the misconception that any subword tokenizer (like WordPiece or SentencePiece) is equally effective for code, but the trap is that only BPE trained on code corpora optimizes for the repetitive, syntax-heavy nature of programming languages, while others over-segment or use general-language frequency distributions.

How to eliminate wrong answers

Option A is wrong because WordPiece tokenizer, designed for natural language (e.g., BERT), splits code into subwords based on frequency in general text, leading to higher token counts for code-specific patterns like indentation or operators. Option B is wrong because SentencePiece with unigram LM uses a probabilistic unigram model that often over-segments code into many small pieces (e.g., splitting 'print' into 'pr', 'int'), increasing token count. Option D is wrong because character-level tokenization produces the highest token count possible, as each character becomes a separate token, which is the opposite of minimizing API calls.

Full explanation →

551

MCQmedium

An OCI user observes that their embedding model returns vectors that are not normalized, and they want to compute cosine similarity between two text embeddings. What should they do?

A.Compute the Euclidean distance between the vectors

B.Compute the L1 norm of the difference

C.Normalize the vectors to unit length, then compute the dot product

D.Compute the dot product directly

AnswerC

Cosine similarity is dot product of normalized vectors. Normalizing ensures the result is in [-1,1] and reflects the cosine of the angle.

Why this answer

Cosine similarity measures the cosine of the angle between two vectors, which is equivalent to the dot product of the vectors after they have been normalized to unit length (L2 norm = 1). Option C correctly describes this process: first normalize each embedding vector to unit length, then compute the dot product. This is the standard approach because raw embedding vectors from models like OCI's AI services may not be unit vectors, and the dot product alone does not account for magnitude differences.

Exam trap

Cisco often tests the misconception that the dot product alone is equivalent to cosine similarity, but the trap is that this only holds if the vectors are already normalized to unit length, which is not guaranteed by default.

How to eliminate wrong answers

Option A is wrong because Euclidean distance measures the straight-line distance between vectors, which is sensitive to vector magnitude and does not directly compute cosine similarity. Option B is wrong because the L1 norm of the difference (Manhattan distance) is a different metric that does not capture angular similarity. Option D is wrong because computing the dot product directly on non-normalized vectors yields a value that is influenced by both the angle and the magnitudes of the vectors, not purely the cosine of the angle.

Full explanation →

552

MCQmedium

An AI developer is building a document Q&A application using LangChain and OCI Generative AI. They need to split large PDF documents into smaller chunks before embedding. Which text splitter should they use to ensure splits respect sentence boundaries while also controlling chunk size?

A.RecursiveCharacterTextSplitter

B.WebBaseLoader

C.PDFLoader

D.TokenTextSplitter

AnswerA

This splitter recursively splits text by a list of separators, keeping paragraphs and sentences intact while controlling chunk size.

Why this answer

RecursiveCharacterTextSplitter splits text recursively by separators (like \n\n, \n, period) to keep semantically related text together and respects sentence-like boundaries. TokenTextSplitter splits by tokens without regard for sentence boundaries. PDFLoader and WebBaseLoader are document loaders, not splitters.

Full explanation →

553

MCQmedium

A developer uses OCI Generative AI's chat endpoint with a system message placed after user messages. The model ignores the system message. What is the most likely reason?

A.The system message is too long

B.Temperature is set too high

C.The model has not been fine-tuned for instruction following

D.The system message is placed after user messages

AnswerD

The standard order is system first, then user; otherwise the model may misinterpret.

Why this answer

In OCI Generative AI's chat endpoint, the system message must be placed before user messages to establish the model's behavior and context. When placed after user messages, the model treats it as part of the conversation history rather than a directive, causing it to be ignored. This ordering is a fundamental requirement for the chat API's message structure.

Exam trap

Oracle often tests the specific API message ordering requirement, where candidates mistakenly attribute the failure to model limitations or hyperparameters rather than the structural placement of the system message.

How to eliminate wrong answers

Option A is wrong because the system message being too long would cause a token limit error or truncation, not silent ignoring. Option B is wrong because temperature controls randomness in output, not whether instructions are followed; a high temperature might produce varied responses but does not cause the model to ignore the system message. Option C is wrong because OCI Generative AI models are pre-trained for instruction following without requiring fine-tuning; the issue is purely about message ordering, not model capability.

Full explanation →

554

Multi-Selectmedium

Which TWO are required components to implement a basic RAG system using OCI services? (Choose two.)

Select 2 answers

A.OCI Object Storage

B.OCI Functions

C.OCI Data Flow

D.OCI Search with OpenSearch

E.OCI Document Understanding

AnswersD, E

Required as the vector database for similarity search.

Why this answer

A RAG system needs a way to parse documents into chunks (OCI Document Understanding) and a vector store to index and search embeddings (OCI Search with OpenSearch).

Full explanation →

555

MCQmedium

A prompt engineer is testing different versions of a prompt to improve accuracy on a classification task. Which practice is most appropriate for systematic refinement?

A.Always increase the number of few-shot examples

B.Run A/B tests on a representative evaluation set and measure accuracy

C.Manually review outputs of a few examples and adjust based on intuition

D.Change the model to a larger one

AnswerB

A/B testing with a labeled dataset provides quantitative evidence to guide refinement.

Why this answer

A/B testing with a holdout evaluation set allows objective comparison of prompt variants and measures performance improvements reliably.

Full explanation →

556

MCQeasy

A developer wants to call the OCI Generative AI service from a Python application running on an OCI Compute instance. Which method is the most secure for authenticating the API calls?

A.Use a resource principal

B.Use the OCI CLI with a config file containing credentials

C.Use instance principals with a dynamic group and policy

D.Use an API signing key stored on the instance

AnswerC

Instance principals allow secure authentication without storing secrets.

Why this answer

Option C is correct because instance principals allow the Compute instance to authenticate to OCI services without storing any credentials on the instance. By assigning a dynamic group and policy, the instance obtains a temporary security token from the OCI metadata service, which is the most secure method for programmatic access from within OCI.

Exam trap

The trap here is that candidates confuse resource principals (used for serverless functions) with instance principals (used for Compute instances), or they assume that storing credentials in a config file is acceptable because it is a common practice in non-OCI environments.

How to eliminate wrong answers

Option A is wrong because resource principals are used for OCI Functions or other OCI resources that need to make API calls, not for Compute instances. Option B is wrong because using the OCI CLI with a config file containing credentials stores long-lived user credentials on the instance, which is less secure and violates the principle of least privilege. Option D is wrong because storing an API signing key on the instance creates a persistent secret that could be compromised if the instance is breached, and it requires manual key rotation.

Full explanation →

557

MCQeasy

Which of the following is a decoder-only model architecture?

A.T5

B.GPT-3

C.BART

D.BERT

AnswerB

GPT-3 is decoder-only, using masked self-attention.

Why this answer

GPT is a decoder-only model. BERT is encoder-only. T5 is encoder-decoder.

Full explanation →

558

MCQmedium

A company needs to evaluate a text summarization model. They have reference summaries and want a metric that measures overlap of n-grams. Which metric is MOST appropriate?

A.BLEU

B.Perplexity

C.ROUGE

D.BERTScore

AnswerC

ROUGE measures recall of n-grams and is standard for summarization evaluation.

Why this answer

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the most appropriate metric because it measures the overlap of n-grams between the generated summary and reference summaries, directly aligning with the company's requirement. Unlike BLEU, which focuses on precision, ROUGE emphasizes recall, making it better suited for summarization tasks where capturing all key content from the reference is critical.

Exam trap

Cisco often tests the distinction between precision-focused metrics (BLEU) and recall-focused metrics (ROUGE), trapping candidates who assume BLEU is suitable for summarization because it also measures n-gram overlap, without recognizing that summarization evaluation prioritizes recall of reference content over precision of generated output.

How to eliminate wrong answers

Option A is wrong because BLEU (Bilingual Evaluation Understudy) primarily measures precision of n-gram overlap and was designed for machine translation, not summarization; it penalizes shorter outputs and does not emphasize recall of reference content. Option B is wrong because Perplexity measures how well a language model predicts a sequence, not n-gram overlap between summaries, and is used for evaluating language model fluency, not summarization quality. Option D is wrong because BERTScore uses contextual embeddings from BERT to compute semantic similarity via cosine similarity, not direct n-gram overlap, and while it captures meaning, it does not meet the specific requirement for n-gram-based overlap measurement.

Full explanation →

559

MCQmedium

A company's AI system uses RAG to answer customer questions. Users often get incomplete answers because the retrieved chunks do not contain all relevant information. Which step in the RAG pipeline is most likely the issue?

A.Retrieval top-k setting

B.Generation model temperature

C.Chunking strategy (chunk size and overlap)

D.Embedding model selection

AnswerC

If chunks are too small or have insufficient overlap, relevant information may be split, leading to incomplete retrieval.

Why this answer

Chunking determines how documents are split into pieces. If chunks are too small, key information may be split across chunks, causing incomplete retrieval. Adjusting chunk size and overlap can improve completeness.

Full explanation →

560

MCQeasy

An organization wants to deploy a generative AI chatbot using OCI Generative AI service. The chatbot must comply with data residency requirements by ensuring that all data processing occurs within a specific geographic region. What is the best practice to achieve this?

A.Use a dedicated AI cluster in the required region

B.Enable cross-region replication for disaster recovery

C.Configure a tenancy-wide policy to restrict region usage

D.Use IAM policies to block access from other regions

AnswerA

Dedicated AI clusters are region-specific and ensure data stays in that region.

Why this answer

Option A is correct because OCI Generative AI service allows you to provision a dedicated AI cluster within a specific region, ensuring all model inference and data processing remain within that geographic boundary. This dedicated cluster is isolated from other regions and complies with data residency requirements by design, as no data leaves the chosen region during processing.

Exam trap

The trap here is that candidates confuse data residency with access control or disaster recovery, thinking that IAM policies or replication settings can enforce geographic data boundaries, when in fact only the physical placement of the compute cluster guarantees data stays within a region.

How to eliminate wrong answers

Option B is wrong because cross-region replication is a disaster recovery feature that copies data to another region, which would violate data residency by moving data outside the required geographic region. Option C is wrong because tenancy-wide policies restrict where resources can be created, but they do not control where data processing occurs for an existing AI cluster; data could still be processed in a different region if the cluster is not explicitly placed. Option D is wrong because IAM policies block user access from other regions but do not prevent the AI service from processing data in a region other than the required one; data residency is about data location, not access control.

Full explanation →

561

MCQhard

A company is deploying a LangChain agent that uses a custom tool to query an external API. The agent must handle rate limits gracefully. Which approach should the developer implement?

A.Increase the tool's timeout to reduce request frequency

B.Deploy multiple agent instances to distribute requests

C.Ignore rate limits and rely on the LLM to slow down

D.Use the Tool's built-in RateLimiter wrapper with a specified max_requests_per_second

AnswerD

The RateLimiter wrapper automatically throttles calls to the tool.

Why this answer

LangChain provides a built-in RateLimiter tool wrapper that can be applied to any tool to enforce rate limits. Alternatively, custom retry logic can be added, but using the built-in wrapper is the recommended pattern.

Full explanation →

562

MCQhard

A team is building a code generation assistant and needs to choose between fine-tuning a base LLM or using in-context learning with a few examples. They have 500 high-quality code examples. The assistant must generate code for a wide variety of tasks. Which approach is BETTER and why?

A.Fine-tuning, because it reduces inference cost compared to providing examples each time

B.Fine-tuning, because it permanently encodes the examples into the model weights

C.In-context learning, because it allows the model to adapt to each task dynamically without risking catastrophic forgetting

D.In-context learning, because it requires no additional training infrastructure

AnswerC

In-context learning uses the model's existing knowledge and adapts via examples in the prompt, which is more flexible for diverse tasks with a small dataset.

Why this answer

Fine-tuning with 500 examples may lead to overfitting or catastrophic forgetting, especially when the tasks are diverse. In-context learning with a few examples per task is more flexible and leverages the model's pre-trained knowledge. The small dataset size makes fine-tuning risky.

Full explanation →

563

MCQeasy

A team is using OCI Generative AI Agents to build a customer support bot. The bot sometimes generates answers that contradict the knowledge base. What is the most likely cause?

A.The chunking strategy for the knowledge base does not capture enough context overlap.

B.The max tokens value is too low, truncating the response.

C.The temperature parameter is set too high, causing the model to hallucinate.

D.The model's repetition penalty is too high.

AnswerA

If chunks are too small or lack overlap, the model may not retrieve all relevant information, leading to inconsistencies.

Why this answer

Option A is correct because when the chunking strategy lacks sufficient context overlap, the retrieved chunks may omit critical surrounding information, causing the generative AI model to infer missing details incorrectly and produce answers that contradict the knowledge base. In OCI Generative AI Agents, the chunking strategy determines how documents are split into smaller pieces for retrieval; without adequate overlap, the model loses the semantic continuity needed to stay faithful to the source material.

Exam trap

Oracle often tests the misconception that hallucinations are always caused by temperature settings, when in fact retrieval quality issues like poor chunking are a more common root cause in RAG-based systems.

How to eliminate wrong answers

Option B is wrong because a low max tokens value truncates the response length but does not cause the model to generate contradictory content; it simply cuts off the output prematurely. Option C is wrong because while a high temperature parameter increases randomness and can lead to hallucinations, the question specifically states the bot contradicts the knowledge base, which is more directly tied to retrieval failures (chunking) than to generation randomness. Option D is wrong because a high repetition penalty discourages the model from repeating phrases, which might reduce fluency but does not cause contradictions with the knowledge base.

Full explanation →

564

MCQhard

An organization needs to deploy a fine-tuned model for real-time inference with strict latency requirements. They have provisioned a Dedicated AI Cluster with 2 model units. Which statement about this setup is accurate?

A.The cluster provides low-latency, dedicated inference for the fine-tuned model, and you are billed per model unit.

B.You must use the shared infrastructure for fine-tuned models; dedicated clusters are only for base models.

C.The cluster automatically scales model units based on request load.

D.The cluster can only host OCI’s built-in models, not custom fine-tuned models.

AnswerA

Dedicated clusters offer dedicated compute for low-latency inference, and costs are based on model units provisioned.

Why this answer

Dedicated AI Clusters provide low-latency dedicated inference, and model units are used to allocate capacity for running models, including custom fine-tuned ones. The cluster is specifically designed for hosting fine-tuned models with consistent performance.

Full explanation →

565

MCQmedium

A company uses OCI Generative AI to power a chatbot for customer support. They notice that the model's responses sometimes contain factual inaccuracies. Which strategy would best reduce hallucination?

A.Implementing Retrieval-Augmented Generation (RAG).

B.Increasing the temperature parameter.

C.Reducing the max token limit.

D.Fine-tuning the model on a larger general corpus.

AnswerA

RAG retrieves relevant facts from a knowledge base, grounding the output and reducing hallucination.

Why this answer

Retrieval-Augmented Generation (RAG) grounds the model's responses in retrieved factual information, directly reducing hallucination. Increasing temperature increases randomness, fine-tuning on a larger corpus may not fix factual accuracy, and reducing max tokens does not affect correctness.

Full explanation →

566

MCQhard

Refer to the exhibit. The dashboard shows latency grouped by modelId, but some points are missing for certain modelIds. Which of the following is the most likely reason?

A.The metric name is misspelled

B.The aggregation interval is too short

C.The modelIds with missing data may have been deleted or are inactive

D.The compartmentId is incorrect

AnswerC

Inactive or deleted models stop emitting metrics, leading to gaps in the time series.

Why this answer

Option C is correct because in OCI's Generative AI service, model deployments are associated with specific modelIds. If a modelId is deleted or its deployment is deactivated, the corresponding telemetry data (e.g., latency metrics) will no longer be reported, causing gaps in the dashboard. The dashboard aggregates metrics only for active modelIds, so missing points indicate that those modelIds are no longer in service.

Exam trap

The trap here is that candidates may confuse missing data due to inactive resources with configuration errors (e.g., metric name typos or compartment mismatches), but Cisco tests the understanding that metric gaps are often caused by resource lifecycle events rather than misconfiguration.

How to eliminate wrong answers

Option A is wrong because a misspelled metric name would cause all data points to be missing for all modelIds, not just selective gaps. Option B is wrong because a too-short aggregation interval would result in sparse or noisy data across all modelIds, not missing points for specific ones. Option D is wrong because an incorrect compartmentId would prevent any metrics from being displayed for the entire dashboard, not just for certain modelIds.

Full explanation →

567

MCQeasy

A company deploys a fine-tuned Llama 2 model using OCI Generative AI service. They want to ensure low-latency inference for a real-time chat application. Which deployment option should they use?

A.Batch inference job

B.OCI Functions

C.Dedicated AI cluster

D.Serverless endpoint (standard)

AnswerC

Dedicated AI clusters offer reserved capacity and low latency for real-time inference.

Why this answer

A dedicated AI cluster provides reserved compute resources (GPUs) for low-latency, real-time inference by eliminating resource contention. This is essential for a fine-tuned Llama 2 model in a chat application where consistent sub-second response times are required, unlike shared or serverless options that introduce cold starts or queuing delays.

Exam trap

The trap here is that candidates confuse 'serverless endpoint (standard)' with a low-latency option, not realizing that its shared infrastructure and potential cold starts make it unsuitable for real-time inference, while a dedicated cluster guarantees consistent performance.

How to eliminate wrong answers

Option A is wrong because batch inference jobs are designed for asynchronous, high-throughput processing of large datasets, not for real-time, low-latency chat interactions. Option B is wrong because OCI Functions is a serverless compute service with cold-start latency and limited GPU support, making it unsuitable for sustained, low-latency model inference. Option D is wrong because a serverless endpoint (standard) uses shared infrastructure that can experience variable latency due to multi-tenancy and scaling delays, which is not acceptable for real-time chat.

Full explanation →

568

Multi-Selectmedium

Which TWO techniques are commonly used to reduce the memory footprint of LLM inference?

Select 2 answers

A.Quantization

B.Increasing batch size

C.KV cache optimization

D.Gradient checkpointing

E.Using full precision (FP32)

AnswersA, C

Reduces memory by using lower precision weights.

Why this answer

Quantization reduces the memory footprint by lowering the precision of model weights and activations from FP32 to lower bit-widths like INT8 or FP16, which directly decreases the memory required to store and compute with the model. KV cache optimization reduces memory usage by efficiently managing the key-value cache during autoregressive decoding, often through techniques like shared memory, pruning, or compression, which is critical for long-context inference.

Exam trap

Oracle often tests the distinction between training and inference techniques, so candidates mistakenly apply gradient checkpointing (a training memory saver) to inference, or confuse batch size scaling with memory reduction.

Full explanation →

569

Multi-Selectmedium

Which TWO factors are most likely to cause hallucinations in LLMs?

Select 2 answers

A.High temperature

B.Short context window

C.Excessive fine-tuning

D.Low top-p

E.Inadequate training data

AnswersA, E

High temperature increases randomness, leading to less factual outputs.

Why this answer

A high temperature setting increases the randomness of token sampling, making the model more likely to generate plausible-sounding but factually incorrect or nonsensical outputs. This directly contributes to hallucinations by encouraging the model to deviate from the most probable, grounded responses.

Exam trap

Oracle often tests the misconception that low top-p or short context windows are primary causes of hallucinations, when in fact high temperature and insufficient training data are the two most direct factors that increase the likelihood of generating false or fabricated content.

Full explanation →

570

MCQeasy

Which OCI Generative AI model family is optimized for generating text embeddings that capture semantic meaning for tasks like clustering and classification?

A.Cohere Embed

B.Cohere Rerank

C.Meta Llama 3

D.Cohere Command R

AnswerA

Cohere Embed models are optimized for text embeddings, supporting tasks like clustering, classification, and search.

Why this answer

Cohere Embed models are specifically designed for embedding text into vectors. Cohere Command and Meta Llama are generative models, not embedding models.

Full explanation →

571

MCQhard

A prompt engineer is designing a system that must extract structured data from unstructured text. The model occasionally outputs extra text beyond the required JSON. Which parameter should be adjusted to enforce strict output format?

A.Increase the frequency penalty

B.Reduce the temperature to 0

C.Increase the top-p value

D.Set a stop sequence to the closing delimiter of the JSON (e.g., '}')

AnswerD

Stop sequences halt generation when the specified token or string is produced, ensuring no additional output after the JSON.

Why this answer

Stop sequences tell the model when to stop generating. By adding a stop sequence like '}' (end of JSON), the model will terminate after the JSON object, preventing extra text.

Full explanation →

572

Multi-Selectmedium

A company wants to implement a retrieval-augmented generation (RAG) chatbot using OCI Generative AI Agents. Which TWO services or components are required for this solution?

Select 2 answers

A.Cohere Embed model

B.Fine-tuned model

C.Knowledge Base

D.OCI Object Storage

E.Dedicated AI Cluster

AnswersC, D

The knowledge base indexes documents and enables retrieval.

Why this answer

Option C is correct because a Knowledge Base is the core repository that stores the documents or data sources the RAG chatbot retrieves from. OCI Generative AI Agents use a knowledge base to perform retrieval-augmented generation, where the agent first retrieves relevant chunks from the knowledge base and then passes them to the generative model to produce a grounded, context-aware response.

Exam trap

The trap here is that candidates often assume a dedicated AI cluster or a specific embedding model is mandatory for RAG, when in fact OCI Generative AI Agents abstract away these details and only require a knowledge base and a data source like Object Storage.

Full explanation →

573

Multi-Selecteasy

Which TWO techniques can help reduce bias in LLM outputs?

Select 2 answers

A.Setting temperature to 0

B.Using only English data

C.Using diverse training data

D.Increasing model size

E.Applying adversarial debiasing

AnswersC, E

Diverse data reduces representation bias.

Why this answer

Option C is correct because using diverse training data helps the model learn from a wide range of perspectives, reducing the risk of over-representing any single group or viewpoint. This directly mitigates bias by ensuring the training distribution is more representative of the real world, rather than skewed toward a dominant demographic or cultural norm.

Exam trap

Oracle often tests the misconception that lowering temperature or increasing model size can fix bias, when in reality these parameters affect randomness and capacity, not the underlying distributional fairness of the training data.

Full explanation →

574

MCQhard

A developer is using a Cohere Command model via OCI Generative AI. They want the model to generate responses strictly in JSON format for a specific task, but the model sometimes outputs additional explanatory text. Which prompt engineering technique is MOST effective?

A.Include a single example of a JSON output in the user message

B.Set the temperature to 0.0 to make the model deterministic

C.Use a stop sequence '}' to force the model to stop after the JSON

D.Add a preamble: 'You are a JSON generator. Output only valid JSON. Do not include any other text.'

AnswerD

The preamble acts as a system prompt, setting the role and strict output constraint. Combined with explicit instructions, this effectively suppresses extra text.

Why this answer

Using a system prompt to set the persona (e.g., 'You are a JSON generator') and including a step-by-step instruction reduces unwanted text. Cohere's preamble works like a system prompt to enforce constraints.

Full explanation →

575

MCQmedium

In OCI Generative AI, when using the Cohere Command model, which parameter is used to discourage the model from repeating the same phrases?

A.presence_penalty

B.frequency_penalty

C.temperature

D.top-k

AnswerB

Frequency penalty applies a penalty proportional to token frequency, reducing repetition.

Why this answer

Frequency penalty reduces the likelihood of tokens that have already appeared, directly targeting repetition of phrases.

Full explanation →

576

Multi-Selectmedium

A developer is using the OCI Generative AI Chat API to create a customer support bot. They want the bot to maintain a consistent personality and follow specific guidelines. Which TWO settings should they use?

Select 2 answers

A.System prompt

B.Preamble override

C.Temperature

D.Frequency penalty

E.Max tokens

AnswersA, B

System prompt defines the assistant's behavior and constraints.

Why this answer

A system prompt sets the bot's behavior and guidelines, and preamble override allows customizing the model's initial instructions. Temperature and max tokens do not define personality or guidelines.

Full explanation →

577

MCQmedium

A data scientist needs to fine-tune a large language model on a custom dataset of 10,000 prompt-completion pairs. They want to minimize cost while still updating the model effectively. Which fine-tuning technique is used by OCI Generative AI service?

A.Prefix tuning

B.T-Few fine-tuning

C.Adapter fine-tuning

D.LoRA fine-tuning

AnswerB

T-Few is the parameter-efficient fine-tuning method provided by OCI GenAI service.

Why this answer

OCI Generative AI uses T-Few, which updates only a small number of parameters via learned transformations, reducing computational cost while maintaining performance. Adapter, LoRA, and prefix tuning are general PEFT methods but not the specific technique offered by OCI.

Full explanation →

578

MCQeasy

Refer to the exhibit. Why did the embedding creation fail?

A.The input text is too short

B.The API call was not properly authenticated

C.The model ID is not available in the us-ashburn-1 region

D.The region is not enabled for the Generative AI service

AnswerB

The MissingAuthenticationError indicates no credentials were provided.

Why this answer

The 401 status code in the exhibit indicates an authentication failure. The API call requires a valid OAuth 2.0 token or API key in the Authorization header; without proper authentication, the embedding creation request is rejected regardless of input length, model availability, or region enablement.

Exam trap

Cisco often tests the distinction between authentication (401) and authorization (403) errors, and candidates mistakenly attribute a 401 to model or region issues instead of recognizing it as a credential problem.

How to eliminate wrong answers

Option A is wrong because there is no minimum input length requirement for embedding creation; even a single token can generate an embedding. Option C is wrong because the error code 401 is unrelated to model availability; a model ID unavailability would return a 404 or 400 error, not an authentication error. Option D is wrong because region enablement issues would produce a 403 or 400 error, not a 401; the 401 specifically indicates missing or invalid credentials.

Full explanation →

579

Multi-Selecthard

A developer is troubleshooting low recall in a vector search. Which THREE factors should be checked? (Choose three.)

Select 3 answers

A.Embedding model quality and relevance to domain

B.Chunk size and overlap strategy

C.Quality of the query embedding generation

D.The number of results returned (k) in the search

E.The LLM's temperature setting

AnswersA, B, C

A model not trained on similar data may produce poor embeddings.

Why this answer

Option A is correct because the embedding model's quality and domain relevance directly determine how well semantic relationships are captured. If the model is not fine-tuned on domain-specific data, it may fail to map similar concepts close together in the vector space, leading to low recall. For example, a general-purpose model may not distinguish between 'bank' as a financial institution versus a river bank in a legal document search.

Exam trap

Cisco often tests the misconception that retrieval parameters like k or generation parameters like temperature affect recall, when in fact recall is primarily determined by embedding quality, chunking strategy, and query embedding fidelity.

Full explanation →

580

Multi-Selectmedium

A data scientist is designing a prompt for code generation and needs to reduce the likelihood of the model generating incorrect or hallucinated code. Which two parameter adjustments are most effective? (Choose two.)

Select 2 answers

A.Set top-k to a high value (e.g., 100)

B.Set frequency_penalty to a moderate value (e.g., 0.5)

C.Increase max_tokens significantly

D.Set presence_penalty to 0

E.Set temperature to a low value (e.g., 0.1)

AnswersB, E

Frequency penalty discourages repetition of tokens, which can reduce hallucinated patterns.

Why this answer

Lowering temperature reduces randomness, making outputs more deterministic and less prone to hallucinations. Frequency penalty reduces repetitive mistakes. Top-k and presence penalty are less directly effective.

Full explanation →

581

MCQmedium

A developer is using the Cohere Command model for text generation and wants to ensure the output is deterministic for testing purposes. Which sampling strategy should they use?

A.Top-k sampling with k=50

B.Temperature sampling with temperature=0.7

C.Top-p (nucleus) sampling with p=0.9

D.Greedy decoding

AnswerD

Greedy decoding picks the most likely token at each step, making outputs deterministic.

Why this answer

Greedy decoding always selects the token with highest probability, producing the same output for a given input. Temperature, top-k, and top-p introduce randomness.

Full explanation →

582

MCQhard

A team is building a Retrieval-Augmented Generation (RAG) pipeline using OCI Generative AI. They need to store and retrieve document embeddings for semantic search. Which OCI service is most appropriate as the vector store?

A.OCI Search with OpenSearch

B.OCI Streaming

C.OCI Object Storage

D.OCI Autonomous Database with AI Vector Search

AnswerA

OpenSearch supports vector storage and k-NN search, making it ideal for RAG pipelines.

Why this answer

OCI Search with OpenSearch is the most appropriate vector store for a RAG pipeline because it natively supports storing and querying high-dimensional vector embeddings using the k-nearest neighbor (k-NN) algorithm. It integrates directly with OCI Generative AI to enable semantic search over ingested documents, providing the required similarity search capabilities for retrieval-augmented generation.

Exam trap

Cisco often tests the misconception that any database with vector support (like Autonomous Database) is the best choice, but the question specifically asks for the 'most appropriate' vector store in a RAG pipeline, where OpenSearch's dedicated vector search engine and direct OCI Generative AI integration make it the optimal answer.

How to eliminate wrong answers

Option B is wrong because OCI Streaming is a real-time data ingestion and messaging service designed for event streams, not for storing or querying vector embeddings. Option C is wrong because OCI Object Storage is a durable, scalable blob storage service for unstructured data, but it lacks native vector indexing and similarity search functionality. Option D is wrong because while OCI Autonomous Database with AI Vector Search does support vector operations, it is not the most appropriate choice for a dedicated vector store in a RAG pipeline; OCI Search with OpenSearch is purpose-built for vector search and offers better performance and simpler integration with OCI Generative AI.

Full explanation →

583

MCQeasy

Which Oracle AI Vector Search index type is designed for approximate nearest neighbor search and uses a navigable small world graph?

A.VECTOR

B.HNSW

C.IVF

D.BTREE

AnswerB

HNSW uses a multi-layer navigable small world graph for efficient ANN search.

Why this answer

HNSW (Hierarchical Navigable Small World) is a graph-based index for ANN search.

Full explanation →

584

MCQeasy

Which prompting technique involves providing the model with a small set of input-output examples within the prompt to guide its behavior?

A.Tree-of-thought prompting

B.Chain-of-thought prompting

C.Few-shot prompting

D.Zero-shot prompting

AnswerC

Correct: few-shot provides a few examples.

Why this answer

Few-shot prompting includes examples of desired input-output pairs, helping the model infer the task without fine-tuning.

Full explanation →

585

MCQmedium

A team is fine-tuning a foundation model on a large dataset stored in OCI Object Storage. They want to minimize data transfer costs. What is the best practice for locating the storage?

A.Place the bucket in the same region and availability domain as the fine-tuning job

B.Use OCI File Storage instead of Object Storage

C.Use a cross-region bucket to leverage geographically distributed data

D.Place the bucket in the same region as the fine-tuning job

AnswerD

Correct: Same-region transfer is free of charge.

Why this answer

Option D is correct because placing the Object Storage bucket in the same OCI region as the fine-tuning job eliminates cross-region data transfer charges. OCI charges egress fees when data moves between regions, but intra-region data transfer between services in the same region is free. This minimizes costs while keeping the data accessible for the fine-tuning workload.

Exam trap

Oracle often tests the misconception that specifying an availability domain (Option A) is necessary for cost optimization, when in fact Object Storage buckets are regional and availability domain selection is irrelevant for data transfer costs.

How to eliminate wrong answers

Option A is wrong because OCI Object Storage buckets are regional resources, not tied to a specific availability domain; specifying an availability domain is irrelevant and does not affect data transfer costs. Option B is wrong because OCI File Storage is a network-attached file system that incurs additional egress costs when accessed from compute instances in a different region or availability domain, and it does not inherently reduce data transfer costs compared to Object Storage. Option C is wrong because a cross-region bucket replicates data across regions, which incurs replication and egress costs, and accessing data from a different region than the fine-tuning job would still result in cross-region data transfer charges.

Full explanation →

586

MCQhard

A company is deploying a multi-language chatbot using OCI Generative AI Service. The chatbot must support English, Spanish, and French. The team finds that responses in Spanish are less accurate than in English. They have a small bilingual dataset. What is the best approach?

A.Use a multilingual base model (e.g., mT5) and fine-tune on the bilingual dataset (English and Spanish) using cross-lingual transfer learning.

B.Use prompt engineering with language-specific instructions in the system prompt.

C.Translate all user queries to English, process them, then translate responses back.

D.Train separate fine-tuned models for each language.

AnswerA

Cross-lingual transfer leverages English data to improve Spanish performance, and fine-tuning on bilingual data further boosts accuracy.

Why this answer

Option A is correct because fine-tuning a multilingual base model like mT5 on a small bilingual dataset leverages cross-lingual transfer learning, where knowledge from high-resource languages (English) improves performance on low-resource languages (Spanish). This approach is specifically designed for scenarios with limited data and directly addresses the accuracy gap without requiring separate models or translation pipelines.

Exam trap

The trap here is that candidates often overestimate the power of prompt engineering (Option B) for language-specific accuracy, underestimating that systematic linguistic errors require model adaptation through fine-tuning or transfer learning, not just instruction tuning.

How to eliminate wrong answers

Option B is wrong because prompt engineering with language-specific instructions does not adapt the model's internal representations; it merely provides contextual cues, which is insufficient to correct systematic inaccuracies in a specific language. Option C is wrong because translating queries to English and back introduces translation errors, latency, and loss of nuance, and does not improve the model's native understanding of Spanish. Option D is wrong because training separate fine-tuned models for each language is inefficient with a small bilingual dataset and fails to exploit cross-lingual transfer, leading to poor performance on the low-resource language.

Full explanation →

587

MCQhard

A company is using OCI Generative AI for a RAG-based code assistant. They index source code repositories into a vector store. Developers report that the assistant often suggests deprecated APIs or outdated code snippets, even though the latest code is in the repository. The index was built a week ago and has not been updated. They plan to set up incremental updates. However, they notice that even after re-indexing the latest commits, the issue persists. What is the most likely oversight?

A.The vector store is not configured to overwrite existing vectors for updated documents.

B.The retrieval top-k is set too low, missing some relevant snippets.

C.The chunking strategy splits code at function boundaries, losing import statements.

D.The embedding model is not fine-tuned on code; it was trained on natural language.

AnswerA

Without overwrite, old vectors persist even after re-indexing, causing retrieval of outdated code.

Why this answer

Option A is correct because if the vector store does not overwrite or update vectors for changed documents, old vectors remain, causing retrieval of outdated code. Option B (chunking at function boundaries) may cause missing imports but not specifically deprecation. Option C (embedding model not fine-tuned on code) might affect quality but not freshness.

Option D (low top-k) would affect recall, not freshness.

Full explanation →

588

MCQhard

Refer to the exhibit. A data scientist received this output after submitting a fine-tuning job. What is the most effective change to resolve the out-of-memory error?

A.Increase the sequence length.

B.Reduce the learning rate.

C.Decrease the number of fine-tuning epochs.

D.Increase the number of nodes in the cluster.

AnswerD

Correct: More nodes mean more total memory, alleviating OOM.

Why this answer

The out-of-memory error during fine-tuning indicates that the model's memory requirements exceed the available resources on the current node. Increasing the number of nodes in the cluster distributes the model parameters, gradients, and optimizer states across multiple GPUs or nodes, effectively increasing the total memory capacity and resolving the OOM error. This is a standard approach in distributed training frameworks like PyTorch DDP or FSDP, which OCI Data Science supports.

Exam trap

Oracle often tests the misconception that reducing epochs or learning rate can fix memory errors, when in fact memory errors are resource constraints that require scaling hardware (more nodes or GPUs) or reducing memory-intensive parameters like batch size or sequence length.

How to eliminate wrong answers

Option A is wrong because increasing the sequence length would increase the memory footprint per sample (due to larger attention matrices), making the OOM error worse, not better. Option B is wrong because reducing the learning rate affects training dynamics and convergence, not memory usage; it does not address the root cause of insufficient memory. Option C is wrong because decreasing the number of fine-tuning epochs reduces total training time but does not change the peak memory consumption per step, so the OOM error would still occur.

Full explanation →

589

Multi-Selecthard

A developer is evaluating OCI GenAI model families. Which three are correct characteristics of the available models? (Choose three.)

Select 3 answers

A.Llama models are open-source and available for fine-tuning

B.All models support real-time streaming of tokens

C.Cohere embedding models produce vector representations

D.OCI GenAI provides both hosted and dedicated deployment options

E.Cohere Command models are optimized for multilingual tasks

AnswersA, C, D

Meta's Llama models are open-source and supported by OCI GenAI for fine-tuning.

Why this answer

Llama models, such as Llama 2 and Llama 3, are open-source large language models originally developed by Meta. OCI GenAI provides them as pre-built models that developers can fine-tune using their own datasets, enabling customization for domain-specific tasks without training from scratch.

Exam trap

Oracle often tests the misconception that all models in a platform share the same capabilities, such as streaming or multilingual optimization, when in reality each model family (e.g., Llama, Cohere Command, Cohere Embed) has distinct design goals and feature sets.

Full explanation →

590

MCQhard

A company fine-tunes an LLM on internal support tickets. After deployment, the model hallucinates company-specific product names. What is the most effective mitigation?

A.Switch to a smaller model to reduce hallucination risk

B.Use prompt engineering to remind the model to be accurate

C.Implement RAG with a verified product database

D.Fine-tune further with more ticket data

AnswerC

RAG provides factual grounding, reducing hallucinations.

Why this answer

RAG (Retrieval-Augmented Generation) grounds the LLM's output in a verified product database, providing factual context that prevents hallucination of company-specific product names. Unlike fine-tuning, which only adjusts model weights and can still produce plausible but incorrect names, RAG retrieves exact records at inference time, ensuring accuracy for proprietary terminology.

Exam trap

Oracle often tests the misconception that fine-tuning alone can fix factual accuracy for domain-specific entities, when in reality RAG is required to ground outputs in a verifiable external knowledge source.

How to eliminate wrong answers

Option A is wrong because switching to a smaller model reduces capacity and often increases hallucination risk due to lower parameter count and less memorization ability. Option B is wrong because prompt engineering is a fragile, surface-level fix that cannot enforce factual accuracy for specific product names; the model may still generate plausible but incorrect names. Option D is wrong because further fine-tuning with more ticket data risks overfitting and does not guarantee elimination of hallucinated product names, as the model can still invent names not present in the training distribution.

Full explanation →

591

MCQeasy

What is a recommended practice to prevent the LLM from generating information not present in the retrieved context when building a RAG application?

A.Setting the temperature to 0.

B.Using a system message that says 'Use only the provided context to answer.'

C.Including few-shot examples in the prompt.

D.Increasing the topK value.

AnswerB

This instruction directly constrains the model to the context.

Why this answer

Option B is correct because explicitly instructing the LLM via a system message to 'Use only the provided context to answer' directly constrains the model to rely solely on the retrieved documents, reducing the risk of hallucination. This prompt engineering technique leverages the model's instruction-following capability to suppress parametric knowledge and enforce grounded generation.

Exam trap

Cisco often tests the misconception that lowering temperature or increasing retrieval size alone prevents hallucination, when in fact explicit prompt instructions are the primary mechanism to enforce context-only generation.

How to eliminate wrong answers

Option A is wrong because setting temperature to 0 makes the output deterministic (lowest randomness) but does not prevent the LLM from using its internal knowledge; it can still generate information not present in the retrieved context. Option C is wrong because few-shot examples improve output formatting or reasoning patterns but do not restrict the model to the retrieved context; the model may still hallucinate from its training data. Option D is wrong because increasing topK retrieves more documents, which can introduce irrelevant or noisy context, but does not control whether the LLM adheres to that context; the model may still generate ungrounded information.

Full explanation →

592

Multi-Selecthard

A data scientist is designing a RAG pipeline using LangChain and Oracle AI Vector Search. They want to ensure that the retrieved documents are diverse and not overly similar to each other. Which TWO approaches can achieve this?

Select 2 answers

A.Set fetch_k larger than k in the retriever's search_kwargs and use MMR

B.Use a smaller chunk_size to create more granular chunks

C.Use a higher chunk_overlap to increase redundancy

D.Use as_retriever with search_type='mmr'

E.Use similarity_search with a high k value

AnswersA, D

Why this answer

MMR (Maximum Marginal Relevance) is designed to balance relevance and diversity. Setting fetch_k > k in the retriever and using MMR can also improve diversity.

Full explanation →

593

MCQeasy

Refer to the exhibit. In this RAG pipeline, what is the role of the 'embedding_model' variable?

A.It converts text into vector representations for similarity search.

B.It applies guardrails to filter content.

C.It fine-tunes the model on the provided texts.

D.It generates text completions based on prompts.

AnswerA

Embeddings are used to index and retrieve relevant documents via vector similarity.

Why this answer

In a Retrieval-Augmented Generation (RAG) pipeline, the 'embedding_model' variable is responsible for converting input text (such as user queries or document chunks) into dense vector representations. These vectors are then used to perform similarity search against a vector database, enabling the retrieval of the most relevant context for the generative model. This is a core function of the embedding step in RAG, not a guardrail, fine-tuning, or text generation role.

Exam trap

Cisco often tests the distinction between the embedding model (for vectorization and retrieval) and the generative model (for text completion), so the trap here is confusing the embedding model's role with that of the LLM or guardrail components in a RAG pipeline.

How to eliminate wrong answers

Option B is wrong because applying guardrails to filter content is typically handled by separate content moderation or safety models, not by the embedding model which focuses on semantic encoding. Option C is wrong because fine-tuning the model on provided texts is a training process that adjusts model weights, whereas the embedding model in a RAG pipeline is used for inference (encoding) and is not fine-tuned during retrieval. Option D is wrong because generating text completions based on prompts is the role of the generative language model (e.g., the LLM in the RAG pipeline), not the embedding model which only produces vector representations.

Full explanation →

594

MCQhard

A healthcare startup is building a chatbot to answer patient inquiries using a large language model (LLM) deployed on OCI Data Science AI Quick Actions. The chatbot must comply with HIPAA regulations, so all patient data must remain within the OCI tenancy and never be sent to third-party APIs. The team has fine-tuned a Llama 2 7B model on de-identified medical records using OCI Data Science notebooks. The model is deployed as a managed endpoint via AI Quick Actions. Early testing shows that the chatbot sometimes generates responses containing specific patient names or dates of birth that were present in the fine-tuning dataset. Moreover, the model occasionally hallucinates medication dosages that are not medically accurate. Which course of action should the team take to address both issues while maintaining HIPAA compliance?

A.Deploy a rule-based post-processing script that checks each response against a list of known patient names and medication dosages, and rejects any response containing them.

B.Switch to a larger model (e.g., Llama 2 70B) to improve accuracy and reduce hallucinations, and apply output filtering to remove any detected PII from responses.

C.Increase the fine-tuning dataset size with more varied de-identified records to reduce overfitting, and apply a temperature setting of 0 to make outputs deterministic.

D.Re-fine-tune the model using differential privacy to limit memorization of training data, and implement retrieval-augmented generation (RAG) with a curated medical knowledge base to ground medication-related responses.

AnswerD

Differential privacy during training reduces the risk of memorizing private data, and RAG grounds responses in a trusted knowledge base, reducing hallucinations. This combination addresses both issues effectively.

Why this answer

Option D is correct because it addresses both memorization of PII and hallucination of medication dosages while maintaining HIPAA compliance. Differential privacy during fine-tuning limits the model's ability to memorize specific patient data, and retrieval-augmented generation (RAG) grounds responses in a curated medical knowledge base, reducing hallucinations without sending data outside the OCI tenancy.

Exam trap

Oracle often tests the misconception that simply filtering outputs or increasing model size can solve memorization and hallucination issues, when in fact only training-time techniques like differential privacy and inference-time grounding like RAG address the root causes.

How to eliminate wrong answers

Option A is wrong because a rule-based post-processing script cannot catch all variations of patient names or hallucinated dosages (e.g., misspellings, new names), and rejecting responses containing known names does not prevent the model from generating them in the first place. Option B is wrong because switching to a larger model (Llama 2 70B) does not inherently reduce memorization of training data or hallucinations; it may even increase both, and output filtering alone cannot guarantee removal of all PII without risking false positives or missing subtle leaks. Option C is wrong because increasing dataset size does not guarantee reduced overfitting or memorization, and setting temperature to 0 makes outputs deterministic but does not prevent the model from reproducing memorized PII or hallucinating dosages; it only removes randomness.

Full explanation →

595

MCQmedium

You deployed a generative AI model on OCI Model Deployment with autoscaling configured based on average CPU utilization. The model is a large language model that heavily utilizes the GPU. During peak hours, the scaling is too slow to keep up with demand, resulting in high latency for users. You want to improve the responsiveness of autoscaling. Which change should you make?

A.Decrease the target CPU utilization threshold for scale-out

B.Increase the maximum number of replicas in the autoscaling configuration

C.Use GPU utilization as the scaling metric instead of CPU utilization

D.Increase the cooldown period between scale-out events

AnswerC

GPU utilization directly correlates with inference load, enabling more responsive scaling.

Why this answer

Option C is correct because the model heavily utilizes GPU, not CPU. Autoscaling based on CPU utilization is irrelevant for GPU-bound workloads, leading to delayed scale-out. Using GPU utilization as the scaling metric directly reflects the actual resource bottleneck, enabling faster and more accurate scaling decisions.

Exam trap

The trap here is that candidates assume CPU utilization is always the correct scaling metric for any workload, overlooking that GPU-bound models require a metric that reflects the actual bottleneck.

How to eliminate wrong answers

Option A is wrong because decreasing the target CPU utilization threshold would cause scale-out to trigger at even lower CPU usage, but since the model is GPU-bound, CPU utilization remains low and irrelevant, so this change does not address the root cause. Option B is wrong because increasing the maximum number of replicas only sets an upper limit on scaling; it does not speed up the scaling decision or make it more responsive to demand. Option D is wrong because increasing the cooldown period between scale-out events would actually slow down scaling further, worsening latency during peak hours.

Full explanation →

596

MCQeasy

When building a RAG application for document retrieval, which chunking strategy is recommended to maximize retrieval accuracy?

A.Use fixed-size token chunks with no overlap

B.Use overlapping chunks with a sliding window

C.Use random splitting points

D.Use entire documents as single chunks

AnswerB

Overlap ensures contextual continuity between chunks.

Why this answer

Overlapping chunks with a sliding window ensure that context is preserved across chunk boundaries, which is critical for retrieval accuracy in RAG applications. When a query spans the boundary between two fixed-size chunks, the overlap captures the relevant context in both chunks, reducing the risk of missing key information. This strategy directly addresses the limitation of fixed-size token chunks that may split semantically related content.

Exam trap

Cisco often tests the misconception that fixed-size chunks are optimal for simplicity, but the trap is that candidates overlook how boundary splitting degrades retrieval accuracy in practice.

How to eliminate wrong answers

Option A is wrong because fixed-size token chunks with no overlap can split sentences or concepts across chunk boundaries, causing loss of context and reducing retrieval accuracy. Option C is wrong because random splitting points introduce unpredictability and are likely to break semantic units, leading to poor retrieval performance. Option D is wrong because using entire documents as single chunks ignores the practical limits of embedding model context windows and can result in information dilution, where relevant details are buried within a large vector representation.

Full explanation →

597

MCQeasy

An OCI GenAI practitioner wants to deploy a model that can generate code from natural language descriptions. Which type of model is most suitable?

A.T5

B.ResNet

C.BERT

D.GPT

AnswerD

GPT (decoder-only) excels at autoregressive text generation, ideal for code generation.

Why this answer

GPT (Generative Pre-trained Transformer) is the most suitable model for code generation from natural language because it is an autoregressive language model optimized for text generation tasks. Unlike encoder-only models, GPT generates coherent, contextually relevant sequences of tokens, making it ideal for producing code based on descriptive prompts.

Exam trap

Cisco often tests the distinction between encoder-only (BERT) and decoder-only (GPT) architectures, leading candidates to mistakenly choose BERT for generation tasks because they associate it with language understanding, not realizing it cannot generate sequences.

How to eliminate wrong answers

Option A is wrong because T5 is a text-to-text transformer that, while capable of generation, is primarily designed for translation, summarization, and classification tasks, not specifically optimized for autoregressive code generation like GPT. Option B is wrong because ResNet is a convolutional neural network (CNN) architecture for image recognition and computer vision tasks, not for natural language processing or code generation. Option C is wrong because BERT is an encoder-only transformer model designed for understanding tasks (e.g., classification, question answering) and cannot generate coherent sequences of text or code due to its bidirectional, non-autoregressive architecture.

Full explanation →

598

MCQmedium

A developer creates a prompt for a code generation task but the output often contains syntax errors. Which adjustment to the prompt is MOST likely to improve correctness?

A.Remove the output format specification to give the model freedom

B.Add a few examples of correct code for similar tasks in the prompt

C.Increase the temperature to 0.9 for more creative solutions

D.Set the max tokens to a very high value

AnswerB

Few-shot examples set a clear pattern for the model to follow, improving syntax.

Why this answer

Adding few-shot examples of correct code directly in the prompt provides the model with explicit patterns to follow, reducing syntactic drift. This technique, known as few-shot prompting, anchors the model's output to the demonstrated structure and syntax, which is far more effective than relying on the model's internal knowledge alone for code generation tasks.

Exam trap

Cisco often tests the misconception that increasing creativity (temperature) or removing constraints improves output quality, when in fact structured examples are the most reliable method for enforcing correctness in code generation tasks.

How to eliminate wrong answers

Option A is wrong because removing the output format specification removes crucial constraints that help the model adhere to a consistent structure, increasing the likelihood of malformed output. Option C is wrong because increasing temperature to 0.9 introduces more randomness, which exacerbates syntax errors by encouraging less predictable token sequences. Option D is wrong because setting max tokens to a very high value does not address the root cause of syntax errors; it only allows the model to generate longer outputs, which may contain even more errors.

Full explanation →

599

MCQhard

A developer is deploying a fine-tuned model using OCI Generative AI service. They want to use a custom container image for inference. Which statement is true?

A.Custom containers are only supported with OCI Data Science, not Generative AI.

B.You can upload a container image to OCI Container Registry and reference it when creating a dedicated AI cluster.

C.Custom containers are supported only for fine-tuning jobs, not inference.

D.Custom containers are not supported; only built-in models are available.

AnswerB

Correct: This is the documented approach for using custom inference containers.

Why this answer

Option B is correct because OCI Generative AI service allows you to bring your own custom container image for inference by uploading it to OCI Container Registry (OCIR) and referencing it when creating a dedicated AI cluster. This enables you to deploy fine-tuned models with custom inference logic, dependencies, or frameworks that are not available in the built-in serving containers.

Exam trap

The trap here is that candidates may confuse the scope of custom container support, assuming it is limited to OCI Data Science or only for training, when in fact OCI Generative AI explicitly supports custom containers for inference via dedicated AI clusters.

How to eliminate wrong answers

Option A is wrong because custom containers are supported with OCI Generative AI for inference, not only with OCI Data Science. Option C is wrong because custom containers are supported for inference, not just for fine-tuning jobs; fine-tuning uses built-in containers or custom training containers, but inference also supports custom containers. Option D is wrong because custom containers are indeed supported; you are not limited to only built-in models.

Full explanation →

600

MCQeasy

Which of the following best describes the role of the self-attention mechanism in a Transformer model?

A.It encodes the order of tokens in the sequence

B.It computes a weighted sum of all input token representations, where weights depend on pairwise compatibility between tokens

C.It applies a convolutional filter over local windows of tokens

D.It replaces the need for positional encoding by using recurrence

AnswerB

Self-attention calculates attention scores between every pair of tokens and uses them to aggregate information.

Why this answer

The self-attention mechanism computes a weighted sum of all input token representations, where the weights are determined by the pairwise compatibility (attention scores) between tokens. This allows each token to dynamically attend to every other token in the sequence, capturing global dependencies without the limitations of fixed local windows or recurrence.

Exam trap

Cisco often tests the misconception that self-attention inherently encodes positional information, when in fact it is permutation-invariant and relies on separate positional encodings to maintain sequence order.

How to eliminate wrong answers

Option A is wrong because encoding the order of tokens is the role of positional encoding, not the self-attention mechanism itself; self-attention is permutation-invariant and requires explicit positional information. Option C is wrong because applying a convolutional filter over local windows describes a CNN approach, not the global, pairwise weighting of self-attention in Transformers. Option D is wrong because self-attention does not replace positional encoding; it operates without recurrence, but positional encoding is still necessary to inject sequence order information into the model.

Full explanation →

Page 8 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice 1Z0-1127 by domain

Target a specific domain to shore up weak areas.

Prompt Engineering OCI Generative AI Service LLM Fundamentals LangChain and AI Application Development Fundamentals of Large Language Models Using OCI Generative AI Service Building LLM Applications with RAG and Vector Search Deploying and Managing Generative AI on OCI

See all domains with question counts →