CCNA Deploying and Managing Generative AI on OCI Questions

75 of 122 questions · Page 1/2 · Deploying and Managing Generative AI on OCI · Answers revealed

1
Multi-Selecteasy

Which two metrics would you monitor to ensure a generative AI deployment on OCI is operating efficiently? (Choose two.)

Select 2 answers
A.Number of active users
B.Object Storage bucket size
C.Request throughput (requests per second)
D.Average inference latency
E.Model accuracy on validation set
AnswersC, D

Throughput indicates how many requests the system can handle.

Why this answer

Request throughput (requests per second) is a critical metric for monitoring the operational efficiency of a generative AI deployment on OCI because it directly measures the system's capacity to handle incoming inference requests. If throughput drops below expected levels, it indicates a bottleneck in the compute resources (e.g., GPU utilization) or the serving infrastructure, which can lead to degraded user experience and potential timeouts.

Exam trap

Oracle often tests the distinction between model quality metrics (like accuracy) and operational efficiency metrics (like latency and throughput), so the trap here is that candidates mistakenly select 'Model accuracy on validation set' because they confuse model performance with deployment performance.

2
MCQhard

You are deploying a generative AI solution on OCI for a healthcare client that requires strict data residency (data must remain in the EU) and low-latency inference. The solution uses a fine-tuned LLM model (7B parameters) stored in Object Storage in the Frankfurt region. You have set up an OCI Data Science model deployment endpoint with GPU shape VM.GPU.A10.1, using a single replica. During load testing with 50 concurrent users, you observe high latency (average 8 seconds per request) and occasional 504 gateway timeouts. The model deployment logs show no errors, and the model loads successfully. You have confirmed that the Object Storage bucket is in the same region and that the network latency between the client and the endpoint is minimal (under 5 ms). Which action should you take to reduce latency and eliminate timeouts?

A.Increase the model deployment endpoint timeout setting from 60 seconds to 300 seconds in the OCI console.
B.Upgrade the model deployment shape to VM.GPU.A100.4 and keep a single replica.
C.Increase the number of replicas to 3 and enable autoscaling based on CPU utilization.
D.Move the model deployment to the US East (Ashburn) region to leverage lower-cost GPU capacity and reduce latency.
AnswerC

Option D is correct because increasing the number of replicas to handle concurrent requests reduces queuing and improves throughput, while also enabling load balancing to avoid timeouts.

Why this answer

Option C is correct because the high latency and 504 timeouts with 50 concurrent users indicate that a single GPU replica is overwhelmed by the request queue. Increasing replicas to 3 distributes the load across multiple endpoints, while enabling autoscaling based on CPU utilization ensures dynamic scaling to handle traffic spikes. This directly reduces per-request latency and eliminates timeouts without violating data residency requirements.

Exam trap

The trap here is that candidates often confuse increasing timeout (Option A) or upgrading GPU size (Option B) as solutions to concurrency issues, when in fact horizontal scaling via replicas is required to handle multiple simultaneous requests without violating data residency constraints.

How to eliminate wrong answers

Option A is wrong because increasing the timeout from 60 to 300 seconds only masks the symptom of slow inference; it does not address the root cause of insufficient compute capacity, and 504 timeouts will still occur if requests queue up beyond the timeout. Option B is wrong because upgrading to a larger GPU (A100.4) with a single replica increases throughput per request but does not resolve the concurrency bottleneck; with 50 concurrent users, a single replica still serializes requests, leading to high latency and potential timeouts. Option D is wrong because moving the deployment to US East violates the strict data residency requirement that data must remain in the EU, and it does not solve the concurrency issue; latency from cross-region data transfer would also increase.

3
Multi-Selecteasy

Which TWO are valid methods to monitor the performance of a generative AI model deployed on OCI Data Science?

Select 2 answers
A.Use OCI Notifications to receive alerts on model drift
B.Use OCI Monitoring service to track custom metrics like latency and throughput
C.Use OCI Logging service to collect inference logs
D.Use OCI Events service to trigger retraining on low accuracy
E.Use OCI Audit service to review API call logs
AnswersB, C

Allows pushing custom metrics from the inference script.

Why this answer

Option B is correct because OCI Monitoring service allows you to define and track custom metrics such as inference latency (e.g., p50/p99 response times) and throughput (requests per second) for your generative AI model deployed on OCI Data Science. This enables real-time performance monitoring and alerting based on thresholds you set, which is essential for production AI workloads.

Exam trap

Oracle often tests the distinction between monitoring (OCI Monitoring), logging (OCI Logging), and notification/event services, so candidates mistakenly select OCI Notifications or OCI Events as monitoring tools when they are actually reactive or alerting services.

4
Multi-Selectmedium

A company is deploying a large generative AI model on OCI using GPU compute instances. They want to optimize inference cost while maintaining acceptable latency. Which TWO strategies should they implement?

Select 2 answers
A.Enable provisioned concurrency on all models.
B.Select the smallest GPU instance type that meets latency requirements.
C.Increase the max-tokens parameter to generate longer responses.
D.Deploy the model on multiple large GPU instances to handle peak load.
E.Use an inference endpoint with auto-scaling to match demand.
AnswersB, E

Choosing appropriate instance size avoids paying for unused capacity.

Why this answer

Option B is correct because selecting the smallest GPU instance type that meets latency requirements directly reduces compute cost per inference without sacrificing user experience. This aligns with OCI's pay-as-you-go GPU pricing, where larger instances incur higher hourly costs. The key is to right-size the GPU based on model memory footprint and inference throughput, not to over-provision.

Exam trap

Oracle often tests the misconception that 'bigger GPU instances always improve performance' or that 'provisioned concurrency applies to all OCI services,' when in fact it is specific to serverless compute and irrelevant to GPU inference endpoints.

5
MCQmedium

A team wants to use OCI Generative AI to generate synthetic data for training a model. They are concerned about the cost of API calls. Which pricing model would be most cost-effective for high-volume batch processing?

A.OCI Universal Credits with per-request charges
B.Monthly subscription with limited requests
C.Pay-as-you-go per request
D.Reserved capacity with a fixed monthly fee
AnswerD

Reserved capacity offers predictable pricing and lower per-request cost for high volume.

Why this answer

Option D is correct because reserved capacity with a fixed monthly fee provides the lowest per-request cost for high-volume batch processing. OCI Generative AI offers dedicated capacity pricing, which is ideal for predictable, large-scale workloads where you commit to a certain throughput, avoiding per-request charges that would accumulate significantly with high volume.

Exam trap

Oracle often tests the misconception that pay-as-you-go is always the cheapest for any workload, but the trap here is that high-volume batch processing benefits from reserved capacity's flat fee, which lowers per-request costs significantly compared to per-request pricing models.

How to eliminate wrong answers

Option A is wrong because OCI Universal Credits with per-request charges would be expensive for high-volume batch processing, as each API call incurs a separate cost, leading to unpredictable and high expenses. Option B is wrong because a monthly subscription with limited requests would cap the number of requests, making it unsuitable for high-volume batch processing where you need to generate large amounts of synthetic data without hitting a limit. Option C is wrong because pay-as-you-go per request is the most expensive model for high-volume workloads, as costs scale linearly with each API call, whereas reserved capacity offers a flat fee for better cost predictability.

6
MCQmedium

An organization is deploying a large language model on OCI using a dedicated AI cluster. They need to minimize inference latency. Which configuration step is most critical?

A.Set up a load balancer across multiple regions
B.Configure the cluster to use high-bandwidth RDMA networking
C.Use a single VM shape to reduce network hops
D.Disable model parallelism to simplify setup
AnswerB

Correct: RDMA enables ultra-low-latency communication between nodes, essential for performance.

Why this answer

RDMA (Remote Direct Memory Access) bypasses the CPU and kernel to transfer data directly between GPU memories, drastically reducing latency and CPU overhead. In a dedicated AI cluster on OCI, high-bandwidth RDMA networking (e.g., using RoCE v2 or InfiniBand) is the most critical step to minimize inference latency because model parallelism and tensor parallelism across nodes depend on fast, low-latency interconnects. Without RDMA, even with optimized model parallelism, the network becomes the bottleneck, increasing per-token latency.

Exam trap

Oracle often tests the misconception that load balancing or simplifying the model architecture (e.g., disabling parallelism) reduces latency, when in fact the critical bottleneck for distributed inference is inter-node communication, which RDMA directly addresses.

How to eliminate wrong answers

Option A is wrong because a multi-region load balancer adds cross-region network latency and is designed for high availability and geographic distribution, not for minimizing inference latency within a single cluster. Option C is wrong because using a single VM shape does not reduce network hops; inference on large models requires multiple GPUs across nodes, and a single VM cannot host the full model, so network hops are inevitable. Option D is wrong because disabling model parallelism would force the entire model onto a single GPU, which is impossible for large models and would actually increase latency due to memory swapping or inability to load the model, not simplify setup.

7
MCQeasy

Refer to the exhibit. Users in the group cannot create a new custom model deployment on a Dedicated AI Cluster. What is the most likely missing permission?

A.Manage ai-document-understanding
B.Manage ai-agents
C.Use of virtual-network-family for cluster networking
D.Manage instance-configurations
AnswerC

Dedicated AI Clusters require VCN networking permissions to provision networking resources.

Why this answer

Creating a custom model deployment on a Dedicated AI Cluster requires the user to have the 'Use of virtual-network-family for cluster networking' permission. This permission allows the user to specify and manage the virtual network (VCN) and subnet that the cluster uses for networking. Without it, the deployment fails because the cluster cannot be attached to the required network resources.

Exam trap

The trap here is that candidates often focus on AI-specific permissions (like ai-document-understanding or ai-agents) and overlook the underlying networking permission required for cluster-based deployments, assuming that cluster creation is purely an AI service operation.

How to eliminate wrong answers

Option A is wrong because 'Manage ai-document-understanding' is a permission for managing document understanding AI services, not for deploying custom models on a Dedicated AI Cluster. Option B is wrong because 'Manage ai-agents' controls permissions for AI agent resources, which are separate from model deployment on clusters. Option D is wrong because 'Manage instance-configurations' is related to compute instance configurations, not to the networking setup required for a Dedicated AI Cluster deployment.

8
MCQmedium

A company has fine-tuned a large language model using OCI Generative AI service. When attempting to deploy the model to a dedicated endpoint, the deployment fails with an error indicating insufficient capacity. Which action should be taken to resolve this issue?

A.Delete existing endpoints to free capacity
B.Deploy the model to a different OCI region
C.Use a pre-built model instead of the fine-tuned model
D.Request a service limit increase for dedicated endpoints
AnswerD

OCI allows customers to request higher limits for resources like dedicated endpoints.

Why this answer

Option C is correct because capacity quotas can be increased by requesting a service limit increase. Option A is wrong because deploying to a different region may not address capacity issues in the current region. Option B is wrong because using a pre-built model would not leverage the fine-tuned model.

Option D is wrong because deleting other endpoints may not be necessary if quota increase is possible.

9
Multi-Selecthard

Which THREE components are required to deploy a custom generative AI model on OCI Data Science model deployment?

Select 3 answers
A.A load balancer to distribute traffic
B.An inference script (e.g., score.py) to handle prediction requests
C.A model artifact containing the model files
D.An API signing key for authentication
E.A deployment configuration specifying resources and environment
AnswersB, C, E

Required to define how the model is called.

Why this answer

Option B is correct because OCI Data Science model deployment requires an inference script (typically score.py) to define how the model processes incoming prediction requests. This script is the entry point that loads the model artifact and executes inference logic, making it an essential component for serving predictions.

Exam trap

The trap here is that candidates confuse optional infrastructure components like load balancers or API keys with the mandatory deployment components, leading them to select A or D instead of recognizing that the inference script, model artifact, and deployment configuration are the three required elements.

10
Multi-Selecteasy

Which TWO methods can be used to invoke a generative AI model deployed on OCI?

Select 2 answers
A.Using OCI Notifications service
B.Using the OCI Console web interface
C.Sending HTTP requests to the model endpoint URL
D.Using OCI Events service
E.Using the OCI SDK (e.g., Python, Java)
AnswersC, E

Direct REST calls are standard.

Why this answer

Option C is correct because generative AI models deployed on OCI expose a RESTful endpoint that accepts HTTP requests (typically POST with JSON payloads) for inference. This is the standard method for programmatic access to the model, allowing integration with any HTTP client.

Exam trap

Oracle often tests the distinction between management-plane actions (like using the Console or Events to trigger deployments) and data-plane actions (like invoking the model via SDK or HTTP), leading candidates to confuse OCI services that manage resources with those that perform inference.

11
MCQhard

A company wants to deploy a custom generative AI model for generating synthetic data for training other models. The model requires approximately 20GB of memory and must be accessible via a REST API with authentication. Additionally, the team needs to monitor for data drift over time. Which combination of OCI services best meets these requirements with minimal operational overhead?

A.OCI Compute with custom Docker container and Prometheus monitoring
B.OCI Data Science Model Deployment with OCI Monitoring and OCI Logging
C.OCI Functions with API Gateway for authentication
D.OCI Data Flow with OCI Data Catalog for model registry
AnswerB

Model Deployment supports large models, authentication, and integrates with Monitoring and Logging for drift detection.

Why this answer

Option B is correct because OCI Data Science Model Deployment provides a managed environment for hosting custom generative AI models with REST API endpoints and built-in authentication via OCI IAM. It integrates natively with OCI Monitoring and OCI Logging to track data drift and operational metrics without requiring additional infrastructure setup, minimizing operational overhead.

Exam trap

The trap here is that candidates may confuse OCI Functions (serverless) as suitable for long-running model inference, but its memory and timeout limits make it impractical for a 20GB model, while OCI Data Science Model Deployment is purpose-built for this scenario.

How to eliminate wrong answers

Option A is wrong because OCI Compute with a custom Docker container requires manual management of the host, scaling, and authentication, and Prometheus monitoring adds operational overhead for setup and maintenance, which contradicts the 'minimal operational overhead' requirement. Option C is wrong because OCI Functions is a serverless compute service designed for short-lived, stateless functions (max 5-minute execution and limited memory, typically up to 10GB), not for hosting a persistent 20GB generative AI model with a REST API. Option D is wrong because OCI Data Flow is a managed Apache Spark service for batch data processing, not for hosting real-time model inference endpoints, and OCI Data Catalog is for metadata management, not model registry or monitoring data drift.

12
MCQmedium

A user runs the CLI command shown but receives only one model in the list, even though they know there are more models available in the compartment. What is the most likely reason?

A.The compartment does not contain the other models.
B.The other models are in a different region.
C.The CLI command requires the --all flag to list all models.
D.The user lacks permissions for other compartments.
AnswerC

Without --all, the CLI returns only the first page of results.

Why this answer

Option C is correct because the OCI CLI paginates results by default; using the --all flag retrieves all models. Options A and D are possible but the most common cause is pagination; B is less likely since the user is querying a specific compartment.

13
MCQmedium

A company is using OCI Generative AI service to generate product descriptions. They notice that the model sometimes generates biased content. Which approach should they take to mitigate bias while maintaining performance?

A.Fine-tune the model with a balanced, curated dataset that reduces bias
B.Use a larger model without fine-tuning
C.Post-process outputs to remove biased phrases
D.Switch to a different pre-built model from OCI
AnswerA

Fine-tuning allows adjusting model behavior.

Why this answer

Fine-tuning with a balanced, curated dataset directly addresses the root cause of bias by adjusting the model's internal weights to reduce reliance on biased patterns in the original training data. This approach preserves the model's generative performance for product descriptions because it retrains only on domain-specific, unbiased examples, unlike post-processing which merely filters outputs without correcting the underlying model behavior.

Exam trap

Oracle often tests the misconception that post-processing or model swapping is a sufficient fix for bias, when in fact only fine-tuning or retraining can address the root cause without sacrificing performance.

How to eliminate wrong answers

Option B is wrong because using a larger model without fine-tuning does not inherently reduce bias; larger models can amplify existing biases from their training data and may even introduce new biases due to increased complexity. Option C is wrong because post-processing outputs to remove biased phrases is a superficial fix that can degrade performance by altering the natural language flow and may miss subtle or context-dependent biases, while also adding latency. Option D is wrong because switching to a different pre-built model from OCI merely changes the source of bias without guaranteeing a reduction; all pre-built models inherit biases from their training corpora, and the new model may perform worse on the specific product description task.

14
MCQhard

An organization is deploying a generative AI model that requires GPU acceleration for inference. They are using OCI Data Science Model Deployment. The model is expected to handle variable traffic, with occasional spikes. Which scaling option should they configure to ensure cost-efficiency and responsiveness?

A.Use OCI Generative AI on-demand API with a serverless endpoint
B.Use CPU-only instances and rely on batching
C.Configure autoscaling with a minimum of 1 and maximum of 10 GPU instances
D.Deploy with a fixed number of 1 GPU instance
AnswerC

Autoscaling matches capacity to load.

Why this answer

Option C is correct because autoscaling with a minimum of 1 and maximum of 10 GPU instances allows the deployment to dynamically adjust capacity in response to variable traffic and spikes, ensuring cost-efficiency by scaling down during low demand and responsiveness by scaling up during peaks. OCI Data Science Model Deployment supports autoscaling policies that can be configured with GPU shapes, making it the optimal choice for a generative AI model requiring GPU acceleration.

Exam trap

Oracle often tests the misconception that serverless endpoints (Option A) are always the best for variable traffic, but candidates must recognize that OCI Generative AI on-demand API is a pre-built model service, not a custom model deployment, and thus does not support custom GPU scaling policies.

How to eliminate wrong answers

Option A is wrong because OCI Generative AI on-demand API with a serverless endpoint is a managed service for accessing pre-built models, not a custom model deployment, and it does not provide the granular control over GPU instances needed for the organization's own model. Option B is wrong because CPU-only instances lack the GPU acceleration required for inference of generative AI models, leading to unacceptable latency and throughput, even with batching. Option D is wrong because a fixed number of 1 GPU instance cannot handle variable traffic with occasional spikes, resulting in either over-provisioning during low traffic (wasting cost) or under-provisioning during spikes (causing performance degradation or failures).

15
Multi-Selecthard

Which THREE steps are required to deploy a custom generative AI model using OCI Data Science Model Deployment?

Select 3 answers
A.Fine-tune the model using OCI Generative AI service
B.Create a model artifact (e.g., pickle, ONNX) with inference code
C.Register the model in OCI Generative AI service
D.Upload the model artifact to an OCI Object Storage bucket
E.Create a model deployment using the OCI Data Science Model Deployment service
AnswersB, D, E

Model must be packaged with dependencies for serving.

Why this answer

Option B is correct because deploying a custom generative AI model via OCI Data Science Model Deployment requires packaging the model and its inference code into a standardized artifact format (e.g., pickle, ONNX). This artifact is the core input that the deployment runtime loads to serve predictions, making it an essential step in the workflow.

Exam trap

The trap here is confusing the OCI Generative AI service's managed model lifecycle (fine-tuning and registration) with the custom model deployment workflow in OCI Data Science, leading candidates to incorrectly select steps that belong to the managed service rather than the custom deployment pipeline.

16
Multi-Selecteasy

Which TWO of the following are sources of training data for fine-tuning a model in OCI Generative AI?

Select 2 answers
A.OCI Object Storage bucket
B.OCI File Storage
C.OCI Database
D.Local file uploaded through the OCI Console
E.OCI Streaming
AnswersA, D

Object Storage is a common source for large datasets.

Why this answer

Option A is correct because OCI Object Storage is a supported source for training data when fine-tuning models in OCI Generative AI. The service can directly access data stored in Object Storage buckets via service-level integrations, allowing you to reference large datasets without local uploads.

Exam trap

Oracle often tests the distinction between storage services that are directly integrated with AI fine-tuning (Object Storage) versus general-purpose storage or data services (File Storage, Database, Streaming) that require additional middleware or are not supported at all.

17
Multi-Selectmedium

Which TWO of the following are valid ways to consume OCI Generative AI models?

Select 2 answers
A.Using the OCI Console chat interface
B.Using the OCI CLI
C.Using the OCI Generative AI REST API
D.Using OCI SDK for Python
E.Using OCI Data Flow
AnswersA, C

The console provides a chat UI to interact with models.

Why this answer

A is correct because the OCI Console provides a built-in chat interface that allows users to interact directly with Generative AI models without writing any code. This interface is part of the OCI Generative AI service's web-based console, enabling prompt testing and model evaluation through a graphical user interface.

Exam trap

Oracle often tests the distinction between direct consumption methods (like the console and REST API) versus indirect tools (like SDKs and CLI) that require additional layers of abstraction or are not designed for model inference.

18
Multi-Selecthard

A team is fine-tuning a generative AI model on OCI using a custom dataset. The training job fails with an out-of-memory error. Which THREE actions should they take to resolve this issue?

Select 3 answers
A.Use gradient accumulation to simulate larger batch sizes.
B.Increase the learning rate to speed up training.
C.Use a larger GPU shape with more memory.
D.Reduce the batch size.
E.Increase the number of training epochs.
AnswersA, C, D

Gradient accumulation allows effective large batches with less memory.

Why this answer

Option A is correct because gradient accumulation allows the model to simulate the effect of a larger batch size without increasing memory usage. Instead of computing gradients over a single large batch, the optimizer accumulates gradients over several smaller batches before performing a weight update. This technique effectively decouples the batch size from memory consumption, enabling training on large models or high-resolution inputs that would otherwise cause an out-of-memory error.

Exam trap

Oracle often tests the misconception that increasing the learning rate or epochs can resolve memory errors, when in fact only actions that directly reduce per-step memory footprint (like reducing batch size, using gradient accumulation, or upgrading to a larger GPU) are effective.

19
MCQeasy

A developer wants to integrate OCI Generative AI into a web application. Which API authentication method is recommended for programmatic access?

A.Pre-authenticated request
B.API key-based signing
C.OAuth 2.0 client credentials
D.Username and password in the header
AnswerB

OCI uses API signing (based on RSA keys) for all REST API calls.

Why this answer

Option B is correct because OCI APIs require request signing using an API signing key (an RSA key pair) for programmatic access. The developer must generate a key pair, upload the public key to the OCI console, and then use the private key to sign each HTTP request using the OCI Signature Version 1 algorithm (based on RFC 2104 HMAC-SHA256). This ensures authentication without transmitting secrets over the wire.

Exam trap

The trap here is that candidates may confuse OAuth 2.0 (common in other cloud providers like AWS or Azure) with OCI's requirement for API key-based signing, leading them to select OAuth 2.0 client credentials.

How to eliminate wrong answers

Option A is wrong because pre-authenticated requests (PARs) are used for temporary access to specific OCI Object Storage buckets or objects, not for authenticating API calls to OCI Generative AI services. Option C is wrong because OCI does not support OAuth 2.0 client credentials for direct API authentication; OCI uses IAM-based API signing keys or instance principals for programmatic access. Option D is wrong because sending username and password in the header is a basic authentication scheme that is insecure and not supported by OCI APIs; OCI requires cryptographic request signing.

20
Multi-Selecteasy

A developer is troubleshooting an OCI Generative AI inference request that returns a 400 Bad Request error. Which three common causes could result in this error? (Choose three.)

Select 3 answers
A.Incorrect endpoint URL
B.Invalid API key in the request header
C.Missing required parameters in the request body
D.Exceeding the model's maximum token limit
E.Network connectivity issues
AnswersA, C, D

A wrong URL may cause the request to be malformed or routed incorrectly, resulting in a 400.

Why this answer

A 400 Bad Request error indicates the server cannot process the request due to client-side issues. An incorrect endpoint URL (A) is a common cause because the request is sent to the wrong OCI Generative AI service endpoint (e.g., using a chat endpoint for a text generation model), leading to a malformed request that the server rejects. Missing required parameters (C) in the request body, such as 'compartmentId' or 'modelId', also triggers a 400 error as the API cannot validate or process the inference without them.

Exceeding the model's maximum token limit (D) results in a 400 error because the input or output exceeds the model's configured token capacity, which the API validates before processing.

Exam trap

Oracle often tests the distinction between HTTP 4xx status codes, where candidates confuse 400 Bad Request (client-side malformed request) with 401 Unauthorized (invalid credentials) or 403 Forbidden (insufficient permissions), leading them to incorrectly select invalid API key as a cause for a 400 error.

21
MCQhard

Your organization has deployed a generative AI model for a multilingual translation service on OCI Model Deployment. The model is a 13B parameter transformer hosted on a single VM.GPU.A100.1 shape with 2 replicas. Recently, the service experiences intermittent timeouts when a burst of requests arrives. You have enabled autoscaling based on CPU utilization, but the scaling is too slow. After investigation, you find that the model inference time is highly variable due to different sequence lengths. You need to ensure the service can handle sudden spikes without timeouts. Which solution should you implement?

A.Implement a request queue (e.g., OCI Queue) to buffer requests and process them asynchronously
B.Increase the maximum number of replicas and prewarm additional replicas before expected traffic
C.Reduce the model size to a 7B parameter model to decrease inference time
D.Use autoscaling based on the number of messages in the request queue
AnswerA

Queuing decouples traffic spikes from the model, preventing timeouts.

Why this answer

Option A is correct because implementing a request queue (e.g., OCI Queue) decouples request ingestion from processing, allowing the service to buffer bursts of requests and process them asynchronously. This prevents timeouts by smoothing out the variable inference times caused by differing sequence lengths, as the queue absorbs spikes and the model processes at its own pace. Autoscaling based on CPU utilization is too slow for sudden spikes, but a queue provides immediate relief by not dropping requests.

Exam trap

The trap here is that candidates often assume autoscaling (option B or D) is sufficient for burst handling, but they overlook that autoscaling has inherent latency (minutes to provision new replicas), whereas a request queue provides immediate buffering to absorb spikes without dropping requests.

How to eliminate wrong answers

Option B is wrong because increasing the maximum number of replicas and prewarming them only helps if the scaling mechanism is fast enough to react; it does not address the root cause of variable inference times and still relies on autoscaling, which is too slow for sudden bursts. Option C is wrong because reducing the model size to a 7B parameter model would degrade translation quality and does not solve the intermittent timeout issue caused by variable sequence lengths; it might reduce average inference time but not eliminate spikes. Option D is wrong because autoscaling based on the number of messages in the request queue would still be reactive and subject to latency in provisioning new replicas, and it does not prevent timeouts during the scaling delay; the queue itself is the primary solution to buffer requests.

22
Multi-Selectmedium

Which TWO actions should be taken to monitor model drift in a deployed generative AI model? (Select TWO)

Select 2 answers
A.Compare inference statistics over time
B.Retrain the model weekly
C.Use OCI Data Labeling for new data
D.Set up alerts on accuracy metrics
E.Deploy multiple model versions
AnswersA, D

Tracking statistics like output length or sentiment can indicate drift.

Why this answer

Comparing inference statistics over time (Option A) is correct because model drift in generative AI is detected by monitoring changes in output distributions, token probabilities, or response patterns relative to baseline metrics. This allows you to identify when the model's behavior deviates from expected performance due to shifts in input data or underlying patterns.

Exam trap

Oracle often tests the distinction between monitoring actions (detecting drift) and remediation actions (retraining, labeling, deploying versions), so candidates mistakenly select retraining or labeling as monitoring steps.

23
MCQhard

A generative AI model deployed on OCI Model Deployment is experiencing high tail latency. The model is a large language model that processes variable-length input sequences. Profiling shows that inference time varies significantly: short inputs (100 tokens) take 100ms, while long inputs (2000 tokens) take 2 seconds. The application requires consistent low latency (<500ms) for most requests. You want to reduce the variance in inference time without major changes to the model architecture. Which technique should you apply?

A.Implement dynamic batching that groups requests of similar lengths together before inference
B.Increase the number of replicas to distribute the load evenly
C.Reduce the model size by removing layers or using a smaller version
D.Deploy multiple model endpoints for different length ranges and route requests accordingly
AnswerA

Grouping by length reduces the overhead from padding and stabilizes inference time.

Why this answer

Dynamic batching groups requests of similar input lengths together, which reduces the variance in inference time by ensuring that each batch processes tokens of comparable size. This minimizes the padding overhead and keeps the per-request latency more predictable, directly addressing the high tail latency caused by variable-length sequences without altering the model architecture.

Exam trap

The trap here is that candidates often confuse horizontal scaling (Option B) with latency variance reduction, but scaling replicas does not address the root cause of variable inference time due to sequence length differences.

How to eliminate wrong answers

Option B is wrong because increasing the number of replicas distributes load but does not reduce the variance in inference time for individual requests; it may even increase tail latency due to additional network hops and synchronization overhead. Option C is wrong because reducing model size (e.g., removing layers or using a smaller version) constitutes a major architectural change, which the question explicitly prohibits, and it would degrade model quality. Option D is wrong because deploying multiple endpoints for different length ranges adds operational complexity and does not inherently reduce variance; it merely separates traffic, but each endpoint still processes variable-length inputs with high tail latency unless combined with dynamic batching.

24
MCQeasy

A developer receives the above error when trying to send a request to a model endpoint. What is the most likely reason?

A.The endpoint was deleted by an administrator
B.The network connection to OCI is down
C.The API key is invalid
D.The model is still being deployed
AnswerA

The specific error indicates the endpoint is deleted.

Why this answer

The error message indicates that the model endpoint is not found. In OCI Generative AI, when an administrator deletes an endpoint, subsequent requests to that endpoint's URL return a 404 Not Found error. This is the most likely reason because the endpoint resource no longer exists in the tenancy, and the request cannot be routed to any model.

Exam trap

The trap here is that candidates often confuse a 404 Not Found with network or authentication issues, but the specific error code directly points to the resource (endpoint) not existing, which is most commonly caused by deletion.

How to eliminate wrong answers

Option B is wrong because a network connection issue to OCI would typically result in a timeout or connection refused error, not a 404 Not Found. Option C is wrong because an invalid API key would cause a 401 Unauthorized or 403 Forbidden error, not a 404. Option D is wrong because if the model is still being deployed, the endpoint would return a 503 Service Unavailable or a provisioning status error, not a 404.

25
MCQeasy

A startup wants to minimize costs when using OCI Generative AI service for a chatbot application that experiences sporadic usage. Which deployment strategy is most cost-effective?

A.Use a pre-built model with a dedicated endpoint
B.Use the serverless on-demand API without dedicated endpoints
C.Provision a dedicated endpoint for low latency
D.Deploy the model on OCI Compute with autoscaling
AnswerB

Pay per request, no idle costs.

Why this answer

Option B is correct because serverless on-demand pricing charges only for usage, ideal for sporadic workloads. Option A is wrong because dedicated endpoints incur hourly costs regardless of usage. Option C is wrong because pre-built models may also have per-request costs but dedicated endpoints are not cost-effective.

Option D is wrong because running models on OCI Compute adds management overhead and costs.

26
MCQeasy

A company has fine-tuned a custom Llama 3 model using OCI Data Science for a chatbot. They now need a production-grade inference endpoint with auto-scaling. Which OCI service should they use?

A.OCI Functions
B.OCI Data Science Model Deployment
C.OCI Generative AI Service
D.OCI Kubernetes Engine (OKE)
AnswerC

Correct: OCI Generative AI Service offers managed endpoints for fine-tuned models with scaling.

Why this answer

Option C is correct because OCI Generative AI Service provides a fully managed, production-grade inference endpoint with built-in auto-scaling for custom models like fine-tuned Llama 3. It abstracts infrastructure management, offers serverless deployment, and integrates with OCI Data Science for model import, making it the ideal choice for a chatbot requiring scalable inference.

Exam trap

Oracle often tests the misconception that OCI Data Science Model Deployment is the correct choice for any custom model deployment, but the trap here is that for production-grade, auto-scaling inference of a fine-tuned LLM, OCI Generative AI Service is the managed, purpose-built service that eliminates the operational complexity of manual scaling and infrastructure management.

How to eliminate wrong answers

Option A is wrong because OCI Functions is a serverless compute service for event-driven, stateless code snippets (functions) with a maximum timeout of 5 minutes, not suitable for hosting large language models like Llama 3 that require persistent GPU resources and long-running inference. Option B is wrong because OCI Data Science Model Deployment is designed for deploying custom models but requires manual configuration of auto-scaling policies and does not natively support the optimized inference infrastructure (e.g., dedicated GPU clusters) that OCI Generative AI Service provides for fine-tuned models. Option D is wrong because OCI Kubernetes Engine (OKE) is a container orchestration service that demands significant operational overhead for managing GPU nodes, scaling, and model serving infrastructure, whereas the question specifies a need for a production-grade inference endpoint with auto-scaling, which OCI Generative AI Service delivers as a managed service.

27
MCQhard

A company has deployed a generative AI endpoint using a custom fine-tuned model. They observe that the endpoint is returning 429 (Too Many Requests) errors during business hours. They need to handle this without losing requests. What should they implement?

A.Increase the endpoint's max tokens limit.
B.Implement client-side retry with exponential backoff.
C.Reduce the number of concurrent requests from the application.
D.Use a dedicated AI cluster with higher capacity.
AnswerB

Retry with backoff is the standard approach to handle 429 errors.

Why this answer

Option B is correct because implementing client-side retry with exponential backoff is the standard approach to handle HTTP 429 (Too Many Requests) errors without losing requests. When the OCI Generative AI endpoint returns a 429 status, the client can automatically retry the request after a delay that increases exponentially, reducing the load on the endpoint while ensuring all requests are eventually processed. This pattern is recommended by OCI and follows best practices for rate-limited APIs, as it allows the system to recover from transient capacity issues without manual intervention.

Exam trap

Oracle often tests the misconception that increasing capacity (Option D) or reducing concurrency (Option C) is the primary solution for rate limiting, when in fact the correct answer is a client-side retry mechanism that preserves request integrity.

How to eliminate wrong answers

Option A is wrong because increasing the max tokens limit does not address rate limiting; it only affects the maximum length of generated text per request, not the number of requests allowed per time window. Option C is wrong because reducing concurrent requests from the application would prevent some requests from being sent, effectively losing them rather than handling them gracefully; the goal is to avoid losing requests, not to drop them. Option D is wrong because using a dedicated AI cluster with higher capacity is a costly and potentially over-provisioned solution that does not address the immediate need to handle existing 429 errors without losing requests; it also does not provide a mechanism to retry failed requests.

28
MCQmedium

A data scientist is using the OCI Generative AI service to generate text completions. The API calls are returning HTTP 400 errors with the message 'Invalid model parameters'. What is the most likely cause?

A.The API key is expired
B.The request exceeds the rate limit
C.The endpoint URL is incorrect
D.One or more model parameters (e.g., temperature, top_p) are outside the accepted range
AnswerD

Invalid parameters lead to client error 400.

Why this answer

The HTTP 400 error with 'Invalid model parameters' directly indicates that one or more of the parameters sent in the API request (such as temperature, top_p, max_tokens, or stop sequences) are outside the acceptable range defined by the OCI Generative AI service. For example, temperature must be between 0 and 1, and top_p between 0 and 1, and sending a value like 2.0 for temperature would trigger this error. The other options (expired key, rate limit, incorrect endpoint) would produce different HTTP status codes or error messages.

Exam trap

Oracle often tests the distinction between HTTP 4xx error codes and their specific meanings, so the trap here is that candidates may confuse a 400 Bad Request (parameter validation failure) with authentication (401) or rate-limiting (429) errors, especially when the error message is generic.

How to eliminate wrong answers

Option A is wrong because an expired API key would result in an HTTP 401 Unauthorized error, not a 400 Bad Request with 'Invalid model parameters'. Option B is wrong because exceeding the rate limit would return an HTTP 429 Too Many Requests error, not a 400 error. Option C is wrong because an incorrect endpoint URL would typically result in an HTTP 404 Not Found error or a connection failure, not a 400 error with a message about model parameters.

29
MCQmedium

A company is using OCI Generative AI service with a dedicated AI cluster for text generation. They notice that the latency is higher than expected. The cluster is in the Ashburn region, and users are distributed globally. What is the most effective way to reduce latency?

A.Enable the OCI Generative AI inference optimizer
B.Deploy dedicated AI clusters in regions closer to the users
C.Increase the number of nodes in the dedicated AI cluster
D.Use a content delivery network (CDN) to cache responses
AnswerB

Geographic proximity reduces network round-trip time.

Why this answer

Latency for globally distributed users is primarily driven by network distance and the speed of light. Deploying dedicated AI clusters in regions closer to the users reduces the physical distance data must travel, directly minimizing network round-trip time (RTT). This is the most effective architectural change because OCI's Generative AI service processes each request on the dedicated cluster and cannot bypass geographic latency through software optimizations alone.

Exam trap

The trap here is that candidates often confuse throughput improvements (scaling nodes or using an optimizer) with latency reduction, failing to recognize that geographic proximity is the only way to address network round-trip time for globally distributed users.

How to eliminate wrong answers

Option A is wrong because the OCI Generative AI inference optimizer is a software-level tuning feature that improves throughput and model efficiency, but it does not reduce network latency caused by geographic distance. Option C is wrong because increasing the number of nodes in the dedicated AI cluster improves parallel processing capacity and throughput, but does not reduce the per-request network latency for users far from the Ashburn region. Option D is wrong because a CDN caches static content (e.g., images, HTML), but Generative AI text responses are dynamic, unique per request, and cannot be cached to serve different users.

30
Multi-Selecteasy

A company is deploying a generative AI model for a real-time inference API. To ensure high availability and cost efficiency under variable load, which two configurations should they implement? (Choose two.)

Select 2 answers
A.Use a single replica with a larger GPU to handle all traffic
B.Deploy the model in a single availability domain to simplify management
C.Disable connection draining on the load balancer
D.Set the number of model deployment replicas to at least 2
E.Enable autoscaling based on average CPU utilization
AnswersD, E

Multiple replicas provide redundancy and high availability.

Why this answer

Option D is correct because deploying at least two replicas ensures high availability by eliminating a single point of failure; if one replica fails, the other can still serve inference requests. This is a standard best practice for production workloads on OCI, where model deployment replicas are distributed across fault domains to maintain service continuity.

Exam trap

Oracle often tests the misconception that a single, powerful GPU instance is sufficient for high availability, but the exam expects you to recognize that redundancy through multiple replicas and autoscaling are required for both availability and cost efficiency under variable load.

31
MCQeasy

A retail company uses OCI Generative AI to generate product descriptions. They observe the model occasionally produces biased content. Which technique should be applied to reduce bias in model outputs?

A.Increase the max_tokens parameter.
B.Apply prompt engineering with explicit instructions to avoid bias.
C.Reduce the model's inference temperature to 0.
D.Use a different random seed for each request.
AnswerB

Prompt engineering is the recommended approach to guide the model towards desired behavior and reduce bias.

Why this answer

Option B is correct because prompt engineering allows you to explicitly instruct the model to avoid biased content, such as by including directives like 'Ensure the description is neutral and unbiased' in the system or user prompt. This technique directly influences the model's output generation without altering its underlying parameters, making it a targeted and effective approach for reducing bias in OCI Generative AI models.

Exam trap

The trap here is that candidates often confuse parameter tuning (like temperature or max_tokens) with content-level controls, assuming that reducing randomness or increasing output length can mitigate bias, when in fact bias is a training data issue that requires explicit instruction via prompt engineering to override.

How to eliminate wrong answers

Option A is wrong because increasing max_tokens only extends the maximum length of the generated text, which does not address the content's bias—it may even allow more biased statements to be produced. Option C is wrong because reducing inference temperature to 0 makes the model deterministic and less creative, but it does not inherently remove bias; biased patterns in the training data can still be reproduced with high confidence. Option D is wrong because using a different random seed for each request only affects the randomness of sampling (when temperature > 0), not the underlying bias in the model's learned associations or outputs.

32
MCQhard

An organization is deploying multiple generative AI models on a shared dedicated AI cluster. They need to isolate resource usage for each model to avoid interference. Which strategy is recommended?

A.Use separate fine-tuning jobs for each model
B.Configure multiple virtual clusters within the dedicated AI cluster using compartment quotas
C.Use OCI Resource Manager to allocate resources
D.Deploy each model on its own dedicated AI cluster
AnswerD

Correct: Each cluster has dedicated hardware, ensuring no resource contention.

Why this answer

Option D is correct because deploying each model on its own dedicated AI cluster provides complete hardware-level isolation, ensuring that resource usage (e.g., GPU memory, compute cycles) for one model does not interfere with another. In OCI Generative AI, dedicated AI clusters are single-tenant instances, so each model gets exclusive access to its allocated infrastructure, eliminating contention. This is the recommended strategy for strict isolation in shared environments.

Exam trap

Oracle often tests the misconception that logical isolation (e.g., compartment quotas or virtual clusters) is sufficient for performance isolation, when in fact hardware-level separation is required to prevent interference in shared AI clusters.

How to eliminate wrong answers

Option A is wrong because separate fine-tuning jobs do not isolate runtime inference resources; they only isolate training workloads, and models still share the same cluster during inference. Option B is wrong because virtual clusters with compartment quotas provide logical isolation via resource limits but do not prevent resource contention at the hardware level (e.g., GPU memory oversubscription). Option C is wrong because OCI Resource Manager is an infrastructure-as-code tool for provisioning resources, not a mechanism for runtime resource isolation between models.

33
Multi-Selecthard

A company is designing a generative AI solution on OCI that must comply with data privacy regulations. Which three best practices should they follow? (Choose three.)

Select 3 answers
A.Enable audit logging for all inference requests
B.Allocate a dedicated compartment for generative AI resources to apply specific IAM policies
C.Use dedicated AI clusters with private endpoints to keep data within the OCI network
D.Store all inference inputs and outputs in a public bucket for transparency
E.Use customer-managed keys (CMK) for encrypting model artifacts and inference data
AnswersA, C, E

Audit logs help demonstrate compliance with data privacy regulations.

Why this answer

Option A is correct because enabling audit logging for all inference requests is a fundamental data privacy best practice. It provides an immutable record of who accessed the generative AI service, what data was sent, and when, which is essential for compliance audits and detecting unauthorized access. OCI Audit service captures these events automatically when configured, ensuring traceability without storing the actual inference payloads.

Exam trap

The trap here is that candidates may confuse general resource management best practices (like compartments) with specific data privacy compliance requirements, or mistakenly think public buckets are acceptable for transparency when they actually create a severe data exposure risk.

34
MCQhard

A team has deployed a generative AI model using OCI Data Science model deployment. The endpoint is behind a load balancer. Users report that after 5 minutes of inactivity, the first request takes over 30 seconds to respond, while subsequent requests are fast. What is the most likely cause and solution?

A.The model deployment has an idle timeout that scales down to zero; configure a minimum number of instances or use a warm-up request
B.The load balancer is scaling based on CPU utilization; increase the CPU threshold
C.The VCN has a network latency issue; use a different availability domain
D.The inference code has a lazy initialization; pre-load the model in the deployment script
AnswerA

Idle timeout causes cold start; setting min replicas or health check warm-up solves it.

Why this answer

The described behavior—first request after 5 minutes of inactivity taking over 30 seconds, with subsequent requests fast—is a classic symptom of an idle timeout that scales the model deployment to zero instances. OCI Data Science model deployments support auto-scaling with an idle timeout (default 5 minutes) that can reduce the number of instances to zero when no requests are received. When a new request arrives, it must wait for a new instance to spin up, causing the delay.

The solution is to configure a minimum number of instances (e.g., 1) to keep the model warm, or use a warm-up request to prevent the idle timeout from triggering.

Exam trap

Oracle often tests the distinction between infrastructure-level idle timeouts (which cause cold starts after inactivity) and application-level lazy initialization (which causes a one-time delay after deployment), and candidates may confuse the 5-minute inactivity pattern with a code initialization issue rather than a scaling policy.

How to eliminate wrong answers

Option B is wrong because the load balancer scaling based on CPU utilization would cause performance degradation under high load, not a cold-start delay after inactivity; increasing the CPU threshold would not address the idle timeout issue. Option C is wrong because a VCN network latency issue would cause consistently slow responses, not a pattern where only the first request after inactivity is slow. Option D is wrong because lazy initialization in the inference code would cause a delay on the first request after deployment or code reload, not specifically after 5 minutes of inactivity; the 5-minute window matches the default idle timeout of the model deployment, not a code-level initialization.

35
MCQeasy

A developer wants to integrate generative AI capabilities into an application using REST API calls. Which OCI Generative AI service endpoint should they use for text generation?

A./completions
B./models
C./inference
D./chat
AnswerA

/completions is the endpoint for generating text completions.

Why this answer

Option A is correct because the OCI Generative AI service exposes a REST API endpoint at `/completions` specifically for text generation tasks. This endpoint accepts a prompt and returns a generated text completion, aligning directly with the developer's requirement to integrate generative AI capabilities via REST API calls.

Exam trap

The trap here is that candidates confuse the `/chat` endpoint (designed for conversational AI) with the `/completions` endpoint (designed for single-turn text generation), or assume `/inference` is a generic catch-all endpoint for all AI tasks.

How to eliminate wrong answers

Option B is wrong because `/models` is used to list or retrieve metadata about available generative AI models, not to perform text generation. Option C is wrong because `/inference` is not a valid endpoint in the OCI Generative AI REST API; the correct endpoint for inference-based text generation is `/completions`. Option D is wrong because `/chat` is an endpoint designed for conversational AI interactions (multi-turn chat), not for single-turn text generation tasks.

36
MCQmedium

A team deploys a generative AI model endpoint and notices intermittent 429 Too Many Requests errors. The endpoint is configured with auto-scaling using a dedicated AI cluster. What is the most likely cause?

A.The model's context window exceeded
B.Insufficient storage on the cluster
C.The auto-scaling policy is not aggressive enough
D.Rate limiting at the OCI API Gateway
AnswerC

Auto-scaling may not be scaling up quickly enough to handle traffic spikes, leading to throttling.

Why this answer

The 429 Too Many Requests error indicates that the endpoint is receiving more requests than it can handle. With auto-scaling enabled on a dedicated AI cluster, the most likely cause is that the auto-scaling policy is not aggressive enough to keep up with the request rate, meaning it scales up too slowly or has insufficient maximum instance limits to handle the traffic spike.

Exam trap

The trap here is that candidates often confuse client-side rate limiting (API Gateway) with server-side capacity issues (auto-scaling lag), but the dedicated AI cluster configuration points directly to insufficient scaling policy aggressiveness.

How to eliminate wrong answers

Option A is wrong because exceeding the model's context window would result in a 400 Bad Request or an input length error, not a 429 rate-limiting error. Option B is wrong because insufficient storage on the cluster would manifest as disk-full errors or model loading failures, not HTTP 429 responses which are specifically about request throttling. Option D is wrong because the question states the endpoint is configured with auto-scaling using a dedicated AI cluster, implying the endpoint is directly exposed without an OCI API Gateway in front; rate limiting at the API Gateway would be a separate layer and is not mentioned in the configuration.

37
Multi-Selectmedium

A company is designing a generative AI application using OCI Generative AI. Which two factors should be considered when selecting the appropriate model? (Choose two.)

Select 2 answers
A.Model's training data cutoff date
B.Availability in all OCI regions
C.Supported languages
D.Maximum token output limit
E.Built-in safety filters
AnswersA, D

The cutoff date indicates how recent the model's knowledge is.

Why this answer

The model's training data cutoff date determines the temporal scope of the model's knowledge. For generative AI applications requiring up-to-date information or compliance with data recency requirements, selecting a model with a cutoff date that aligns with the use case is critical. OCI Generative AI models have specific cutoff dates (e.g., June 2023 for certain models), and using a model with an older cutoff may produce outdated or factually incorrect responses.

Exam trap

The trap here is that candidates often confuse service-level features (like safety filters or regional availability) with model-specific selection criteria, leading them to pick options that are technically true but irrelevant to the core decision of choosing the right model for a generative AI application.

38
MCQmedium

A company wants to use OCI Generative AI to summarize customer support tickets. They need to ensure that the model does not output any sensitive information. Which technique should they implement?

A.Prompt engineering to instruct the model to exclude sensitive information.
B.Use a smaller model that is less likely to memorize data.
C.Enable content filtering on the endpoint.
D.Disable the use of training data in the endpoint configuration.
AnswerA

Carefully crafted prompts can guide the model to avoid leaking sensitive data.

Why this answer

Prompt engineering is the correct technique because it allows the company to explicitly instruct the generative AI model to exclude sensitive information from its outputs. By crafting a system prompt or user prompt with specific directives (e.g., 'Do not include any personally identifiable information, account numbers, or confidential data in your summary'), the model's behavior is directly controlled at inference time. This is a lightweight, flexible approach that does not require changing the model architecture or endpoint configuration, and it is the most direct way to enforce output constraints in OCI Generative AI.

Exam trap

Oracle often tests the misconception that disabling training data or using a smaller model can prevent sensitive output, when in fact prompt engineering is the primary technique for controlling model behavior at inference time in OCI Generative AI.

How to eliminate wrong answers

Option B is wrong because using a smaller model does not guarantee the exclusion of sensitive information; smaller models can still memorize and output sensitive data from their training set, and model size is unrelated to output filtering. Option C is wrong because content filtering on the endpoint typically blocks predefined categories (e.g., hate speech, violence) but is not designed to dynamically detect and remove sensitive business data like customer support ticket details. Option D is wrong because disabling the use of training data in the endpoint configuration (e.g., setting 'trainingDataConsent' to false) only prevents the model from being fine-tuned or retrained on the input data; it does not affect the model's output behavior during inference, so sensitive information can still appear in summaries.

39
MCQmedium

A developer is getting a 401 Unauthorized error when calling the OCI Generative AI inference API. What is the most likely cause?

A.The API endpoint has reached its rate limit
B.The request is missing or has an invalid authentication signature
C.The model does not support the requested parameters
D.The model is not deployed
AnswerB

401 Unauthorized specifically indicates authentication failure.

Why this answer

A 401 Unauthorized error specifically indicates a failure in authentication, not authorization or resource availability. The OCI Generative AI inference API requires every request to include a valid signature based on the OCI Signature Version 1 algorithm (RFC 2104 HMAC-SHA256). If the request is missing the Authorization header or the signature is malformed (e.g., incorrect key ID, mismatched signing string, or expired timestamp), the API gateway rejects it with a 401 response.

Exam trap

Oracle often tests the distinction between HTTP status codes (401 vs 403 vs 429 vs 404) to see if candidates confuse authentication failures with authorization, rate limiting, or resource availability issues.

How to eliminate wrong answers

Option A is wrong because a rate limit exceeded (HTTP 429) would return a 'Too Many Requests' error, not a 401 Unauthorized. Option C is wrong because unsupported parameters typically result in a 400 Bad Request error, not a 401. Option D is wrong because a model that is not deployed would return a 404 Not Found or a 400 error, as the endpoint itself would be unreachable or the model ID invalid, not an authentication failure.

40
MCQeasy

A data scientist needs to fine-tune a model on OCI Generative AI. Which of the following is a required parameter in the fine-tuning request?

A.hyperparameters
B.model_name
C.dataset_type
D.All of the above
AnswerD

All three (model_name, dataset_type, hyperparameters) are required for a fine-tuning request.

Why this answer

In OCI Generative AI, the fine-tuning request requires all three parameters: hyperparameters (to define training behavior like learning rate and epochs), model_name (to specify the base model being fine-tuned), and dataset_type (to indicate the format of the training data, such as 'TEXT' or 'MULTI_TURN'). Therefore, 'All of the above' is correct because each listed option is a mandatory field in the fine-tuning API call.

Exam trap

Oracle often tests the 'All of the above' pattern when each individual option is factually correct but candidates incorrectly assume only one is required, missing the comprehensive nature of the fine-tuning request.

How to eliminate wrong answers

Option A is wrong because hyperparameters are indeed required, but the question asks for 'a required parameter' and the correct answer includes all options, so selecting only A would be incomplete. Option B is wrong because model_name is required, but again, it is not the only required parameter. Option C is wrong because dataset_type is required, but the question expects the comprehensive answer that all three are necessary.

The trap is that each individual option is technically required, but the question is designed to test whether you know that all three are mandatory in the fine-tuning request.

41
MCQhard

A company deploys a large language model on a dedicated AI cluster with 4 nodes. The model requires 128 GB of memory per instance, but the nodes have only 64 GB each. During inference, the nodes experience out-of-memory errors. What is the best solution?

A.Enable model parallelism across nodes
B.Increase the number of nodes to 8
C.Upgrade to higher memory node shapes
D.Reduce the batch size in inference requests
AnswerA

Model parallelism distributes the model across nodes, enabling inference with the available memory.

Why this answer

Model parallelism splits the model's layers or parameters across multiple nodes, allowing the 128 GB model to be distributed across the 4 nodes (each with 64 GB) so that no single node exceeds its memory capacity. This is the best solution because it directly addresses the memory constraint without requiring additional hardware or sacrificing inference throughput, and it is a standard technique for deploying large language models on distributed AI clusters.

Exam trap

Oracle often tests the misconception that scaling out (more nodes) or scaling down (batch size) can fix memory constraints for large models, but the trap here is that the model's parameter memory is fixed and cannot be reduced by batch size changes, and adding more nodes without parallelism still leaves each node unable to host the full model.

How to eliminate wrong answers

Option B is wrong because increasing the number of nodes to 8 does not solve the fundamental issue: each node still has only 64 GB, and the model requires 128 GB per instance; without model parallelism, each node would still try to load the entire model and fail. Option C is wrong because upgrading to higher memory node shapes (e.g., 128 GB per node) would work but is often cost-prohibitive or unavailable, and the question asks for the best solution given the existing cluster; model parallelism is more efficient and scalable. Option D is wrong because reducing the batch size reduces per-request memory usage but does not reduce the model's parameter memory footprint (128 GB), so the model itself still cannot fit into a single node's 64 GB memory.

42
MCQmedium

A team is deploying a generative AI model using OCI Functions for serverless inference. They are experiencing cold start latency of over 10 seconds for the first invocation after idle periods. What is the best strategy to reduce cold start latency?

A.Migrate the inference to OCI Data Flow for better performance.
B.Use provisioned concurrency to keep a set number of function instances warm.
C.Reduce the function timeout to force faster execution.
D.Increase the memory allocation for the function.
AnswerB

Provisioned concurrency eliminates cold start by pre-warming instances.

Why this answer

Option B is correct because OCI Functions supports provisioned concurrency, which keeps a specified number of instances warm. Option A (increasing memory) can reduce cold start but not as effectively. Option C (reducing timeout) might cause failures.

Option D (using OCI Data Flow) is for data processing, not inference.

43
Multi-Selecteasy

Which TWO are best practices for securing a generative AI endpoint on OCI? (Select TWO)

Select 2 answers
A.Enable OCI Logging for audit
B.Use a public endpoint with IP restrictions
C.Disable authentication for internal use
D.Store API keys in OCI Vault
E.Use a dedicated AI cluster with a private subnet
AnswersD, E

OCI Vault securely manages secrets and API keys.

Why this answer

Option D is correct because OCI Vault provides a secure, centralized service for storing and managing API keys used to authenticate requests to generative AI endpoints. Storing keys in Vault prevents hardcoding them in application code or configuration files, reducing the risk of exposure and enabling automated rotation and access control via IAM policies.

Exam trap

The trap here is that candidates often confuse logging (Option A) with a security control, or assume that IP restrictions (Option B) are sufficient for securing an AI endpoint, when in fact OCI emphasizes private endpoints and authentication as best practices.

44
MCQeasy

A company wants to use OCI Generative AI service to generate marketing copy that adheres to brand guidelines. Which technique should they use?

A.Use model distillation
B.Use prompt engineering with a pre-trained model
C.Use knowledge distillation
D.Fine-tune the model with brand-specific data
AnswerD

Correct: Fine-tuning adjusts model weights to match brand style and guidelines.

Why this answer

Fine-tuning a pre-trained model with brand-specific data (Option D) is the correct approach because it adjusts the model's weights to align with the company's unique brand guidelines, tone, and vocabulary. This supervised learning process ensures the generated marketing copy consistently adheres to specific requirements, unlike prompt engineering which relies on ephemeral instructions that may not reliably enforce brand constraints.

Exam trap

Oracle often tests the distinction between prompt engineering (which is temporary and instruction-based) and fine-tuning (which permanently alters model behavior), leading candidates to choose prompt engineering because it seems simpler, but it fails to guarantee adherence to brand guidelines.

How to eliminate wrong answers

Option A is wrong because model distillation is a technique to compress a large model into a smaller, faster one, not to adapt outputs to brand guidelines. Option B is wrong because prompt engineering with a pre-trained model can guide outputs but does not permanently embed brand-specific rules; the model may still deviate from guidelines without fine-tuned weights. Option C is wrong because knowledge distillation transfers knowledge from a teacher model to a student model for efficiency, not for customizing outputs to brand-specific data.

45
MCQhard

A machine learning engineer is deploying a fine-tuned Llama 2 model on OCI Data Science model deployment. The deployment fails with an error: 'Model artifact exceeds the maximum allowed size of 10 GB.' The model files total 12 GB. What is the best approach to resolve this?

A.Store the model in Object Storage and reference it in the deployment configuration
B.Use a different model that is smaller than 10 GB
C.Increase the model deployment artifact size limit via a service request
D.Compress the model artifact to under 10 GB using gzip
AnswerA

Object Storage allows large models and is supported by model deployment.

Why this answer

Option A is correct because OCI Data Science model deployment has a hard limit of 10 GB for the model artifact uploaded directly. By storing the model in Object Storage and referencing it in the deployment configuration, you bypass this limit entirely, as the deployment service can load the model from Object Storage at runtime without requiring the artifact to be part of the deployment package.

Exam trap

Oracle often tests the misconception that you can increase service limits via a support ticket, but for model artifact size, the limit is architectural and not adjustable; candidates may also incorrectly assume compression solves the issue without considering decompression at runtime.

How to eliminate wrong answers

Option B is wrong because it suggests a workaround that may not be feasible; the engineer has already fine-tuned a specific Llama 2 model, and switching to a smaller model would require retraining and may not meet business requirements. Option C is wrong because the 10 GB artifact size limit is a hard platform constraint that cannot be increased via a service request; OCI does not allow raising this limit for model deployments. Option D is wrong because compressing the artifact with gzip does not reduce the actual size of the model files when decompressed; the deployment service would need to decompress them, and the uncompressed size would still exceed the 10 GB limit, causing the same error.

46
MCQeasy

A data scientist wants to deploy a fine-tuned LLM on OCI for inference with low latency. Which OCI service should they use?

A.OCI Data Science Notebook Session
B.OCI Generative AI Service (Dedicated AI Cluster)
C.OCI Data Flow
D.OCI Functions
AnswerB

Dedicated AI Cluster is optimized for low-latency inference with reserved resources.

Why this answer

B is correct because OCI Generative AI Service with a Dedicated AI Cluster provides a managed, high-throughput, low-latency inference endpoint for fine-tuned LLMs. It leverages GPU-accelerated infrastructure and optimized serving stacks (e.g., vLLM, TensorRT-LLM) to minimize response times, making it ideal for production inference workloads.

Exam trap

The trap here is that candidates confuse development environments (Notebook Sessions) or general-purpose serverless compute (Functions) with purpose-built inference services, overlooking the need for GPU-accelerated, managed inference endpoints for low-latency LLM deployment.

How to eliminate wrong answers

Option A is wrong because OCI Data Science Notebook Session is an interactive development environment for prototyping and training, not a production-grade inference endpoint; it lacks auto-scaling, load balancing, and dedicated GPU serving for low-latency inference. Option C is wrong because OCI Data Flow is a serverless Apache Spark service designed for batch and stream data processing, not for real-time LLM inference. Option D is wrong because OCI Functions is a serverless compute service for short-lived, stateless functions (max 5-minute timeout) and does not support GPU acceleration or persistent model serving required for low-latency LLM inference.

47
Multi-Selecthard

An enterprise is deploying a generative AI model that must comply with data residency regulations. Which two configurations should they implement? (Select TWO.)

Select 2 answers
A.Set up OCI IAM policies to prevent data egress from the region for the model's resources
B.Enable OCI Logging for all API calls
C.Use OCI Object Storage with cross-region replication for redundancy
D.Store encryption keys in an OCI Vault in a different region
E.Deploy the dedicated AI cluster in the region that meets data residency requirements
AnswersA, E

Correct: IAM policies can restrict access to resources from outside the region.

Why this answer

Option A is correct because OCI IAM policies can explicitly deny data egress from a specific region, ensuring that the generative AI model's resources (such as training data, model artifacts, and inference endpoints) remain within the region that satisfies data residency regulations. This is achieved by writing policy statements that restrict the movement of data across regional boundaries, which is a direct control for compliance.

Exam trap

The trap here is that candidates often confuse data residency enforcement with monitoring or key management, mistakenly selecting logging (Option B) or cross-region replication (Option C) as compliance controls, when only IAM policies and regional deployment directly prevent data movement.

48
MCQmedium

A company is deploying a fine-tuned Cohere model on OCI Generative AI service for real-time inference. They need to ensure low latency even during demand spikes. Which configuration should they prioritize?

A.Enable model caching on the endpoint.
B.Use a dedicated AI cluster for the endpoint.
C.Use streaming responses.
D.Increase the max tokens parameter.
AnswerB

A dedicated AI cluster with autoscaling ensures consistent low latency under variable load.

Why this answer

A dedicated AI cluster provides isolated compute resources (GPUs) that are not shared with other tenants or workloads, ensuring consistent low latency even under demand spikes. This is critical for real-time inference because shared endpoints can experience resource contention and throttling during high traffic, while a dedicated cluster guarantees predictable performance.

Exam trap

Oracle often tests the misconception that caching or streaming alone can solve latency under load, when in fact only dedicated compute resources guarantee isolation and consistent performance during demand spikes.

How to eliminate wrong answers

Option A is wrong because model caching reduces latency for repeated requests by storing intermediate results, but it does not prevent resource contention during demand spikes; it only helps with cache hits, not with ensuring low latency under sustained high load. Option C is wrong because streaming responses improve perceived latency by sending tokens as they are generated, but they do not address the underlying compute resource availability or prevent queuing delays during spikes. Option D is wrong because increasing the max tokens parameter increases the maximum output length, which can actually increase latency per request and does nothing to handle demand spikes or resource contention.

49
Multi-Selecthard

A company is deploying a generative AI model on OCI for an internal application that must comply with strict security policies. The model will be accessed by a limited group of users. Which three actions should the administrator take to ensure security? (Choose three.)

Select 3 answers
A.Expose the model endpoint to the internet for ease of access
B.Deploy the model in a private VCN subnet
C.Use IAM policies to restrict model endpoint access to specific users
D.Disable audit logging to minimize storage costs
E.Store model authentication keys in OCI Vault
AnswersB, C, E

A private subnet ensures the endpoint is not publicly accessible.

Why this answer

Deploying the model in a private VCN subnet ensures that the model endpoint is not exposed to the internet, which is a fundamental security requirement for compliance with strict security policies. By placing the model in a private subnet, all traffic must traverse through a bastion host, VPN, or FastConnect, providing network isolation and reducing the attack surface. This aligns with OCI's shared responsibility model where the customer controls network security.

Exam trap

The trap here is that candidates may think exposing the endpoint to the internet is acceptable if IAM policies are used, but network isolation (private subnet) is a separate and mandatory layer of defense that cannot be replaced by IAM alone.

50
MCQhard

A security administrator wrote the above IAM policy for a compartment named MyCompartment. Users in the GenerativeAIUsers group can successfully list dedicated AI clusters and models in MyCompartment, but when they try to create an inference endpoint using a model from a different compartment (SharedModels), they get an authorization error. What is the most likely missing policy statement?

A.ALLOW GROUP GenerativeAIUsers TO MANAGE generative-ai-models IN COMPARTMENT SharedModels
B.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-family IN TENANCY
C.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-dedicated-ai-clusters IN COMPARTMENT SharedModels
D.ALLOW GROUP GenerativeAIUsers TO USE generative-ai-models IN COMPARTMENT SharedModels
AnswerD

This allows them to use models from SharedModels compartment.

Why this answer

The error occurs because the user has permission to list models in MyCompartment but not to use a model from SharedModels when creating an inference endpoint. The missing policy must grant the USE permission on generative-ai-models in the SharedModels compartment, as creating an endpoint requires the ability to reference and use the model resource from that compartment. Option D correctly provides this permission.

Exam trap

Oracle often tests the distinction between 'read' and 'use' permissions, where candidates mistakenly think listing models (read) is sufficient to use them in another resource creation, but OCI requires the 'use' verb for referencing a resource across compartments.

How to eliminate wrong answers

Option A is wrong because it grants MANAGE permission, which is excessive; the user only needs USE permission to reference the model for creating an endpoint. Option B is wrong because it grants USE on the entire generative-ai-family at the tenancy level, which is too broad and not scoped to the specific model resource needed from SharedModels. Option C is wrong because it grants USE on generative-ai-dedicated-ai-clusters in SharedModels, but the error is about using a model, not a dedicated AI cluster.

51
MCQhard

A global enterprise is deploying a generative AI application that requires high availability across multiple OCI regions. The application must automatically fail over to a secondary region if the primary region becomes unavailable. What is the recommended architecture to achieve this?

A.Deploy endpoints in two regions behind an OCI Load Balancer with cross-region failover
B.Deploy OCI Generative AI endpoints in two regions and use a global DNS round-robin
C.Use OCI Streaming to replicate requests between regions
D.Use DNS failover with a single endpoint in the primary region
AnswerA

OCI Load Balancer can route traffic to a backup region when primary is unhealthy.

Why this answer

Option A is correct because OCI Load Balancer supports cross-region failover by distributing traffic across backend sets in multiple regions, enabling automatic failover to a secondary region when the primary region becomes unavailable. This architecture ensures high availability for generative AI applications by leveraging health checks and failover policies at the load balancer level, which is the recommended approach for multi-region active-passive setups.

Exam trap

Oracle often tests the misconception that DNS-based solutions (like round-robin or simple failover) provide automatic failover with health checks, but in OCI, DNS failover requires manual intervention or additional services like Traffic Management Steering, whereas OCI Load Balancer natively supports automatic cross-region failover.

How to eliminate wrong answers

Option B is wrong because global DNS round-robin does not provide automatic failover; it distributes traffic statically and cannot detect regional outages, leading to continued traffic to an unavailable endpoint. Option C is wrong because OCI Streaming is a messaging service for real-time data ingestion and replication, not a traffic routing or failover mechanism for application endpoints. Option D is wrong because DNS failover with a single endpoint in the primary region lacks a secondary region for failover, offering no high availability if the primary region fails.

52
Multi-Selecteasy

A data scientist is preparing to fine-tune a foundation model on OCI. Which two actions should they take to optimize costs? (Select TWO.)

Select 2 answers
A.Use the smallest model that meets accuracy requirements
B.Use a single OCPU shape to minimize per-hour cost
C.Use spot preemptible instances to save on compute
D.Monitor fine-tuning progress and stop early if validation loss plateaus
E.Store training data in Archive Storage to reduce storage costs
AnswersA, D

Correct: Smaller models require less compute and memory.

Why this answer

Option A is correct because using the smallest model that meets accuracy requirements directly reduces the number of parameters and computational operations required during fine-tuning. On OCI, larger models consume significantly more GPU memory and compute hours, so selecting the minimal viable model minimizes both training time and associated costs. This aligns with cost optimization best practices for generative AI workloads.

Exam trap

Oracle often tests the misconception that spot/preemptible instances are universally cost-effective for all AI workloads, but in OCI, they are not supported for interactive or stateful fine-tuning jobs, making Option C a classic distractor.

53
MCQmedium

A data scientist deployed a fine-tuned Llama 2 7B model on OCI Model Deployment with a single VM.GPU.A10.1 shape. Users report average latency of 3 seconds per request, which is too high for the intended real-time application. The model is used for short text generation (max 128 tokens). The data scientist wants to reduce per-request latency without significant accuracy loss. Which action would be most effective?

A.Increase the number of workers per replica
B.Increase the max_tokens parameter for the model
C.Enable response streaming for the model endpoint
D.Apply 4-bit quantization using AWQ
AnswerD

Quantization reduces model size and inference time with minimal accuracy loss.

Why this answer

4-bit quantization using AWQ reduces the model's memory footprint and computational requirements by compressing weights to 4-bit integers, which directly decreases inference latency on the VM.GPU.A10.1 shape. This technique preserves most of the model's accuracy while enabling faster token generation, making it the most effective single action for reducing per-request latency in a real-time short text generation scenario.

Exam trap

The trap here is that candidates confuse throughput improvements (Option A) or perceived latency (Option C) with actual per-request latency reduction, or mistakenly think increasing max_tokens (Option B) would help, when in fact it worsens the problem.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers per replica does not reduce per-request latency; it only increases throughput by handling more concurrent requests, but each individual request still experiences the same inference time. Option B is wrong because increasing the max_tokens parameter would actually increase latency, as the model would generate more tokens per request, making the problem worse. Option C is wrong because enabling response streaming does not reduce the total time to generate the full response; it only sends tokens incrementally to the client, improving perceived latency but not actual end-to-end latency.

54
MCQeasy

A user wants to access the OCI Generative AI service programmatically. Which credential method is recommended for use in a production application running on OCI Compute?

A.API signing keys
B.Instance principal
C.User password and OCID
D.Resource principal
AnswerB

Instance principal dynamically obtains credentials via instance metadata service.

Why this answer

Instance principal authentication is the recommended method for production applications running on OCI Compute because it allows the application to authenticate with OCI services without managing or embedding any credentials. The OCI Compute instance assumes a dynamic group and IAM policy that grants it permissions, and the SDK automatically handles token exchange via the instance metadata service, eliminating the need for long-lived secrets.

Exam trap

Oracle often tests the distinction between instance principal (for Compute instances) and resource principal (for serverless or managed services), leading candidates to confuse the two or to incorrectly select API signing keys as the 'most secure' option without considering operational overhead.

How to eliminate wrong answers

Option A is wrong because API signing keys are long-lived secrets that must be securely stored and rotated, which adds operational overhead and risk in a production environment. Option C is wrong because user passwords and OCIDs are intended for interactive console login, not programmatic API access, and they cannot be used with the OCI SDK or CLI for service calls. Option D is wrong because resource principal is used for serverless functions like OCI Functions or for resources like OCI Object Storage buckets, not for a Compute instance running a custom application.

55
Multi-Selectmedium

Which two actions are required when deploying a custom fine-tuned model using the OCI Generative AI service? (Choose two.)

Select 2 answers
A.Configure an API Gateway for the model endpoint
B.Register the model in the OCI Data Science Model Catalog
C.Set up a load balancer for the deployment
D.Upload model artifacts to OCI Object Storage
E.Create a dedicated AI cluster
AnswersD, E

Model artifacts must be stored in Object Storage before deployment.

Why this answer

Option D is correct because deploying a custom fine-tuned model in OCI Generative AI requires the model artifacts (e.g., weights, configuration files) to be stored in OCI Object Storage. The service pulls the artifacts from a designated bucket during deployment. Option E is correct because a dedicated AI cluster must be created to host the model, providing the necessary compute resources for inference.

Exam trap

The trap here is that candidates confuse the OCI Data Science Model Catalog (used for ML model lifecycle management) with the Generative AI service's own model registration, leading them to incorrectly select Option B.

56
MCQhard

A company deploys a fine-tuned model on an OCI Generative AI dedicated AI cluster. After deployment, they observe high latency during peak hours. The cluster has only one replica. Which action would most effectively reduce latency without increasing cost unnecessarily?

A.Increase the number of replicas to 10.
B.Enable auto-scaling with a maximum of 3 replicas.
C.Switch to a larger base model.
D.Move to a serverless deployment model.
AnswerB

Auto-scaling adjusts to demand; a max of 3 provides headroom without waste.

Why this answer

Enabling auto-scaling with a maximum of 3 replicas (Option B) is the most effective action because it dynamically adds replicas during peak hours to handle increased load, reducing latency, while limiting the maximum to 3 prevents unnecessary cost overruns. This balances performance and cost, unlike a fixed large replica count or switching models, which either wastes resources or fails to address the root cause of insufficient compute capacity.

Exam trap

Oracle often tests the misconception that more replicas always reduce latency, but the trap here is that candidates may choose Option A (10 replicas) without considering cost efficiency, while the correct answer requires balancing performance with cost constraints via auto-scaling.

How to eliminate wrong answers

Option A is wrong because increasing replicas to 10 would significantly raise costs without proportional latency benefits, as the cluster likely doesn't need that many replicas during non-peak hours, leading to idle resource waste. Option C is wrong because switching to a larger base model would increase inference latency and cost due to higher computational requirements, exacerbating the problem rather than solving it. Option D is wrong because moving to a serverless deployment model on OCI Generative AI would introduce cold-start latency and unpredictable scaling behavior, and it may not support fine-tuned models or dedicated cluster features, potentially increasing latency and cost.

57
MCQeasy

What is the primary benefit of using a Dedicated AI Cluster over On-Demand serving for deploying generative AI models on OCI?

A.Higher throughput and lower latency due to reserved capacity
B.No need to manage model versions
C.Automatic scaling to zero when not in use
D.Lower cost for variable workloads
AnswerA

Reserved capacity minimizes resource contention, improving performance.

Why this answer

A Dedicated AI Cluster provides reserved compute capacity on OCI, ensuring consistent high throughput and low latency for generative AI inference workloads. Unlike On-Demand serving, which shares resources and can suffer from contention or cold starts, a Dedicated AI Cluster guarantees that GPU resources are always available for your model, eliminating variability in response times.

Exam trap

Oracle often tests the misconception that Dedicated AI Clusters are cheaper for variable workloads, when in fact their fixed-cost model makes them optimal for steady-state, high-volume inference, not spiky or unpredictable traffic.

How to eliminate wrong answers

Option B is wrong because managing model versions is a separate concern handled by model registries and deployment pipelines, not a benefit specific to Dedicated AI Clusters. Option C is wrong because Dedicated AI Clusters are always-on and do not scale to zero; automatic scaling to zero is a feature of serverless or On-Demand serving to reduce costs when idle. Option D is wrong because Dedicated AI Clusters incur fixed costs for reserved capacity, making them more expensive for variable workloads compared to On-Demand serving, which charges per-usage and can scale down.

58
MCQhard

A data science team is using OCI Data Science to fine-tune a model. They notice that training jobs are failing due to out-of-memory errors on the notebook session. What should they do to resolve this?

A.Enable autoscaling on the notebook session.
B.Use OCI Data Flow instead.
C.Switch to a larger notebook session shape.
D.Reduce the batch size in the training script.
AnswerC

A larger shape provides more memory, resolving OOM issues.

Why this answer

Out-of-memory errors during training on a notebook session indicate that the current shape's memory capacity is insufficient for the model or data being processed. Switching to a larger notebook session shape directly increases available RAM and compute resources, resolving the memory constraint without altering the training logic or infrastructure type.

Exam trap

Oracle often tests the misconception that autoscaling or reducing batch size can fix memory issues in a single-node notebook session, but the correct approach is to match the compute shape to the workload's memory requirements.

How to eliminate wrong answers

Option A is wrong because autoscaling adjusts the number of compute instances horizontally, not the memory of a single notebook session; it does not prevent OOM errors caused by insufficient per-instance memory. Option B is wrong because OCI Data Flow is a serverless Spark-based service for big data processing, not designed for fine-tuning deep learning models, and migrating would require rewriting the training pipeline. Option D is wrong because reducing batch size can mitigate memory usage but does not address the root cause of an undersized notebook session shape; it may also degrade training convergence or performance, and the question asks for a resolution to the failing jobs, not a workaround.

59
MCQhard

A financial services company is concerned about data privacy when using OCI Generative AI service for processing sensitive customer data. They want to ensure that their data is not used to improve the model and is encrypted at rest and in transit. Which combination of OCI features should they implement?

A.Use OCI Object Storage buckets with encryption to store prompts and responses
B.Provision a dedicated endpoint, configure data privacy opt-out, and use OCI Vault for encryption keys
C.Deploy the model in an OCI Data Science project with a private endpoint
D.Use the on-demand API with default encryption
AnswerB

Dedicated endpoints allow data isolation; Vault provides customer-managed keys; privacy opt-out prevents use for training.

Why this answer

Option B is correct because it addresses all three requirements: a dedicated endpoint ensures network isolation and encryption in transit, data privacy opt-out prevents OCI from using customer data for model improvement, and OCI Vault integration allows customers to manage their own encryption keys for data at rest, meeting the financial services company's strict data privacy and encryption needs.

Exam trap

Oracle often tests the misconception that encryption alone (at rest or in transit) is sufficient for data privacy, when in fact the critical requirement for preventing model improvement is the explicit data privacy opt-out mechanism, which is a separate control from encryption.

How to eliminate wrong answers

Option A is wrong because OCI Object Storage encryption only protects data at rest, not data in transit, and it does not prevent OCI from using prompts and responses for model training or improvement. Option C is wrong because deploying a model in an OCI Data Science project with a private endpoint provides network isolation but does not include a data privacy opt-out mechanism to prevent data from being used for model improvement, nor does it inherently enforce customer-managed encryption keys for the Generative AI service. Option D is wrong because the on-demand API with default encryption uses OCI-managed keys, which does not give the customer control over encryption keys, and it does not include a data privacy opt-out to prevent data usage for model improvement.

60
MCQeasy

Your team has deployed a fine-tuned GPT-2 model on OCI Model Deployment for a simple text generation API. The model performs text completion for short prompts (e.g., 50 tokens). The endpoint is working but response times are over 10 seconds for these short prompts. The model size is approximately 500MB and you used a VM.Standard.E3.Flex shape (2 OCPU, 16GB RAM). The deployment is in a single replica with no autoscaling. You have verified that the network latency is minimal (<5ms). The model was trained in OCI Data Science using a GPU shape, but during deployment you selected a CPU shape to reduce cost. The model is a transformer-based neural network. You've also confirmed that the deployment is healthy and there are no errors in the logs. The memory usage is within limits. What is the most likely cause of the high latency?

A.High network latency between the client and the model endpoint
B.Model is too large for the VM.Standard.E3.Flex shape
C.Insufficient CPU resources for the model size
D.Missing GPU acceleration for inference
AnswerD

GPU acceleration is essential for fast inference on neural network models like GPT-2.

Why this answer

Option D is correct because GPT-2 is a transformer-based neural network that relies heavily on matrix multiplications, which are far more efficiently executed on GPUs due to their parallel architecture. Even though the model is only 500MB, CPU inference for transformer models is notoriously slow because CPUs process sequential operations, while GPUs can parallelize the attention mechanism and feed-forward layers. The 10-second latency for a 50-token prompt is a classic symptom of missing GPU acceleration, as the CPU shape (2 OCPU) lacks the specialized tensor cores needed for fast transformer inference.

Exam trap

The trap here is that candidates might assume a 500MB model is 'small enough' for CPU inference, overlooking that transformer architecture—not model size—is the primary driver of latency, and that GPU acceleration is essential even for moderately sized transformer models.

How to eliminate wrong answers

Option A is wrong because the problem states that network latency is minimal (<5ms), so high network latency is not the cause. Option B is wrong because the model size of 500MB fits comfortably within the 16GB RAM of the VM.Standard.E3.Flex shape, and memory usage is confirmed to be within limits. Option C is wrong because while CPU resources are limited (2 OCPU), the core issue is not insufficient CPU resources per se, but rather that CPUs are architecturally unsuited for the parallel computations required by transformer models; even with more CPU cores, inference would still be significantly slower than with a GPU.

61
MCQeasy

A user sends an inference request with the JSON parameters shown. They notice the model is returning very short responses. What is the most likely cause?

A.maxTokens is set too high
B.topP is set too high
C.The modelId is incorrect
D.temperature is set too low
AnswerD

Low temperature reduces randomness, often leading to shorter, safer outputs.

Why this answer

Option D is correct because a low temperature value (close to 0) makes the model highly deterministic, reducing randomness and often leading to shorter, more conservative responses. In generative AI, temperature controls the probability distribution over tokens; lower values cause the model to favor the most likely tokens, which can result in repetitive or truncated outputs. The user's inference request likely includes a temperature setting that is too low, causing the model to produce very short responses.

Exam trap

Oracle often tests the misconception that maxTokens or topP control response length directly, when in fact temperature has a more subtle effect on output length by influencing token diversity and repetition.

How to eliminate wrong answers

Option A is wrong because maxTokens sets the maximum number of tokens the model can generate; setting it too high would allow longer responses, not shorter ones. Option B is wrong because topP (nucleus sampling) controls the cumulative probability threshold for token selection; setting it too high would include more diverse tokens, potentially leading to longer or more varied responses, not shorter ones. Option C is wrong because an incorrect modelId would typically cause an error or unexpected behavior (e.g., model not found), not consistently produce very short responses.

62
MCQhard

A machine learning engineer evaluates OCI Generative AI for a real-time content generation application. They need to meet a SLAs of 99.9% availability. Which deployment architecture satisfies the requirement with the lowest cost?

A.Two dedicated AI clusters in different regions.
B.Two dedicated AI clusters in different availability domains.
C.Two dedicated AI clusters in the same availability domain.
D.Single dedicated AI cluster with a single replica.
AnswerB

Clusters in different ADs provide resilience against AD failures at moderate cost.

Why this answer

Option B is correct because deploying two dedicated AI clusters in different availability domains within a single region provides high availability (HA) to meet the 99.9% SLA while minimizing cost. OCI's dedicated AI clusters are regional resources, and placing replicas across availability domains protects against domain-level failures without the cross-region data transfer and egress costs incurred by multi-region deployments.

Exam trap

The trap here is that candidates often assume multi-region deployment is required for high availability, but OCI's 99.9% SLA can be achieved within a single region using availability domains, making the cross-region option unnecessarily expensive.

How to eliminate wrong answers

Option A is wrong because two dedicated AI clusters in different regions introduces unnecessary cross-region data transfer costs and higher latency, making it more expensive than a single-region HA solution. Option C is wrong because two dedicated AI clusters in the same availability domain does not protect against availability domain failures, thus failing to meet the 99.9% SLA requirement. Option D is wrong because a single dedicated AI cluster with a single replica provides no redundancy; any failure of that cluster or its underlying infrastructure would cause downtime, violating the 99.9% availability SLA.

63
MCQeasy

A developer wants to deploy a custom generative AI model that was trained using OCI Data Science. Which service should they use to expose the model as an API endpoint?

A.OCI API Gateway
B.OCI Data Science Model Deployment
C.OCI Functions
D.OCI Generative AI service
AnswerB

This is purpose-built for deploying models as APIs.

Why this answer

B is correct because OCI Data Science Model Deployment is specifically designed to host and serve machine learning models as REST API endpoints. It directly deploys models trained in OCI Data Science, managing the underlying infrastructure, scaling, and providing a secure HTTPS endpoint for inference requests.

Exam trap

The trap here is that candidates confuse OCI Generative AI service (a managed service for pre-built models) with the ability to deploy custom models, or they assume API Gateway alone can serve a model without a backend compute service.

How to eliminate wrong answers

Option A is wrong because OCI API Gateway is a service for creating, managing, and securing API endpoints for backend services, but it does not host or run models; it would need a separate compute target like a model deployment behind it. Option C is wrong because OCI Functions is a serverless compute service for running stateless code snippets (functions) in response to events, not for hosting large generative AI models with persistent state and GPU requirements. Option D is wrong because OCI Generative AI service is a managed service that provides pre-built foundation models (like LLMs) from providers such as Cohere and Meta, not a platform to deploy custom models trained by the user.

64
MCQhard

A healthcare company is using OCI Generative AI to analyze patient records and generate clinical summaries. The company must comply with HIPAA regulations, which require that all protected health information (PHI) be encrypted at rest and in transit, and that access be logged and audited. The current architecture uses an OCI Data Science model deployment with a public endpoint. The model is stored in an OCI Object Storage bucket that is publicly accessible for testing. The company is now moving to production. The compliance officer has flagged the following issues: (1) The model endpoint is publicly accessible. (2) The bucket containing the model is public. (3) No audit logs are enabled. The company wants to remediate these issues while maintaining the ability to invoke the model from on-premises applications via a secure connection. Which set of actions should the architect take?

A.Switch the model endpoint to a private subnet with a service gateway, change the bucket to be accessible only via pre-authenticated requests, and enable OCI Logging for the model deployment.
B.Keep the public endpoint but restrict access using IAM policies and source IP addresses, make the bucket private, and enable OCI Audit.
C.Switch the model endpoint to a private subnet with a service gateway, update the bucket policy to block all public access, enable OCI Audit service, and set up a VPN or FastConnect for on-premises access.
D.Use a public load balancer with SSL termination, restrict bucket access to the load balancer's OCID, and enable OCI Audit.
AnswerC

This ensures private endpoint, private bucket, audit logging, and secure on-premises connectivity.

Why this answer

Option C is correct because it addresses all three compliance issues: moving the model endpoint to a private subnet with a service gateway removes public exposure, making the bucket private with a policy that blocks all public access secures the model artifacts, and enabling OCI Audit provides the required logging. Additionally, setting up a VPN or FastConnect allows secure on-premises access without exposing the endpoint to the public internet, fully satisfying HIPAA encryption and audit requirements.

Exam trap

The trap here is that candidates often think IP restrictions or pre-authenticated requests are sufficient for HIPAA compliance, but HIPAA requires that PHI be encrypted at rest and in transit and that access be logged and audited—public endpoints and shared URLs violate the 'encryption in transit' and 'audit' requirements because they rely on internet-exposed paths and lack proper access controls.

How to eliminate wrong answers

Option A is wrong because pre-authenticated requests (PARs) still expose the bucket via a URL that can be shared, which does not meet HIPAA's requirement for access logging and audit; PARs are not a substitute for private bucket policies and audit logging. Option B is wrong because keeping the public endpoint even with IP restrictions is not sufficient for HIPAA compliance—public endpoints are inherently exposed to network-level attacks and do not satisfy the requirement for encryption at rest and in transit in a fully private manner; also, OCI Audit alone does not cover logging for the model deployment itself. Option D is wrong because a public load balancer with SSL termination still leaves the endpoint publicly accessible, and restricting bucket access to the load balancer's OCID does not prevent the bucket from being publicly listed or accessed via other paths; OCI Audit alone does not address the public endpoint issue.

65
MCQmedium

A data scientist is deploying a custom generative AI model using OCI Data Science. After deploying the model to an endpoint, they notice that inference requests are failing with a timeout error when the payload size exceeds 1 MB. What is the most likely cause and solution?

A.The load balancer is misconfigured; reconfigure the load balancer timeout settings.
B.The model server lacks sufficient memory; scale out to more instances.
C.The model is not optimized for large payloads; use AutoML to optimize the model.
D.The model deployment has a default payload size limit of ~1 MB; increase the payload limit in the deployment configuration.
AnswerD

OCI Data Science model deployments have a default request payload limit that can be increased.

Why this answer

The correct answer is D because OCI Data Science model deployments have a default payload size limit of approximately 1 MB. When inference requests exceed this limit, the load balancer or gateway times out the request. The solution is to increase the payload limit in the deployment configuration, which can be adjusted via the OCI console or API by modifying the `maximumRequestPayloadSize` setting.

Exam trap

The trap here is that candidates often confuse a payload size limit with a generic timeout or resource issue, leading them to choose load balancer reconfiguration (A) or scaling (B) instead of recognizing the explicit payload limit enforced by the deployment configuration.

How to eliminate wrong answers

Option A is wrong because the load balancer timeout settings are not the root cause; the timeout is a symptom of hitting the payload size limit, not a misconfiguration of the load balancer itself. Option B is wrong because insufficient memory would cause out-of-memory errors or slow inference, not a timeout specifically triggered by payload size exceeding 1 MB. Option C is wrong because AutoML optimizes model training and hyperparameters, not the runtime payload handling; the issue is a deployment configuration limit, not model optimization.

66
MCQmedium

Your team is deploying a generative AI model for a clinical decision support system. The model must meet HIPAA compliance requirements. You have trained a model using OCI Data Science and now need to deploy it so that patient data is protected. The application requires real-time inference. Which set of actions should you take to ensure compliance while maintaining low latency?

A.Use OCI Functions with API Gateway and allow anonymous access
B.Deploy in a public subnet with HTTPS and enable OCI Audit
C.Use OCI Data Flow for batch inference and store results in Object Storage with SSE
D.Deploy in a private VCN subnet, use a service gateway, store keys in OCI Vault, and enable OCI Logging and OCI Audit
AnswerD

These actions address HIPAA requirements for access control, encryption, and auditing.

Why this answer

Option D is correct because deploying the model in a private VCN subnet ensures the inference endpoint is not exposed to the internet, meeting HIPAA's requirement for network isolation. Using a service gateway allows private connectivity to OCI services without traversing the internet, while storing encryption keys in OCI Vault enables customer-managed key control for data at rest. Enabling OCI Logging and OCI Audit provides the necessary audit trail for compliance, and the private subnet with service gateway keeps latency low by avoiding internet hops.

Exam trap

Oracle often tests the misconception that HTTPS encryption alone is sufficient for HIPAA compliance, but the trap here is that network isolation (private subnet) is mandatory for PHI, and public subnet exposure violates the HIPAA Security Rule even with encryption in transit.

How to eliminate wrong answers

Option A is wrong because OCI Functions with API Gateway and anonymous access bypasses authentication and authorization, violating HIPAA's access control requirements, and anonymous access exposes patient data to unauthorized users. Option B is wrong because deploying in a public subnet, even with HTTPS, exposes the inference endpoint to the internet, which is not permitted for protected health information (PHI) under HIPAA's security rule, and OCI Audit alone does not enforce network isolation. Option C is wrong because OCI Data Flow is a batch processing service, not suitable for real-time inference, and storing results in Object Storage with SSE does not address the need for low-latency, synchronous inference required by the clinical decision support system.

67
MCQeasy

A company needs to ensure that only authorized users can invoke an endpoint for a generative AI model. Which OCI feature should be used to control access?

A.Network security groups (NSGs)
B.VCN flow logs
C.OCI Web Application Firewall (WAF)
D.OCI Identity and Access Management (IAM) policies
AnswerD

Correct: IAM policies grant or deny access to specific resources like models and endpoints.

Why this answer

OCI Identity and Access Management (IAM) policies are the correct choice because they define who (users, groups, or service principals) can invoke which OCI resources, including generative AI model endpoints. IAM policies use resource-type and verb-based statements (e.g., 'allow group A to manage ai-service-family in compartment X') to enforce authorization at the API level, ensuring only authorized principals can call the model's inference endpoint.

Exam trap

The trap here is that candidates confuse network-level controls (NSGs, WAF) with identity-based access control, mistakenly thinking that restricting network traffic to the endpoint is sufficient for authorization, whereas OCI requires IAM policies to authenticate and authorize the caller's identity at the API layer.

How to eliminate wrong answers

Option A is wrong because Network Security Groups (NSGs) control network traffic at the subnet or VNIC level using stateful firewall rules (e.g., allow/deny TCP port 443), not user identity or API-level authorization. Option B is wrong because VCN flow logs capture metadata about network traffic (source IP, destination port, etc.) for auditing or troubleshooting, but they do not enforce access control. Option C is wrong because OCI Web Application Firewall (WAF) protects against HTTP-based attacks (e.g., SQL injection, XSS) and can filter by IP or request patterns, but it cannot authenticate or authorize individual users or service principals invoking the model endpoint.

68
MCQeasy

A developer wants to call the OCI Generative AI service from a Python application running on an OCI Compute instance. Which method is the most secure for authenticating the API calls?

A.Use a resource principal
B.Use the OCI CLI with a config file containing credentials
C.Use instance principals with a dynamic group and policy
D.Use an API signing key stored on the instance
AnswerC

Instance principals allow secure authentication without storing secrets.

Why this answer

Option C is correct because instance principals allow the Compute instance to authenticate to OCI services without storing any credentials on the instance. By assigning a dynamic group and policy, the instance obtains a temporary security token from the OCI metadata service, which is the most secure method for programmatic access from within OCI.

Exam trap

The trap here is that candidates confuse resource principals (used for serverless functions) with instance principals (used for Compute instances), or they assume that storing credentials in a config file is acceptable because it is a common practice in non-OCI environments.

How to eliminate wrong answers

Option A is wrong because resource principals are used for OCI Functions or other OCI resources that need to make API calls, not for Compute instances. Option B is wrong because using the OCI CLI with a config file containing credentials stores long-lived user credentials on the instance, which is less secure and violates the principle of least privilege. Option D is wrong because storing an API signing key on the instance creates a persistent secret that could be compromised if the instance is breached, and it requires manual key rotation.

69
MCQeasy

An organization wants to deploy a generative AI chatbot using OCI Generative AI service. The chatbot must comply with data residency requirements by ensuring that all data processing occurs within a specific geographic region. What is the best practice to achieve this?

A.Use a dedicated AI cluster in the required region
B.Enable cross-region replication for disaster recovery
C.Configure a tenancy-wide policy to restrict region usage
D.Use IAM policies to block access from other regions
AnswerA

Dedicated AI clusters are region-specific and ensure data stays in that region.

Why this answer

Option A is correct because OCI Generative AI service allows you to provision a dedicated AI cluster within a specific region, ensuring all model inference and data processing remain within that geographic boundary. This dedicated cluster is isolated from other regions and complies with data residency requirements by design, as no data leaves the chosen region during processing.

Exam trap

The trap here is that candidates confuse data residency with access control or disaster recovery, thinking that IAM policies or replication settings can enforce geographic data boundaries, when in fact only the physical placement of the compute cluster guarantees data stays within a region.

How to eliminate wrong answers

Option B is wrong because cross-region replication is a disaster recovery feature that copies data to another region, which would violate data residency by moving data outside the required geographic region. Option C is wrong because tenancy-wide policies restrict where resources can be created, but they do not control where data processing occurs for an existing AI cluster; data could still be processed in a different region if the cluster is not explicitly placed. Option D is wrong because IAM policies block user access from other regions but do not prevent the AI service from processing data in a region other than the required one; data residency is about data location, not access control.

70
MCQhard

Refer to the exhibit. The dashboard shows latency grouped by modelId, but some points are missing for certain modelIds. Which of the following is the most likely reason?

A.The metric name is misspelled
B.The aggregation interval is too short
C.The modelIds with missing data may have been deleted or are inactive
D.The compartmentId is incorrect
AnswerC

Inactive or deleted models stop emitting metrics, leading to gaps in the time series.

Why this answer

Option C is correct because in OCI's Generative AI service, model deployments are associated with specific modelIds. If a modelId is deleted or its deployment is deactivated, the corresponding telemetry data (e.g., latency metrics) will no longer be reported, causing gaps in the dashboard. The dashboard aggregates metrics only for active modelIds, so missing points indicate that those modelIds are no longer in service.

Exam trap

The trap here is that candidates may confuse missing data due to inactive resources with configuration errors (e.g., metric name typos or compartment mismatches), but Cisco tests the understanding that metric gaps are often caused by resource lifecycle events rather than misconfiguration.

How to eliminate wrong answers

Option A is wrong because a misspelled metric name would cause all data points to be missing for all modelIds, not just selective gaps. Option B is wrong because a too-short aggregation interval would result in sparse or noisy data across all modelIds, not missing points for specific ones. Option D is wrong because an incorrect compartmentId would prevent any metrics from being displayed for the entire dashboard, not just for certain modelIds.

71
MCQeasy

A company deploys a fine-tuned Llama 2 model using OCI Generative AI service. They want to ensure low-latency inference for a real-time chat application. Which deployment option should they use?

A.Batch inference job
B.OCI Functions
C.Dedicated AI cluster
D.Serverless endpoint (standard)
AnswerC

Dedicated AI clusters offer reserved capacity and low latency for real-time inference.

Why this answer

A dedicated AI cluster provides reserved compute resources (GPUs) for low-latency, real-time inference by eliminating resource contention. This is essential for a fine-tuned Llama 2 model in a chat application where consistent sub-second response times are required, unlike shared or serverless options that introduce cold starts or queuing delays.

Exam trap

The trap here is that candidates confuse 'serverless endpoint (standard)' with a low-latency option, not realizing that its shared infrastructure and potential cold starts make it unsuitable for real-time inference, while a dedicated cluster guarantees consistent performance.

How to eliminate wrong answers

Option A is wrong because batch inference jobs are designed for asynchronous, high-throughput processing of large datasets, not for real-time, low-latency chat interactions. Option B is wrong because OCI Functions is a serverless compute service with cold-start latency and limited GPU support, making it unsuitable for sustained, low-latency model inference. Option D is wrong because a serverless endpoint (standard) uses shared infrastructure that can experience variable latency due to multi-tenancy and scaling delays, which is not acceptable for real-time chat.

72
MCQmedium

A team is fine-tuning a foundation model on a large dataset stored in OCI Object Storage. They want to minimize data transfer costs. What is the best practice for locating the storage?

A.Place the bucket in the same region and availability domain as the fine-tuning job
B.Use OCI File Storage instead of Object Storage
C.Use a cross-region bucket to leverage geographically distributed data
D.Place the bucket in the same region as the fine-tuning job
AnswerD

Correct: Same-region transfer is free of charge.

Why this answer

Option D is correct because placing the Object Storage bucket in the same OCI region as the fine-tuning job eliminates cross-region data transfer charges. OCI charges egress fees when data moves between regions, but intra-region data transfer between services in the same region is free. This minimizes costs while keeping the data accessible for the fine-tuning workload.

Exam trap

Oracle often tests the misconception that specifying an availability domain (Option A) is necessary for cost optimization, when in fact Object Storage buckets are regional and availability domain selection is irrelevant for data transfer costs.

How to eliminate wrong answers

Option A is wrong because OCI Object Storage buckets are regional resources, not tied to a specific availability domain; specifying an availability domain is irrelevant and does not affect data transfer costs. Option B is wrong because OCI File Storage is a network-attached file system that incurs additional egress costs when accessed from compute instances in a different region or availability domain, and it does not inherently reduce data transfer costs compared to Object Storage. Option C is wrong because a cross-region bucket replicates data across regions, which incurs replication and egress costs, and accessing data from a different region than the fine-tuning job would still result in cross-region data transfer charges.

73
MCQhard

Refer to the exhibit. A data scientist received this output after submitting a fine-tuning job. What is the most effective change to resolve the out-of-memory error?

A.Increase the sequence length.
B.Reduce the learning rate.
C.Decrease the number of fine-tuning epochs.
D.Increase the number of nodes in the cluster.
AnswerD

Correct: More nodes mean more total memory, alleviating OOM.

Why this answer

The out-of-memory error during fine-tuning indicates that the model's memory requirements exceed the available resources on the current node. Increasing the number of nodes in the cluster distributes the model parameters, gradients, and optimizer states across multiple GPUs or nodes, effectively increasing the total memory capacity and resolving the OOM error. This is a standard approach in distributed training frameworks like PyTorch DDP or FSDP, which OCI Data Science supports.

Exam trap

Oracle often tests the misconception that reducing epochs or learning rate can fix memory errors, when in fact memory errors are resource constraints that require scaling hardware (more nodes or GPUs) or reducing memory-intensive parameters like batch size or sequence length.

How to eliminate wrong answers

Option A is wrong because increasing the sequence length would increase the memory footprint per sample (due to larger attention matrices), making the OOM error worse, not better. Option B is wrong because reducing the learning rate affects training dynamics and convergence, not memory usage; it does not address the root cause of insufficient memory. Option C is wrong because decreasing the number of fine-tuning epochs reduces total training time but does not change the peak memory consumption per step, so the OOM error would still occur.

74
MCQmedium

You deployed a generative AI model on OCI Model Deployment with autoscaling configured based on average CPU utilization. The model is a large language model that heavily utilizes the GPU. During peak hours, the scaling is too slow to keep up with demand, resulting in high latency for users. You want to improve the responsiveness of autoscaling. Which change should you make?

A.Decrease the target CPU utilization threshold for scale-out
B.Increase the maximum number of replicas in the autoscaling configuration
C.Use GPU utilization as the scaling metric instead of CPU utilization
D.Increase the cooldown period between scale-out events
AnswerC

GPU utilization directly correlates with inference load, enabling more responsive scaling.

Why this answer

Option C is correct because the model heavily utilizes GPU, not CPU. Autoscaling based on CPU utilization is irrelevant for GPU-bound workloads, leading to delayed scale-out. Using GPU utilization as the scaling metric directly reflects the actual resource bottleneck, enabling faster and more accurate scaling decisions.

Exam trap

The trap here is that candidates assume CPU utilization is always the correct scaling metric for any workload, overlooking that GPU-bound models require a metric that reflects the actual bottleneck.

How to eliminate wrong answers

Option A is wrong because decreasing the target CPU utilization threshold would cause scale-out to trigger at even lower CPU usage, but since the model is GPU-bound, CPU utilization remains low and irrelevant, so this change does not address the root cause. Option B is wrong because increasing the maximum number of replicas only sets an upper limit on scaling; it does not speed up the scaling decision or make it more responsive to demand. Option D is wrong because increasing the cooldown period between scale-out events would actually slow down scaling further, worsening latency during peak hours.

75
MCQhard

A developer is deploying a fine-tuned model using OCI Generative AI service. They want to use a custom container image for inference. Which statement is true?

A.Custom containers are only supported with OCI Data Science, not Generative AI.
B.You can upload a container image to OCI Container Registry and reference it when creating a dedicated AI cluster.
C.Custom containers are supported only for fine-tuning jobs, not inference.
D.Custom containers are not supported; only built-in models are available.
AnswerB

Correct: This is the documented approach for using custom inference containers.

Why this answer

Option B is correct because OCI Generative AI service allows you to bring your own custom container image for inference by uploading it to OCI Container Registry (OCIR) and referencing it when creating a dedicated AI cluster. This enables you to deploy fine-tuned models with custom inference logic, dependencies, or frameworks that are not available in the built-in serving containers.

Exam trap

The trap here is that candidates may confuse the scope of custom container support, assuming it is limited to OCI Data Science or only for training, when in fact OCI Generative AI explicitly supports custom containers for inference via dedicated AI clusters.

How to eliminate wrong answers

Option A is wrong because custom containers are supported with OCI Generative AI for inference, not only with OCI Data Science. Option C is wrong because custom containers are supported for inference, not just for fine-tuning jobs; fine-tuning uses built-in containers or custom training containers, but inference also supports custom containers. Option D is wrong because custom containers are indeed supported; you are not limited to only built-in models.

Page 1 of 2 · 122 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Deploying and Managing Generative AI on OCI questions.