1Z0-1127 Exam Questions and Answers

A company is deploying a large language model for a customer service chatbot. The model needs to understand industry-specific jargon and maintain low latency. Which approach best balances these requirements?

Employ retrieval-augmented generation (RAG) with a general model

Rely solely on prompt engineering with a general model

Use a large general-purpose LLM with zero-shot prompting

Fine-tune a small open-source LLM on domain-specific data

Fine-tuning adapts the model to jargon and a smaller model keeps latency low.

Why: Fine-tuning a small open-source LLM on domain-specific data is the best approach because it adapts the model to understand industry-specific jargon while keeping the model small enough to maintain low latency. Unlike larger models, a fine-tuned small model can run efficiently on local hardware, reducing inference time and avoiding the overhead of external API calls or large model sizes.

A data scientist observes that their fine-tuned LLM performs well on training data but generates repetitive and dull responses in production. What is the most likely cause and best solution?

The model is overfitted; apply stronger regularization

The temperature is set too low; increase temperature during inference

Low temperature makes outputs deterministic and repetitive; increasing it adds variability.

The training data lacks diversity; add more varied examples

The model has too many layers; reduce model size

Why: The model's repetitive and dull responses indicate that the temperature parameter is too low, causing the model to always select the most probable tokens, leading to deterministic and monotonous outputs. Increasing temperature during inference introduces randomness into token sampling, allowing for more diverse and creative responses. This is a common issue in production LLMs where low temperature settings optimized for training metrics fail to produce engaging real-world outputs.

An organization wants to use an LLM to summarize legal documents. Which consideration is most important for ensuring accurate summaries?

Fine-tune the model on a curated legal corpus

Domain-specific fine-tuning teaches the model legal terminology and reasoning.

Use the largest available general-purpose model

Rely on zero-shot summarization with careful prompting

Pre-train a new model from scratch on legal texts

Why: Legal documents require precise understanding, so fine-tuning on legal data is critical. Option B is wrong because larger models don't guarantee domain accuracy. Option C is wrong because pre-training from scratch is expensive and unnecessary. Option D is wrong because zero-shot may miss legal nuances.

A developer is building a code generation assistant. The model occasionally produces syntactically correct but semantically wrong code. Which technique directly addresses semantic correctness?

Expand the token vocabulary

Lower the temperature to 0

Apply RLHF using human-validated code examples

RLHF directly optimizes for desired outcomes like semantic correctness.

Increase beam search width

Why: Reinforcement Learning from Human Feedback (RLHF) directly addresses semantic correctness by fine-tuning the model using human-validated code examples. This process teaches the model to prefer outputs that are not only syntactically valid but also logically correct and aligned with developer intent, reducing semantically wrong code generation.

A company fine-tunes an LLM on internal support tickets. After deployment, the model hallucinates company-specific product names. What is the most effective mitigation?

Switch to a smaller model to reduce hallucination risk

Use prompt engineering to remind the model to be accurate

Implement RAG with a verified product database

RAG provides factual grounding, reducing hallucinations.

Fine-tune further with more ticket data

Why: RAG (Retrieval-Augmented Generation) grounds the LLM's output in a verified product database, providing factual context that prevents hallucination of company-specific product names. Unlike fine-tuning, which only adjusts model weights and can still produce plausible but incorrect names, RAG retrieves exact records at inference time, ensuring accuracy for proprietary terminology.

A team wants to evaluate an LLM's performance on a text classification task. Which metric is most appropriate for a balanced dataset?

BLEU score

Perplexity

Accuracy

Accuracy directly measures correct predictions, appropriate for balanced data.

ROUGE score

Why: Accuracy is the most appropriate metric for evaluating an LLM on a text classification task with a balanced dataset because it directly measures the proportion of correctly predicted labels out of total predictions. For balanced classes, accuracy provides a reliable and intuitive performance indicator without the distortion caused by class imbalance.

Want more Fundamentals of Large Language Models practice?

All Using OCI Generative AI Service questions

Domain 2: Using OCI Generative AI Service

A company uses OCI Generative AI Service to build a chatbot for customer support. They notice that the model sometimes generates inappropriate responses. What is the MOST effective way to mitigate this without retraining the model?

Fine-tune the model with curated safe examples

Configure system instructions to define acceptable behavior

System instructions constrain the model's output at inference time without retraining.

Reduce the temperature parameter to 0

Use the moderation API to filter responses

Why: Configuring system instructions is the most effective approach because it allows you to define the model's behavior and constraints at inference time without modifying the underlying model weights. In OCI Generative AI Service, system instructions act as a persistent prompt that guides the model's responses, enabling you to explicitly prohibit inappropriate content and enforce safety guidelines. This is a non-invasive, immediate mitigation that does not require the time, cost, or data preparation associated with retraining or fine-tuning.

A developer wants to use OCI Generative AI Service to summarize long documents. Which endpoint should they use to send the document content?

/generate

/classify

/embed

/chat

The /chat endpoint accepts a conversation history, suitable for summarization tasks.

Why: Option D is correct because the /chat endpoint in OCI Generative AI Service is designed for conversational interactions and can handle long document summarization by accepting the document content as part of the chat context. This endpoint supports multi-turn dialogues and large input payloads, making it suitable for processing and summarizing lengthy documents.

An organization deploys a fine-tuned model for legal document analysis using OCI Generative AI Service. They need to ensure that only authorized users in the 'LegalTeam' group can access the model endpoint. Which policy statement should be used?

Allow group LegalTeam to use generative-ai-model in compartment ABC

Use permission allows invoking the model for inference.

Allow group LegalTeam to manage generative-ai-family in compartment ABC

Allow group LegalTeam to read generative-ai-model in compartment ABC

Allow group LegalTeam to inspect generative-ai-model in compartment ABC

Why: Option A is correct because the 'use' verb on the 'generative-ai-model' resource type grants the LegalTeam group permission to invoke the model endpoint for inference, which is the minimum privilege required for accessing a deployed fine-tuned model in OCI Generative AI Service. The 'use' permission specifically allows calling the model for text generation or analysis without granting broader management or read capabilities.

A data scientist is using OCI Generative AI Service to generate product descriptions. They notice that the output often repeats phrases. Which parameter adjustment would MOST directly address this issue?

Increase the temperature

Increase the max tokens

Increase the frequency penalty

Frequency penalty penalizes tokens that have already appeared, reducing repetition.

Decrease the top-p value

Why: Option C is correct because the frequency penalty directly reduces the likelihood of the model repeating the same phrases by penalizing tokens that have already appeared in the generated text. In OCI Generative AI Service, this parameter subtracts a fixed value from the log-probability of each token each time it is generated, making repeated tokens less likely to be chosen again. This is the most direct mechanism to address repetitive output.

A company needs to integrate OCI Generative AI Service with an existing application that uses OCI IAM for authentication. They want to use resource principal to allow the application to call the service without storing API keys. Which step is REQUIRED?

Create an OCI API key for the application

Enable the Generative AI Service for resource principal in the tenancy

Assign the application to a group with admin privileges

Create a dynamic group and a policy granting access to the Generative AI Service

Dynamic group with matching rules and a policy are required for resource principal.

Why: Resource principal authentication in OCI requires the application to be represented by a dynamic group, which matches instances or resources based on defined rules. A policy must then grant that dynamic group access to the Generative AI Service. This avoids storing API keys by using OCI IAM's built-in resource principal token exchange.

A developer is using OCI Generative AI Service to generate code snippets. They want to ensure the output is as deterministic as possible for testing. Which combination of parameters should they use?

Temperature = 0, Top-p = 1

Temperature=0 makes output deterministic; top-p=1 disables nucleus sampling.

Temperature = 0.5, Top-p = 0.5

Temperature = 0, Top-p = 0

Temperature = 1, Top-p = 1

Why: Setting Temperature=0 makes the model deterministic by always selecting the highest-probability token, while Top-p=1 includes all tokens in the sampling pool, ensuring no additional randomness is introduced. This combination eliminates stochastic variation, making outputs repeatable for testing.

Want more Using OCI Generative AI Service practice?

All Building LLM Applications with RAG and Vector Search questions

Domain 3: Building LLM Applications with RAG and Vector Search

A developer is building a RAG application using Oracle Cloud Infrastructure (OCI) Document Understanding and OCI Generative AI. After chunking documents and generating embeddings, the developer observes that the retrieval step often returns chunks that are semantically unrelated to the query. Which action is MOST likely to improve retrieval relevance?

Switch from a dense embedding model to a sparse embedding model.

Adjust the chunk size and chunk overlap to better capture coherent passages.

Proper chunking helps preserve meaning and improves retrieval accuracy.

Increase the chunk size to capture more context.

Reduce the number of retrieved chunks (k) in the vector search.

Why: Option C is correct because adjusting the chunk size and overlap can significantly impact the quality of retrieved passages. Option A is wrong because increasing the chunk size may introduce more noise. Option B is wrong because reducing the number of retrieved chunks could miss relevant information. Option D is wrong because the embedding model is already chosen; changing it may not fix the chunking issue.

An organization stores its knowledge base in Oracle Autonomous Database and wants to build a RAG chatbot using OCI Generative AI. The chatbot must retrieve the most relevant documents based on user queries. Which indexing approach is BEST suited for efficient similarity search on text embeddings?

Create an ANN index on the embedding vector column.

ANN indexes enable fast approximate nearest neighbor search in vector databases.

Create a bitmap index on the embedding vector column.

Create an inverted index on the document text column.

Create a B-tree index on the document text column.

Why: Option A is correct because Approximate Nearest Neighbor (ANN) indexes are specifically designed for high-dimensional vector spaces, enabling efficient similarity search on embedding vectors. In Oracle Autonomous Database, ANN indexes (e.g., using IVF or HNSW algorithms) drastically reduce search latency compared to brute-force scans, which is critical for real-time RAG chatbot responses.

A company is deploying a RAG pipeline using OCI Data Science and OCI Generative AI. The pipeline uses a Cohere command model for generation and a Cohere embed model for retrieval. The team notices that the model occasionally produces hallucinated answers that are not supported by the retrieved context. Which strategy is MOST effective at reducing hallucinations?

Implement a faithfulness verification step that re-ranks retrieved passages based on alignment with the generated answer.

A verification step can detect and mitigate unsupported claims.

Increase the temperature parameter of the generation model.

Increase the number of retrieved chunks (k) to provide more context.

Use a larger generative model with more parameters.

Why: Option D is correct because incorporating a faithfulness check that re-ranks retrieval results can directly filter out unsupported claims. Option A is wrong because increasing temperature may increase randomness and hallucinations. Option B is wrong because more retrieved chunks can introduce conflicting information. Option C is wrong because a larger model does not guarantee faithfulness and increases cost.

A data scientist is building a RAG application that processes PDF invoices. The extraction step uses OCI Document Understanding to convert PDFs to text. The scientist then splits the text into chunks and generates embeddings using OCI Generative AI. However, the retrieval often misses critical fields like invoice numbers and dates. Which preprocessing step would MOST likely improve retrieval of these specific fields?

Increase the chunk size to include entire invoices.

Apply stemming and lemmatization to the text before chunking.

Tag each chunk with metadata such as invoice number, date, and vendor, and use metadata filtering during retrieval.

Metadata filtering enables precise retrieval based on structured fields.

Switch from dense embeddings to sparse embeddings for better exact match.

Why: Option C is correct because metadata tagging and filtering directly address the retrieval of specific fields like invoice numbers and dates. By attaching metadata (e.g., invoice number, date, vendor) to each chunk and filtering on these metadata fields during retrieval, the RAG system can precisely locate the relevant chunks without relying solely on semantic similarity. This approach leverages OCI Document Understanding's ability to extract structured data and OCI Generative AI's vector search capabilities to combine dense embeddings with exact metadata matching.

A developer is using OCI Generative AI to build a question-answering system over a large corpus of technical manuals. The developer uses the Cohere Embed model to generate embeddings and stores them in an OCI OpenSearch cluster. Queries are slow and the team needs to reduce latency. Which approach is BEST for improving search speed while maintaining acceptable accuracy?

Increase the embedding dimension for better representation.

Reduce the k value in the nearest neighbor search.

Fewer neighbors means less distance computation and faster retrieval.

Use exact nearest neighbor search instead of approximate.

Increase the index refresh interval to reduce write overhead.

Why: Reducing the k value in the nearest neighbor search directly decreases the number of vectors that must be compared during query time, which lowers latency. In approximate nearest neighbor (ANN) search, a smaller k means fewer candidates are evaluated, speeding up retrieval while still maintaining acceptable accuracy if the original k was unnecessarily high. This is the most effective tuning knob for latency in vector search systems like OCI OpenSearch with Cohere embeddings.

A team is deploying a RAG system that uses OCI Generative AI to answer questions about internal HR policies. The system must comply with data residency requirements: all data processing must stay within a specific OCI region. The team uses OCI Data Science for orchestration. Which architecture BEST meets the data residency requirement?

Deploy the generative AI model endpoints within the same OCI region as the data and compute.

All components remain in the specified region, ensuring compliance.

Use OCI Generative AI endpoints in a different region but store data in the required region.

Use an external third-party LLM endpoint that guarantees data residency.

Store embeddings in a different region but run inference in the required region.

Why: Option A is correct because deploying the generative AI model endpoints within the same OCI region as the data and compute ensures that all data processing—including inference, embedding generation, and vector search—occurs entirely within the required region, satisfying data residency requirements. OCI Generative AI endpoints are region-specific and do not automatically route requests to other regions, so co-locating all components avoids any cross-region data transfer.

Want more Building LLM Applications with RAG and Vector Search practice?

All Deploying and Managing Generative AI on OCI questions

Domain 4: Deploying and Managing Generative AI on OCI

A company is deploying a generative AI service on OCI using the OCI Data Science service with a large language model (LLM) in a VCN. The model inference endpoint must be accessible only from a private subnet within the same VCN. Which networking component should be configured to enable this?

NAT Gateway

Dynamic Routing Gateway (DRG)

Internet Gateway

Service Gateway

Service gateway enables private subnet access to OCI services like Data Science.

Why: A Service Gateway enables private subnet resources to access OCI services (including the OCI Data Science model deployment endpoint) without traversing the internet. Since the inference endpoint must be accessible only from a private subnet within the same VCN, the Service Gateway provides the necessary private connectivity by routing traffic over the OCI network fabric, not through a NAT or internet gateway.

A data scientist is fine-tuning a generative AI model on OCI Data Science using a custom container with GPU resources. The training job fails with an out-of-memory error despite the GPU instance having sufficient memory. The job works fine on a smaller dataset. What is the most likely cause?

The training script has a memory leak

The GPU instance is not supported by OCI Data Science

The model is not compatible with the PyTorch version

The batch size is too large for the GPU memory

Large batch size can cause OOM errors; reducing batch size resolves it.

Why: The most likely cause is that the batch size is too large for the GPU memory. Even though the GPU instance has sufficient total memory, a batch size that exceeds the available GPU memory (after accounting for model parameters, gradients, and optimizer states) will trigger an out-of-memory (OOM) error. Reducing the batch size allows the model to fit within the GPU's memory limits, which explains why the job works on a smaller dataset but fails on a larger one.

An organization wants to deploy a generative AI chatbot using OCI Generative AI service. The chatbot must comply with data residency requirements by ensuring that all data processing occurs within a specific geographic region. What is the best practice to achieve this?

Use a dedicated AI cluster in the required region

Dedicated AI clusters are region-specific and ensure data stays in that region.

Enable cross-region replication for disaster recovery

Configure a tenancy-wide policy to restrict region usage

Use IAM policies to block access from other regions

Why: Option A is correct because OCI Generative AI service allows you to provision a dedicated AI cluster within a specific region, ensuring all model inference and data processing remain within that geographic boundary. This dedicated cluster is isolated from other regions and complies with data residency requirements by design, as no data leaves the chosen region during processing.

A team has deployed a generative AI model using OCI Data Science model deployment. The endpoint is behind a load balancer. Users report that after 5 minutes of inactivity, the first request takes over 30 seconds to respond, while subsequent requests are fast. What is the most likely cause and solution?

The model deployment has an idle timeout that scales down to zero; configure a minimum number of instances or use a warm-up request

Idle timeout causes cold start; setting min replicas or health check warm-up solves it.

The load balancer is scaling based on CPU utilization; increase the CPU threshold

The VCN has a network latency issue; use a different availability domain

The inference code has a lazy initialization; pre-load the model in the deployment script

Why: The described behavior—first request after 5 minutes of inactivity taking over 30 seconds, with subsequent requests fast—is a classic symptom of an idle timeout that scales the model deployment to zero instances. OCI Data Science model deployments support auto-scaling with an idle timeout (default 5 minutes) that can reduce the number of instances to zero when no requests are received. When a new request arrives, it must wait for a new instance to spin up, causing the delay. The solution is to configure a minimum number of instances (e.g., 1) to keep the model warm, or use a warm-up request to prevent the idle timeout from triggering.

A company is using OCI Generative AI service with a dedicated AI cluster for text generation. They notice that the latency is higher than expected. The cluster is in the Ashburn region, and users are distributed globally. What is the most effective way to reduce latency?

Enable the OCI Generative AI inference optimizer

Deploy dedicated AI clusters in regions closer to the users

Geographic proximity reduces network round-trip time.

Increase the number of nodes in the dedicated AI cluster

Use a content delivery network (CDN) to cache responses

Why: Latency for globally distributed users is primarily driven by network distance and the speed of light. Deploying dedicated AI clusters in regions closer to the users reduces the physical distance data must travel, directly minimizing network round-trip time (RTT). This is the most effective architectural change because OCI's Generative AI service processes each request on the dedicated cluster and cannot bypass geographic latency through software optimizations alone.

A machine learning engineer is deploying a fine-tuned Llama 2 model on OCI Data Science model deployment. The deployment fails with an error: 'Model artifact exceeds the maximum allowed size of 10 GB.' The model files total 12 GB. What is the best approach to resolve this?

Store the model in Object Storage and reference it in the deployment configuration

Object Storage allows large models and is supported by model deployment.

Use a different model that is smaller than 10 GB

Increase the model deployment artifact size limit via a service request

Compress the model artifact to under 10 GB using gzip

Why: Option A is correct because OCI Data Science model deployment has a hard limit of 10 GB for the model artifact uploaded directly. By storing the model in Object Storage and referencing it in the deployment configuration, you bypass this limit entirely, as the deployment service can load the model from Object Storage at runtime without requiring the artifact to be part of the deployment package.

Want more Deploying and Managing Generative AI on OCI practice?

Browse all 1Z0-1127 questions Take a timed practice test

Frequently asked questions

How many questions are on the 1Z0-1127 exam?

The 1Z0-1127 exam has 40 questions and must be completed in 90 minutes. The passing score is 65/1000.

What types of questions appear on the 1Z0-1127 exam?

Scenario-based questions covering exam objectives with detailed answer explanations.

How are 1Z0-1127 questions organised by domain?

The exam covers 4 domains: Fundamentals of Large Language Models, Using OCI Generative AI Service, Building LLM Applications with RAG and Vector Search, Deploying and Managing Generative AI on OCI. Questions are weighted by domain — higher-weight domains appear more on your actual exam.

Are these the actual 1Z0-1127 exam questions?

No. These are original exam-style practice questions written against the official Oracle 1Z0-1127 exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.

Ready to practice all 40 1Z0-1127 questions?

Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.

Oracle · Free Practice Questions · Last reviewed May 2026

1Z0-1127 Exam Questions and Answers

24real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.

40 exam questions

90 min time limit

Pass: 65/1000 / 1000

4 exam domains

Overview Domain Blueprint Study Guide All QuestionsSample by Domain

1. Fundamentals of Large Language Models 2. Using OCI Generative AI Service 3. Building LLM Applications with RAG and Vector Search 4. Deploying and Managing Generative AI on OCI

Domain 1: Fundamentals of Large Language Models

All Fundamentals of Large Language Models questions

Employ retrieval-augmented generation (RAG) with a general model

Rely solely on prompt engineering with a general model

Use a large general-purpose LLM with zero-shot prompting

Fine-tune a small open-source LLM on domain-specific data

Fine-tuning adapts the model to jargon and a smaller model keeps latency low.

A data scientist observes that their fine-tuned LLM performs well on training data but generates repetitive and dull responses in production. What is the most likely cause and best solution?

The model is overfitted; apply stronger regularization

The temperature is set too low; increase temperature during inference

Low temperature makes outputs deterministic and repetitive; increasing it adds variability.

The training data lacks diversity; add more varied examples

The model has too many layers; reduce model size

An organization wants to use an LLM to summarize legal documents. Which consideration is most important for ensuring accurate summaries?

Fine-tune the model on a curated legal corpus

Domain-specific fine-tuning teaches the model legal terminology and reasoning.

Use the largest available general-purpose model

Rely on zero-shot summarization with careful prompting

Pre-train a new model from scratch on legal texts

A developer is building a code generation assistant. The model occasionally produces syntactically correct but semantically wrong code. Which technique directly addresses semantic correctness?

Expand the token vocabulary

Lower the temperature to 0

Apply RLHF using human-validated code examples

RLHF directly optimizes for desired outcomes like semantic correctness.

Increase beam search width

A company fine-tunes an LLM on internal support tickets. After deployment, the model hallucinates company-specific product names. What is the most effective mitigation?

Switch to a smaller model to reduce hallucination risk

Use prompt engineering to remind the model to be accurate

Implement RAG with a verified product database

RAG provides factual grounding, reducing hallucinations.

Fine-tune further with more ticket data

A team wants to evaluate an LLM's performance on a text classification task. Which metric is most appropriate for a balanced dataset?

BLEU score

Perplexity

Accuracy

Accuracy directly measures correct predictions, appropriate for balanced data.

ROUGE score

Want more Fundamentals of Large Language Models practice?

All Using OCI Generative AI Service questions

Domain 2: Using OCI Generative AI Service

Fine-tune the model with curated safe examples

Configure system instructions to define acceptable behavior

System instructions constrain the model's output at inference time without retraining.

Reduce the temperature parameter to 0

Use the moderation API to filter responses

A developer wants to use OCI Generative AI Service to summarize long documents. Which endpoint should they use to send the document content?

/generate

/classify

/embed

/chat

The /chat endpoint accepts a conversation history, suitable for summarization tasks.

Allow group LegalTeam to use generative-ai-model in compartment ABC

Use permission allows invoking the model for inference.

Allow group LegalTeam to manage generative-ai-family in compartment ABC

Allow group LegalTeam to read generative-ai-model in compartment ABC

Allow group LegalTeam to inspect generative-ai-model in compartment ABC

Increase the temperature

Increase the max tokens

Increase the frequency penalty

Frequency penalty penalizes tokens that have already appeared, reducing repetition.

Decrease the top-p value

Create an OCI API key for the application

Enable the Generative AI Service for resource principal in the tenancy

Assign the application to a group with admin privileges

Create a dynamic group and a policy granting access to the Generative AI Service

Dynamic group with matching rules and a policy are required for resource principal.

A developer is using OCI Generative AI Service to generate code snippets. They want to ensure the output is as deterministic as possible for testing. Which combination of parameters should they use?

Temperature = 0, Top-p = 1

Temperature=0 makes output deterministic; top-p=1 disables nucleus sampling.

Temperature = 0.5, Top-p = 0.5

Temperature = 0, Top-p = 0

Temperature = 1, Top-p = 1

Want more Using OCI Generative AI Service practice?

All Building LLM Applications with RAG and Vector Search questions

Domain 3: Building LLM Applications with RAG and Vector Search

Switch from a dense embedding model to a sparse embedding model.

Adjust the chunk size and chunk overlap to better capture coherent passages.

Proper chunking helps preserve meaning and improves retrieval accuracy.

Increase the chunk size to capture more context.

Reduce the number of retrieved chunks (k) in the vector search.

Create an ANN index on the embedding vector column.

ANN indexes enable fast approximate nearest neighbor search in vector databases.

Create a bitmap index on the embedding vector column.

Create an inverted index on the document text column.

Create a B-tree index on the document text column.

Implement a faithfulness verification step that re-ranks retrieved passages based on alignment with the generated answer.

A verification step can detect and mitigate unsupported claims.

Increase the temperature parameter of the generation model.

Increase the number of retrieved chunks (k) to provide more context.

Use a larger generative model with more parameters.

Increase the chunk size to include entire invoices.

Apply stemming and lemmatization to the text before chunking.

Tag each chunk with metadata such as invoice number, date, and vendor, and use metadata filtering during retrieval.

Metadata filtering enables precise retrieval based on structured fields.

Switch from dense embeddings to sparse embeddings for better exact match.

Increase the embedding dimension for better representation.

Reduce the k value in the nearest neighbor search.

Fewer neighbors means less distance computation and faster retrieval.

Use exact nearest neighbor search instead of approximate.

Increase the index refresh interval to reduce write overhead.

Deploy the generative AI model endpoints within the same OCI region as the data and compute.

All components remain in the specified region, ensuring compliance.

Use OCI Generative AI endpoints in a different region but store data in the required region.

Use an external third-party LLM endpoint that guarantees data residency.

Store embeddings in a different region but run inference in the required region.

Want more Building LLM Applications with RAG and Vector Search practice?

All Deploying and Managing Generative AI on OCI questions

Domain 4: Deploying and Managing Generative AI on OCI

NAT Gateway

Dynamic Routing Gateway (DRG)

Internet Gateway

Service Gateway

Service gateway enables private subnet access to OCI services like Data Science.

The training script has a memory leak

The GPU instance is not supported by OCI Data Science

The model is not compatible with the PyTorch version

The batch size is too large for the GPU memory

Large batch size can cause OOM errors; reducing batch size resolves it.

Use a dedicated AI cluster in the required region

Dedicated AI clusters are region-specific and ensure data stays in that region.

Enable cross-region replication for disaster recovery

Configure a tenancy-wide policy to restrict region usage

Use IAM policies to block access from other regions

The model deployment has an idle timeout that scales down to zero; configure a minimum number of instances or use a warm-up request

Idle timeout causes cold start; setting min replicas or health check warm-up solves it.

The load balancer is scaling based on CPU utilization; increase the CPU threshold

The VCN has a network latency issue; use a different availability domain

The inference code has a lazy initialization; pre-load the model in the deployment script

Enable the OCI Generative AI inference optimizer

Deploy dedicated AI clusters in regions closer to the users

Geographic proximity reduces network round-trip time.

Increase the number of nodes in the dedicated AI cluster

Use a content delivery network (CDN) to cache responses

Store the model in Object Storage and reference it in the deployment configuration

Object Storage allows large models and is supported by model deployment.

Use a different model that is smaller than 10 GB

Increase the model deployment artifact size limit via a service request

Compress the model artifact to under 10 GB using gzip

Want more Deploying and Managing Generative AI on OCI practice?