Google Cloud Generative AI Leader Generative AI Leader (Generative AI Leader) — Questions 301375

500 questions total · 7pages · All types, answers revealed

Page 4

Page 5 of 7

Page 6
301
Multi-Selectmedium

Which THREE steps are required to secure a generative AI pipeline that uses Vertex AI and involves sensitive customer data?

Select 3 answers
A.Use VPC Service Controls to create a perimeter around Vertex AI resources
B.Apply IAM roles with least privilege and use service accounts for the pipeline
C.Expose the prediction endpoint publicly with an API key
D.Enable data encryption at rest using Cloud KMS
E.Disable audit logging to reduce data exposure
AnswersA, B, D

VPC-SC prevents data from leaking outside the perimeter.

Why this answer

Options A, B, and E are correct. Data encryption at rest protects stored data; VPC Service Controls prevent data exfiltration; IAM with least privilege controls access. Option C (public endpoint with API key) is insecure.

Option D (disable audit logging) reduces security visibility.

302
MCQeasy

A startup wants to quickly prototype a gen AI application. Which Google Cloud service should they use first?

A.Vertex AI Workbench
B.Cloud TPUs
C.Gen AI Studio
D.Dataflow
AnswerC

Provides a low-code environment for quickly testing and iterating on gen AI models.

Why this answer

Gen AI Studio (now part of Vertex AI) provides a low-code/no-code interface for quickly prototyping generative AI applications using pre-trained models like PaLM 2 and Gemini. It allows startups to experiment with prompts, tune models, and deploy without managing infrastructure, making it the fastest path from idea to prototype.

Exam trap

The trap here is that candidates confuse Vertex AI Workbench (a general ML IDE) with Gen AI Studio (a generative AI prototyping tool), or assume that rapid prototyping requires custom hardware like TPUs, when Google explicitly designed Gen AI Studio for this purpose.

How to eliminate wrong answers

Option A is wrong because Vertex AI Workbench is a Jupyter-based development environment for building custom ML models, not a rapid prototyping tool for generative AI; it requires more setup and coding. Option B is wrong because Cloud TPUs are specialized hardware accelerators for training large models, not a service for quick prototyping—they involve significant configuration and cost. Option D is wrong because Dataflow is a serverless data processing service for batch and stream pipelines (e.g., ETL), unrelated to generative AI application prototyping.

303
MCQhard

Refer to the exhibit. A developer sees this error when trying to deploy a model from Vertex AI Model Registry. What is the most likely cause?

A.The region is not supported
B.The developer used the model display name instead of the full resource name
C.The model is not published
D.The model is in a different project
AnswerB

Display name is not a valid model reference; the full resource path is required.

Why this answer

The error occurs because Vertex AI Model Registry requires the full resource name (e.g., 'projects/{project}/locations/{region}/models/{model_id}') to deploy a model, not just the display name. The display name is a human-readable label that is not unique within a project, while the full resource name uniquely identifies the model version. Using the display name causes the API to fail with a 'not found' or 'invalid argument' error.

Exam trap

Google Cloud often tests the distinction between display names (non-unique, human-readable) and resource names (unique, API-required) in cloud services like Vertex AI, where candidates mistakenly assume display names can be used interchangeably with resource identifiers.

How to eliminate wrong answers

Option A is wrong because Vertex AI supports model deployment in all regions where the service is available, and the error message does not indicate a regional restriction. Option C is wrong because a model can be deployed from the registry even if it is not published to the public; publishing is only required for sharing with external users or making it available in the Model Garden. Option D is wrong because the error would reference a cross-project permission issue (e.g., 'permission denied' or 'resource not found in project'), not a display name mismatch.

304
MCQeasy

A company is using Vertex AI to deploy a text generation model for a chatbot. They want to reduce the response latency. Which configuration change is most effective?

A.Enable model quantization
B.Use a smaller model variant
C.Increase the number of GPUs
D.Use a larger batch size
AnswerB

Smaller models have faster inference, directly reducing latency.

Why this answer

Option B is correct because using a smaller model variant directly reduces the number of parameters and computational operations required per inference, which lowers latency. In Vertex AI, smaller models like `text-bison@002` have fewer layers and attention heads than larger counterparts, resulting in faster token generation without requiring hardware changes.

Exam trap

Google Cloud often tests the misconception that increasing compute resources (GPUs) or batch size always reduces latency, when in fact these optimizations target throughput, not per-request response time.

How to eliminate wrong answers

Option A is wrong because model quantization (e.g., reducing weights from FP32 to INT8) can reduce memory footprint and improve throughput, but it does not guarantee lower latency per request and may introduce accuracy trade-offs; it is not the most effective single change for latency reduction. Option C is wrong because increasing the number of GPUs can improve throughput for batch processing but does not reduce per-request latency; in fact, it may increase communication overhead and cost without speeding up individual inference. Option D is wrong because using a larger batch size increases throughput for concurrent requests but actually increases the latency for each individual request, as the model processes more sequences together before returning results.

305
MCQmedium

A company is using Vertex AI Agent Builder to create a travel booking agent. They want the agent to book flights and hotels dynamically. What action type should they use?

A.Dynamic call
B.Static call
C.Webhook
D.Notification
AnswerC

Webhooks allow dynamic external API calls for booking.

Why this answer

Option A is correct because webhooks allow the agent to call external APIs dynamically. Option B is wrong because static calls are for predefined responses. Option C is wrong because 'Dynamic call' is not a standard action type.

Option D is wrong because notifications are for sending updates, not performing bookings.

306
MCQhard

You are a generative AI architect at a social media company. You are tasked with building a content moderation system that uses a generative model to flag toxic comments. The system must have very low false positive rates (i.e., not flag harmless comments) to avoid user backlash, but it must catch nearly all toxic comments. You have a large dataset of labeled toxic and non-toxic comments. You plan to use a pre-trained LLM and fine-tune it for classification. During experimentation, you notice that the model's recall for toxic comments is high (95%) but its precision is low (60%), leading to many false positives. You need to improve precision without substantially reducing recall. Which approach should you try first?

A.Gather additional toxic comments from similar platforms to augment the training data.
B.Apply a higher weight to the toxic class in the loss function during fine-tuning.
C.Use a smaller pre-trained model that is inherently less sensitive to subtle toxic language.
D.Tune the classification threshold on a held-out validation set to a higher value (e.g., require higher probability to classify as toxic).
AnswerD

Increasing the threshold reduces false positives (improves precision) with some loss in recall, which can be fine-tuned.

Why this answer

Option D is correct because threshold tuning is a straightforward, post-hoc method to trade recall for precision by raising the decision threshold. Option A is incorrect because adding more toxic samples might increase recall but not necessarily precision. Option B is incorrect because a smaller model might have less capacity to distinguish nuance, worsening precision.

Option C is incorrect because class weighting can improve recall but may hurt precision.

307
Multi-Selecteasy

Which TWO Google Cloud services can be used together to implement a RAG (retrieval-augmented generation) pipeline? (Select 2)

Select 2 answers
A.Cloud SQL
B.Vertex AI Vector Search
C.Bigtable
D.Vertex AI PaLM API
E.Cloud Functions
AnswersB, D

Provides vector similarity search for retrieval.

Why this answer

Vertex AI Vector Search (option B) is correct because it provides a managed vector database for storing and querying embeddings, which is essential for the retrieval step in a RAG pipeline. It enables semantic similarity search over large datasets, allowing the system to fetch relevant context documents based on a user query.

Exam trap

Google Cloud often tests the misconception that any database (like Cloud SQL or Bigtable) can serve as a vector store for RAG, but they lack native vector indexing and similarity search, making them unsuitable for efficient retrieval at scale.

308
Multi-Selecteasy

Which TWO are components of the Vertex AI Generative AI Studio?

Select 2 answers
A.Dataflow
B.Model Garden
C.Pipeline templates
D.Cloud Functions
E.Prompt Editor
AnswersB, E

Model Garden is a component for discovering and selecting models.

Why this answer

Model Garden is a core component of Vertex AI Generative AI Studio that provides a curated repository of foundation models, including Google's PaLM and Gemini models, as well as third-party models. It allows users to discover, compare, and deploy these models directly within the studio environment, making it essential for generative AI workflows.

Exam trap

Google Cloud often tests the distinction between core generative AI studio components (like Model Garden and Prompt Editor) and broader GCP services (like Dataflow or Cloud Functions) that are not part of the studio, leading candidates to select familiar but incorrect options.

309
MCQhard

An AI team is building a customer support chatbot for a telecom company using a fine-tuned LLM on Vertex AI. The model performs well on common issues but fails to answer correctly for rare or novel problems, often providing plausible-sounding but incorrect solutions. The team has a large corpus of internal troubleshooting documents. They want to minimize incorrect answers while keeping latency low. Which approach should they take?

A.Switch to a larger base model (e.g., Gemini Ultra) without any retrieval.
B.Implement a retrieval-augmented generation (RAG) pipeline using Vertex AI Search to fetch relevant documents before generating answers.
C.Collect more data on rare issues and continue fine-tuning the model weekly.
D.Use a few-shot prompt with 10 examples of rare problems and solutions.
AnswerB

RAG dynamically retrieves relevant context, enabling accurate answers for rare issues.

Why this answer

Option B is correct because RAG uses the troubleshooting documents as a knowledge base, providing grounded answers for rare issues without retraining. Option A is wrong because more fine-tuning on common issues won't help with rare ones. Option C is wrong because a larger model may increase latency and cost without solving the grounding problem.

Option D is wrong because few-shot examples cannot cover all rare scenarios.

310
Multi-Selectmedium

Which TWO are benefits of using retrieval-augmented generation (RAG) over fine-tuning?

Select 2 answers
A.No need for training
B.Higher accuracy on all tasks
C.More up-to-date information
D.Reduced model size
E.Lower latency
AnswersA, C

RAG does not require fine-tuning; it works with the base model plus retrieval.

Why this answer

Option A is correct because RAG does not require any training or fine-tuning of the underlying model. It works by retrieving relevant documents from an external knowledge base at inference time and providing them as context to the model, which generates an answer based on that context. This eliminates the need for costly and time-consuming model retraining or parameter updates.

Exam trap

Google Cloud often tests the misconception that RAG reduces latency or model size, when in fact it increases system complexity and inference time due to the retrieval step, while fine-tuning keeps the model unchanged in size and latency.

311
MCQhard

A developer is building a chatbot for a medical application that discusses sensitive health topics. The chatbot consistently gets its outputs blocked. What should the developer do?

A.Disable the safety filter entirely to allow all topics.
B.Adjust the safety category thresholds to allow VIOLENCE and SEXUAL content since it's medical.
C.Increase the input token limit to 2000.
D.Review and refine the system instructions to avoid triggering safety filters, and consider using a different model endpoint that allows medical contexts.
AnswerD

Refining prompts and using appropriate endpoints can prevent unnecessary blocks.

Why this answer

Option C is correct because reviewing and refining system instructions to avoid triggering safety filters, and possibly using a different model endpoint that allows medical contexts, directly addresses the filter blocks. Option A (disable safety filter) violates policy. Option B (adjust thresholds to allow VIOLENCE/SEXUAL) is inappropriate for medical context and may violate guidelines.

Option D (increase token limit) does not help with safety blocks.

312
MCQeasy

A company wants to build a chatbot that answers questions using their internal knowledge base. Which approach is most suitable?

A.Use Retrieval-Augmented Generation (RAG)
B.Fine-tune a model on the knowledge base
C.Train a new model from scratch
D.Use zero-shot prompting with no context
AnswerA

RAG retrieves relevant context and generates answers, perfect for knowledge base Q&A.

Why this answer

Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents from a knowledge base with generative responses, making it ideal for this use case.

313
MCQmedium

Refer to the exhibit. A data scientist runs the gcloud command and sees the model listed. However, when they try to deploy the model to an endpoint, they get an error: 'Model is not deployable'. What is the most likely reason?

A.The model is still in training and not yet ready.
B.The model was imported from a custom container but without a serving specification or artifact.
C.The model does not have the correct IAM permissions assigned to the deployment service account.
D.The region for the endpoint is different from the model's region.
AnswerB

A model must have a serving container or artifacts to be deployable.

Why this answer

Option B is correct because a model imported from a custom container must include a serving specification (e.g., a `predict` route) and an artifact (e.g., a saved model file) to be deployable. Without these, Vertex AI cannot determine how to serve predictions, resulting in the 'Model is not deployable' error. The `gcloud` command listing the model only confirms its registration, not its readiness for deployment.

Exam trap

Google Cloud often tests the misconception that a model listed in the registry is automatically deployable, but the trap here is that Vertex AI separates model registration from deployment readiness, requiring explicit serving configuration for custom containers.

How to eliminate wrong answers

Option A is wrong because if the model were still in training, it would not appear in the model list via `gcloud`; Vertex AI only registers a model after training completes. Option C is wrong because IAM permissions affect the deployment action itself (e.g., who can deploy), not the deployability status of the model; the error 'Model is not deployable' is a model-level validation, not an authorization failure. Option D is wrong because region mismatch between the endpoint and model would cause a resource-location error, not a 'Model is not deployable' error; Vertex AI enforces regional consistency but does not block deployment based on region alone.

314
Multi-Selecteasy

A company is prompt engineering a model for customer support. They want to reduce hallucination (false information) in responses. Which TWO techniques are most effective? (Choose two.)

Select 2 answers
A.Implement RAG to retrieve relevant documents for context
B.Provide 3 few-shot examples of conversations
C.Reduce max output tokens to 150
D.Add a system instruction: 'Only answer based on the provided context.'
E.Increase temperature to 1.2
AnswersA, D

RAG provides factual grounding, reducing hallucination.

Why this answer

Correct options: B and D. B (Retrieval-Augmented Generation) grounds the model in real data. D (specify in system instruction to only use provided facts) instructs the model to rely on context.

A (increase temperature) increases creativity, worsening hallucination. C (few-shot examples) helps format but not factuality. E (reduce max tokens) only limits length.

315
Multi-Selectmedium

A company is considering whether to use Vertex AI's Generative AI Studio. Which TWO are benefits?

Select 2 answers
A.It is always cheaper than using third-party APIs
B.It integrates seamlessly with Vertex AI Pipelines for MLOps
C.It generates outputs that are always more accurate than custom models
D.It provides built-in tools for prompt engineering and iterative testing
E.It requires no coding or machine learning expertise to use
AnswersB, D

Integration allows automating deployment, monitoring, and retraining.

Why this answer

Option B is correct because Vertex AI Generative AI Studio is designed to work natively with Vertex AI Pipelines, enabling users to incorporate generative models into end-to-end MLOps workflows for automation, monitoring, and retraining. This integration allows seamless orchestration of prompt tuning, model evaluation, and deployment within the same managed environment, reducing operational overhead.

Exam trap

Google Cloud often tests the misconception that 'no-code' tools eliminate the need for any ML expertise, but the trap here is that Generative AI Studio still requires understanding of prompt engineering, model evaluation, and cost trade-offs to avoid poor outputs or unexpected expenses.

316
Multi-Selectmedium

A development team is integrating a large language model into a healthcare application. They need to reduce the risk of generating harmful medical advice. Which THREE measures should they implement? (Choose three.)

Select 3 answers
A.Use a safety filter to block outputs containing harmful medical terminology.
B.Implement RAG to retrieve verified medical information from trusted sources.
C.Fine-tune the model on a curated dataset of medical textbooks.
D.Include a disclaimer in the system instruction that the model is not a doctor.
E.Set the temperature to a very high value to ensure diverse outputs.
AnswersA, B, C

Safety filters directly block harmful content at inference time.

Why this answer

Options A, B, and C are correct. A safety filter blocks outputs containing harmful terminology, fine-tuning on a curated medical dataset improves domain knowledge and safety, and RAG with trusted sources grounds outputs in verified information. Option D (high temperature) increases randomness and risk.

Option E (disclaimer) does not reduce the generation of harmful advice.

317
MCQeasy

A small business wants to use Vertex AI to analyze customer reviews and extract sentiment, product mentions, and overall themes. They have a small dataset of 500 reviews in a CSV file. The team is not experienced with machine learning and wants a pre-built solution that requires minimal coding. They want to start quickly and scale later. Which Google Cloud offering should they use?

A.Cloud Natural Language API for pre-trained sentiment and entity extraction.
B.Vertex AI Workbench to build a custom sentiment analysis model.
C.AutoML Natural Language to train a custom model on their data.
D.Vertex AI Gemini API with zero-shot prompting.
AnswerA

This is a pre-built API that requires no ML experience and can be used immediately.

Why this answer

Option C is correct. Vertex AI's Natural Language API offers pre-trained models for sentiment and entity extraction. Option A (Vertex AI Workbench) requires coding.

Option B (AutoML) requires labeling and training. Option D (Gemini API) would require prompt engineering and is not purpose-built for this task.

318
MCQhard

A healthcare organization needs a generative AI model to answer medical questions using proprietary clinical guidelines. They have a large dataset of doctor-patient interactions. Should they fine-tune a pre-trained model or use Retrieval-Augmented Generation (RAG)?

A.Use RAG to reduce inference costs by skipping model updates.
B.Use RAG to retrieve relevant guidelines during inference, avoiding frequent retraining.
C.Use prompt engineering to encode all guidelines into the system prompt.
D.Fine-tune the model on the clinical guidelines and interactions.
AnswerB

RAG dynamically pulls up-to-date guidelines, ensuring accuracy and compliance.

Why this answer

RAG is preferred because it can incorporate the latest guidelines without retraining, crucial for regulatory changes. Fine-tuning may cause overfitting to outdated interactions. Option B is wrong because fine-tuning requires continuous retraining.

Option C is wrong because prompt engineering alone cannot inject proprietary knowledge. Option D is wrong because RAG does not inherently reduce cost.

319
MCQmedium

A team deployed a custom generative AI model using KServe on Google Kubernetes Engine (GKE) with the above configuration. They notice that the model is taking longer than expected to respond. What is the most likely cause?

A.The CPU resource limits are too low
B.The model is crashing due to insufficient memory
C.The model requires more than 1 GPU for acceptable performance
D.The container image is too large and takes time to pull
AnswerC

Large generative models often need multiple GPUs for low latency.

Why this answer

The configuration specifies 1 GPU, but the model requires more than 1 GPU for acceptable performance. KServe on GKE allocates GPU resources based on the `limits` field; if the model's inference workload exceeds the memory bandwidth or compute capacity of a single GPU, latency increases due to queuing and serialization. This is the most likely cause of the slow response time, as GPU-bound models are sensitive to under-provisioning.

Exam trap

The trap here is that candidates assume slow responses always indicate a resource shortage like CPU or memory, but for GPU-accelerated models, the most common cause of high latency is insufficient GPU compute or memory bandwidth, not CPU or memory limits.

How to eliminate wrong answers

Option A is wrong because CPU resource limits affect non-GPU compute tasks, but the primary bottleneck for a GPU-accelerated model is GPU throughput, not CPU; low CPU limits would cause throttling only if the model has CPU-intensive preprocessing or postprocessing, which is not indicated. Option B is wrong because insufficient memory would cause the pod to be OOMKilled (crash) rather than just slow responses; the model is responding, so memory is sufficient. Option D is wrong because the container image pull happens during pod startup, not during inference; once the pod is running, image size does not affect response latency.

320
Multi-Selectmedium

Which TWO of the following are best practices for prompt engineering?

Select 2 answers
A.Provide context and examples in the prompt
B.Append random noise to prompts to improve creativity
C.Use clear and specific instructions
D.Always use the maximum possible number of tokens
E.Use negative prompts to discourage undesired outputs
AnswersA, C

Context and examples help the model understand the desired output.

Why this answer

Clear and specific instructions help guide the model, and providing context and examples improves output quality. Options B, D, and E are not recommended.

321
MCQhard

A research lab is using Vertex AI to generate high-resolution medical images (2560x1920) of cell structures using Imagen. They have fine-tuned the model on their own microscope images. The generated images are sharp but often contain repeating patterns (e.g., identical cell arrangements) that are not biologically plausible. The team suspects the model is overfitting to spatial patterns in the training data. They have already tried increasing the training dataset size and augmenting it with rotations and flips. What additional technique should they try within Vertex AI?

A.Switch to a different foundation model like Stable Diffusion.
B.Add regularization techniques such as dropout layers or data augmentation that randomly crops and blends patches.
C.Use a larger batch size during fine-tuning.
D.Further increase the resolution of training images to 5120x3840.
AnswerB

Regularization helps prevent overfitting to specific spatial patterns.

Why this answer

Option D is correct. Adding regularization via dropout or batch normalization during fine-tuning can reduce overfitting. Option A (higher resolution) may exacerbate overfitting.

Option B (larger batch size) can help generalization but not specifically for repeating patterns. Option C (different model) is not a parameter tuning approach.

322
MCQeasy

A data scientist needs to fine-tune a foundation model for a sentiment analysis task without managing infrastructure. Which Google Cloud service should they use?

A.Compute Engine
B.BigQuery ML
C.Cloud Run
D.Vertex AI Model Garden
AnswerD

Model Garden offers managed fine-tuning of foundation models without infrastructure overhead.

Why this answer

Vertex AI Model Garden is the correct service because it provides a curated hub of foundation models that can be fine-tuned with managed infrastructure, eliminating the need for the data scientist to provision or manage servers. It supports one-click deployment and fine-tuning workflows for sentiment analysis, directly addressing the requirement to avoid infrastructure management.

Exam trap

The trap here is that candidates often confuse BigQuery ML's ability to train models on tabular data with the capability to fine-tune large language models, but BigQuery ML does not support fine-tuning of foundation models for NLP tasks.

How to eliminate wrong answers

Option A is wrong because Compute Engine is an IaaS offering that requires the user to manually provision, configure, and manage virtual machines, which contradicts the requirement of not managing infrastructure. Option B is wrong because BigQuery ML is designed for creating and executing machine learning models using SQL queries on structured data in BigQuery, not for fine-tuning large foundation models for natural language tasks like sentiment analysis. Option C is wrong because Cloud Run is a serverless container platform for running stateless HTTP-driven applications, but it does not provide native support for fine-tuning foundation models; it would require the user to build and manage the fine-tuning pipeline themselves.

323
MCQmedium

A company fine-tunes a model using Vertex AI and notices the model's performance drops on the original training task (e.g., language understanding) after fine-tuning for a new task (e.g., summarization). What could be the cause?

A.Data leakage
B.Model quantization
C.Catastrophic forgetting
D.Underfitting
AnswerC

Fine-tuning on a narrow task can overwrite general knowledge, leading to performance degradation on the original task.

Why this answer

Catastrophic forgetting occurs when a neural network loses previously learned knowledge upon being fine-tuned on a new task. In this scenario, fine-tuning the model for summarization overwrites the weights responsible for language understanding, causing performance degradation on the original task. This is a well-known limitation of sequential fine-tuning in deep learning.

Exam trap

Google Cloud often tests the distinction between catastrophic forgetting and underfitting, as candidates may mistakenly think the model simply didn't learn the new task well, rather than recognizing that it forgot the original task due to weight overwriting.

How to eliminate wrong answers

Option A is wrong because data leakage refers to the inadvertent exposure of target information during training, which would typically inflate performance metrics rather than cause a drop on the original task. Option B is wrong because model quantization reduces numerical precision (e.g., from FP32 to INT8) to improve inference speed and memory efficiency, but it does not inherently cause performance loss on a previously learned task; any accuracy loss from quantization is generally uniform across tasks. Option D is wrong because underfitting means the model fails to capture patterns in the training data, resulting in poor performance on both the original and new tasks, not a selective drop on the original task after fine-tuning.

324
MCQhard

A developer uses the Vertex AI Python SDK to call a Gemini model for structured JSON output. However, the model often returns malformed JSON. Which parameter should the developer set in the generation configuration to enforce valid JSON output?

A.Set the temperature to a lower value (0.1) to reduce variation.
B.Set the 'response_mime_type' parameter to 'application/json'.
C.Include few-shot examples of the desired JSON format in the system prompt.
D.Switch to a smaller model to reduce complexity.
AnswerB

This parameter forces the model to output valid JSON, supported by Gemini.

Why this answer

Option B is correct because setting `response_mime_type` to `'application/json'` in the generation configuration instructs the Gemini API to constrain the model's output to valid JSON format. This parameter leverages the model's native structured output capability, ensuring the response adheres to JSON syntax without relying on post-processing or prompt engineering.

Exam trap

Google Cloud often tests the misconception that prompt engineering (e.g., few-shot examples or temperature tuning) can reliably enforce structured output, when in fact the correct approach is to use the API's native structured output parameter like `response_mime_type`.

How to eliminate wrong answers

Option A is wrong because lowering temperature reduces randomness but does not enforce structural constraints; the model can still produce malformed JSON due to token-level deviations. Option C is wrong because few-shot examples in the system prompt improve formatting consistency but do not guarantee valid JSON output, as the model may still generate syntax errors or deviate from the schema. Option D is wrong because switching to a smaller model reduces capacity and may increase the likelihood of malformed output, and model size does not address the need for structured output enforcement.

325
MCQhard

A healthcare startup fine-tunes a model to generate patient education materials. They want to ensure the model never gives medical advice, only information. They add a safety instruction, but the model sometimes still gives advice. What advanced technique should they apply?

A.Hard-code a list of prohibited phrases in a post-processing script
B.Add a secondary classifier to rewrite any detected advice into general information
C.Use semantic similarity to a 'medical advice' embedding and reject if close
D.Apply RLHF with a reward model that penalizes outputs containing medical advice
AnswerD

RLHF directly optimizes the model to avoid undesired behaviors based on human preferences.

Why this answer

Option C is correct because reinforcement learning from human feedback (RLHF) with a reward model that penalizes advice can steer the model away from that behavior. Option A is wrong because hard-coded rules may not cover all cases. Option B is wrong because embedding distance is not effective for controlling output content.

Option D is wrong because output filtering can block but does not prevent generation of advice in the first place.

326
MCQmedium

An e-commerce company uses a generative AI model to generate marketing copy. They notice that the model occasionally produces off-brand or inappropriate content. What is the best way to mitigate this?

A.Reduce the model's temperature
B.Increase the model's top-k sampling
C.Use a safety filter
D.Fine-tune the model on brand guidelines
AnswerD

Trains the model to adhere to brand style and content.

Why this answer

Option B is correct because fine-tuning on brand guidelines directly addresses brand-specific content issues. Option A (safety filter) is too broad. Option C (reducing temperature) affects creativity but not brand adherence.

Option D (increasing top-k sampling) increases diversity, not control.

327
Multi-Selecteasy

A developer wants to use the Gemini API to generate creative text. Which TWO parameters can they adjust to influence the output?

Select 2 answers
A.Color space
B.Audio sample rate
C.Top-k
D.Image size
E.Temperature
AnswersC, E

Top-k limits the vocabulary sampled.

Why this answer

Temperature and top-k are standard parameters to control randomness and creativity. Color space, image size, and audio sample rate are not relevant for text generation.

328
MCQeasy

A data scientist wants to quickly prototype a text generation application using Google's foundation models. Which Google Cloud service should they use?

A.Generative AI Studio
B.Cloud Natural Language API
C.Vertex AI Prediction
D.AI Platform Training
AnswerA

Generative AI Studio provides a no-code interface to prototype with foundation models.

Why this answer

Generative AI Studio is designed for rapid prototyping with foundation models. Vertex AI Prediction is for serving, AI Platform Training is for custom training, and Cloud Natural Language API is for analysis, not generation.

329
MCQeasy

A healthcare company wants to use Gemini to analyze patient records and summarize findings. Which data privacy practice is most critical when using the Gemini API on Vertex AI?

A.Fine-tune Gemini using PHI to improve accuracy.
B.Disable request-response logging in Vertex AI to ensure data is not stored.
C.Enable Vertex AI Data Governance to mask or redact PII before sending to the API.
D.Use the text-davinci-003 model instead of Gemini, as it is more private.
AnswerC

D is correct because Data Governance can automatically protect sensitive data.

Why this answer

Option C is correct because Vertex AI Data Governance allows you to configure data masking or redaction of personally identifiable information (PII) before the data is sent to the Gemini API, ensuring compliance with healthcare regulations like HIPAA. This is the most critical practice because it prevents PHI from being exposed to the model or stored in logs, directly addressing the core privacy requirement. Disabling logging alone (Option B) does not prevent PHI from being processed by the model, and fine-tuning with PHI (Option A) introduces significant compliance risks.

Exam trap

Google Cloud often tests the misconception that disabling logging is sufficient for data privacy, when in fact the critical step is preventing sensitive data from being sent to the API in the first place, which is achieved through data masking or redaction.

How to eliminate wrong answers

Option A is wrong because fine-tuning Gemini using PHI would require storing and processing that data in a training pipeline, which violates HIPAA and other data privacy regulations unless strict de-identification and contractual safeguards are in place; it also increases the attack surface for data breaches. Option B is wrong because disabling request-response logging in Vertex AI prevents storage of API interactions but does not prevent PHI from being sent to and processed by the Gemini model itself, leaving the data exposed during inference. Option D is wrong because text-davinci-003 is an OpenAI model, not available on Vertex AI, and it does not inherently offer better privacy controls; the comparison is irrelevant and the premise is false.

330
Multi-Selecthard

Which THREE of the following are potential risks when deploying generative AI?

Select 3 answers
A.Hallucinations
B.Memorization of sensitive training data
C.Bias and fairness issues
D.Increased model accuracy
E.Toxic or harmful content generation
AnswersA, C, E

Models can generate false or fabricated information.

Why this answer

Option A is correct because generative AI models, particularly large language models (LLMs), can produce plausible-sounding but factually incorrect or nonsensical outputs, known as hallucinations. This occurs due to the model's probabilistic nature and lack of true understanding, where it generates text based on learned patterns rather than verified facts.

Exam trap

Google Cloud often tests the distinction between risks and benefits, so the trap here is that candidates may mistakenly identify 'increased model accuracy' as a risk, when it is actually a performance improvement and not a deployment risk.

331
Multi-Selectmedium

A healthcare provider is planning to deploy generative AI for clinical note summarization. Which THREE actions are essential for regulatory compliance (e.g., HIPAA)?

Select 3 answers
A.Implement role-based access controls to limit who can view AI-generated notes.
B.Anonymize patient data before using it for model training or inference.
C.Allow clinicians to share AI-generated summaries with anyone in the organization.
D.Store raw patient data in model training logs for auditing.
E.Ensure data encryption at rest and in transit.
AnswersA, B, E

Access controls ensure only authorized users see sensitive data.

Why this answer

Option A is correct because role-based access controls (RBAC) are a core requirement under HIPAA's Security Rule (45 CFR § 164.312(a)(1)) to ensure that only authorized personnel can access electronic protected health information (ePHI). In the context of generative AI for clinical note summarization, RBAC prevents unauthorized viewing of AI-generated summaries that may contain sensitive patient data, thereby enforcing the minimum necessary standard.

Exam trap

Google Cloud often tests the misconception that sharing AI-generated summaries freely within an organization is acceptable under HIPAA, when in fact the minimum necessary rule strictly limits access to only those who need the information for their job functions.

332
MCQeasy

Which technique allows a model to incorporate real-time data from external APIs?

A.RAG with tool calling
B.Prompt engineering
C.Fine-tuning
D.Model pruning
AnswerA

Enables dynamic API access during generation.

Why this answer

RAG with tool calling enables the model to query external APIs in real-time to retrieve current data. Fine-tuning uses static data, prompt engineering alone doesn't fetch data, and model pruning reduces size.

333
Multi-Selectmedium

Which TWO strategies are effective for reducing latency in a generative AI chat application deployed on Vertex AI? (Select 2)

Select 2 answers
A.Deploy on TPU instead of GPU
B.Use streaming responses
C.Increase the max output tokens
D.Enable model quantization
E.Use larger batch sizes
AnswersB, D

Reduces perceived latency.

Why this answer

Option B is correct because streaming responses reduce perceived latency by sending tokens to the client as they are generated, rather than waiting for the full response. This leverages server-sent events (SSE) or chunked transfer encoding to deliver partial results immediately, improving user experience in chat applications.

Exam trap

Google Cloud often tests the distinction between reducing actual latency (e.g., model optimization) versus reducing perceived latency (e.g., streaming), and candidates mistakenly choose options that increase throughput (like larger batch sizes) without realizing they harm per-request latency.

334
MCQeasy

A startup is building a customer support chatbot using Vertex AI and wants to ground responses in their product documentation to reduce hallucinations. Which approach should they use?

A.Enable Vertex AI Grounding with a custom enterprise data store containing the documentation.
B.Use the Codey API for text generation.
C.Use the base model without any grounding to maximize flexibility.
D.Fine-tune the model on the documentation and deploy.
AnswerA

Grounding ties responses to specific documents, reducing hallucinations.

Why this answer

Vertex AI Grounding with a custom enterprise data store is the correct approach because it allows the chatbot to retrieve and cite specific chunks from the product documentation in real time, directly reducing hallucinations by constraining responses to verified content. This method uses the underlying grounding service to query a vector-based data store (powered by Vertex AI Search) and append source references to the model's output, ensuring factual accuracy without retraining.

Exam trap

Google Cloud often tests the misconception that fine-tuning is the best way to incorporate domain knowledge, but the trap here is that fine-tuning does not provide dynamic, verifiable grounding with citations, whereas Vertex AI Grounding with a custom data store does, making it the correct choice for reducing hallucinations in a retrieval-augmented generation use case.

How to eliminate wrong answers

Option B is wrong because the Codey API is designed for code generation tasks (e.g., code completion, chat), not for grounding responses in external documents; it lacks the retrieval-augmented generation (RAG) capabilities needed to reduce hallucinations from product documentation. Option C is wrong because using a base model without grounding maximizes flexibility but also maximizes the risk of hallucination, as the model relies solely on its training data and cannot verify facts against the documentation. Option D is wrong because fine-tuning the model on the documentation embeds the content into the model's weights, which is static, costly to update, and does not provide real-time citation or retrieval; it also risks overfitting and does not leverage Vertex AI's built-in grounding infrastructure for dynamic fact-checking.

335
MCQeasy

A prompt engineer wants to improve the model's adherence to a specific output format (e.g., always start with a greeting). Which technique should they try first?

A.Use a lower temperature to make the output more deterministic.
B.Fine-tune the model on many examples of the desired format.
C.Include a system instruction at the beginning of the prompt that specifies the desired format.
D.Modify the model's tokenizer to encode the format rules.
AnswerC

System instructions set global behavior and are the easiest first step.

Why this answer

Option C is correct because system instructions are the most direct and efficient method to enforce output formatting in large language models. By placing a clear directive at the beginning of the prompt (e.g., 'Always start your response with a greeting'), the model's attention mechanism is guided to prioritize this rule during generation, without requiring retraining or hyperparameter changes.

Exam trap

Google Cloud often tests the misconception that hyperparameter tuning (like temperature) can enforce structural output rules, when in fact it only controls randomness, not format adherence.

How to eliminate wrong answers

Option A is wrong because lowering temperature reduces randomness but does not enforce a specific structural rule like starting with a greeting; it only makes token selection more deterministic, which could still produce varied formats. Option B is wrong because fine-tuning is a resource-intensive process that requires a curated dataset and retraining, making it an overkill for a simple formatting constraint that can be achieved with a prompt instruction. Option D is wrong because modifying the tokenizer would alter how input text is split into tokens, not how the model adheres to output format rules; tokenizers have no mechanism to enforce generation constraints.

336
MCQhard

A team is training a custom foundation model using JAX on TPUs on Google Cloud. They encounter frequent Out of Memory (OOM) errors. Which action is most effective in resolving the OOM error?

A.Reduce the model size by decreasing the number of layers.
B.Increase the batch size to maximize TPU utilization.
C.Use mixed precision training (bfloat16) to reduce memory footprint.
D.Enable model parallelism using GSPMD to distribute the model across TPU cores.
AnswerD

Model parallelism directly addresses memory constraints by partitioning the model.

Why this answer

Option D is correct because OOM errors when training large foundation models on TPUs often stem from the model exceeding the memory of a single TPU core. GSPMD (Generalized SPMD) enables automatic model parallelism, sharding the model's parameters, gradients, and optimizer states across multiple TPU cores, thereby reducing per-core memory pressure without altering the model architecture or precision.

Exam trap

Google Cloud often tests the misconception that mixed precision (bfloat16) alone is sufficient to resolve OOM errors, when in fact for very large models the memory bottleneck is the model size itself, not just the precision, and model parallelism is required.

How to eliminate wrong answers

Option A is wrong because reducing the number of layers changes the model architecture and may degrade model quality; it is a workaround, not a systematic solution to memory management. Option B is wrong because increasing the batch size increases memory consumption for activations and gradients, exacerbating OOM errors rather than resolving them. Option C is wrong because while mixed precision training (bfloat16) halves the memory footprint of tensors, it does not address the fundamental issue of a model being too large to fit on a single TPU core; it only provides a constant-factor reduction and may still result in OOM for very large models.

337
MCQeasy

Refer to the exhibit. A developer sees this error when trying to call a Vertex AI endpoint for online prediction. What permission does the requesting identity need to be granted?

A.aiplatform.prediction.predict
B.aiplatform.endpoints.predict
C.aiplatform.endpoints.use
D.aiplatform.models.predict
AnswerB

The error explicitly states this permission is required.

Why this answer

The error occurs when calling a Vertex AI endpoint for online prediction, which requires the `aiplatform.endpoints.predict` permission. This permission is specifically scoped to the endpoint resource, allowing the identity to send prediction requests to a deployed model endpoint. The correct IAM role binding must include this permission for the requesting identity to successfully invoke the endpoint.

Exam trap

Google Cloud often tests the distinction between permissions scoped to endpoints versus models, and candidates mistakenly choose `aiplatform.models.predict` because they think prediction is always tied to the model, not the endpoint serving it.

How to eliminate wrong answers

Option A is wrong because `aiplatform.prediction.predict` is not a valid IAM permission in Vertex AI; the correct permission for prediction is scoped to the endpoint or model resource, not a generic 'prediction' service. Option C is wrong because `aiplatform.endpoints.use` does not exist as a permission; Vertex AI uses `aiplatform.endpoints.predict` for invoking predictions on endpoints. Option D is wrong because `aiplatform.models.predict` is a permission for calling prediction directly on a model resource, not on an endpoint, and the error specifically references an endpoint call, not a model call.

338
MCQmedium

A healthcare company is using Vertex AI to build a generative AI assistant that helps doctors draft clinical notes. The assistant uses a fine-tuned PaLM 2 model deployed on a private endpoint. Recently, doctors have reported that the assistant takes over 30 seconds to respond, causing workflow delays. Additionally, the monthly Vertex AI costs have increased by 40% without a proportional increase in usage. The model responses are generally accurate but sometimes include irrelevant details. The company wants to improve response time and cost while maintaining acceptable quality. A review of logs shows that most requests are for similar note types (e.g., progress notes, discharge summaries) and that the same prompt is used repeatedly with minor variations. What should the company do first?

A.Switch to a larger model (e.g., Gemini 1.5 Pro) to improve response quality and reduce irrelevant details
B.Increase the Vertex AI endpoint's maximum request quota to handle concurrent requests
C.Apply model quantization (e.g., INT8) to reduce model size and inference time
D.Implement response caching for common queries and batch process similar requests
AnswerD

Caching reduces redundant computations, and batching improves throughput, together cutting latency and cost.

Why this answer

Option B is correct because implementing caching and batching directly addresses latency and cost by reusing common responses and processing requests in groups. Option A (switching to a larger model) would increase latency and cost. Option C (increasing quota) does not improve performance or cost efficiency.

Option D (model quantization) might help latency but could reduce accuracy; it's also more complex than caching/batching as a first step.

339
MCQmedium

A startup is building a generative AI tool that helps users write code. They want to launch quickly but need to ensure the generated code is secure and does not introduce vulnerabilities. They have a small team of developers with some ML experience. The tool should be cloud-hosted. Which approach balances speed, security, and cost?

A.Deploy the tool without any security checks and rely on manual review
B.Train a custom code generation model from scratch on a large dataset
C.Use a pre-trained code model (e.g., Codey) and add a security filtering layer
D.Use a smaller model and restrict outputs to only simple code patterns
AnswerC

Leverages existing model, adds security checks, fast to deploy.

Why this answer

Option B is correct because using a pre-trained code model with a security filtering layer provides a good balance: quick start, built-in safety checks, and manageable cost. Option A (building from scratch) is too slow. Option C (manual review) doesn't scale.

Option D (restricting outputs) may reduce usefulness.

340
MCQhard

Refer to the exhibit. This is the IAM policy for a project containing a Vertex AI Agent Builder agent and a data store. The agent is unable to access the data store. What is the most likely cause?

A.The user needs more permissions
B.The agent needs a bigger quota
C.The agent service account needs the data store viewer role
D.The data store is not in the same region
AnswerC

The agent's service account must have access to the data store.

Why this answer

Option A is correct because the agent uses a service account that needs the data store viewer role. The exhibited policy grants admin to a user, not the agent's service account. Options B, C, and D are unlikely given the error context.

341
MCQmedium

A team is deploying a large language model for legal document summarization. They find the model occasionally omits critical legal clauses. Which improvement technique would be most effective?

A.Design a prompt that explicitly lists required sections
B.Increase the top_p value to 1.0
C.Fine-tune the model on legal summaries
D.Lower the temperature to 0.1
AnswerA

A structured prompt with requirements improves completeness.

Why this answer

Using prompt engineering with explicit instructions to include all clauses and possibly a checklist directly addresses omissions. Option A is wrong because fine-tuning would require labeled data of summaries with clauses. Option B is wrong because temperature reduction might make output less creative but doesn't enforce completeness.

Option D is wrong because it adds randomness, making omissions more likely.

342
MCQhard

A global e-commerce company uses Vertex AI Gemini API for real-time product description generation. They observe that sometimes the model generates text in a language other than the user's language, despite being prompted in English. They need to ensure output language consistency. Which approach is most effective?

A.Set the language parameter in the generation config to 'en'
B.Fine-tune the model on a dataset of English-only product descriptions
C.Configure a safety filter that blocks non-English text
D.Run a language detection model on the output and regenerate if not English
AnswerC

Vertex AI allows custom safety filters; blocking non-English text ensures output language consistency.

Why this answer

Option C is correct because using a safety filter with language detection blocks unintended languages. Option A (setting the generation config language parameter) is not directly available in Gemini API. Option B (fine-tuning for language detection) is overkill.

Option D (post-processing with translation) is reactive and adds latency.

343
MCQmedium

A developer is building a customer support chatbot using a large language model. The chatbot frequently generates plausible-sounding but incorrect answers to product questions. Which technique should be applied to improve factual accuracy?

A.Provide a few-shot example of correct answers in the prompt.
B.Use a higher temperature setting to encourage more creative responses.
C.Increase the model's context length to include more of the conversation history.
D.Enable Grounding with the company's product knowledge base.
AnswerD

Grounding retrieves live, verified data and injects it into the prompt, directly improving factual accuracy.

Why this answer

Option D is correct because Grounding (e.g., using Vertex AI Grounding with Search) retrieves relevant information from a trusted source in real time, reducing hallucination. Option A is wrong because increasing context length may include more irrelevant information and does not guarantee accuracy. Option B is wrong because higher temperature increases randomness, worsening hallucinations.

Option C is wrong because few-shot prompting can help but only if examples are accurate and relevant; it does not dynamically look up facts.

344
MCQhard

A retailer wants to generate personalized product descriptions using PaLM API. They have concerns about data privacy. What is the best practice to mitigate these concerns?

A.Train a custom model from scratch on proprietary data stored on-premise
B.Use the PaLM API directly with anonymized customer data
C.Encrypt all data in transit and at rest using customer-managed encryption keys
D.Enable data residency and use prompt engineering to avoid including personally identifiable information
AnswerD

Vertex AI allows data to stay in specific regions, and careful prompt design can generate personalized content without exposing raw PII.

Why this answer

Option D is correct because data residency ensures customer data is processed and stored within a specific geographic region, addressing regulatory compliance, while prompt engineering allows the retailer to avoid sending PII to the PaLM API entirely. This combination mitigates privacy risks without requiring custom model training or relying solely on encryption, which does not prevent the API from processing sensitive data.

Exam trap

Google Cloud often tests the misconception that encryption alone (Option C) is sufficient for data privacy, when in fact it does not prevent the API from accessing or processing the data, which is the core concern in this scenario.

How to eliminate wrong answers

Option A is wrong because training a custom model from scratch on proprietary data is cost-prohibitive, requires extensive ML expertise, and does not leverage the PaLM API's pre-trained capabilities, making it an inefficient solution for generating personalized descriptions. Option B is wrong because using the PaLM API directly with anonymized customer data still transmits data to Google's servers, and anonymization may not be irreversible or sufficient to prevent re-identification, violating privacy policies. Option C is wrong because encrypting data in transit (e.g., TLS 1.3) and at rest (e.g., AES-256) protects against unauthorized access but does not prevent the PaLM API from processing the data, meaning the retailer's privacy concerns about data exposure to the API remain unaddressed.

345
MCQhard

A healthcare startup is exploring GenAI for clinical note summarization. They have concerns about patient data privacy. Which Google Cloud approach best addresses privacy while still using powerful models?

A.Deploy open-source models on-premises
B.Use a third-party API with anonymization of patient data
C.Use Vertex AI with model customization (fine-tuning)
D.Use Vertex AI with data residency controls and no external data sharing
AnswerD

Vertex AI offers regional endpoints and commitments to not use customer data for training, addressing privacy while providing powerful models.

Why this answer

Vertex AI with data residency controls and no external data sharing ensures that patient data remains within specified geographic boundaries and is not used for model training or improvement, directly addressing healthcare privacy regulations like HIPAA. This approach leverages Google Cloud's powerful models while maintaining strict data governance, unlike options that risk data exposure or lack enterprise-grade controls.

Exam trap

The trap here is that candidates often assume fine-tuning (Option C) inherently provides privacy, but without explicit data residency and no-sharing policies, it fails to meet strict healthcare compliance requirements.

How to eliminate wrong answers

Option A is wrong because deploying open-source models on-premises, while offering data control, often lacks the advanced summarization capabilities and scalability of Vertex AI's foundation models, and still requires significant effort to ensure HIPAA compliance without Google's built-in privacy safeguards. Option B is wrong because using a third-party API, even with anonymization, introduces risks of data leakage or re-identification, and typically does not provide contractual guarantees against model training on patient data, violating many healthcare privacy policies. Option C is wrong because fine-tuning a model on Vertex AI without explicit data residency controls and no external data sharing may still allow Google to process data outside desired regions or use it for service improvements, failing to meet strict data privacy requirements.

346
MCQmedium

A team monitors their generative AI model on Vertex AI. They notice output quality declining. Which metric is most likely the root cause?

A.Input token count per request is increasing.
B.Output token count is decreasing.
C.Prediction latency is stable.
D.Error rate is less than 1%.
AnswerA

Growing inputs may push the model beyond optimal context length, reducing focus.

Why this answer

The increasing input token count suggests users are providing more context, which may exceed the model's effective context window or dilute relevant information, degrading quality. Latency and error rate are fine.

347
Multi-Selecthard

A financial institution is deploying a generative AI solution that generates investment advice. They must ensure fairness, avoid toxic outputs, and comply with regulations like GDPR. Which TWO strategies should they implement? (Choose two.)

Select 2 answers
A.Use Vertex AI Safety Attributes to filter harmful content in both input and output.
B.Set the model temperature to 0 to eliminate creativity and reduce bias.
C.Implement a human review process for any advice above a certain risk threshold.
D.Fine-tune the model exclusively on compliant financial documents.
E.Disable request logging to avoid storing sensitive data.
AnswersA, C

B is correct because it proactively blocks toxic content.

Why this answer

Options B and D are correct because using safety attributes to filter harm and implementing a human-in-the-loop for high-risk outputs are direct measures. Option A is wrong because disabling logging is against compliance. Option C is wrong because training only on compliant data is insufficient for every scenario.

Option E is wrong because decreasing temperature does not guarantee fairness.

348
MCQmedium

A content generation model for e-commerce product descriptions repeats the same phrases across multiple descriptions (e.g., 'high-quality', 'best-in-class'). The team wants more varied and engaging output. Which parameter adjustment is most appropriate?

A.Increase the frequency penalty parameter to 1.0.
B.Decrease the max output tokens to 50.
C.Increase the temperature parameter to 1.5.
D.Set the top-p value to a very small number like 0.1.
AnswerA

Frequency penalty specifically reduces the model's tendency to repeat tokens, improving lexical diversity.

Why this answer

Option B is correct because increasing the frequency penalty discourages the model from repeating tokens, directly reducing repetition. Option A is wrong because higher temperature increases randomness but may not specifically target repetition. Option C is wrong because focusing top-p only on a small set may increase repetition.

Option D is wrong because decreasing max tokens truncates output but doesn't reduce repetition.

349
Multi-Selecteasy

A data scientist is using Vertex AI's Generative AI Studio to experiment with prompt designs. Which THREE features are available in the studio?

Select 3 answers
A.Grounding configuration
B.Model parameter adjustments (temperature, top_p, etc.)
C.Automated hyperparameter tuning
D.Prompt templates
E.A/B testing of multiple prompt versions
AnswersA, B, D

Grounding can be set up in the studio.

Why this answer

Options A, B, and C are core features. Option D is wrong because automated hyperparameter tuning is not part of the studio. Option E is wrong because A/B testing requires deployment, not experimentation.

350
MCQmedium

Refer to the exhibit. A data scientist runs this command to upload a custom model to Vertex AI. What is the primary purpose of the --container-image-uri flag?

A.To indicate the model artifact location
B.To set the training container
C.To specify the base image for model serving
D.To define the prediction container
AnswerC

Defines the serving environment for predictions.

Why this answer

The --container-image-uri flag in the `gcloud ai models upload` command specifies the custom container image that Vertex AI will use to serve predictions. This is the base image for model serving, not for training, because Vertex AI uses this image to create the serving environment that hosts the model and handles prediction requests.

Exam trap

The trap here is that candidates confuse the --container-image-uri flag with the training container (Option B) because both involve custom containers, but Vertex AI separates training and serving containers, and this flag is exclusively for serving.

How to eliminate wrong answers

Option A is wrong because the model artifact location is specified via the --artifact-uri flag, not --container-image-uri. Option B is wrong because the training container is set during model training (e.g., via `gcloud ai custom-jobs`), not during model upload; --container-image-uri is for serving. Option D is wrong because while the flag does define the container used for predictions, the correct technical term in Vertex AI is 'serving container' or 'prediction container' is a misnomer; the flag sets the base image for the serving container, not the prediction container itself (which is built from this base image).

351
MCQhard

A company deployed a large language model on Vertex AI using the configuration shown in the exhibit. During peak usage, users report high latency. Which change is most likely to improve latency?

A.Remove the accelerator to simplify deployment.
B.Increase minReplicaCount to 3.
C.Switch to a GPU with more memory, such as NVIDIA_TESLA_A100.
D.Change machineType to n1-standard-4 to reduce cost.
AnswerB

More replicas ready at all times reduces cold-start and scaling latency.

Why this answer

Option A is correct because increasing the minimum number of replicas ensures that more instances are ready to serve traffic during bursts, reducing the time spent scaling up. Option B is wrong because the GPU (T4) is already suitable for inference; upgrading may not address the core latency issue. Option C is wrong because switching to a less powerful machine type (n1-standard-4) would likely increase latency.

Option D is wrong because removing the accelerator would significantly degrade performance for a large model.

352
MCQmedium

A financial services firm needs to generate synthetic data for training models while ensuring that no real customer data leaks. Which technique should they use?

A.Using the Vertex AI PII redaction service
B.Using a public foundation model without fine-tuning
C.Data masking before training
D.Differential privacy during fine-tuning
AnswerD

Differential privacy adds noise to protect individual data.

Why this answer

Differential privacy provides formal guarantees that individual data points cannot be reverse-engineered. Data masking alone may not be sufficient.

353
MCQeasy

A retail company with a large FAQ database wants to build a generative AI customer service chatbot that can answer questions accurately with up-to-date information. Which business strategy should they prioritize?

A.Use retrieval-augmented generation (RAG) with vector search on the FAQ database.
B.Train a new model from scratch using the FAQ data.
C.Fine-tune a foundational model on the entire FAQ dataset.
D.Use a general-purpose language model without any customization.
AnswerA

RAG retrieves current, relevant information from the database, providing accurate and fresh responses without model retraining.

Why this answer

Option A is correct because retrieval-augmented generation (RAG) with vector search allows the chatbot to dynamically retrieve the most relevant, up-to-date FAQ entries from a large database at inference time, grounding the generative model's responses in verified content without requiring retraining. This approach combines the flexibility of a pre-trained language model with the accuracy of real-time information retrieval, ensuring answers reflect the latest FAQ updates.

Exam trap

Google Cloud often tests the misconception that fine-tuning is the best way to inject domain knowledge, but the trap here is that fine-tuning cannot efficiently handle frequently changing data, whereas RAG provides a modular, update-friendly architecture that avoids retraining costs.

How to eliminate wrong answers

Option B is wrong because training a new model from scratch on FAQ data is computationally prohibitive, requires massive datasets and resources, and still cannot guarantee up-to-date answers without frequent retraining. Option C is wrong because fine-tuning a foundational model on the entire FAQ dataset risks catastrophic forgetting of general language capabilities and does not inherently handle dynamic updates; any FAQ change would require re-fine-tuning. Option D is wrong because a general-purpose language model without customization lacks domain-specific knowledge and cannot access the company's proprietary FAQ database, leading to hallucinated or outdated answers.

354
Multi-Selecthard

Which TWO techniques can help reduce latency for a real-time generative AI application? (Choose two.)

Select 2 answers
A.Use streaming responses to send tokens as generated.
B.Quantize the model to a lower precision.
C.Deploy more model replicas to handle load.
D.Enable prompt caching for repeated queries.
E.Batch multiple user requests together.
AnswersA, B

Streaming eliminates waiting for the full output, reducing perceived latency.

Why this answer

Streaming and model quantization directly reduce response time. Batching is for offline, and more deploy replicas can increase throughput but not necessarily reduce latency for a single request. Prompt caching can help if prompts repeat, but not generally.

355
MCQmedium

A software development team builds an internal code assistant using a generative model. The assistant writes Python functions that often contain security vulnerabilities such as SQL injection or command injection. The team wants to mitigate these vulnerabilities without adding a manual review step for every code snippet, as that would slow development. They have access to a static analysis security scanner API. Which approach best addresses the vulnerabilities while maintaining developer velocity?

A.Increase top-k sampling to generate a wider variety of code tokens.
B.After each generation, automatically run the code through the static analysis scanner, and if vulnerabilities are found, send the output back to the model for revision with the scanner's feedback.
C.Fine-tune the model on a corpus of secure code examples.
D.Add a system prompt: 'Do not generate code with security vulnerabilities.'
AnswerB

This iterative process catches and corrects security issues without manual intervention, keeping velocity high.

Why this answer

Option D is correct because post-generation automatic scanning with the security scanner catches vulnerabilities and can request regeneration with suggestions, maintaining speed. Option A is wrong because fine-tuning may not eliminate all vulnerabilities. Option B is wrong because top-k only affects output diversity, not security.

Option C is wrong because system prompts are not reliably followed for security.

356
MCQeasy

A developer uses a generative AI model with the system instruction shown. The response is correct but very brief. Which parameter adjustment could encourage more detail without losing accuracy?

A.Add 'Provide a detailed response' to the system instruction.
B.Set temperature to 0 to make output deterministic.
C.Set topK to 1 to focus on most likely tokens.
D.Increase temperature to 1.5 to encourage creativity.
AnswerA

System instructions can guide verbosity while maintaining accuracy.

Why this answer

Adding a length constraint in a system instruction (e.g., 'Provide detailed responses') is effective. Lower temperature may reduce creativity. Higher temperature could introduce errors.

Changing topK doesn't directly control length.

357
MCQmedium

A healthcare company is building a clinical decision support system using Gemini 1.5 Pro on Vertex AI. They need responses that are highly accurate and comply with medical regulations, including traceability to source documents. They have a large corpus of curated medical guidelines stored in PDFs in Cloud Storage. Their team has experience with both fine-tuning and prompt engineering. Which approach best ensures regulatory compliance and accuracy?

A.Use a combination of grounding to the medical guidelines and prompt engineering with system instructions specifying compliance requirements.
B.Use prompt engineering with system instructions and few-shot examples, but no grounding.
C.Use grounding to the medical guidelines but rely on prompt engineering only for compliance instructions.
D.Fine-tune the model on the medical guidelines corpus to internalize the knowledge.
AnswerA

Grounding ensures traceability to source documents, and prompt engineering enforces regulatory language, together meeting compliance.

Why this answer

Option D is correct because combining grounding (which ties answers to the actual guidelines) with prompt engineering (which enforces compliance requirements) provides traceability and accuracy. Option A (fine-tuning only) risks the model memorizing rather than citing sources, and updates require retraining. Option B (grounding only) may still allow the model to generate ungrounded responses if not properly constrained.

Option C (prompt engineering only) relies on the model's pre-trained knowledge, which is less reliable.

358
MCQhard

A generative AI model for chatbot responses sometimes produces toxic language. The team wants to reduce toxicity without significantly affecting the model's helpfulness. Which approach is best?

A.Increase the temperature parameter
B.Reduce the maximum output tokens
C.Fine-tune with a dataset of non-toxic responses and use RLHF
D.Apply a toxicity classifier as a post-processing filter
AnswerC

Fine-tuning combined with RLHF aligns model behavior effectively.

Why this answer

Fine-tuning with a curated dataset of non-toxic responses directly adjusts the model's weights to reduce the likelihood of generating toxic language, while RLHF (Reinforcement Learning from Human Feedback) further aligns the model with human preferences for helpfulness and safety. This combined approach addresses the root cause of toxicity in the model's behavior without the blunt trade-offs of other methods, preserving the model's utility.

Exam trap

Google Cloud often tests the misconception that post-processing filters (like toxicity classifiers) are sufficient for safety, when in fact they fail to address the model's learned behavior and can degrade helpfulness due to false positives, making fine-tuning with RLHF the superior alignment technique.

How to eliminate wrong answers

Option A is wrong because increasing the temperature parameter increases randomness in token selection, which can actually amplify the probability of generating toxic or nonsensical outputs, not reduce them. Option B is wrong because reducing the maximum output tokens limits response length but does not influence the content or safety of the generated tokens, leaving toxicity unchanged. Option D is wrong because applying a toxicity classifier as a post-processing filter only masks toxic outputs after generation, wasting computational resources and potentially blocking helpful responses that contain false-positive flagged terms, without fixing the underlying model behavior.

359
MCQeasy

A startup with limited budget wants to quickly test a generative AI use case for personalized email marketing. Which approach minimizes time-to-market and cost?

A.Hire a team of AI researchers to build a solution.
B.Develop a custom model from scratch.
C.Fine-tune a large open-source model on internal data.
D.Use a managed API like the PaLM API with prompt engineering.
AnswerD

Quick to implement, pay-per-use, no infrastructure management.

Why this answer

Option D is correct because using a managed API like the PaLM API with prompt engineering eliminates the need for infrastructure setup, model training, and data preparation. This approach leverages a pre-trained model via a simple REST API call, allowing the startup to iterate on prompts and achieve personalized email content in hours rather than weeks, minimizing both time-to-market and cost.

Exam trap

Google Cloud often tests the misconception that fine-tuning (Option C) is always the fastest and cheapest path for customization, but the trap here is that fine-tuning still requires significant compute and data preparation, whereas prompt engineering on a managed API is truly zero-infrastructure and pay-per-use, making it the optimal choice for a quick, low-cost test.

How to eliminate wrong answers

Option A is wrong because hiring a team of AI researchers is expensive and time-consuming, requiring salaries, compute resources, and months of development, which contradicts the limited budget and quick testing goal. Option B is wrong because developing a custom model from scratch demands vast amounts of labeled data, significant GPU/TPU compute, and deep expertise, making it cost-prohibitive and slow for a rapid proof-of-concept. Option C is wrong because fine-tuning a large open-source model still requires substantial compute for training (e.g., GPU hours for LoRA or full fine-tuning), data curation, and deployment overhead, which exceeds the minimal cost and speed constraints of a quick test.

360
MCQmedium

A healthcare company wants to use generative AI to summarize patient records but must comply with HIPAA. Which deployment option should they choose?

A.Use Vertex AI on Google Cloud with data residency
B.Use Google Workspace AI
C.Use an on-premises deployment of open-source model
D.Use a third-party API
AnswerC

Full control over data and compliance.

Why this answer

Option C is correct because an on-premises deployment of an open-source model ensures that all patient data remains within the organization's controlled infrastructure, never leaving the local network. This eliminates any risk of data transmission to external cloud services, which is critical for HIPAA compliance where protected health information (PHI) must be safeguarded against unauthorized access or breaches. On-premises solutions allow the organization to implement its own security controls, encryption, and audit trails without relying on a third-party's compliance posture.

Exam trap

The trap here is that candidates assume cloud providers like Google Cloud or AWS are automatically HIPAA-compliant with data residency, but they overlook the shared responsibility model and the need for a BAA, which still exposes data to the provider's infrastructure and potential third-party risks, making on-premises the only option that guarantees full data control.

How to eliminate wrong answers

Option A is wrong because Vertex AI on Google Cloud, even with data residency, still involves data processing on Google's infrastructure, which requires a Business Associate Agreement (BAA) and may not satisfy all HIPAA requirements if the organization cannot fully control data access or auditing. Option B is wrong because Google Workspace AI is a SaaS offering that processes data on Google's servers, and while it can be HIPAA-compliant with a BAA, it introduces shared responsibility and potential data exposure risks that an on-premises solution avoids. Option D is wrong because using a third-party API means sending PHI to an external service, which requires the third-party to be HIPAA-compliant and sign a BAA, but it still exposes data to network transmission and external processing, increasing the attack surface and compliance burden.

361
MCQhard

A global bank wants to deploy a generative AI assistant for employees across multiple European countries, each with strict data residency laws. Which deployment strategy is most compliant?

A.Deploy separate model instances in each country's cloud region.
B.Use a federated learning approach where data stays on-premises.
C.Deploy a single model in a US region and use data masking.
D.Use a third-party API that processes data outside Europe.
AnswerA

Ensures data never leaves the country, meeting local compliance requirements.

Why this answer

Option A is correct because deploying separate model instances in each country's cloud region ensures that data never crosses national borders, directly complying with strict data residency laws like the GDPR's data localization requirements. This strategy uses regional cloud infrastructure (e.g., AWS eu-central-1, Azure westeurope) to keep both training and inference data within the specific jurisdiction, avoiding any cross-border data transfer.

Exam trap

Google Cloud often tests the misconception that data masking or anonymization alone satisfies data residency laws, but the trap here is that data residency requires the data to physically remain within the jurisdiction, not just be obfuscated.

How to eliminate wrong answers

Option B is wrong because federated learning only keeps training data on-premises, but the model parameters or gradients must still be exchanged with a central server, which can violate data residency if that server is outside the country. Option C is wrong because deploying a single model in a US region and using data masking does not prevent the underlying data from being processed or stored in the US, which violates EU data residency laws like GDPR. Option D is wrong because using a third-party API that processes data outside Europe directly violates data residency requirements, as the data physically leaves the European Economic Area (EEA) without adequate safeguards.

362
Multi-Selecthard

Which THREE factors should you consider when selecting a foundation model from Model Garden? (Choose three.)

Select 3 answers
A.Number of model versions
B.The color of the model card
C.Model size
D.Model accuracy on benchmarks
E.Model license
AnswersC, D, E

Size impacts cost, latency, and deployment requirements.

Why this answer

Options A, B, and C are correct. Model license determines usage rights, accuracy benchmarks show performance, and model size affects cost and latency. Option D is not a primary selection factor, and option E is irrelevant.

363
MCQeasy

A business wants to build a generative AI application but has limited data science resources. What is the recommended path?

A.Use Vertex AI's AutoML and pre-built APIs to accelerate development
B.Hire a team of ML engineers to develop an in-house solution
C.Purchase a third-party generative AI SaaS product off-the-shelf
D.Build a custom model from scratch using TensorFlow
AnswerA

AutoML abstracts away model building complexity, and APIs provide ready-to-use functionality.

Why this answer

Option C is correct because Vertex AI's AutoML and pre-built APIs lower the barrier to entry for teams without deep machine learning expertise. Option A (building from scratch) requires extensive expertise. Option B (hiring a team) is costly and time-consuming.

Option D (buying a SaaS product) may not offer customization.

364
Multi-Selecthard

Which THREE are best practices for designing prompts for a generative AI model?

Select 3 answers
A.Provide few-shot examples for complex tasks
B.Include specific and clear instructions
C.Break the task into smaller steps
D.Use negative prompts to avoid undesired outputs
E.Always set temperature to 1.0 for creativity
AnswersA, B, C

Correct: Examples guide the model toward desired outputs.

Why this answer

Clear instructions, few-shot examples, and task decomposition improve the model's understanding and output quality. Negative prompting is less reliable, and fixed temperature is not a universal best practice.

365
Multi-Selectmedium

Which TWO techniques can help improve the factual accuracy of a language model's outputs? (Choose two.)

Select 2 answers
A.Decrease the max output tokens.
B.Increase the temperature parameter.
C.Fine-tune on a domain-specific curated dataset.
D.Implement retrieval-augmented generation (RAG).
E.Use top-k random sampling.
AnswersC, D

Fine-tuning adapts the model to domain facts.

Why this answer

Fine-tuning on a domain-specific curated dataset (C) directly adjusts the model's weights using high-quality, verified examples, teaching it to produce factually correct outputs for that domain. This reduces hallucinations by grounding the model in accurate, relevant data rather than relying solely on its pre-training distribution.

Exam trap

Google Cloud often tests the misconception that adjusting decoding parameters (like temperature, top-k, or max tokens) can improve factual accuracy, when in reality these only control output style, length, or randomness, not the correctness of the underlying information.

366
MCQeasy

A data scientist is using a large language model to generate product descriptions. The descriptions are often too verbose. Which parameter adjustment is most appropriate?

A.Decrease the top-k value.
B.Increase the max output tokens.
C.Decrease the temperature.
D.Increase the frequency penalty.
AnswerD

Frequency penalty reduces repetitive phrases, encouraging conciseness.

Why this answer

Option A is correct because increasing the frequency penalty discourages repetition and can make output more concise. Option B (decreasing temperature) reduces randomness but not verbosity. Option C (decreasing top-k) limits word choice but not length.

Option D (increasing max tokens) makes descriptions longer.

367
MCQhard

A model generates responses that frequently repeat phrases or words. Which parameter adjustment is most likely to fix this?

A.Increase top_k
B.Increase temperature
C.Increase repetition penalty
D.Increase max output tokens
AnswerC

Correct: Repetition penalty specifically reduces the likelihood of repeating tokens.

Why this answer

Increasing the repetition penalty directly discourages the model from selecting tokens that have already appeared in the generated sequence, thereby reducing repetitive phrases or words. This parameter works by subtracting a fixed penalty from the logits of previously generated tokens before applying the softmax function, making them less likely to be chosen again.

Exam trap

The trap here is that candidates often confuse repetition penalty with diversity-promoting parameters like temperature or top_k, mistakenly believing that increasing randomness or narrowing token selection will fix repetition, when in fact those adjustments can worsen the problem.

How to eliminate wrong answers

Option A is wrong because increasing top_k limits the sampling pool to the k most likely next tokens, which can actually increase repetition by narrowing the diversity of choices. Option B is wrong because increasing temperature flattens the probability distribution, making all tokens more equally likely, which can lead to more random and potentially more repetitive outputs, not less. Option D is wrong because increasing max output tokens only extends the length of the generated response; it does not address the underlying cause of repetition and may even exacerbate it by allowing more opportunities for the model to loop on repeated phrases.

368
MCQmedium

Refer to the exhibit. A sudden surge of traffic reaches 15,000 requests per second, but the endpoint can only handle 1,000 req/s per replica. What will happen to new requests?

A.They will be processed, and replicas will exceed maxReplicaCount.
B.They will be redirected to a different model.
C.They will receive HTTP 429 (Too Many Requests) errors.
D.They will be queued until capacity becomes available.
AnswerC

Once max replicas are reached, new requests get a 429 status code.

Why this answer

Option C is correct because when a surge of 15,000 requests per second hits an endpoint configured with a maxReplicaCount (e.g., 10 replicas at 1,000 req/s each = 10,000 req/s capacity), any excess requests beyond that capacity are rejected with an HTTP 429 (Too Many Requests) status code. This is standard behavior in autoscaling systems: once the replica count reaches its maximum limit, the service cannot scale further, and new requests are throttled to prevent overload.

Exam trap

The trap here is that candidates assume autoscaling can handle any traffic surge indefinitely, ignoring the hard limit of maxReplicaCount, and thus incorrectly choose Option A or D, failing to recognize that HTTP 429 is the standard throttling mechanism when capacity is exhausted.

How to eliminate wrong answers

Option A is wrong because the maxReplicaCount is a hard upper limit; replicas cannot exceed this configured value, so new requests are not processed beyond that capacity. Option B is wrong because traffic redirection to a different model is not a standard behavior for capacity overflow; it would require explicit routing rules or a load balancer configured for failover, which is not implied in the scenario. Option D is wrong because queuing is not the default behavior for HTTP-based endpoints in this context; while some systems support request queuing (e.g., with message brokers), the exhibit describes a direct endpoint handling, and HTTP 429 is the standard response for rate limiting per RFC 6585.

369
MCQeasy

Refer to the exhibit. A developer runs this command but forgets to specify the model name. What will happen?

A.The command will fail with an error
B.The command will prompt for a name
C.The model will be uploaded with a default name
D.The command will succeed but the model will be unlisted
AnswerA

Missing required --display-name causes an error.

Why this answer

In the context of the `gcloud ai models upload` command (or similar model deployment commands in Vertex AI), the model name is a required positional argument. If omitted, the CLI will fail with an error because it cannot proceed without a unique identifier to register the model in the model registry. The command does not default to any name or prompt interactively; it strictly validates required parameters before execution.

Exam trap

Google Cloud often tests the misconception that cloud CLI tools will either prompt for missing required parameters or apply a sensible default, when in reality they fail fast with a clear error to enforce explicit configuration.

How to eliminate wrong answers

Option B is wrong because the command does not prompt for a name; it expects the name as a positional argument in the initial command string, and if missing, it immediately returns a usage error. Option C is wrong because there is no default name mechanism; model names must be explicitly provided to avoid collisions and ensure traceability in the registry. Option D is wrong because the command will not succeed at all; it fails before any upload occurs, so no model is created in any state (listed or unlisted).

370
MCQeasy

You want to use a Google foundation model to generate text summaries of news articles. Which Vertex AI service should you use?

A.Vertex AI Prediction
B.Vertex AI Model Registry
C.Vertex AI Generative AI Studio
D.Vertex AI Feature Store
AnswerC

Generative AI Studio allows testing and using foundation models like text-bison@002.

Why this answer

Option B is correct because Vertex AI Generative AI Studio provides access to foundation models for text generation. Option A is wrong because Vertex AI Prediction is for deploying custom models. Option C is wrong because Model Registry manages model versions.

Option D is wrong because Feature Store is for ML features.

371
MCQmedium

A data scientist notices that a Gemini model generates inconsistent responses to similar prompts. What is the likely cause?

A.Model is not fine-tuned enough
B.The prompt is too short
C.The temperature setting is too low
D.The top_p or temperature parameters are set too high causing randomness
AnswerD

High temperature or top_p increases randomness and variability.

Why this answer

Option D is correct because high temperature (e.g., >1.0) or high top_p (e.g., >0.9) increases the randomness of token sampling, causing the model to select less probable tokens. This directly leads to inconsistent responses for similar prompts, as the model's output distribution becomes more uniform and less deterministic.

Exam trap

Google Cloud often tests the misconception that fine-tuning or prompt length is the primary cause of output inconsistency, when in fact the sampling parameters (temperature and top_p) directly control randomness and are the most common culprit.

How to eliminate wrong answers

Option A is wrong because fine-tuning adjusts the model's weights for a specific task, but it does not control the randomness of token generation; even a fully fine-tuned model will produce inconsistent outputs if sampling parameters are set too high. Option B is wrong because prompt length affects context and specificity, not the inherent randomness of the generation process; a short prompt can still yield consistent responses if temperature and top_p are low. Option C is wrong because a low temperature setting (e.g., 0.1) actually reduces randomness, making outputs more deterministic and consistent, not inconsistent.

372
Multi-Selectmedium

Which THREE of the following are common techniques to reduce harmful biases in generative AI models? (Choose three.)

Select 3 answers
A.Use reinforcement learning from human feedback (RLHF) with a reward model that penalizes biased or unfair outputs.
B.Curate diverse and balanced training datasets that overrepresent underrepresented groups.
C.Decrease the model's temperature parameter to make outputs more deterministic.
D.Apply adversarial training to remove protected attribute information from hidden representations.
E.Conduct a legal review of all generated outputs before release.
AnswersA, B, D

RLHF can shape model behavior to avoid biased generations.

Why this answer

A is correct because RLHF uses a reward model trained on human preferences to score model outputs, and explicitly penalizing biased or unfair outputs during fine-tuning directly reduces harmful biases. This technique aligns the model's behavior with human values by optimizing against a learned reward signal that captures bias-related concerns.

Exam trap

Google Cloud often tests the distinction between hyperparameter tuning (like temperature) and actual bias mitigation techniques, so candidates mistakenly think lowering temperature reduces bias when it only affects output randomness.

373
MCQmedium

A company uses Vertex AI PaLM for code generation. The code often contains security vulnerabilities. Which improvement should be applied?

A.Set top_k to 1
B.Include a security-focused system instruction
C.Use Codey model instead
D.Increase temperature to 0.8
AnswerB

Guides the model to prioritize security practices.

Why this answer

Including a security-focused system instruction (e.g., 'Write secure code that avoids SQL injection') directly guides the model to produce safer output. Increasing temperature worsens security, setting top_k to 1 reduces diversity but doesn't address security, and using Codey alone without instructions may not suffice.

374
MCQhard

Refer to the exhibit. A developer receives this error when trying to call a model for prediction. What is the most likely cause?

A.The project has exceeded its prediction quota.
B.The developer's service account lacks the required IAM role.
C.The model version has been deprecated.
D.The model is not deployed on an endpoint.
AnswerB

The 403 error is a standard permission denied response from IAM.

Why this answer

The error when calling a model for prediction most likely stems from the developer's service account lacking the required IAM role. In Google Cloud AI Platform, the 'aiplatform.user' or 'aiplatform.predictor' role is necessary to invoke prediction endpoints; without it, the API returns a permission-denied error. This is a common misconfiguration when service accounts are created without explicit roles attached.

Exam trap

Google Cloud often tests the misconception that quota limits are the default cause of prediction errors, but the trap here is that permission-denied errors are more frequently due to missing IAM roles rather than quota exhaustion, especially in multi-service-account environments.

How to eliminate wrong answers

Option A is wrong because exceeding the prediction quota would return a '429 RESOURCE_EXHAUSTED' or 'Quota exceeded' error, not a generic permission-denied error. Option C is wrong because a deprecated model version would still be accessible for predictions until it is deleted, and the error would typically indicate 'Model version not found' rather than an authorization failure. Option D is wrong because if the model is not deployed on an endpoint, the error would be 'Model not deployed' or 'Endpoint not found', not a permission error.

375
MCQhard

A large enterprise is deploying a multi-modal generative AI application that processes customer support emails (text) and attached screenshots (images). They need to run inference on over 10,000 requests per minute with strict latency requirements (p99 < 500ms). They have already selected Gemini 1.5 Pro as the model and deployed it on Vertex AI using a GPU-based endpoint with autoscaling. During testing, they observe that the p99 latency spikes to over 2 seconds during peak traffic. The application is stateless and requests are independent. The team has access to Cloud Observability and can modify the deployment configuration. Which course of action should the team take to meet the latency requirements while minimizing cost?

A.Increase the maximum number of replicas in the autoscaling configuration to handle spikes
B.Enable Vertex AI Model Caching and deploy the endpoint on a managed instance group with larger GPU nodes (e.g., A100 40GB)
C.Use preemptible VMs for the endpoint to get priority scheduling
D.Switch to a CPU-based ml.c5 instance to reduce GPU contention
AnswerB

Caching reduces computation for repeated prompts, and larger GPUs accelerate inference.

Why this answer

Option C is correct because enabling model caching reduces redundant computation for repeated prompts, and using dedicated VMs (MIGs) with higher GPU count per replica reduces per-request latency. Option A is wrong because adding more replicas may help throughput but not per-request latency. Option B is wrong because CPU-based serving would be much slower for Gemini.

Option D is wrong because preemptible VMs are not reliable for production latency.

Page 4

Page 5 of 7

Page 6

All pages