CCNA Ml Models Ops Questions

75 of 191 questions · Page 2/3 · Ml Models Ops topic · Answers revealed

76
MCQhard

A data scientist developed a model using custom training on Vertex AI. They want to automate the entire training-to-deployment process. Which service should they use?

A.Cloud Composer
B.Vertex AI Pipelines
C.Cloud Build
D.Cloud Functions
AnswerB

Vertex AI Pipelines is purpose-built for ML pipeline orchestration.

Why this answer

Vertex AI Pipelines is the correct choice because it provides a fully managed, serverless orchestration service specifically designed to automate ML workflows, including custom training, hyperparameter tuning, evaluation, and deployment. It integrates natively with Vertex AI services and supports Kubeflow Pipelines SDK or TFX for defining reproducible, end-to-end pipelines, making it the ideal solution for automating the entire training-to-deployment process.

Exam trap

The trap here is that candidates often confuse general-purpose orchestration (Cloud Composer) with ML-specific pipeline orchestration (Vertex AI Pipelines), overlooking that Vertex AI Pipelines provides built-in ML artifact tracking and native integration with Vertex AI training and prediction services.

How to eliminate wrong answers

Option A is wrong because Cloud Composer is a workflow orchestration service based on Apache Airflow, which is more general-purpose and requires custom operators or hooks to interact with Vertex AI, adding unnecessary complexity and not providing native ML pipeline capabilities. Option C is wrong because Cloud Build is a CI/CD service focused on building, testing, and deploying software artifacts (e.g., containers), not on orchestrating ML training workflows or managing model deployment steps like evaluation and versioning. Option D is wrong because Cloud Functions is a serverless compute service for event-driven, short-lived functions, which lacks the state management, sequencing, and artifact tracking needed for multi-step ML pipelines.

77
MCQmedium

A team uses Vertex AI Pipelines to automate retraining of a model every month. The pipeline includes data preprocessing, training, and deployment steps. After a recent update, the pipeline fails intermittently with a timeout error during the deployment step. What is the most likely cause?

A.The service account used by the pipeline lacks permissions to deploy the model
B.The trained model size has increased due to more data, causing the deployment step to time out
C.The pipeline is configured to run steps in parallel, leading to resource contention
D.BigQuery query quotas are being exceeded during data preprocessing
AnswerB

Larger models take longer to upload and deploy, potentially exceeding timeout limits.

Why this answer

Option D is correct because a model with increased size due to training on more data can cause the deployment step to time out. Option A (BigQuery quotas) would affect preprocessing, not deployment. Option B (insufficient service account permissions) would cause persistent errors, not intermittent.

Option C (pipeline step order) would cause consistent failure.

78
MCQeasy

A team has trained a model using AutoML Tables. They want to deploy it for batch predictions on a schedule. What is the simplest approach?

A.Write a Cloud Function triggered by Cloud Scheduler
B.Export model to Cloud Storage and use Dataflow
C.Deploy to App Engine
D.Use Vertex AI Batch Prediction with a scheduled pipeline
AnswerD

Vertex AI Batch Prediction is the native, simplest way to perform batch predictions on a schedule.

Why this answer

Vertex AI Batch Prediction is the simplest approach because it is a managed service that directly supports batch predictions on AutoML Tables models without requiring additional infrastructure. By wrapping it in a scheduled Vertex AI pipeline, you can automate the entire workflow—triggering predictions on a schedule, handling input/output to Cloud Storage, and managing compute resources—all within the Vertex AI ecosystem, minimizing operational overhead.

Exam trap

Google Cloud often tests the misconception that you must export an AutoML model to use it outside Vertex AI, but the simplest path is to use Vertex AI's native batch prediction service, which avoids the overhead of custom infrastructure like Dataflow or Cloud Functions.

How to eliminate wrong answers

Option A is wrong because Cloud Functions are designed for lightweight, event-driven tasks and lack native support for AutoML Tables model serving; you would need to manually load the model and handle scaling, which adds complexity and is not the simplest approach. Option B is wrong because exporting the model to Cloud Storage and using Dataflow introduces unnecessary steps—Dataflow requires writing a custom pipeline to load the exported model and perform predictions, whereas Vertex AI Batch Prediction handles this natively. Option C is wrong because App Engine is a platform for hosting web applications, not designed for batch prediction workloads; it would require building a custom prediction service and managing scaling, which is more complex than using Vertex AI's built-in batch prediction.

79
MCQmedium

A data engineer deploys a TensorFlow model on Vertex AI using a custom container. After deployment, online prediction requests sometimes fail with a 500 error and the message 'Out of memory'. The model requires significant memory during inference. Which action should the engineer take to resolve this issue?

A.Reduce the batch size of prediction requests sent to the endpoint.
B.Increase the memory limit in the Vertex AI endpoint configuration.
C.Optimize the model by quantizing weights to reduce model size.
D.Use a machine type with higher CPU performance.
AnswerB

Configuring a higher memory machine type or increasing the memory limit in the container spec provides the needed resources.

Why this answer

Option B is correct because Vertex AI endpoints allow you to configure a machine type with a specific memory limit. When a custom container runs out of memory during inference, increasing the memory allocation (e.g., by selecting a machine type with more RAM, such as n1-highmem-8) directly addresses the 'Out of memory' error. This ensures the container has sufficient resources to handle the model's inference workload without crashing.

Exam trap

Google Cloud often tests the misconception that reducing batch size or optimizing the model (quantization) is the first step to fix runtime OOM errors, when in fact the immediate operational fix is to allocate more memory to the deployment.

How to eliminate wrong answers

Option A is wrong because reducing the batch size may reduce per-request memory usage, but the error occurs during inference of a single request or a small batch; the root cause is insufficient memory for the model itself, not request batching. Option C is wrong because quantizing weights reduces model size on disk and may lower memory footprint, but it is a model optimization technique that requires retraining or conversion and does not immediately resolve a runtime OOM error in a deployed container. Option D is wrong because higher CPU performance (e.g., more vCPUs) does not increase available memory; the OOM error is a memory issue, not a CPU bottleneck, and Vertex AI machine types with higher CPU often have the same or lower memory ratios.

80
MCQeasy

Your company deploys a classification model on Vertex AI for online predictions. The model is an XGBoost model trained on tabular data with 500 features. The endpoint uses a single n1-standard-4 node. After deployment, users report that predictions take 8-10 seconds on average, while the required SLA is under 2 seconds. You have already verified that the model is not large (under 100 MB) and the input data size is small. The endpoint does not scale automatically. Which action should you take to reduce latency to meet the SLA? A) Change the machine type to n1-highcpu-4 to prioritize compute over memory. B) Enable autoscaling by setting min replicas to 2 and max replicas to 5. C) Switch to a custom container that preloads the model into memory. D) Reduce the number of features by half.

A.Change the machine type to n1-highcpu-4 to prioritize compute over memory.
B.Reduce the number of features by half.
C.Switch to a custom container that preloads the model into memory.
D.Enable autoscaling by setting min replicas to 2 and max replicas to 5.
AnswerD

Adding replicas offloads requests, reducing wait time and average latency.

Why this answer

Option B is correct because the current single node is overloaded; autoscaling distributes traffic across multiple nodes, reducing latency for each request. Option A (CPU-optimized machine) may not help if the bottleneck is not CPU. Option C (preloading) is already default for Vertex AI.

Option D (feature reduction) could degrade model accuracy and is not necessary.

81
MCQeasy

A company has deployed a classification model on Vertex AI. They want to detect data drift in real-time for the model's input features. Which service should they use?

A.Cloud Monitoring
B.Cloud Data Loss Prevention
C.Cloud Logging
D.Vertex AI Model Monitoring
AnswerD

Vertex AI Model Monitoring continuously monitors feature distributions and alerts on drift.

Why this answer

Vertex AI Model Monitoring is the correct service because it is specifically designed to detect data drift and feature skew for models deployed on Vertex AI. It continuously monitors input features against a baseline distribution and alerts when drift exceeds a configured threshold, enabling real-time detection without requiring custom code.

Exam trap

The trap here is that candidates confuse general monitoring (Cloud Monitoring) with ML-specific drift detection, assuming any monitoring tool can detect data drift, when in fact Vertex AI Model Monitoring is the only service that performs statistical distribution comparison for model inputs.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring is a general-purpose observability service for metrics, uptime checks, and dashboards; it lacks built-in statistical drift detection for ML model features. Option B is wrong because Cloud Data Loss Prevention (DLP) is used for inspecting, classifying, and masking sensitive data, not for monitoring feature distributions or drift. Option C is wrong because Cloud Logging captures and stores log entries from services but does not perform statistical analysis or drift detection on model inputs.

82
Multi-Selecteasy

A data engineering team is operationalizing a machine learning model for real-time inference. They need to monitor the model's performance in production. Which THREE types of monitoring should they implement? (Choose three.)

Select 3 answers
A.Model accuracy decay
B.Model re-training frequency
C.Training pipeline failures
D.Prediction latency
E.Input feature drift
AnswersA, D, E

Measures decline in prediction quality over time.

Why this answer

Model accuracy decay (A) is critical because in production, the model's predictive performance can degrade over time due to changes in the underlying data distribution or business logic. Monitoring accuracy decay allows the team to detect when the model no longer meets its performance baseline, triggering retraining or rollback. This is a standard practice in MLOps for maintaining model reliability.

Exam trap

Google Cloud often tests the distinction between monitoring the model's operational health (latency, drift, accuracy) versus managing the training lifecycle (retraining frequency, pipeline failures), leading candidates to confuse infrastructure monitoring with model performance monitoring.

83
MCQeasy

A company wants to monitor the performance of a deployed model in production. Which metric indicates that the model's predictions are degrading?

A.Increase in prediction error rate
B.Increase in prediction latency
C.Decrease in throughput
D.Increase in number of requests
AnswerA

Error rate reflects model accuracy.

Why this answer

An increase in prediction error rate directly indicates that the model's outputs are deviating from the expected or ground-truth values, signaling degradation in predictive performance. This metric captures the core concept of model drift, where the statistical properties of the input data or the relationship between features and labels change over time, leading to less accurate predictions. In production ML monitoring, tracking error rate (e.g., classification accuracy, RMSE) is the primary method to detect when a model needs retraining or updating.

Exam trap

Google Cloud often tests the distinction between operational metrics (latency, throughput) and model performance metrics (error rate), trapping candidates who confuse system health with prediction quality.

How to eliminate wrong answers

Option B is wrong because prediction latency measures the time taken for the model to return a prediction, which reflects infrastructure or model complexity issues, not the accuracy or degradation of the predictions themselves. Option C is wrong because throughput (requests per second) is a measure of system capacity and scalability, not a direct indicator of prediction quality or model drift. Option D is wrong because an increase in the number of requests indicates higher demand or usage, which does not imply that the model's predictions are becoming less accurate or degrading.

84
MCQeasy

A company deploys a scikit-learn model on Vertex AI for online predictions. The model is packaged in a custom container with all dependencies. Users report high latency (over 5 seconds) for predictions. The model size is 2 GB. What is the most likely cause of the high latency?

A.Using online predictions instead of batch prediction
B.Not enabling GPU acceleration
C.Using a custom container with a large unoptimized model
D.Using a small machine type (e.g., n1-standard-2)
AnswerC

Large models in custom containers cause slow loading and inference; using a prebuilt container or optimizing the model would reduce latency.

Why this answer

Option C is correct because a 2 GB model loaded into a custom container without optimization (e.g., quantization, pruning, or ONNX conversion) will cause significant cold-start latency and per-request loading overhead. Vertex AI online predictions require the model to be loaded into memory for each request or container instance; a large, unoptimized model increases both loading time and inference time, easily exceeding 5 seconds.

Exam trap

Google Cloud often tests the misconception that latency is always due to compute resources (CPU/GPU) or prediction type, when in fact the model's size and lack of optimization are the primary culprits in custom container deployments.

How to eliminate wrong answers

Option A is wrong because online predictions are designed for low-latency, real-time inference, and switching to batch prediction would not reduce latency for individual requests—batch prediction is for high-throughput, asynchronous jobs. Option B is wrong because GPU acceleration primarily speeds up matrix operations during inference, but the main bottleneck here is model size and loading overhead, not compute speed; a 2 GB model on CPU can still be fast if optimized. Option D is wrong because while a small machine type (e.g., n1-standard-2 with 2 vCPUs and 7.5 GB RAM) could contribute to latency, the most likely cause is the unoptimized model size; even a larger machine would still suffer from the same loading and inference delays if the model is not optimized.

85
Multi-Selecthard

A company is migrating ML workflows to Vertex AI Pipelines. They want to ensure best practices for pipeline reproducibility and debugging. Which THREE actions should they take? (Choose three.)

Select 3 answers
A.Set a random seed for all training components
B.Store all artifacts in Cloud Storage with versioned prefixes
C.Pin all dependencies in training images
D.Use dynamic pipeline parameters for each run
E.Use conditional execution based on previous component outputs
AnswersA, B, C

Random seeds ensure deterministic training results.

Why this answer

Setting a random seed for all training components ensures deterministic behavior, meaning that the same inputs will produce the same outputs across multiple runs. This is critical for debugging and reproducibility in Vertex AI Pipelines, as it eliminates stochastic variability that can mask bugs or make results irreproducible. Without a fixed seed, even identical code and data can yield different model weights or metrics, complicating root cause analysis.

Exam trap

Google Cloud often tests the distinction between features that improve workflow flexibility (like dynamic parameters or conditional execution) and those that enforce reproducibility and debuggability, leading candidates to confuse operational convenience with best practices for deterministic pipelines.

86
MCQeasy

Refer to the exhibit. A Cloud Build step fails when pushing a Docker image to Artifact Registry. What is the missing IAM role for the Cloud Build service account?

A.roles/artifactregistry.writer
B.roles/containerregistry.admin
C.roles/storage.objectCreator
D.roles/cloudbuild.builds.editor
AnswerA

This role allows pushing images to Artifact Registry.

Why this answer

The Cloud Build service account needs the `roles/artifactregistry.writer` role to push Docker images to Artifact Registry. This role grants the necessary permissions to upload artifacts, including images, to the registry. Without it, the build step fails with an authorization error.

Exam trap

Google Cloud often tests the distinction between Artifact Registry and Container Registry roles, and the trap here is that candidates confuse `roles/containerregistry.admin` (for Container Registry) with the correct Artifact Registry role, or assume that Cloud Build's own editor role includes artifact push permissions.

How to eliminate wrong answers

Option B is wrong because `roles/containerregistry.admin` is for Container Registry (gcr.io), not Artifact Registry, and the question specifies Artifact Registry. Option C is wrong because `roles/storage.objectCreator` applies to Cloud Storage buckets, not Artifact Registry repositories. Option D is wrong because `roles/cloudbuild.builds.editor` allows managing Cloud Build builds but does not grant permissions to push artifacts to Artifact Registry.

87
MCQeasy

A data scientist trains a TensorFlow model using Vertex AI Training and wants to deploy it for online prediction. Which Vertex AI resource should the data scientist use to create an endpoint for serving predictions?

A.Vertex AI Batch Prediction Job
B.Vertex AI Endpoint
C.Vertex AI Feature Store
D.Vertex AI Model Registry
AnswerB

An endpoint is required to deploy a model for online predictions.

Why this answer

Option A is correct because Vertex AI Endpoint is the resource for serving online predictions. Option B (Model Registry) stores models but does not serve. Option C (Batch Prediction Job) is for batch predictions.

Option D (Feature Store) is for managing features.

88
MCQmedium

You have a batch prediction job on Vertex AI that processes millions of records. The job is failing with an out-of-memory error. What is the best way to resolve this?

A.Increase the minNodes and maxNodes for the batch prediction job
B.Split the input data into smaller files and run multiple batch prediction jobs
C.Enable autoscaling on the batch prediction job
D.Use a machine type with more memory for the batch prediction job
AnswerD

Increasing memory directly solves OOM.

Why this answer

Option D is correct because a batch prediction job on Vertex AI runs on a single machine (or a cluster of machines) and an out-of-memory (OOM) error indicates that the model or data processing exceeds the available RAM of the chosen machine type. Increasing the machine's memory directly addresses the root cause by providing more heap space for loading the model and processing large batches of predictions, without altering the job's parallelism or data partitioning.

Exam trap

The trap here is that candidates confuse scaling out (increasing nodes or autoscaling) with scaling up (increasing per-node resources), and assume that more nodes or splitting data will fix a memory exhaustion issue that is actually caused by insufficient RAM on each individual machine.

How to eliminate wrong answers

Option A is wrong because minNodes and maxNodes control the number of replicas for distributed prediction, not the memory per machine; increasing nodes spreads the workload but does not increase per-node memory, so OOM errors can still occur on each node. Option B is wrong because splitting input data into smaller files and running multiple jobs addresses data size but not the per-instance memory limit; if the model itself is large or each prediction requires significant memory, even smaller files can cause OOM on the same machine type. Option C is wrong because autoscaling adjusts the number of nodes based on load, not the memory capacity of each node; it can help with throughput but does not resolve a fundamental memory shortage on individual machines.

89
Multi-Selectmedium

A data engineering team is building a CI/CD pipeline for machine learning models using Cloud Build and AI Platform. Which TWO practices are essential for ensuring reproducible and safe model deployments?

Select 2 answers
A.Use Cloud Functions to trigger retraining on new data arrival.
B.Tag each model version with the Git commit hash of the training code.
C.Run integration tests against the model on a staging endpoint before promoting to production.
D.Use the same environment for training and serving, possibly via custom containers.
E.Directly deploy from the development environment using gcloud commands.
AnswersB, C

Links model to exact code version for reproducibility.

Why this answer

Options A and C are correct. A ensures every model version is linked to source and training process; C ensures validation before production. B is not about reproducibility; D might be useful but not essential for reproducibility; E is anti-pattern.

90
MCQhard

A company uses Vertex AI to serve a model that requires GPU for inference. They want to minimize cost while handling variable traffic. Which strategy should they use?

A.Deploy the model to Cloud Functions with GPU
B.Use a Vertex AI Endpoint with GPU and configure auto-scaling to zero when idle
C.Use Vertex AI Batch Prediction with GPU
D.Use a Vertex AI Endpoint with GPU with a fixed number of replicas
AnswerB

Scales to zero reduces cost.

Why this answer

Option B is correct because Vertex AI Endpoints support GPU-accelerated inference with autoscaling, including the ability to scale down to zero replicas when there is no traffic. This minimizes cost by only incurring GPU charges during active inference, while still handling variable traffic through dynamic scaling.

Exam trap

Google Cloud often tests the misconception that serverless services like Cloud Functions can support GPU acceleration, when in reality GPU compute requires dedicated infrastructure like Vertex AI Endpoints or GKE.

How to eliminate wrong answers

Option A is wrong because Cloud Functions do not support GPU attachments; they are designed for lightweight, event-driven compute and cannot run GPU-accelerated inference. Option C is wrong because Vertex AI Batch Prediction is intended for offline, asynchronous processing of large datasets, not for serving real-time variable traffic with low latency. Option D is wrong because using a fixed number of replicas with GPU does not minimize cost; it keeps GPU instances running continuously regardless of traffic, leading to higher costs during idle periods.

91
MCQmedium

Your company uses Vertex AI Pipelines to automate model retraining. The pipeline has three steps: data extraction from BigQuery, feature engineering using Dataflow, and model training using a custom container on Vertex AI Training. Recently, the pipeline has been failing intermittently at the Dataflow step with a 'The job encountered a transient error. Please retry.' message. You have enabled pipeline retries with 3 attempts. However, the pipeline still fails after 3 retries. You check the logs and find that the Dataflow job requires more resources than the default worker configuration provides. Which change should you make to reduce the failure rate?

A.Increase the number of Dataflow workers to improve parallelism
B.Increase the number of retries in the pipeline to 5
C.Replace Dataflow with Dataproc to run the feature engineering step
D.Increase the Dataflow worker machine type to have more memory and CPU in the pipeline step configuration
AnswerD

More resources prevent the transient resource exhaustion errors.

Why this answer

Option D is correct because the pipeline fails due to insufficient resources (memory and CPU) in the default Dataflow worker configuration. By increasing the worker machine type (e.g., using a custom machine type with more vCPUs and memory), the Dataflow job can handle the feature engineering workload without hitting resource limits, reducing transient failures. This directly addresses the root cause identified in the logs, unlike retries or parallelism changes.

Exam trap

Google Cloud often tests the misconception that increasing parallelism (more workers) or retries will fix resource exhaustion errors, when the actual fix is to increase per-worker resources by selecting a larger machine type.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers improves parallelism but does not address the root cause of insufficient per-worker resources (memory/CPU); it may even increase resource contention. Option B is wrong because increasing retries from 3 to 5 does not fix the underlying resource constraint; the job will continue to fail on each retry if the worker configuration remains inadequate. Option C is wrong because replacing Dataflow with Dataproc is an unnecessary architectural change that introduces new operational complexity and does not solve the specific resource issue; the problem is with worker sizing, not the service itself.

92
MCQeasy

A data science team has built a model using scikit-learn. They want to operationalize it on Google Cloud without rewriting the code. Which approach should they take?

A.Export the model as a PMML file and use BigQuery ML
B.Use AI Platform Training to host the model directly
C.Package the model in a custom container and deploy to Vertex AI Endpoints
D.Convert the scikit-learn model to TensorFlow SavedModel format
AnswerC

Custom containers allow any framework without code changes.

Why this answer

Option C is correct because Vertex AI Endpoints support custom containers, allowing you to package your scikit-learn model with its dependencies (e.g., a Flask or FastAPI inference server) and deploy it without rewriting any code. This approach directly meets the requirement to operationalize the existing model on Google Cloud without modification.

Exam trap

Google Cloud often tests the misconception that AI Platform Training can host models directly, but it is strictly for training jobs, not serving endpoints; candidates confuse the training service with the prediction service.

How to eliminate wrong answers

Option A is wrong because PMML (Predictive Model Markup Language) is not natively supported by BigQuery ML; BigQuery ML uses SQL-based model creation and does not import PMML files for inference. Option B is wrong because AI Platform Training is designed for training jobs, not for hosting models as endpoints; hosting is done via AI Platform Prediction (now part of Vertex AI), but even then, scikit-learn models require a custom prediction routine or container, not direct hosting. Option D is wrong because converting a scikit-learn model to TensorFlow SavedModel format would require rewriting the model's inference logic and dependencies, contradicting the requirement to avoid code changes.

93
MCQhard

You have deployed a TensorFlow model on Vertex AI Endpoints with autoscaling. The model receives high traffic during peak hours, but you notice that inference latency increases significantly during cold starts. Which strategy would best minimize cold-start latency without incurring unnecessary cost?

A.Set minNodes to a value that handles baseline traffic, and use traffic splitting to gradually shift traffic to new replicas
B.Set minNodes to 0 and enable node auto-scaling
C.Increase maxNodes to allow more replicas during peak, and rely on Kubernetes Horizontal Pod Autoscaler
D.Use Cloud Functions with Cloud Run for the model inference to leverage serverless cold-start mitigation
AnswerA

Keeps baseline replicas warm; gradual traffic shift avoids sudden load.

Why this answer

Setting minNodes to a value that handles baseline traffic ensures that a minimum number of replicas are always warm, eliminating cold starts for baseline requests. Traffic splitting gradually shifts new traffic to newly created replicas, allowing them to warm up before receiving full load, which minimizes latency spikes without over-provisioning resources.

Exam trap

Google Cloud often tests the misconception that increasing maxNodes or relying on generic autoscaling (like HPA) solves cold starts, but the key is keeping a baseline of warm replicas via minNodes and using traffic splitting to warm new replicas gradually.

How to eliminate wrong answers

Option B is wrong because setting minNodes to 0 means no replicas are kept warm, so every scale-up event will trigger a cold start, increasing latency during traffic spikes. Option C is wrong because increasing maxNodes alone does not prevent cold starts; without a minimum number of warm replicas, new replicas still need to initialize, and relying on Kubernetes Horizontal Pod Autoscaler (which is not used by Vertex AI Endpoints) is irrelevant as Vertex AI uses its own autoscaling mechanism. Option D is wrong because Cloud Functions and Cloud Run are serverless compute services, not designed for hosting TensorFlow models with GPU/TPU support, and they introduce their own cold-start latency without addressing the specific issue of model inference cold starts on Vertex AI.

94
MCQmedium

A data scientist deploys a new version of a fraud detection model (model2) alongside the existing model (model1) on the same Vertex AI endpoint with a 70/30 traffic split. After 24 hours, the team notices that model2's predictions are significantly different from model1's, and the fraud detection rate has increased. What is the most likely explanation for the change in predictions?

A.Model2 was trained on data that leaked future information, causing unrealistic results.
B.Model2 is receiving corrupted input data due to a bug in the traffic routing.
C.The traffic split is misconfigured and sending all traffic to model2.
D.Model2 uses a different model artifact (fraud_detection_v2) that produces different predictions.
AnswerD

The environment variable MODEL_NAME points to different model versions, causing output differences.

Why this answer

Option D is correct because the most straightforward explanation for a significant change in predictions and an increased fraud detection rate is that model2 uses a different model artifact (fraud_detection_v2) that was designed to produce different outputs. In Vertex AI, deploying a new model version with a traffic split means both models receive the same input data, but each model artifact independently processes it. If model2's predictions differ substantially, it indicates the model artifact itself has been updated or replaced, not that there is a data or routing issue.

Exam trap

Google Cloud often tests the misconception that a traffic split or routing issue can cause prediction differences, when in fact the split only controls which model receives the request, not the content of the request or the model's internal logic.

How to eliminate wrong answers

Option A is wrong because data leakage would cause unrealistically high performance during training, but it does not explain why predictions differ between two models receiving the same live input data; both models would be affected if the input data itself contained leaked future information. Option B is wrong because corrupted input data due to a traffic routing bug would affect both models equally if they share the same endpoint and routing logic; Vertex AI's traffic split routes requests to the correct model based on the configured percentage, not by altering the input data. Option C is wrong because if the traffic split were misconfigured to send all traffic to model2, model1 would receive zero requests, but the question states a 70/30 split is in place and the team notices model2's predictions differ from model1's; a misconfiguration would not cause model2's predictions to change—it would simply change which model serves requests.

95
MCQhard

You run `gcloud ai models describe` and get the error above. The model was created successfully from a training job that completed without errors. The model ID is correct. What is the most likely cause?

A.The model was deleted or expired due to time-to-live settings.
B.The gcloud command is not authenticated to the correct project.
C.The model was created but not yet trained; training must complete before describe works.
D.The model was created in a different region (e.g., europe-west4) than the one specified in the command.
AnswerD

Model resources are regional; if created in another region, describe with wrong region fails.

Why this answer

Option D is correct because `gcloud ai models describe` defaults to the `us-central1` region unless overridden with the `--region` flag. If the model was created in a different region (e.g., `europe-west4`), the command will fail with a 'Model not found' error even though the model ID is correct. Vertex AI models are regional resources, so the region must match exactly.

Exam trap

Google Cloud often tests the misconception that Vertex AI models are global resources, but they are actually regional, and candidates forget to specify the `--region` flag or assume the default region matches the model's location.

How to eliminate wrong answers

Option A is wrong because the model was created successfully from a training job that completed without errors, and there is no mention of time-to-live settings being configured; deletion or expiration would typically produce a different error message. Option B is wrong because the error is not about authentication; if the project were wrong, the error would indicate 'Permission denied' or 'Project not found', not 'Model not found'. Option C is wrong because the model was created from a completed training job, meaning training already finished; the `describe` command works on the model resource itself, not on a training state.

96
MCQmedium

A data science team wants to deploy a model that requires a custom container with specific NVIDIA CUDA version. They build the image and push to Artifact Registry. When deploying to Vertex AI, the model fails to load with an error: 'Failed to start container: invalid ELF header'. What is the most likely cause?

A.The container image was built for a different CPU architecture (e.g., ARM64) than the Vertex AI machine (x86_64)
B.The model file (saved as .pkl) is corrupted
C.The CUDA version in the container is incompatible with the GPU on the machine
D.The container does not have the necessary permissions to access the model file in Cloud Storage
AnswerA

Invalid ELF header indicates the binary is incompatible with the platform architecture.

Why this answer

Option A is correct because the image was built for the wrong architecture (e.g., building on an ARM Mac for a x86 deployment). Option B (CUDA version mismatch) would cause a different error. Option C (container permissions) would cause a permission denied error.

Option D (model file format) would cause loading errors but not container startup failure.

97
Multi-Selecteasy

A company is designing a CI/CD pipeline for their ML models using Cloud Build and Vertex AI. Which TWO practices should they adopt to ensure reliable and reproducible deployments?

Select 2 answers
A.Require manual approval for every model change before deployment
B.Store all model artifacts in a single Cloud Storage bucket without versioning
C.Use immutable container images with version tags for each model deployment
D.Include unit tests for data preprocessing and feature engineering code in the pipeline
E.Deploy every model version directly to production for immediate use
AnswersC, D

Immutable images ensure that the exact same environment is used across all deployments.

Why this answer

Options B and D are correct. B (unit tests for preprocessing) ensures data consistency. D (immutable versioned images) ensures reproducibility.

Option A (manual approval for all changes) slows down CI/CD. Option C (staging endpoint) is good but not required for reproducibility. Option E (single source of truth) is important but not a specific CI/CD practice.

98
MCQhard

An e-commerce company uses Vertex AI to serve a real-time personalization model. The model is updated daily via a retraining pipeline that uploads a new version to the same endpoint. Recently, after a model update, the online prediction responses have been returning anomalous results (e.g., recommending irrelevant products). The previous version performed well. The team suspects that the new model is undercooked or has a bug. They have already checked the training code and the pipeline logs, which show no errors. The pipeline deploys the new model version to the endpoint by updating the traffic split to route 100% of traffic to the new version. Which course of action should the team take to quickly mitigate the issue while diagnosing the root cause? A) Roll back the endpoint to the previous model version by setting traffic split to 0% for the new version. B) Delete the current endpoint and recreate it with the previous model version. C) Tweak the training hyperparameters and retrain immediately. D) Increase the number of replicas on the endpoint to handle load.

A.Tweak the training hyperparameters and retrain immediately.
B.Increase the number of replicas on the endpoint to handle load.
C.Delete the current endpoint and recreate it with the previous model version.
D.Roll back the endpoint to the previous model version by setting traffic split to 0% for the new version.
AnswerD

Rolling back traffic instantly restores previous behavior while allowing debugging of the new version.

Why this answer

Option A is correct because rolling back traffic to the previous known-good version immediately restores correct predictions, while the team investigates the new model. Option B (deleting endpoint) is excessive and causes downtime. Option C (retrain) takes time and may not fix the bug.

Option D (more replicas) does not address the incorrect model output.

99
Multi-Selecthard

During a Vertex AI training pipeline, the training job fails with an error: 'Out of memory: Killed process'. The model is a large deep learning model using TensorFlow. Which THREE steps should the team take to resolve this issue?

Select 3 answers
A.Change to a distributed training strategy
B.Enable memory growth configuration in TensorFlow
C.Switch the training from GPU to TPU accelerator
D.Reduce the training batch size
E.Use a custom machine type with more memory
AnswersB, D, E

Memory growth allows TensorFlow to allocate memory on demand, avoiding early OOM.

Why this answer

Option B is correct because TensorFlow by default allocates all available GPU memory, which can cause out-of-memory (OOM) errors when other processes or the system itself need memory. Enabling memory growth with `tf.config.experimental.set_memory_growth` allows TensorFlow to allocate memory incrementally, reducing the risk of OOM kills. This is a direct mitigation for the 'Killed process' error caused by memory exhaustion.

Exam trap

Google Cloud often tests the misconception that distributed training automatically solves memory issues, but in reality, it distributes computation, not memory pressure, and can even increase per-node memory usage due to gradient synchronization buffers.

100
MCQhard

You have two versions of a classification model (v1 and v2) deployed on a Vertex AI Endpoint. You want to gradually roll out v2 to 10% of traffic, monitor performance, and if metrics are better, increase traffic to 100%. You have set up model monitoring for skew and drift. Which configuration should you use?

A.Use the Vertex AI Endpoint 'traffic_split' parameter to assign 10% of traffic to v2 and 90% to v1.
B.Deploy v2 to a separate endpoint and use a load balancer to route 10% of traffic.
C.Create a new deployment with v2 on the same endpoint and set the 'min_replica_count' to 1 for both versions.
D.Enable Vertex AI Model Monitoring on the endpoint and set up alerting for performance drop.
AnswerA

Traffic splitting is the standard method for canary deployments.

Why this answer

The Vertex AI Endpoint 'traffic_split' parameter allows you to direct a percentage of inference requests to different model versions deployed on the same endpoint. Setting 10% to v2 and 90% to v1 enables a gradual rollout while monitoring skew and drift, and you can adjust the split as needed. This is the native, supported method for canary deployments in Vertex AI, avoiding the complexity and latency of external load balancers.

Exam trap

The trap here is that candidates confuse infrastructure-level load balancing (Option B) with Vertex AI's built-in traffic splitting, or think that replica counts (Option C) control traffic distribution, when in fact traffic_split is the only parameter that directly controls request routing percentages.

How to eliminate wrong answers

Option B is wrong because deploying v2 to a separate endpoint and using an external load balancer adds unnecessary complexity, latency, and cost; Vertex AI Endpoints natively support traffic splitting without additional infrastructure. Option C is wrong because setting 'min_replica_count' to 1 for both versions does not control traffic distribution; it only ensures minimum instance availability, not the percentage of requests routed to each model. Option D is wrong because enabling Model Monitoring and alerting for performance drop is a monitoring step, not a configuration for traffic splitting; it does not direct 10% of traffic to v2.

101
MCQeasy

A company deploys a machine learning model on Vertex AI for online predictions. The model experiences intermittent spikes in traffic, causing latency increases. Which strategy should the company use to ensure consistent low latency during traffic spikes?

A.Enable autoscaling on the Vertex AI endpoint with appropriate min and max nodes.
B.Manually scale the deployed model to a larger machine type during peak hours.
C.Reduce the number of prediction nodes to minimize overhead.
D.Switch to batch prediction to handle all requests asynchronously.
AnswerA

Autoscaling automatically adjusts the number of nodes based on traffic, ensuring low latency during spikes while controlling cost.

Why this answer

Vertex AI endpoints support autoscaling, which dynamically adjusts the number of prediction nodes based on incoming traffic. By setting appropriate min and max nodes, the endpoint can scale up during traffic spikes to maintain low latency and scale down during low traffic to reduce costs. This ensures consistent performance without manual intervention.

Exam trap

Google Cloud often tests the misconception that manual scaling or switching to batch prediction is a valid solution for real-time latency spikes, when in fact autoscaling is the only automated, cost-effective method for handling intermittent traffic on Vertex AI endpoints.

How to eliminate wrong answers

Option B is wrong because manually scaling to a larger machine type during peak hours is reactive, not proactive, and cannot respond instantly to intermittent spikes; it also incurs higher costs during all peak hours rather than scaling only when needed. Option C is wrong because reducing the number of prediction nodes would decrease capacity, worsening latency during traffic spikes rather than improving it. Option D is wrong because batch prediction is designed for asynchronous, offline processing of large datasets and does not provide real-time, low-latency responses required for online predictions.

102
Multi-Selectmedium

Which TWO best practices should be followed when managing multiple model versions on Vertex AI Endpoints for a production system?

Select 2 answers
A.Always keep all historical versions deployed to enable fast rollback.
B.If two versions share the same endpoint, they must have exactly the same machine type.
C.Use traffic splitting to gradually shift traffic to a new version while monitoring performance.
D.Upload each model version as a new model resource and deploy to a separate endpoint for isolation.
E.Use the same endpoint for multiple versions and adjust min_replica_count, max_replica_count for each version.
AnswersC, D

Traffic splitting enables canary deployments and safe rollback.

Why this answer

Option C is correct because Vertex AI Endpoints support traffic splitting, allowing you to route a percentage of inference requests to a new model version while the rest goes to the existing version. This enables gradual rollout, monitoring of performance metrics (e.g., latency, error rate), and safe rollback without downtime. It is a best practice for production systems to validate a new version before fully cutting over.

Exam trap

Google Cloud often tests the misconception that multiple versions on the same endpoint must share identical infrastructure settings (like machine type), but Vertex AI allows heterogeneous configurations per version, and traffic splitting is the correct method for gradual rollouts, not keeping all versions or adjusting autoscaling parameters.

103
MCQmedium

An MLOps team wants to implement continuous deployment of ML models using Cloud Build and Vertex AI. They have a GitHub repository with training code. What should they use?

A.Deploy using Cloud Run
B.Vertex AI Pipelines integrated with Cloud Build
C.Cloud Functions to monitor GitHub
D.Cloud Build trigger with a custom step to run Vertex AI Training job and deploy
AnswerD

Cloud Build can be configured to trigger on GitHub pushes and run training/deployment steps.

Why this answer

Option A is correct: Cloud Build triggers can include custom steps to run Vertex AI Training jobs and then deploy the model. Option B is wrong because Vertex AI Pipelines is an orchestration service, not a CI/CD system; Cloud Build is the CI/CD tool. Option C is wrong because Cloud Functions is event-driven but not designed for CI/CD pipelines.

Option D is wrong because Cloud Run is for serverless containers, not for training and deploying ML models.

104
MCQhard

Your team manages a multi-model ensemble deployed on Vertex AI Endpoint. The ensemble consists of three models: a neural network (NN), a gradient boosted tree (GBT), and a logistic regression (LR). They are deployed as separate endpoints and traffic is split using a traffic split configuration. Recently, the overall accuracy dropped from 92% to 85%. Monitoring shows that the NN model's latency has increased significantly, causing it to miss timeouts and fall back to default predictions. The other two models are performing normally. The NN model is the most complex and handles the majority of the traffic. You need to restore accuracy quickly. What should you do first?

A.Increase the timeout for predictions on the NN endpoint to avoid fallback.
B.Enable fallback logic to use the GBT model when NN times out, ensuring no prediction is missed.
C.Temporarily reduce the traffic percentage to the NN model to 0% and redistribute to GBT and LR until the NN issue is resolved.
D.Relaunch the NN model with a larger machine type and more replicas to reduce latency.
AnswerC

This immediately stops the problematic model from serving and restores accuracy using the other models.

105
MCQmedium

What is the most likely cause of this error?

A.The JSONL file is missing some required fields.
B.The input images have a different number of channels than expected.
C.The instances are in JSON format instead of JSONL.
D.Each JSONL line contains a single image tensor without a batch dimension.
AnswerD

The model expects a batch dimension; each line should contain a batch of images.

Why this answer

The error occurs because each line in a JSONL file is expected to be a self-contained JSON object representing a single inference request. When the line contains a raw image tensor without a batch dimension, the model's serving framework (e.g., TensorFlow Serving or TorchServe) cannot perform batched inference, as it expects input tensors to have shape (batch_size, channels, height, width) or (batch_size, height, width, channels). The missing batch dimension causes a shape mismatch error during model execution.

Exam trap

Google Cloud often tests the distinction between data format errors (like JSON vs JSONL) and tensor shape errors, trapping candidates who confuse a missing batch dimension with a missing field or channel mismatch.

How to eliminate wrong answers

Option A is wrong because missing required fields would typically cause a parsing or validation error, not a tensor shape mismatch; the error described is specifically about the batch dimension, not missing fields. Option B is wrong because an incorrect number of channels would produce a channel mismatch error, not a missing batch dimension error; the model expects a specific channel count, but the error message would reference channel depth, not batch size. Option C is wrong because JSON format instead of JSONL would cause a file parsing error (e.g., expecting one JSON object per line but finding a JSON array), not a tensor shape error; the serving framework would fail to load the file entirely.

106
Multi-Selecthard

An MLOps team manages a pipeline that retrains an XGBoost classifier weekly using BigQuery data. The pipeline is orchestrated with Cloud Composer and deploys the new model to Vertex AI Endpoint if validation metrics (AUC > 0.9) are met. Over the past month, the deployed model's AUC has dropped from 0.95 to 0.88, despite the training pipeline consistently reporting AUC > 0.9. Which THREE steps should the team take to diagnose and fix this issue?

Select 3 answers
A.Review the training pipeline's hyperparameter tuning configuration to ensure it is not overfitting to stale data.
B.Add a canary deployment step where new model version receives a small percentage of traffic before full rollout.
C.Compare feature distributions between the training data and online serving data using Vertex AI Model Monitoring.
D.Retrain the model using a longer training history to include older data that may still be relevant.
E.Implement model validation on the deployed endpoint by logging predictions and comparing against actuals for a sample of traffic using Vertex Explainable AI.
AnswersB, C, E

Canary testing can catch performance issues early before the model is fully deployed.

107
MCQeasy

You are deploying a machine learning model to production using Vertex AI. The model requires GPU acceleration for low-latency predictions. You need to minimize costs while ensuring availability during a defined business hours window (8 AM to 6 PM). Which deployment strategy should you use?

A.Deploy to an endpoint with manual scaling, set min nodes to zero and max nodes to 10, and use a cron job to adjust during business hours.
B.Use a custom prediction routine (CPR) that dynamically requests GPUs from the cluster.
C.Deploy to a dedicated endpoint with a GPU machine and configure autoscaling.
D.Use Cloud Functions to invoke the model, and let Google Cloud manage the underlying GPU infrastructure.
AnswerA

Manual scaling allows setting min to zero, stopping all nodes outside hours, and auto-scheduling via cron or Cloud Scheduler to scale up before 8 AM and down after 6 PM, minimizing cost.

Why this answer

Option A is correct because it uses manual scaling with a cron job to set min nodes to zero outside business hours (8 AM–6 PM) and scale up to a maximum of 10 nodes during business hours, ensuring GPU availability when needed while minimizing costs by running zero instances when the model is not required. This approach directly addresses the requirement for low-latency GPU predictions during a defined window without paying for idle GPU resources outside that window.

Exam trap

Google Cloud often tests the misconception that autoscaling alone is sufficient for cost optimization, but the trap here is that autoscaling with a GPU machine typically requires a minimum of one replica, which still incurs 24/7 GPU costs, whereas manual scaling with a cron job to set min nodes to zero is the only way to completely eliminate GPU costs outside the defined business hours.

How to eliminate wrong answers

Option B is wrong because a custom prediction routine (CPR) is a way to package custom logic for serving predictions, not a deployment strategy for managing GPU scaling or scheduling; it does not inherently control when GPUs are requested or released based on a business hours window. Option C is wrong because deploying to a dedicated endpoint with a GPU machine and autoscaling will keep at least one instance running continuously (autoscaling typically has a minimum of 1 node), incurring costs 24/7 even when the model is not needed outside business hours. Option D is wrong because Cloud Functions does not support GPU acceleration; it is a serverless compute platform for lightweight, stateless functions and cannot attach GPUs for model inference.

108
Multi-Selecthard

A company trains a model using Cloud TPUs. The model is deployed to AI Platform Prediction using a custom container with TensorFlow. Which THREE considerations are most important when serving this model?

Select 3 answers
A.The model should be retrained using GPU to ensure identical performance on serving hardware.
B.The serving container must have the same TensorFlow version that was used during training to avoid compatibility issues.
C.The model should be quantized to reduce memory footprint before deployment.
D.The serving infrastructure must use GPU or CPU, as AI Platform Prediction does not support TPU serving.
E.The model must be exported as a TensorFlow SavedModel and packaged in a custom container with proper dependencies.
AnswersB, D, E

Version mismatch can cause errors or different behavior.

Why this answer

Option B is correct because TensorFlow models are tightly coupled to the specific version of TensorFlow used during training. Serving with a different version can lead to incompatibilities in graph serialization, op definitions, or checkpoint formats, causing runtime errors or silent prediction failures. AI Platform Prediction's custom container must therefore match the training environment's TensorFlow version to ensure the model loads and executes correctly.

Exam trap

Google Cloud often tests the misconception that hardware must match between training and serving, but the real requirement is software version compatibility, not hardware identity.

109
MCQmedium

A machine learning pipeline uses Vertex AI Pipelines. One component fails intermittently due to resource constraints. What is the best way to handle this?

A.Use retry policies in the component specification
B.Deploy the pipeline on a larger cluster
C.Increase the pipeline timeout
D.Use a different orchestrator
AnswerA

Retry policies handle intermittent failures by automatically retrying the component.

Why this answer

Option A is correct because Vertex AI Pipelines supports retry policies at the component level via the `retry` field in the component specification (YAML or Python). This allows the pipeline to automatically re-execute a failed component when the failure is transient (e.g., resource exhaustion), without manual intervention. Retry policies are the standard mechanism for handling intermittent failures in a serverless orchestration environment like Vertex AI Pipelines.

Exam trap

Google Cloud often tests the misconception that scaling up infrastructure (Option B) is the primary fix for intermittent failures, when in fact retry policies are the correct, cost-efficient solution for transient resource constraints in a managed pipeline service.

How to eliminate wrong answers

Option B is wrong because deploying the pipeline on a larger cluster does not address the intermittent nature of the failure; it only increases resource capacity, which may not be cost-effective and does not handle transient resource spikes. Option C is wrong because increasing the pipeline timeout does not resolve resource constraints; it only gives the component more time to run, which will still fail if resources are insufficient. Option D is wrong because using a different orchestrator (e.g., Kubeflow Pipelines, Argo) does not inherently fix resource constraints; the issue is with the component's resource allocation, not the orchestration engine itself.

110
MCQmedium

A retail company uses a machine learning model to predict inventory demand. The model is retrained weekly using Vertex AI Pipelines. Recently, the model's accuracy has degraded because the data distribution has shifted. Which action should you take to monitor and detect this drift automatically?

A.Enable Vertex AI Model Monitoring for the endpoint and configure alerting on feature drift
B.Set up alerts for when the model's mean absolute error exceeds a threshold on the evaluation dataset
C.Enable Cloud Logging for the prediction endpoint and search for error logs
D.Schedule a job to compare the distribution of incoming features with the training data using Cloud Dataflow
AnswerA

Model Monitoring automates drift detection.

Why this answer

Vertex AI Model Monitoring is purpose-built to automatically detect feature drift and prediction drift on deployed endpoints. By enabling it and configuring alerting on feature drift, you can proactively identify when the distribution of incoming features deviates from the training data, which directly addresses the root cause of accuracy degradation without manual intervention.

Exam trap

Google Cloud often tests the distinction between monitoring model performance metrics (like MAE) versus monitoring input data distributions (feature drift), and candidates mistakenly choose a performance-based alerting option because they think accuracy degradation is the only signal, ignoring that drift detection is the proactive mechanism to catch the root cause before accuracy drops.

How to eliminate wrong answers

Option B is wrong because setting alerts on mean absolute error (MAE) on the evaluation dataset only detects performance degradation after the fact, not the underlying data distribution shift; it also requires ground truth labels, which may not be available in real time. Option C is wrong because Cloud Logging for the prediction endpoint captures request/response logs and error messages, but it does not perform statistical drift analysis or compare feature distributions. Option D is wrong because scheduling a job with Cloud Dataflow to compare distributions is a custom, manual approach that lacks the automated, integrated monitoring and alerting capabilities of Vertex AI Model Monitoring, and it introduces unnecessary operational overhead.

111
Multi-Selectmedium

Which THREE metrics should be monitored for a deployed machine learning model in production?

Select 3 answers
A.Number of replicas
B.Prediction error rate
C.Data drift detection
D.Training time
E.Prediction latency
AnswersB, C, E

Accuracy metric.

Why this answer

Prediction error rate (Option B) is a direct measure of model accuracy in production, reflecting how often the model's predictions deviate from actual outcomes. Monitoring this metric is essential for detecting model degradation, data quality issues, or concept drift that can silently reduce model performance over time.

Exam trap

Google Cloud often tests the distinction between operational metrics (like latency, error rate, drift) and development/infrastructure metrics (like training time, replica count) to see if candidates understand what is relevant for ongoing model monitoring versus model building or deployment scaling.

112
Multi-Selectmedium

A data science team has deployed a custom TensorFlow model on Vertex AI Prediction. They notice increasing prediction latency and a growing number of 503 errors during peak traffic hours. The model is served using a single regional endpoint with min replica count of 2 and max replica count of 10. Which TWO actions should the team take to address these issues?

Select 2 answers
A.Use a larger machine type (e.g., n1-highmem-8) instead of the current n1-standard-4 to improve per-replica throughput.
B.Enable autoscaling with a higher max replica count and configure a CPU utilization target of 60%.
C.Reduce the min replica count to 0 to allow the service to scale down to zero when not in use.
D.Deploy the model as a batch prediction job and move all online predictions to batch.
E.Switch to a global endpoint with automatic scaling to distribute traffic across multiple regions.
AnswersB, E

Increasing max replicas and tuning CPU utilization target helps handle peak load and reduce latency.

113
MCQhard

A company serves multiple models using Vertex AI endpoints. Each model has different latency and memory requirements. To minimize cost, the company wants to share underlying compute resources among models. Which approach should they use?

A.Deploy each model as a separate Cloud Run service and use a load balancer.
B.Use a single GKE cluster with multiple deployments and use Istio for routing.
C.Deploy all models to a single Vertex AI endpoint and configure traffic splitting.
D.Create separate endpoints for each model and use a load balancer to route traffic.
AnswerC

Vertex AI endpoints allow deploying multiple models behind one endpoint, sharing resources.

Why this answer

Vertex AI endpoints support traffic splitting, allowing you to deploy multiple models behind a single endpoint and route a percentage of traffic to each model. This enables resource sharing and cost optimization because the underlying compute infrastructure is shared among the models, unlike separate endpoints which would each require dedicated resources.

Exam trap

Google Cloud often tests the misconception that separate endpoints or services are required for different models, when in fact Vertex AI endpoints support multi-model deployment with traffic splitting to share resources and minimize cost.

How to eliminate wrong answers

Option A is wrong because deploying each model as a separate Cloud Run service and using a load balancer does not share underlying compute resources; each service runs in its own container instance, leading to higher cost and no direct model-level traffic splitting. Option B is wrong because using a single GKE cluster with multiple deployments and Istio for routing is overly complex for Vertex AI model serving, and it bypasses the managed Vertex AI endpoint capabilities that natively support traffic splitting and resource sharing. Option D is wrong because creating separate endpoints for each model and using a load balancer defeats the purpose of sharing compute resources; each endpoint would have its own dedicated resources, increasing cost and management overhead.

114
MCQeasy

Your team uses Vertex AI Feature Store to serve features for online predictions. A feature value changes frequently (e.g., user session clicks). Which type of feature should you use to ensure low-latency writes and reads?

A.Streaming feature
B.Batch feature
C.Feature view
D.Bigtable-backed feature
AnswerA

Streaming features are designed for low-latency, high-frequency updates and reads.

Why this answer

Correct: A. Streaming features are for high-frequency updates. Option B is wrong because batch features are for static data.

Option C is wrong because Vertex AI doesn't have 'feature view' as a type. Option D is wrong because Bigtable is not a feature store feature.

115
MCQmedium

You deployed a model on Vertex AI Endpoints using a custom container. The model serves predictions but the latency is higher than expected. You suspect the container is not making full use of the CPU resources. What should you do to reduce latency?

A.Modify the container to use multi-threading or increase the number of workers in the prediction server (e.g., Gunicorn workers).
B.Enable response caching on the endpoint.
C.Change the machine type to a GPU-accelerated machine.
D.Increase the number of nodes by adjusting autoscaling limits.
AnswerA

Properly configuring concurrency allows each node to process multiple requests in parallel, reducing latency under load.

Why this answer

Option A is correct because high latency in a CPU-based custom container often stems from underutilizing available CPU cores. By increasing the number of workers (e.g., Gunicorn workers) or enabling multi-threading, you allow the prediction server to handle multiple requests concurrently, reducing queue time and improving throughput. This directly addresses the symptom of the container not making full use of CPU resources.

Exam trap

Google Cloud often tests the misconception that scaling out (adding more nodes) or upgrading hardware (GPU) is the default fix for latency, when the real issue is often software-level concurrency configuration within the container.

How to eliminate wrong answers

Option B is wrong because response caching reduces latency only for repeated identical requests, not for the general case of underutilized CPU resources; it does not improve concurrent request handling. Option C is wrong because switching to a GPU-accelerated machine would only help if the model benefits from GPU parallelism (e.g., deep learning models), but the question states the container is not making full use of CPU resources, implying the bottleneck is software configuration, not hardware type. Option D is wrong because increasing the number of nodes via autoscaling adds more instances but does not fix the per-instance CPU underutilization; it may even increase cost without addressing the root cause of inefficient request handling within each container.

116
MCQhard

You are a data engineer at a financial services company. You have deployed a credit risk model on Vertex AI Endpoints using a custom container with a TensorFlow SavedModel. The model expects input features as a JSON object. Recently, the model has been returning high prediction latency and occasional 503 errors. You have enabled autoscaling with minNodes=2 and maxNodes=10. The model is CPU-only and uses n1-standard-4 machines. Monitoring shows that during peak hours, CPU utilization reaches 90% and memory is at 80%. The number of prediction requests per second peaks at 100. You suspect that the model is not scaling fast enough. Which action will most effectively reduce latency and eliminate 503 errors?

A.Increase maxNodes to 20 to allow more replicas during peak
B.Change the machine type to n1-standard-4 with a GPU (e.g., NVIDIA T4) and update the custom container to use GPU
C.Set minNodes to 5 to keep more replicas warm
D.Switch to n1-highmem-4 machines to provide more memory per node
AnswerB

GPU acceleration reduces per-request latency and can handle more requests per node.

Why this answer

Option B is correct because the high CPU utilization (90%) indicates that the model's inference is compute-bound. Offloading the computation to a GPU (NVIDIA T4) significantly accelerates TensorFlow model inference, reducing per-request latency and allowing each replica to handle more requests per second. This directly addresses the root cause of the 503 errors (requests timing out due to slow inference) and reduces the need for rapid scaling.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing replicas) is always the solution to latency and 503 errors, when in fact the root cause may be per-replica performance (CPU vs. GPU) that scaling cannot fix.

How to eliminate wrong answers

Option A is wrong because increasing maxNodes to 20 does not address the fundamental bottleneck: each replica is CPU-bound at 90% utilization. More replicas would still be slow and may not scale quickly enough to handle sudden spikes, and they would increase cost without fixing latency. Option C is wrong because setting minNodes to 5 keeps more replicas warm but does not reduce the latency of each individual prediction; the replicas would still be CPU-bound, so 503 errors from slow inference would persist.

Option D is wrong because memory is only at 80%, not a bottleneck; switching to n1-highmem-4 provides more memory but does not accelerate the CPU-bound computation, so latency and 503 errors would remain.

117
Multi-Selecteasy

A company is deploying a machine learning model for fraud detection. The model is trained using TensorFlow and will be served on Vertex AI Prediction. The team wants to implement model monitoring to detect prediction drift. Which TWO actions should they take? (Choose 2)

Select 2 answers
A.Configure Vertex AI Model Monitoring to compare online prediction inputs against training data statistics.
B.Collect ground truth labels for all predictions to measure accuracy drift.
C.Set up a separate Cloud Monitoring alerting policy to watch for prediction errors.
D.Enable automatic model retraining in Vertex AI Model Monitoring when drift is detected.
E.Enable prediction drift monitoring to detect changes in model output distribution.
AnswersA, E

This detects feature drift, which is a common monitoring need.

Why this answer

Option A is correct because Vertex AI Model Monitoring can be configured to compare online prediction inputs against training data statistics to detect skew, which is a form of drift. This is a standard capability of Vertex AI Model Monitoring, where you specify a baseline dataset (typically training data) and the service automatically computes statistics on incoming prediction requests to identify distribution shifts.

Exam trap

Google Cloud often tests the distinction between monitoring for drift (which focuses on input/output distributions) versus monitoring for model accuracy (which requires ground truth labels), and candidates mistakenly think collecting ground truth is a prerequisite for drift detection.

118
Multi-Selecteasy

Which TWO are benefits of using Vertex AI Endpoints for model serving?

Select 2 answers
A.Batch prediction support out of the box.
B.Integrated monitoring for prediction latency and error rates.
C.Automatic scaling based on traffic.
D.Automatic model retraining when drift is detected.
E.Built-in support for A/B testing without any additional configuration.
AnswersB, C

Vertex AI endpoints integrate with Cloud Monitoring for operational metrics.

Why this answer

Vertex AI Endpoints provide integrated monitoring for prediction latency and error rates out of the box, enabling you to track model performance and detect anomalies without additional instrumentation. This is a core operational feature that helps maintain service-level objectives (SLOs) and quickly identify degradation in production.

Exam trap

Google Cloud often tests the distinction between features that are 'built-in' versus those that require separate services or additional configuration, so candidates mistakenly assume batch prediction or automatic retraining are part of Endpoints when they are actually separate Vertex AI components.

119
Multi-Selecthard

A team is deploying a complex model with multiple preprocessing steps. They want to ensure consistent preprocessing during training and serving. Which three approaches can achieve this? (Select 3)

Select 3 answers
A.Store preprocessing logic in a shared Python module
B.Use a separate preprocessing service called from the model
C.Use two separate pipelines for training and serving
D.Use Vertex AI Feature Transform Engine
E.Embed preprocessing logic in the model graph
AnswersA, D, E

A shared module ensures the same code is used in training and serving if properly versioned.

Why this answer

Option A is correct because storing preprocessing logic in a shared Python module ensures that the same code is used during both training and serving, eliminating drift between environments. This approach leverages version control and dependency management to guarantee consistency, which is critical for reproducibility in production ML pipelines.

Exam trap

Google Cloud often tests the misconception that a separate preprocessing service (Option B) is a good architectural pattern for consistency, when in fact it introduces a single point of failure and versioning complexity that undermines the goal of identical preprocessing.

120
MCQhard

A company has a batch prediction job that runs daily using AI Platform Batch Prediction. The job uses a TensorFlow model and processes 10 GB of data. Recently, the job started failing with the error 'The replica worker 0 exited with a non-zero exit code: Out of memory'. Which action should the team take to resolve this without rewriting the model?

A.Increase the number of workers (parallelism) to distribute the data across more machines.
B.Use a machine type with more memory, such as n1-highmem-8.
C.Reduce the batch size parameter in the prediction job configuration.
D.Optimize the model to use less memory by pruning or quantization.
AnswerB

Directly addresses the out-of-memory error by providing more RAM per worker.

Why this answer

The error 'Out of memory' on replica worker 0 indicates that the machine type assigned to the prediction job does not have enough RAM to load the model and process the 10 GB batch. Increasing the machine type to one with more memory (e.g., n1-highmem-8) directly addresses the memory constraint without requiring any code changes. This is the most straightforward fix because AI Platform Batch Prediction allows you to specify machine types in the job configuration, and the error is purely a resource allocation issue.

Exam trap

Google Cloud often tests the distinction between scaling horizontally (adding workers) and scaling vertically (increasing machine resources), where candidates mistakenly assume parallelism solves memory issues, but the error is per-worker memory exhaustion, not throughput.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers (parallelism) distributes the data across more machines but does not increase the memory per worker; each replica still has the same limited memory, so the out-of-memory error would persist on each worker. Option C is wrong because reducing the batch size parameter controls how many predictions are processed per step, which can reduce peak memory usage per request, but the error occurs during model loading or initial data processing, not during per-step prediction; the 10 GB dataset and model size still require sufficient base memory. Option D is wrong because while pruning or quantization could reduce model memory footprint, the question explicitly states 'without rewriting the model,' and these techniques require modifying the model architecture or retraining, which is a form of rewriting.

121
MCQmedium

You configured a model deployment monitor on your Vertex AI endpoint as shown. What will happen when the feature 'age' has a skew of 0.4?

A.An alert will be sent to admin@example.com
B.The endpoint will automatically roll back to a previous model version
C.No alert will be sent because the skew threshold is 0.2 for income
D.An alert will be sent only if both features exceed their thresholds
AnswerA

Skew 0.4 exceeds threshold 0.3 for age.

Why this answer

Option A is correct because the monitoring configuration shows an alert threshold of 0.2 for the feature 'age', and a skew of 0.4 exceeds that threshold. Vertex AI Model Monitoring will trigger the configured alert action, which in this case is sending an email to admin@example.com. The alert is based on the specific feature's threshold, not on any other feature's threshold.

Exam trap

Google Cloud often tests the misconception that alerts require multiple features to exceed thresholds or that the system can automatically roll back models, when in reality each feature is evaluated independently and only notifications are sent.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Monitoring does not automatically roll back model deployments; it only sends alerts based on configured actions, and auto-rollback is not a supported feature in this context. Option C is wrong because the skew threshold for 'age' is 0.2, not 0.2 for 'income'; the question states the skew for 'age' is 0.4, which exceeds its own threshold, so an alert will be sent regardless of the 'income' feature's threshold. Option D is wrong because the alert is triggered per feature when its individual threshold is exceeded; there is no requirement for both features to exceed their thresholds simultaneously.

122
MCQmedium

A Cloud Build pipeline is set up to train a model on Vertex AI. The build fails with the error: 'ERROR: (gcloud.ai-platform.jobs.submit.training) NOT_FOUND: The parent project does not exist.' The project ID and the service account are correctly configured. What is the most likely cause?

A.The region specified for the training job does not exist.
B.The training job requires a GPU, which is not available in the specified region.
C.The Cloud Build service account does not have the aiplatform.jobs.create permission on the project.
D.The training package is not uploaded to Cloud Storage before the pipeline runs.
AnswerC

Insufficient permissions can cause the project to appear as not found to the service account.

Why this answer

The error 'NOT_FOUND: The parent project does not exist' indicates that the Cloud Build service account lacks the necessary IAM permission to submit a training job to Vertex AI. Even though the project ID and service account are correctly configured, the Cloud Build service account must have the 'aiplatform.jobs.create' permission (or the 'Vertex AI User' role) on the project. Without this, the API call fails because the service account is not authorized to access the project resource.

Exam trap

Google Cloud often tests the misconception that a 'NOT_FOUND' error always means a missing resource (like a project ID or region), when in fact it can indicate an IAM permission issue where the service account is not authorized to see or use the project.

How to eliminate wrong answers

Option A is wrong because an invalid region would produce a different error, such as 'INVALID_ARGUMENT: Region not found' or 'PERMISSION_DENIED', not 'NOT_FOUND: The parent project does not exist'. Option B is wrong because GPU availability issues would result in a 'RESOURCE_EXHAUSTED' or 'ZONE_RESOURCE_POOL_EXHAUSTED' error, not a project-level not found error. Option D is wrong because a missing training package in Cloud Storage would cause a 'FILE_NOT_FOUND' or 'INVALID_ARGUMENT' error during job submission, not a project not found error.

123
MCQmedium

A company has deployed a machine learning model to AI Platform Prediction. The model uses a custom container with a TensorFlow SavedModel. After deployment, the prediction latency is higher than expected. Which action is most likely to reduce latency without significantly impacting model accuracy?

A.Convert the model to TensorFlow Lite and use a smaller model.
B.Increase the number of prediction nodes in the AI Platform Prediction cluster.
C.Enable XLA (Accelerated Linear Algebra) compilation on model loading.
D.Apply quantization to the model weights to reduce size.
AnswerC

XLA compiles and optimizes the TensorFlow graph, often improving latency without affecting accuracy.

Why this answer

Option B is correct because enabling XLA compilation on model load can optimize the computational graph for better performance with no accuracy loss. Options A, C, and D either reduce accuracy or are not applicable.

124
Multi-Selectmedium

Which THREE Google Cloud services are typically used together in a production ML pipeline?

Select 3 answers
A.Cloud Storage
B.Cloud Functions
C.Vertex AI Training
D.Vertex AI Prediction
E.BigQuery
AnswersA, C, D

For storing training data, model artifacts, etc.

Why this answer

Cloud Storage is correct because it serves as the central artifact repository in a production ML pipeline on Google Cloud. It stores training data, model artifacts, and prediction inputs/outputs, enabling seamless integration with Vertex AI Training for model training and Vertex AI Prediction for serving. Without Cloud Storage, there is no durable, scalable, and cost-effective way to manage the large datasets and model binaries required for production ML workflows.

Exam trap

The trap here is that candidates confuse 'services used in an ML pipeline' with 'services that can be used somewhere in ML' — Cloud Functions and BigQuery are often used in ML workflows (e.g., triggering retraining or storing features), but they are not the three core services that are typically used together in a production ML pipeline for training, storing artifacts, and serving predictions.

125
MCQhard

Your Vertex AI model deployed on an endpoint is experiencing high tail latency during online predictions. The model uses a large embedding layer, and the input size varies. You have enabled automatic scaling with a minimum of 2 replicas and maximum of 10. What is the most likely cause of the latency spikes and the best first step to diagnose?

A.The model's SavedModel is too large due to the embedding layer; reduce embedding dimensions to lower latency.
B.The endpoint's target CPU utilization might be set too low, causing rapid scale-down and cold starts. Check Cloud Logging for scaling events.
C.The model uses a custom prediction routine that is not optimized; use tf.function to improve performance.
D.Enable model monitoring for online prediction and add a buffer to the endpoint's machine type.
AnswerB

If target utilization is low, replicas scale down quickly; cold starts on new requests cause latency. Logs show scaling.

Why this answer

High tail latency with variable input sizes and a large embedding layer often points to cold starts from aggressive scaling. When the target CPU utilization is set too low, the endpoint scales down quickly during lulls, and a subsequent burst of requests forces new replicas to spin up, causing latency spikes. Checking Cloud Logging for scaling events is the best first step because it directly reveals whether the endpoint is scaling down and then experiencing cold starts.

Exam trap

Google Cloud often tests the misconception that high tail latency is always due to model size or inference optimization, when in fact the most common cause in managed serving environments is autoscaling misconfiguration leading to cold starts.

How to eliminate wrong answers

Option A is wrong because reducing embedding dimensions would lower model accuracy and does not address the root cause of latency spikes from scaling dynamics; the model size is not the primary driver of tail latency in this scenario. Option C is wrong because while a custom prediction routine could be suboptimal, the question describes a standard model with a large embedding layer and variable input size, and the latency pattern (spikes) is more characteristic of cold starts than of per-request optimization issues; tf.function would help steady-state performance but not sudden spikes. Option D is wrong because model monitoring detects drift or anomalies but does not diagnose scaling-related latency, and adding a buffer to the machine type (e.g., increasing memory) does not fix the scaling policy that causes cold starts.

126
Multi-Selectmedium

Which THREE components are typically part of a Vertex AI Pipeline for automated model retraining and deployment?

Select 3 answers
A.Cloud Monitoring alerting component
B.Cloud Storage artifact storage component
C.Training component (e.g., CustomContainerTrainingJob)
D.Model evaluation component (e.g., evaluating on a test set)
E.Deployment component (e.g., deploying model to endpoint)
AnswersC, D, E

Training is the core step.

Why this answer

Option C is correct because a training component, such as a `CustomContainerTrainingJob`, is the core step in a Vertex AI Pipeline that executes the model training logic. It defines the container image, machine configuration, and hyperparameters, enabling automated retraining when triggered by a schedule or event.

Exam trap

Google Cloud often tests the distinction between pipeline components (which are executable tasks in the DAG) and supporting infrastructure (like Cloud Monitoring or Cloud Storage), leading candidates to select options that are related to the pipeline's operation but not actual components within the pipeline definition.

127
MCQhard

A healthcare startup is deploying a natural language processing (NLP) model for extracting medical entities from clinical notes. The model is a fine-tuned BERT model served on Vertex AI Prediction using a custom container. The team observes that prediction latency is around 500ms per request, but they need to handle up to 100 requests per second (QPS) with end-to-end latency under 200ms. The model currently runs on n1-standard-4 machines (4 vCPU, 15 GB memory). During load testing, CPU utilization reaches 90% and memory usage is 12 GB. The team is considering options to meet the requirements. Which action should they take?

A.Use a machine type with a GPU, such as n1-standard-4 with a NVIDIA Tesla T4 accelerator, and optimize the model with TensorRT.
B.Switch to n1-highmem-4 machines to provide more memory for the model.
C.Deploy the model using TensorFlow Serving with CPU-only nodes and increase the number of replicas.
D.Move the model to Cloud Run with automatic scaling to handle the QPS.
AnswerA

GPU accelerates BERT inference and TensorRT further optimizes latency.

Why this answer

Option A is correct because the bottleneck is CPU-bound inference (90% CPU utilization) with memory well within limits (12 GB of 15 GB). Adding a GPU (NVIDIA Tesla T4) and optimizing with TensorRT reduces per-request latency via hardware acceleration and graph optimizations, enabling sub-200ms inference at 100 QPS. This directly addresses the latency requirement without changing the machine family or scaling strategy.

Exam trap

Google Cloud often tests the misconception that scaling horizontally (more replicas or Cloud Run) solves latency problems, when the real issue is per-request compute bottleneck that requires hardware acceleration or model optimization.

How to eliminate wrong answers

Option B is wrong because memory is not the bottleneck (12 GB used out of 15 GB); increasing memory does not reduce CPU-bound inference latency. Option C is wrong because TensorFlow Serving on CPU-only nodes still relies on CPU compute, and increasing replicas adds cost and complexity without addressing the fundamental latency per request; the CPU utilization is already saturated, so more replicas would require horizontal scaling but still not guarantee sub-200ms latency per request. Option D is wrong because Cloud Run's automatic scaling handles QPS but does not reduce per-request latency; the model's inference time remains CPU-bound, and Cloud Run's cold starts and CPU-only instances would not meet the 200ms latency target.

128
Multi-Selecthard

Which THREE steps are essential for implementing a continuous training pipeline with Vertex AI?

Select 3 answers
A.If the new model passes evaluation, deploy it to a production endpoint.
B.Manually approve each new model version before deployment.
C.Deploy the original model once and set it to auto-update.
D.Set up a trigger to start a training pipeline when new training data is available (e.g., via Cloud Storage events).
E.Include a step in the pipeline that evaluates the new model against a validation set.
AnswersA, D, E

Automated deployment upon passing evaluation completes the continuous pipeline.

Why this answer

A continuous training pipeline involves automated retraining, evaluation, and deployment when new data or model improvements occur. Manual approval is optional, not essential. One-time manual deployment is not continuous.

The three essential steps are: trigger on new data, train, and evaluate/promote.

129
MCQmedium

A team notices that the latency for online predictions from a Vertex AI endpoint has increased significantly over the past hour. The model is a large TensorFlow model deployed with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The CPU utilization of the deployed instances is consistently above 85%. What is the most likely cause of the increased latency?

A.The network latency between the client and the endpoint has increased due to regional issues.
B.The model is deployed with GPU acceleration, but the instances are using incorrect CUDA drivers.
C.The model is too large for the instance memory, causing disk swapping.
D.The model is CPU-bound, and the current replicas are saturated, causing queuing.
AnswerD

High CPU utilization indicates the replicas are at capacity, leading to request queuing and higher latency.

Why this answer

The correct answer is D because the consistently high CPU utilization (above 85%) indicates that the existing replicas are saturated, unable to process incoming requests quickly enough. When all replicas are busy, new requests are queued, which directly increases latency. Automatic scaling can add more replicas up to maxReplicaCount=10, but if the scaling is slow or the traffic spike is sudden, queuing occurs first, causing the observed latency increase.

Exam trap

Google Cloud often tests the distinction between symptoms of CPU saturation (queuing/latency) versus memory or GPU issues; the trap here is that candidates may incorrectly attribute latency to network or hardware driver problems when the clear indicator is sustained high CPU utilization on existing instances.

How to eliminate wrong answers

Option A is wrong because network latency between client and endpoint is not indicated by CPU utilization of deployed instances; regional network issues would affect all requests uniformly, not correlate with high CPU. Option B is wrong because incorrect CUDA drivers would cause GPU-related errors or failures, not consistently high CPU utilization; the model would likely fail to run or produce errors, not just increase latency. Option C is wrong because disk swapping due to insufficient memory would manifest as high disk I/O and memory pressure, not primarily high CPU utilization; the symptom described is CPU-bound, not memory-bound.

130
MCQhard

A company uses Vertex AI Feature Store for serving features. They have a high-throughput online serving requirement. Which configuration should they use?

A.Cloud Storage with high-memory instances
B.Bigtable as serving source
C.Firestore
D.Vertex AI Feature Store with online serving enabled
AnswerD

Vertex AI Feature Store is purpose-built for high-throughput online feature serving.

Why this answer

Vertex AI Feature Store with online serving enabled is the correct choice because it is specifically designed for low-latency, high-throughput retrieval of feature values for online predictions. It uses a managed Bigtable backend optimized for real-time serving, ensuring consistent performance under high request loads without requiring manual infrastructure management.

Exam trap

Google Cloud often tests the misconception that any low-latency database (like Bigtable or Firestore) can directly replace Vertex AI Feature Store, ignoring the managed orchestration, feature registry, and point-in-time lookup capabilities that are essential for consistent online serving in ML workflows.

How to eliminate wrong answers

Option A is wrong because Cloud Storage is a blob storage service with high latency and no indexing for real-time feature lookups, making it unsuitable for high-throughput online serving. Option B is wrong because Bigtable is a NoSQL database that can serve features, but it requires manual configuration, scaling, and integration with Vertex AI Feature Store, whereas the Feature Store provides a managed, optimized serving layer with built-in consistency and monitoring. Option C is wrong because Firestore is a document database designed for mobile and web apps with moderate throughput, not for the sub-millisecond latency and high concurrency required by ML feature serving at scale.

131
Multi-Selecteasy

A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?

Select 2 answers
A.Deploy the model to multiple Google Cloud regions for failover.
B.Deploy the model to a single zone to minimize cross-zone latency.
C.Use Cloud Batch for asynchronous prediction.
D.Optimize the model by pruning or quantizing to reduce size.
E.Store the model in Cloud Storage and load it on each request.
AnswersA, D

Why this answer

Deploying the model to multiple Google Cloud regions ensures high availability and failover capability. If one region becomes unavailable, traffic can be routed to another region, maintaining sub-100ms latency by using regional load balancing and Cloud DNS. This aligns with the requirement for a highly available, low-latency fraud detection system.

Exam trap

Google Cloud often tests the misconception that single-zone deployment minimizes latency, but the real trade-off is between availability and negligible intra-region latency, making multi-region deployment the correct choice for high availability.

132
MCQmedium

A team uses Vertex AI AutoML Tables to train a model. They need to deploy the model for real-time predictions with high availability. Which deployment configuration should they use?

A.Export as a Cloud Function
B.Deploy to a Vertex AI Endpoint with 1 replica
C.Use a Vertex AI Batch Prediction job
D.Deploy to a Vertex AI Endpoint with multiple replicas and auto-scaling
AnswerD

Multiple replicas provide HA.

Why this answer

For real-time predictions with high availability, you need a deployment that can handle traffic spikes and failover. Deploying to a Vertex AI Endpoint with multiple replicas and auto-scaling ensures that the model is served from multiple instances, providing redundancy and the ability to scale up or down based on demand. This configuration meets the high-availability requirement by distributing load and automatically recovering from instance failures.

Exam trap

The trap here is that candidates often confuse batch prediction with real-time serving, or assume that a single replica is sufficient for high availability, not realizing that high availability requires redundancy and automatic scaling.

How to eliminate wrong answers

Option A is wrong because exporting as a Cloud Function is not a deployment method for Vertex AI AutoML Tables models; Cloud Functions are for serverless event-driven code, not for hosting ML model endpoints with real-time prediction capabilities. Option B is wrong because deploying to a Vertex AI Endpoint with only 1 replica provides no redundancy or high availability; if that single instance fails or becomes overloaded, predictions will be unavailable. Option C is wrong because a Vertex AI Batch Prediction job is designed for asynchronous, offline predictions on large datasets, not for real-time, low-latency serving.

133
Multi-Selectmedium

A team needs to optimize online prediction cost for a model that has unpredictable traffic spikes. Which TWO strategies are most effective?

Select 2 answers
A.Enable autoscaling with a low min_replica_count and high max_replica_count
B.Set up Model Monitoring to trigger scaling
C.Deploy the model on a single high-memory machine
D.Use a smaller model version
E.Use batch prediction during high traffic
AnswersA, D

Autoscaling provides elasticity, scaling from a low base to handle spikes.

Why this answer

Option A is correct because autoscaling with a low min_replica_count and high max_replica_count allows the deployment to handle unpredictable traffic spikes by dynamically adjusting the number of replicas. This ensures cost efficiency during low traffic while providing capacity to scale out rapidly when demand surges, a key requirement for online prediction serving.

Exam trap

Google Cloud often tests the distinction between monitoring (observability) and scaling (infrastructure action), leading candidates to incorrectly select Model Monitoring as a scaling trigger.

134
MCQmedium

Refer to the exhibit. A data scientist deploys a model using this configuration. Users report that after a few hours of inactivity, the first prediction request takes over 30 seconds. What is the most likely cause?

A.The automatic scaling configuration allows scaling down to zero replicas, causing a cold start on the first request.
B.The network latency between the client and the endpoint is high due to regional distance.
C.The endpoint is misconfigured with the wrong regional endpoint.
D.The model is too large and exceeds the instance memory.
AnswerA

minReplicaCount: 0 permits scaling to zero, and after inactivity, the first request must wait for a new replica to start.

Why this answer

Option A is correct because the automatic scaling configuration that allows scaling down to zero replicas means that after a period of inactivity, all model replicas are terminated. When a new prediction request arrives, the endpoint must provision a new replica from scratch, which involves loading the model artifacts, initializing the inference container, and performing health checks. This cold start process typically takes 30 seconds or more, matching the reported behavior.

Exam trap

Google Cloud often tests the distinction between cold start latency (caused by scaling to zero) and persistent performance issues like network latency or resource exhaustion, so candidates must recognize that a delay only after inactivity points to replica provisioning, not a constant problem.

How to eliminate wrong answers

Option B is wrong because network latency due to regional distance would cause consistent high latency on every request, not just the first request after a period of inactivity. Option C is wrong because a misconfigured regional endpoint would result in persistent errors or high latency on all requests, not a delay only after inactivity. Option D is wrong because if the model exceeded instance memory, the endpoint would fail to serve predictions consistently or return out-of-memory errors, not exhibit a delay only on the first request after inactivity.

135
MCQeasy

A team has multiple versions of a model and wants to manage them centrally, including tracking metadata and promoting versions to production. Which tool should they use?

A.Cloud Storage
B.BigQuery
C.GitHub
D.Vertex AI Model Registry
AnswerD

Centralized model versioning and metadata.

Why this answer

Vertex AI Model Registry is designed for managing model versions, metadata, and deployment. Cloud Storage is storage only. BigQuery is for analytics.

GitHub is for source code.

136
Multi-Selecteasy

Which TWO actions can help reduce prediction latency for a Vertex AI endpoint?

Select 2 answers
A.Increase the number of features
B.Optimize the model architecture to reduce size
C.Use a custom prediction container with optimized dependencies
D.Use a larger machine type with more vCPUs
E.Set min replicas to 0 to save cost
AnswersB, C

Smaller models predict faster.

Why this answer

Optimizing the model architecture to reduce size directly decreases the computational load during inference, which lowers prediction latency. Smaller models require fewer floating-point operations (FLOPs) per prediction, enabling faster response times on Vertex AI endpoints.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (larger machine types) always reduces latency, when in fact it can increase overhead and does not address the root cause of slow inference, which is model complexity.

137
Multi-Selectmedium

A company deploys an ML model using Vertex AI Pipelines. They want to ensure reproducibility and traceability. Which TWO practices should they implement?

Select 2 answers
A.Pin all dependency versions
B.Record dataset version using Vertex AI Dataset
C.Use custom containers for every step
D.Store pipeline run metadata in Vertex AI Experiments
E.Use Kubeflow Pipelines instead
AnswersA, D

Pinning versions ensures consistent environments across runs.

Why this answer

Pinning all dependency versions (Option A) ensures that every pipeline run uses the exact same library versions, eliminating variability from package updates. This is a fundamental practice for reproducibility because even a minor version bump can change model behavior or break code. In Vertex AI Pipelines, dependencies are typically specified in a `requirements.txt` or `Dockerfile`, and pinning them (e.g., `tensorflow==2.12.0`) guarantees consistent execution environments across runs.

Exam trap

Google Cloud often tests the misconception that dataset versioning (Option B) is a core requirement for reproducibility in Vertex AI Pipelines, but the exam emphasizes that dependency pinning and experiment metadata storage are the two primary practices for ensuring reproducibility and traceability in ML pipelines.

138
MCQhard

A healthcare company uses Vertex AI to deploy a medical image classification model. The model is deployed on a private endpoint with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The model uses a custom container with a GPU for inference. Recently, during peak business hours (9 AM - 5 PM), users report that prediction requests frequently time out after 60 seconds, and the error rate increases. The team checks Cloud Monitoring and observes that CPU utilization averages 40%, GPU utilization averages 30%, and the number of replicas stays at 2. There are no errors in the container logs. The model serves a few hundred requests per second during peak. The team suspects the issue is not resource saturation but something else. What should they do to resolve the problem?

A.Switch from online prediction to batch prediction using Vertex AI Batch Prediction.
B.Increase the minReplicaCount to 5 to ensure more replicas are always available.
C.Increase the request timeout setting on the load balancer to 120 seconds.
D.Optimize the prediction container to handle requests faster by reducing image pre-processing and using async I/O.
AnswerD

Improving request handling efficiency directly addresses the timeout. Likely the container is blocking on I/O or serialization.

Why this answer

Option D is correct because the symptoms—low CPU/GPU utilization, replicas stuck at 2, and timeouts—indicate that the container is taking too long to process each request, not that resources are saturated. Optimizing the container (e.g., reducing image pre-processing, using async I/O) reduces per-request latency, allowing the model to handle the same request rate within the 60-second timeout. This directly addresses the root cause without changing scaling or timeout settings.

Exam trap

The trap here is that candidates assume low resource utilization means the system is under-provisioned (leading them to increase replicas or timeout), when in fact the bottleneck is per-request latency within the container, which autoscaling cannot fix.

How to eliminate wrong answers

Option A is wrong because switching to batch prediction would not solve real-time inference timeouts; batch prediction is for offline, non-latency-sensitive workloads and would break the real-time requirement. Option B is wrong because increasing minReplicaCount to 5 does not address the fact that existing replicas are underutilized (30-40% CPU/GPU) and requests are timing out due to slow processing, not lack of replicas. Option C is wrong because increasing the load balancer timeout to 120 seconds would only mask the symptom; the container still cannot process requests fast enough, and the underlying latency issue would persist, potentially causing cascading failures.

139
MCQeasy

A startup is deploying a machine learning model for real-time fraud detection. They need low latency and automatic scaling during peak hours. Which Google Cloud service should they use?

A.Cloud Functions
B.Batch Prediction on Vertex AI
C.Cloud AI Platform Prediction with custom containers
D.Vertex AI Endpoints
AnswerD

Vertex AI Endpoints provides managed online prediction with automatic scaling and low latency.

Why this answer

Vertex AI Endpoints is the managed service for online predictions with autoscaling, ideal for real-time low-latency requirements.

140
MCQhard

Your company uses Vertex AI Pipelines to automate the ML lifecycle. The pipeline includes training, evaluation, and deployment steps. You want to ensure that if a pipeline run fails due to a transient error (e.g., resource quota shortage), it automatically retries before marking the run as failed. What is the best way to implement this?

A.Configure Vertex AI Pipelines to automatically restart failed runs.
B.In the pipeline component code, implement retry logic using exponential backoff for specific exceptions.
C.Set a high timeout value for the pipeline so that transient errors resolve before timeout.
D.Use Cloud Tasks to schedule pipeline runs and retry upon failure.
AnswerB

Retrying within the component handles transient failures gracefully without failing the entire pipeline.

Why this answer

Vertex AI Pipelines does not have built-in retry logic for failed steps. You can wrap each step's logic to catch transient errors and retry, or use a retry mechanism in the container itself. Kubeflow Pipelines' retry policy can be specified.

Modifying pipeline code is the most direct way.

141
MCQeasy

A startup is using Cloud Build to automate the training and deployment of their machine learning models. The workflow is defined in cloudbuild.yaml and includes steps to: 1) Run a training job on AI Platform Training, 2) Build a custom prediction container, 3) Deploy the container to Cloud Run for serving. The deployment step fails intermittently with the error: 'Cloud Run service already exists and is not owned by the calling user.' You need to fix this so that deployments are reliable. What should you do?

A.Ensure the Cloud Build service account has the 'run.services.update' permission on the Cloud Run service.
B.Delete the existing Cloud Run service manually before each build.
C.Use 'gcloud run deploy --replace' in the build step to force replace the existing service.
D.Use Cloud Run for Anthos instead of fully managed Cloud Run to avoid ownership issues.
AnswerA

The error suggests a permissions issue; granting the correct role to the Cloud Build service account resolves it.

142
MCQmedium

A user named Charlie needs to deploy a model to a Vertex AI Endpoint and also create training jobs. Which role should be assigned to Charlie?

A.roles/aiplatform.user
B.roles/owner
C.roles/aiplatform.modelUser
D.roles/editor
AnswerA

aiplatform.user allows creating models, deploying endpoints, and running training jobs.

Why this answer

Correct: B. aiplatform.user includes permissions to create models and deploy endpoints. Option A is wrong because modelUser is read-only for predictions. Option C is wrong because editor includes unrelated permissions.

Option D is wrong because owner is too broad.

143
MCQmedium

A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?

A.Reduce the minimum number of nodes to zero to allow scale-to-zero during low traffic.
B.Place a Google Cloud Load Balancer in front of the Vertex AI endpoint to distribute requests across multiple endpoints.
C.Configure horizontal autoscaling with a higher maximum number of nodes and set a CPU utilization target.
D.Implement A/B testing by splitting traffic between two model versions to distribute load.
AnswerC

Autoscaling allows the endpoint to add nodes during high traffic, reducing latency and preventing timeouts.

Why this answer

Option C is correct because configuring horizontal autoscaling with a higher maximum number of nodes and a CPU utilization target allows Vertex AI Prediction to automatically add more nodes during peak traffic, distributing the inference load and reducing latency. This directly addresses the root cause—insufficient compute resources under high demand—without requiring architectural changes or sacrificing availability.

Exam trap

The trap here is that candidates often confuse load balancing (Option B) with autoscaling, thinking that distributing requests across multiple endpoints is the same as adding more compute capacity, but Vertex AI endpoints are single resources that cannot be fronted by a load balancer to increase capacity—they require autoscaling to add nodes.

How to eliminate wrong answers

Option A is wrong because reducing the minimum number of nodes to zero would cause cold starts when traffic arrives, increasing latency rather than reducing it, and scale-to-zero is not suitable for a customer-facing application requiring real-time predictions. Option B is wrong because placing a Google Cloud Load Balancer in front of a single Vertex AI endpoint does not distribute requests across multiple endpoints—it would only add unnecessary network hops and complexity without solving the resource bottleneck. Option D is wrong because A/B testing splits traffic between model versions for evaluation purposes, not for load distribution; it does not increase the total compute capacity available to handle peak traffic.

144
MCQmedium

Your organization uses Vertex AI Feature Store to serve features for a real-time fraud detection model. The model is deployed on a Vertex AI endpoint. After a data pipeline update, the model's online predictions became inconsistent. What is the most likely cause?

A.The model's prediction server is running out of memory.
B.The feature store's online serving values are not synchronized with the batch feature values used during training.
C.The model was retrained with a different training dataset.
D.The online serving endpoint's model version was accidentally rolled back.
AnswerB

If the pipeline update changed how features are computed or stored, online serving might use out-of-sync values, leading to inconsistent predictions.

Why this answer

In Vertex AI Feature Store, batch feature values used during model training and online serving values are stored separately. If a data pipeline update changes the batch feature values but the online serving values are not updated or synchronized, the model will receive different feature values at inference time than it was trained on, leading to inconsistent predictions. This is the most common cause of prediction drift after a pipeline change.

Exam trap

The trap here is that candidates may confuse a data pipeline update with a model retraining or version rollback, but the key is recognizing that feature store synchronization between batch and online stores is a distinct operational concern that directly causes prediction inconsistency.

How to eliminate wrong answers

Option A is wrong because running out of memory on the prediction server would cause errors or timeouts, not inconsistent predictions; the model would either fail or produce no output, not produce varying results. Option C is wrong because retraining with a different dataset would produce a new model version, but the question states predictions became inconsistent after a data pipeline update, not after a retraining event; a retrained model would be deployed as a new version, not cause inconsistency in the existing model's outputs. Option D is wrong because a rollback of the model version would revert to a previous consistent state, not introduce inconsistency; the predictions would be consistent with the older model version, not inconsistent.

145
MCQhard

You manage a large-scale machine learning system that recommends products to users. The model is a deep neural network trained on TensorFlow and deployed on Vertex AI Endpoint with global load balancing. The model receives over 10,000 requests per second. Recently, the team added a new feature: the user's current geographic location (latitude/longitude). After deploying the updated model, you notice that the average prediction latency has doubled, and the error rate has increased, particularly for requests from regions far from the model's primary training data (North America). You suspect the location feature is causing issues. What should you do to diagnose and mitigate the problem?

A.Remove the location feature from the model and retrain without it to restore performance.
B.Increase the number of replicas for the endpoint to handle the increased latency.
C.Switch to a regional endpoint in North America to reduce latency for the majority of users.
D.Examine the latency breakdown using Cloud Monitoring to see if the location feature is causing computationally expensive operations, then consider feature engineering like bucketing coordinates.
AnswerD

Understanding the latency source and engineering the feature properly can resolve the issue without sacrificing model accuracy.

146
MCQhard

A company is building a continuous training pipeline that retrains a model daily using new data from a feature store. The training data must include features computed up to the timestamp of each training run. Which architecture should be used to ensure time-consistent feature values without label leakage?

A.Train on a fixed window of the most recent features without considering timestamps.
B.Use Vertex AI Feature Store with point-in-time lookup enabled to retrieve features as of the training timestamp.
C.Store all features in a Cloud SQL database and perform a join at training time.
D.Use Pub/Sub to stream new features into Cloud Storage and train on the latest snapshot.
AnswerB

Point-in-time lookups ensure that for each training example, features are retrieved as they existed at the prediction time, preventing leakage.

Why this answer

Option B is correct because Vertex AI Feature Store's point-in-time lookup retrieves the exact feature values as they existed at the specified training timestamp, ensuring time-consistency and preventing label leakage. This mechanism avoids using future data that would not have been available at the time of prediction, which is critical for realistic model evaluation and production performance.

Exam trap

Google Cloud often tests the misconception that simply using the most recent data or a snapshot is sufficient for time-consistency, but the key requirement is to retrieve features as of the exact training timestamp to prevent label leakage, which only point-in-time lookup guarantees.

How to eliminate wrong answers

Option A is wrong because training on a fixed window of the most recent features without considering timestamps can introduce label leakage by including future feature values relative to the label timestamp, and it ignores the temporal ordering required for time-series data. Option C is wrong because storing all features in Cloud SQL and performing a join at training time lacks point-in-time semantics, meaning the join may inadvertently use features from after the label timestamp, causing leakage and inconsistent feature values. Option D is wrong because using Pub/Sub to stream new features into Cloud Storage and training on the latest snapshot does not guarantee that features are retrieved as of the exact training timestamp; the snapshot may include data that arrived after the label was generated, leading to leakage.

147
MCQeasy

A team is using Kubeflow Pipelines on Google Kubernetes Engine to orchestrate ML workflows. They need to track parameters, metrics, and artifacts for each run. Which tool should they integrate?

A.Cloud Monitoring
B.Cloud Logging
C.BigQuery
D.Vertex ML Metadata
AnswerD

Vertex ML Metadata is designed to track ML artifacts, parameters, and metrics across pipeline runs.

Why this answer

Vertex ML Metadata is the correct choice because it is purpose-built for tracking parameters, metrics, and artifacts in ML workflows, and it integrates natively with Kubeflow Pipelines on Google Kubernetes Engine. It stores metadata for each pipeline run, enabling lineage tracking, comparison, and reproducibility of experiments.

Exam trap

Google Cloud often tests the distinction between general-purpose monitoring/logging tools and ML-specific metadata stores, so the trap here is that candidates may confuse Cloud Monitoring or Cloud Logging with a tool that can track ML metrics, when in fact they lack the structured schema and lineage capabilities required for ML workflow orchestration.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring is designed for infrastructure and application performance monitoring (e.g., CPU utilization, latency), not for tracking ML-specific parameters, metrics, and artifacts. Option B is wrong because Cloud Logging collects and stores log data (e.g., text logs from applications), not structured ML metadata like hyperparameters or model artifacts. Option C is wrong because BigQuery is a serverless data warehouse for analytical queries on large datasets, not a metadata store for ML pipeline runs.

148
MCQmedium

A retail company needs to generate product recommendations for millions of users every few hours. The model is a small scikit-learn model. Which prediction method should be used to minimize infrastructure cost while meeting the latency requirements?

A.Use Cloud Run to host the model and invoke it for each user request.
B.Export the model as a container and run on Google Kubernetes Engine with cluster autoscaling.
C.Deploy the model to a Vertex AI endpoint with a single replica for online predictions.
D.Use a Vertex AI batch prediction job that reads from BigQuery and writes results back to BigQuery or Cloud Storage.
AnswerD

Batch prediction is designed for such use cases and is cost-efficient for large datasets processed periodically.

Why this answer

Option D is correct because batch prediction is the most cost-effective approach for generating recommendations for millions of users every few hours. Vertex AI batch prediction jobs process large datasets in parallel without maintaining always-on infrastructure, and they can read from BigQuery and write results directly to BigQuery or Cloud Storage, minimizing compute costs while meeting the latency requirement of 'every few hours' (not real-time).

Exam trap

Google Cloud often tests the distinction between online (real-time) and batch (asynchronous) prediction patterns, and the trap here is that candidates assume 'predictions' always require a live endpoint, overlooking that batch jobs are the correct choice when latency requirements are in hours and the workload is massive and periodic.

How to eliminate wrong answers

Option A is wrong because Cloud Run invokes the model per user request, which would require millions of individual invocations every few hours, leading to high request-based costs and potential cold-start latency issues that are unnecessary for a batch workload. Option B is wrong because Google Kubernetes Engine with cluster autoscaling is overkill for a small scikit-learn model and introduces cluster management overhead and always-on node costs, even with autoscaling, making it more expensive than a serverless batch solution. Option C is wrong because a Vertex AI endpoint with a single replica is designed for online (real-time) predictions, which would be idle most of the time between the batch windows, incurring continuous compute costs for a single replica that is not needed for a scheduled batch job.

149
MCQhard

A financial services company uses Vertex AI to serve a fraud detection model. The model was trained on historical data that is updated daily. The team wants to automate retraining when data drift is detected. Which approach best operationalizes this requirement with minimal manual intervention?

A.Use Cloud Monitoring alerts on prediction latency to trigger a retraining pipeline.
B.Manually monitor model performance metrics in Vertex AI Experiments and retrain when accuracy drops.
C.Use scheduled Vertex AI Pipelines to retrain the model every night, then deploy automatically.
D.Enable Vertex AI Model Monitoring for feature drift and skew, then create a Cloud Function that triggers a Vertex AI Pipeline to retrain and deploy the model after validation.
AnswerD

This automates detection of data drift, triggers retraining only when needed, and includes validation before deployment.

Why this answer

Option D is correct because it uses Vertex AI Model Monitoring to automatically detect feature drift or skew, then triggers a Cloud Function that invokes a Vertex AI Pipeline to retrain and redeploy the model after validation. This approach minimizes manual intervention by automating both the detection of data drift and the subsequent retraining and deployment lifecycle.

Exam trap

Google Cloud often tests the distinction between scheduled retraining (Option C) and event-driven retraining triggered by actual drift detection (Option D), where candidates mistakenly choose the simpler scheduled approach without recognizing that it ignores the requirement to retrain only when drift is detected.

How to eliminate wrong answers

Option A is wrong because prediction latency is unrelated to data drift; monitoring latency only detects performance issues, not changes in data distribution. Option B is wrong because manually monitoring metrics in Vertex AI Experiments requires human intervention and does not automate retraining, contradicting the requirement for minimal manual intervention. Option C is wrong because scheduled nightly retraining ignores whether data drift has actually occurred, leading to unnecessary retraining and potential deployment of models that are not improved, and it does not use drift detection as the trigger.

150
MCQmedium

Your team is responsible for operationalizing a series of machine learning models that are trained and deployed using Vertex AI Pipelines. The pipeline consists of several steps including data preprocessing, training with hyperparameter tuning, model evaluation, and deployment to an endpoint. Recently, the pipeline has been failing intermittently at the model evaluation step with an error indicating insufficient memory. The evaluation step uses a custom container with a memory limit of 4 GB. The training step uses 8 GB and completes successfully. You need to resolve the failure without drastically increasing costs. What should you do?

A.Increase the memory limit for the evaluation custom container to 8 GB to match the training step.
B.Optimize the evaluation code to use streaming or incremental processing to reduce peak memory usage.
C.Reduce the batch size used in the evaluation step to lower memory consumption.
D.Use a smaller machine type for the evaluation step to force lower memory usage.
AnswerB

Optimizing the code is a cost-effective long-term solution that addresses the root cause.

← PreviousPage 2 of 3 · 191 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Ml Models Ops questions.