CCNA Serve Scale Models Questions

75 of 95 questions · Page 1/2 · Serve Scale Models topic · Answers revealed

1
MCQeasy

Your team has deployed a scikit-learn model using a custom container on Vertex AI Prediction. The model receives about 100 requests per second, and the endpoint is configured with a single n1-standard-4 machine. You notice that response times are around 200 ms on average, but occasionally spike to over 10 seconds during traffic bursts. You have set the min replicas to 1 and max replicas to 10. Despite this, spikes still occur. What is the most likely cause and the best course of action?

A.The autoscaling is too slow to react; you should increase the max replicas to 20 and reduce the cooldown period.
B.The model is not optimized for parallel inference; you should enable batching in the custom container.
C.The machine type is insufficient for the model size; you should switch to a n1-highmem-8.
D.The container has a memory leak; you should restart the container periodically.
AnswerA

Reducing cooldown and increasing max replicas helps autoscaling respond faster to bursts.

Why this answer

Option A is correct because the occasional spikes suggest that autoscaling is too slow; enabling batching reduces the number of inference calls and smooths out bursts. Option B could help but the autoscaling may still be too slow. Option C is not necessarily needed if average latency is acceptable.

Option D is unlikely to cause intermittent spikes.

2
Drag & Dropmedium

Drag and drop the steps to implement a CI/CD pipeline for ML models using Cloud Build and Vertex AI in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

First configure the trigger, define the pipeline, then commit to trigger training and deployment.

3
MCQeasy

A data science team has trained a TensorFlow model and wants to serve it online with minimal latency. Which Vertex AI deployment option should they use to ensure the model can handle traffic spikes without manual scaling?

A.Use Vertex AI Model Garden.
B.Deploy the model to a Vertex AI Endpoint with automatic scaling.
C.Use Vertex AI Batch Prediction for offline inference.
D.Deploy the model to a Compute Engine VM with a load balancer.
AnswerB

Autoscaling handles traffic spikes with low latency.

Why this answer

Vertex AI Endpoints with automatic scaling (option B) are designed for online serving with minimal latency and can automatically adjust the number of replicas based on traffic load, handling spikes without manual intervention. This is the correct choice for a TensorFlow model requiring real-time inference and elastic scaling.

Exam trap

Google Cloud often tests the misconception that any cloud deployment with a load balancer (like Compute Engine) provides automatic scaling, but the trap here is that Vertex AI Endpoints offer managed autoscaling natively, whereas Compute Engine VMs require additional infrastructure setup and do not automatically scale without configuring managed instance groups.

How to eliminate wrong answers

Option A is wrong because Vertex AI Model Garden is a repository of pre-built models and foundation models, not a deployment option for serving custom trained models with automatic scaling. Option C is wrong because Vertex AI Batch Prediction is for offline, asynchronous inference on large datasets, not for real-time online serving with low latency. Option D is wrong because deploying to a Compute Engine VM with a load balancer requires manual scaling configuration (e.g., managed instance groups) and lacks the integrated autoscaling, monitoring, and model versioning capabilities of Vertex AI Endpoints.

4
Multi-Selectmedium

Which TWO actions can help reduce prediction latency for a model deployed on Vertex AI Endpoint without changing the model architecture?

Select 2 answers
A.Increase the batch size of prediction requests.
B.Attach a GPU accelerator to the endpoint's machine type.
C.Quantize the model from FP32 to INT8.
D.Deploy the model in multiple regions and use global load balancing.
E.Use a smaller machine type to reduce complexity.
AnswersB, C

GPU reduces computation time for neural networks.

Why this answer

Options A and D are correct. Option A (GPU accelerator) can significantly speed up inference for deep learning models. Option D (model quantization) reduces model size and inference time.

Option B (increasing batch size) increases latency per request. Option C (multiregion deployment) reduces network latency but not prediction latency. Option E (smaller machine type) may increase latency.

5
Multi-Selectmedium

Which THREE factors should be considered when choosing between using Vertex AI Endpoints and Cloud Run for model serving? (Choose three.)

Select 3 answers
A.Built-in model monitoring
B.Complexity of model containerization
C.Cost per request
D.GPU support
E.Automatic scaling to zero
AnswersA, D, E

Vertex AI Endpoints integrates with Model Monitoring; Cloud Run requires custom implementation.

Why this answer

Options A, B, and C are key differentiators. Vertex AI Endpoints supports GPUs natively, Cloud Run has limited GPU support. Cloud Run inherently scales to zero, Vertex AI endpoints don't always scale to zero easily.

Vertex AI Endpoints has built-in model monitoring, Cloud Run does not. Options D and E are less differentiating: both services have similar cost structures and container requirements.

6
MCQhard

A data scientist runs a batch prediction job on Vertex AI using a custom container. The job processes a large JSONL file (10 GB) and fails with an out-of-memory error. The machine type is n1-standard-4 (15 GB memory). Which action should be taken to resolve the error while minimizing cost?

A.Reduce the batch size in the prediction request.
B.Split the input data into smaller files and run multiple batch jobs.
C.Add a GPU accelerator to offload computation.
D.Use a machine type with more memory, such as n1-highmem-8 (52 GB).
AnswerD

Increasing memory directly solves out-of-memory errors.

Why this answer

Option C is correct because out-of-memory errors suggest the machine's memory is insufficient for the model or data size; increasing to a high-memory machine type adds more memory. Option A is wrong because splitting input data does not reduce per-instance memory pressure if the model itself is large. Option B is wrong because the batch size may need adjustment but the primary issue is memory.

Option D is wrong because using a GPU does not increase memory.

7
MCQmedium

A company needs to serve a model for low-frequency inference requests (a few hundred per month) from multiple regions. The priority is simplicity and minimal cost without maintaining infrastructure. Which serving option should they choose?

A.Deploy a real-time Vertex AI Endpoint with min replicas set to 1.
B.Set up a Dataflow streaming pipeline to process requests.
C.Use Vertex AI Batch Prediction triggered as needed.
D.Use Cloud Run with serving container and scale to zero.
AnswerC

Batch prediction is serverless, pay-per-query, and ideal for infrequent large predictions.

Why this answer

Option D is correct because Vertex AI Batch Prediction runs on demand and is cost-effective for infrequent large batches. Option A is wrong because real-time endpoint incurs per-hour cost even if idle. Option B is wrong because Cloud Run is better for online, not offline.

Option C is wrong because Dataflow is more complex and designed for streaming.

8
MCQhard

A company needs to serve a large Transformer model (5 GB) with strict latency requirements (< 50 ms) and throughput of 1000 requests per second. The model is in SavedModel format. They are considering deployment options on Google Cloud. Which approach best meets these requirements?

A.Deploy on Vertex AI Prediction using a single high-memory VM with a GPU (e.g., n1-highmem-32 with A100).
B.Deploy on Cloud Run with a GPU-enabled instance and increase concurrency.
C.Deploy on Vertex AI Prediction using model parallelism across multiple GPUs on a single VM.
D.Deploy on Vertex AI Prediction using distributed serving with TensorFlow Serving and model sharding across multiple VMs.
AnswerA

A single A100 can handle 5GB model with low latency and high throughput.

Why this answer

Option A is correct because a single high-memory VM with a powerful GPU (e.g., A100) can handle the model size and throughput with low latency, avoiding network overhead. Option B is wrong because model parallelism adds complexity and may not be needed for a 5GB model on a single high-end GPU. Option C is wrong because distributed serving introduces network latency.

Option D is wrong because Cloud Run currently does not support GPU instances effectively.

9
MCQeasy

A company is deploying a machine learning model for real-time fraud detection. The model must respond to requests within 100ms. The model is a TensorFlow model and will be deployed on Google Kubernetes Engine (GKE). Which Google Cloud service should be used to serve the model to minimize latency?

A.Deploy the model on Cloud Run with minimum instances set to 1.
B.Deploy the model as a Cloud Function triggered by HTTP requests.
C.Deploy the model on Vertex AI Prediction with a custom container.
D.Deploy TensorFlow Serving on GKE with a LoadBalancer service.
AnswerD

TensorFlow Serving is optimized for low-latency serving and can be configured on GKE with a LoadBalancer for direct access, minimizing network hops.

Why this answer

Option D is correct because deploying TensorFlow Serving directly on GKE with a LoadBalancer service provides the lowest-latency path for real-time inference. TensorFlow Serving is optimized for high-performance model serving with batching and gRPC support, and GKE allows fine-grained control over node placement, autoscaling, and networking to meet the 100ms SLA. In contrast, serverless options like Cloud Run or Cloud Functions add cold-start latency and lack the low-level optimization for TensorFlow models.

Exam trap

The trap here is that candidates often assume Vertex AI Prediction is always the best choice for serving models, but for ultra-low-latency requirements (<100ms), a direct deployment on GKE with TensorFlow Serving avoids the overhead of a managed prediction platform.

How to eliminate wrong answers

Option A is wrong because Cloud Run, even with minimum instances set to 1, introduces additional latency from its HTTP request routing layer and does not natively support gRPC or TensorFlow Serving's optimized batching, making it harder to consistently meet 100ms. Option B is wrong because Cloud Functions have a maximum timeout of 60 seconds but suffer from cold-start delays (often 500ms-2s) and lack persistent GPU/TPU support, making them unsuitable for sub-100ms real-time inference. Option C is wrong because Vertex AI Prediction with a custom container adds overhead from Vertex AI's managed infrastructure (e.g., request routing, health checks, and autoscaling logic) that can introduce 10-50ms extra latency compared to a direct TensorFlow Serving deployment on GKE.

10
Multi-Selectmedium

Which THREE factors should you consider when deciding between online prediction and batch prediction on Vertex AI?

Select 3 answers
A.The type of machine learning model architecture (e.g., CNN vs RNN)
B.Cost per prediction: batch is often cheaper per request
C.Latency requirements (real-time vs. asynchronous)
D.Traffic pattern: sporadic vs. sustained load
E.Availability of GPU instances in the region
AnswersB, C, D

Batch prediction is typically more cost-effective for large volumes.

Why this answer

Latency requirements, cost structure, and data volume patterns are key factors. Instance availability is similar for both; model architecture does not dictate prediction type.

11
MCQhard

A model deployed on Vertex AI Endpoints returns predictions, but the performance metrics (e.g., AUC) degrade over time. The input data distribution is shifting. The team wants to detect and alert on this drift automatically. Which set of actions should they take?

A.Schedule a batch prediction job daily and compare with ground truth
B.Enable Vertex AI Model Monitoring for feature drift and set up alerts via Cloud Monitoring
C.Use Vertex AI Explainable AI to understand predictions
D.Implement custom logging in the serving container and use BigQuery for analysis
AnswerB

Model Monitoring automatically calculates drift metrics and can trigger alerts when drift exceeds thresholds.

Why this answer

Option B is correct because Vertex AI Model Monitoring can monitor for feature distribution drift and skew, and can be configured to send alerts via Cloud Monitoring. Option A is part of model interpretation, not drift detection. Option C requires ground truth labels, which may not be available immediately.

Option D is manual and not automated.

12
MCQmedium

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

A.Move the model to Cloud Functions
B.Use a GPU instance with a fixed number of replicas
C.Use a GPU instance with min replicas=0 and autoscaling
D.Switch to a CPU-only machine type
AnswerC

Scales down to zero when unused, saving costs.

Why this answer

Option C is correct because setting min replicas to 0 allows Vertex AI Prediction to scale down to zero instances during off-peak hours, eliminating GPU costs when no requests are being served. Combined with autoscaling, the deployment will spin up GPU-backed instances on demand only when traffic arrives, directly addressing the underutilization issue while maintaining low latency for inference requests.

Exam trap

Google Cloud often tests the misconception that autoscaling alone reduces costs, but the trap here is that without setting min replicas to 0, you still pay for idle GPU instances during off-peak hours, which is the exact problem described in the question.

How to eliminate wrong answers

Option A is wrong because Cloud Functions does not support GPU acceleration; it runs in a serverless environment limited to CPU-only execution, making it unsuitable for GPU-required inference. Option B is wrong because a fixed number of replicas (even with autoscaling) keeps at least one GPU instance running at all times, failing to eliminate costs during zero-traffic periods; min replicas must be 0 to achieve true cost savings. Option D is wrong because the model explicitly requires GPU for inference, and switching to CPU-only would break inference performance or make it infeasible due to model architecture or latency requirements.

13
MCQhard

A machine learning engineer notices that a model served on Vertex AI Endpoints returns predictions that are consistently 20% slower during the first request after idle (cold start). They are using automatic scaling with min replicas=1. What is the most likely cause and best solution?

A.Vertex AI endpoint warm-up time; set min replicas to 2 to always keep a warm instance
B.Model loading time is high; enable health checks to warm the instance
C.Network latency; deploy to a different region
D.Container initialization delay; use a smaller container image
AnswerA

Increasing min replicas ensures at least one warm instance is always available, reducing cold start latency.

Why this answer

Option D is correct because even with min replicas=1, an instance may be recycled due to idleness or updates, causing a cold start. Setting a higher min replicas (e.g., 2) ensures a warm instance is always available. Option A is incorrect because smaller containers may not significantly reduce cold start.

Option B is incorrect as health checks measure readiness but do not eliminate cold starts. Option C is incorrect as preemptible instances are not relevant to cold starts.

14
Multi-Selecthard

A company deploys a model to Vertex AI Endpoint with autoscaling enabled. During a traffic spike, they observe high tail latency (99th percentile > 2s). Which TWO factors are most likely contributing to this latency?

Select 2 answers
A.The machine type is underpowered for the model.
B.The autoscaling target_cpu_utilization is set too low (e.g., 0.3).
C.The endpoint has too many traffic splits configured.
D.The min_replica_count is set too low, causing cold starts.
E.The model file is very large (e.g., 2GB), increasing model loading time.
AnswersD, E

Low min replicas lead to cold start delays during spikes.

Why this answer

Options A and C are correct. Option A: if min replicas is too low, new replicas must be created and loaded with the model, causing cold start latency. Option C: a large model file increases cold start time as new replicas load the model.

Option B (underpowered machine) would cause high average latency, not just tail. Option D (too many traffic splits) is unrelated. Option E (target CPU utilization set too low) would cause earlier scaling, reducing tail latency; too high would delay scaling.

15
Drag & Dropmedium

Drag and drop the steps to set up a feature store for ML features using Vertex AI Feature Store in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

First define entity type and features, then ingest data, serve, and monitor.

16
MCQmedium

You are responsible for deploying a real-time recommendation model that uses a large embedding table (5 GB) and a small neural network. The model is served through a custom container on Vertex AI Prediction. The end-to-end latency requirement is under 200 ms. During load testing with 500 QPS, you observe that latency increases linearly with batch size. You are currently using a single replica with an n1-standard-8 machine and one T4 GPU. The embedding table is loaded entirely in GPU memory. However, CPU utilization is at 100% while GPU is at 30%. What is the best approach to meet the latency requirement at scale?

A.Increase the number of replicas and use a global load balancer to distribute traffic.
B.Use a custom container that partitions the embedding table across multiple GPUs within a single replica.
C.Switch to a TPU v2-8 pod slice to accelerate embedding lookups.
D.Use a machine type with more CPU cores to parallelize embedding lookups.
AnswerD

More CPU cores reduce contention and latency for embedding operations.

Why this answer

Option D is correct because CPU is the bottleneck; using a machine type with more CPU cores (e.g., n1-highcpu-16) allows parallel embedding lookups and reduces latency. Option A increases resources but not in the bottleneck area. Option B increases replicas but each would still be CPU-bound.

Option C is expensive and may not improve latency if model not T PU-compatible.

17
MCQmedium

A machine learning team wants to perform A/B testing between two model versions (v1 and v2) on Vertex AI Endpoint. They need to gradually route 10% of traffic to v2 while monitoring performance. What is the most efficient way to achieve this?

A.Use a Cloud Load Balancer to route traffic based on a header.
B.Deploy both versions to the same endpoint and set traffic_split to 90% for v1 and 10% for v2.
C.Create two separate endpoints and use a weighted DNS round-robin.
D.Run batch predictions for v2 and log results separately.
AnswerB

Vertex AI Endpoint supports traffic splitting for A/B testing.

Why this answer

Option B is correct because Vertex AI Endpoint natively supports traffic splitting between model versions. Option A is wrong because creating separate endpoints adds complexity and cost. Option C is wrong because Cloud Load Balancing operates at the network level, not model level.

Option D is wrong because batch prediction is not for real-time A/B testing.

18
MCQmedium

After deploying a new version of a model to a Vertex AI Endpoint, the team notices that predictions are still returning results from the old version. The deployment command used a traffic split of 100% to the new version. What is the most likely cause?

A.The model artifact uploaded was identical to the old version.
B.The traffic split was not properly updated; the endpoint is still routing 100% to the old version.
C.The new model version failed health checks and was automatically rolled back.
D.The prediction client is caching the old model response.
AnswerB

If the traffic split command is not applied correctly, the old version continues to serve.

Why this answer

Option A is correct because the traffic split update may not have taken effect if the command failed silently, or the new version is not healthy, causing the endpoint to route traffic to the old version. Option B is wrong because caching is not a typical issue for Vertex AI Endpoint. Option C is wrong because the deployment succeeded but traffic split might need explicit update.

Option D is wrong because a stale model artifact would affect the new version only.

19
MCQmedium

You are using Vertex AI continuous evaluation (model monitoring) for your deployed model. You receive an alert that the prediction distribution is significantly different from the training distribution. What should you do first?

A.Roll back the model to the previous version immediately.
B.Increase the alerting threshold to reduce false positives.
C.Analyze the input data to understand if there is a skew or drift.
D.Retrain the model using the latest data and redeploy.
AnswerC

Diagnosing the cause is the appropriate first step.

Why this answer

When a monitoring alert triggers, the first step is to investigate the root cause: check if input data has changed, retraining is needed, or there is a data pipeline issue. Simply rolling back or retraining without analysis might be premature.

20
Multi-Selectmedium

A company is deploying a model for online predictions on Vertex AI. They want to minimize latency while also handling traffic spikes. Which TWO configurations should they choose?

Select 2 answers
A.Use GPU machine type
B.Enable autoscaling with min replicas=1
C.Disable autoscaling and use manual scaling
D.Use CPU machine type with more memory
E.Set a fixed number of replicas equal to peak load
AnswersA, B

GPUs accelerate inference, reducing latency.

Why this answer

Option A is correct because GPU machine types on Vertex AI provide significantly faster inference for deep learning models, reducing latency per prediction. Option B is correct because enabling autoscaling with min replicas=1 ensures the model can handle traffic spikes by dynamically adding replicas while keeping at least one instance running to avoid cold starts.

Exam trap

Google Cloud often tests the misconception that manual scaling or fixed replicas are better for latency, but the correct approach is autoscaling with a minimum replica count to balance cost and responsiveness.

21
MCQhard

A data scientist deployed a model to Vertex AI Prediction. When making a prediction request as shown in the exhibit, they receive a 400 error. What is the most likely cause?

A.The request JSON is malformed due to a missing comma between instances.
B.The model was trained on 2 features, but the request provides 3 features.
C.The endpoint path is incorrect; it should include the model version.
D.The request is sending 3 separate instances but the model expects only 1.
AnswerB

The error indicates the model expects 2 features per instance, but the request provides 3.

Why this answer

The 400 error indicates a malformed request, typically due to a mismatch between the input features the model expects and what is provided. Since the model was trained on 2 features but the request includes 3 features, Vertex AI rejects the prediction as invalid input shape mismatch. This is the most common cause of 400 errors in Vertex AI Prediction when the instance structure does not match the model's signature.

Exam trap

The trap here is that candidates confuse a 400 error with a routing or versioning issue (Option C) or assume JSON syntax errors (Option A), but the real cause is a feature count mismatch, which is a common pitfall when deploying models with different training and serving data schemas.

How to eliminate wrong answers

Option A is wrong because a missing comma between instances would cause a JSON parse error (e.g., 400 with 'Invalid JSON payload'), but the exhibit shows valid JSON syntax with commas present. Option C is wrong because the endpoint path does not require a model version; Vertex AI Prediction uses the endpoint resource name, and versioning is handled via traffic splitting or aliases, not in the URL path. Option D is wrong because Vertex AI Prediction supports batch prediction with multiple instances in a single request, and the model expects exactly 1 instance per request only if the model's serving signature specifies a fixed batch size of 1, which is not indicated here.

22
MCQmedium

A data science team deploys a custom container on Vertex AI Prediction for a PyTorch model. After deployment, the model returns predictions that are consistently off by a constant factor. The model performed correctly during local testing. What is the most likely cause?

A.The model is loaded in evaluation mode, but the training mode was used in testing.
B.The serving input function in the container is not applying the same normalization as during training.
C.The container is using a different PyTorch version than the training environment.
D.There is a bug in the custom container's prediction route.
AnswerB

Preprocessing mismatch, such as scaling by different factors, leads to constant offset in predictions.

Why this answer

Option B is correct because a constant factor error typically indicates a preprocessing mismatch, such as different normalization. Option A is wrong because different PyTorch versions may cause other inconsistencies but not a constant factor. Option C is wrong because training vs evaluation mode affects dropout/batch norm, not constant scaling.

Option D is possible but less specific than preprocessing.

23
MCQhard

A data science team deploys a large language model (LLM) on Vertex AI Prediction using an NVIDIA A100 GPU. The end-to-end latency is acceptable, but the cost is high due to low GPU utilization. The model is stateless and requests are independent. Which strategy would most effectively reduce cost per prediction?

A.Migrate the model to Cloud TPU using TensorFlow to benefit from higher throughput.
B.Use a smaller GPU, such as NVIDIA T4, and increase the number of replicas to maintain throughput.
C.Reduce the number of min replicas to 0 and scale from 0 on each request.
D.Implement dynamic batching in the serving container to aggregate multiple requests into a single inference call.
AnswerD

Batching improves GPU utilization by processing multiple requests in parallel, lowering cost per inference.

Why this answer

Option A is correct because request batching increases throughput per GPU, reducing cost per prediction. Option B is wrong because a smaller GPU may not meet latency requirements. Option C is wrong because Cloud TPUs are not designed for this model and may increase cost.

Option D is wrong because scaling down replicas reduces capacity and may cause latency spikes.

24
Multi-Selectmedium

A data science team has trained a large deep learning model using Vertex AI Workbench. They want to deploy it to Vertex AI Prediction for online serving. The model is stored in a custom container with a Python-based web server. Which TWO actions should the team take to ensure optimal performance and cost?

Select 2 answers
A.Configure the model to use a larger batch size for inference.
B.Request GPU machine types for the prediction nodes.
C.Set the container's health check path to '/predict'.
D.Use a global load balancer to distribute traffic across regions.
E.Enable autoscaling with a minimum number of replicas.
AnswersB, E

Deep learning models typically require GPUs for low-latency inference.

Why this answer

B is correct because deep learning models, especially large ones, benefit significantly from GPU acceleration for online inference due to their parallel processing capabilities. Vertex AI Prediction supports GPU machine types, and using them reduces latency and improves throughput for compute-intensive model serving, which is critical for optimal performance.

Exam trap

Google Cloud often tests the misconception that health check endpoints should be the same as the prediction endpoint, but in practice, health checks must be lightweight and separate to avoid false positives and resource exhaustion.

25
MCQmedium

A company deploys a custom TensorFlow model to Vertex AI Endpoint for online predictions. After deployment, prediction latency is consistently high (over 500ms) even under low traffic. The model is CPU-only and the default machine type (n1-standard-2) is used. Which action will most likely reduce prediction latency?

A.Increase the max_replica_count to 10 to allow more parallel requests.
B.Change the machine type to n1-highcpu-16 with a GPU accelerator.
C.Set min_replica_count to 3 to ensure always-on capacity.
D.Increase the batch size in the prediction request.
AnswerB

More CPU cores and GPU can reduce inference latency.

Why this answer

Option A is correct because using a machine type with more CPUs or adding a GPU accelerator can reduce inference time for compute-intensive models. Option B is wrong because increasing max replicas does not improve single-request latency. Option C is wrong because batch size affects throughput, not latency per request.

Option D is wrong because increasing min replicas reduces cold start but not steady-state latency.

26
Multi-Selecthard

Which TWO options can help detect model performance degradation in production? (Choose two.)

Select 2 answers
A.Vertex AI Experiments on historical data
B.Cloud Logging for prediction errors
C.Cloud Monitoring custom metrics from serving logs
D.Vertex AI Model Monitoring (drift detection)
E.Using BigQuery to store predictions and compare with ground truth
AnswersD, E

Detects shifts in input distribution that often lead to performance degradation.

Why this answer

Options A and E are correct. Vertex AI Model Monitoring detects drift in input features, which can indicate performance degradation. Storing predictions in BigQuery and comparing with ground truth labels directly measures performance.

Option B monitors infrastructure, not model performance. Option C is training-time. Option D logs errors but not degradation.

27
Multi-Selecthard

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

Select 3 answers
A.The model was exported in a different format than supported
B.The batch size in the request is too large
C.The input data types do not match the expected types (e.g., float vs int)
D.The input data has a different number of features than the model expects
E.The serving function does not include the same preprocessing as training
AnswersC, D, E

Data type mismatch causes shape or value errors.

Why this answer

Option C is correct because Vertex AI Prediction expects the input tensor data types to exactly match those used during model training. If the model was trained with float32 inputs but the prediction request sends int32 values, the serving infrastructure detects the mismatch and returns an 'InvalidArgument: input tensor shape mismatch' error, as TensorFlow Serving (which underlies Vertex AI Prediction) validates dtype consistency at the graph level.

Exam trap

Google Cloud often tests the misconception that 'shape mismatch' only refers to the number of features or dimensions, when in fact it also encompasses data type mismatches and preprocessing inconsistencies that alter the tensor structure before it reaches the model.

28
MCQeasy

You need to serve a TensorFlow model that has a cold start latency of 20 seconds. The model is used for a real-time application with unpredictable traffic, but occasional bursts require immediate responses. What is the best deployment strategy to minimize both cold start impact and cost?

A.Set min_replica_count to 1 to keep at least one instance always warm.
B.Use a larger machine type to reduce cold start time.
C.Set min_replica_count to 0 and rely on autoscaling to handle bursts.
D.Enable serving on Cloud Run for faster cold start.
AnswerA

One warm instance avoids cold start for initial traffic.

Why this answer

Setting a minimum number of replicas (min_replica_count) ensures that some instances are always warm, avoiding cold start for the first requests. This balances cost and latency. Prewarming requests or increasing target utilization wouldn't help directly.

29
MCQeasy

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

A.Reduce model size by removing features
B.Compress the model using gzip and upload
C.Deploy the model on Cloud Run Functions
D.Use a custom container to serve the model
AnswerD

Custom containers have no size limit.

Why this answer

Vertex AI Prediction has a 2GB limit for the model artifact when using pre-built containers. A custom container bypasses this limit because you package the model and serving code into a Docker image, which can be arbitrarily large. This allows you to serve XGBoost models exceeding 2GB without size constraints imposed by the managed serving infrastructure.

Exam trap

Google Cloud often tests the misconception that compression (gzip) or feature reduction can circumvent hard platform limits, when in fact the correct solution is to use a custom container that bypasses the artifact size restriction entirely.

How to eliminate wrong answers

Option A is wrong because removing features reduces model accuracy and does not address the core issue of the 2GB artifact limit; Vertex AI still enforces the limit on the remaining model file. Option B is wrong because gzip compression is not transparent to Vertex AI's pre-built containers—the model must be decompressed at load time, and the 2GB limit applies to the uncompressed artifact, so compression does not bypass the restriction. Option C is wrong because Cloud Run Functions have a 2GB memory limit and are designed for stateless, short-lived functions, not for hosting large ML models; they lack GPU support and are unsuitable for XGBoost inference at scale.

30
MCQmedium

A team needs to serve a PyTorch model for production inference with strict latency requirements (p99 < 100ms). The model has dynamic control flow and uses custom kernels compiled with torch.jit. Which serving approach should they recommend?

A.Build a custom container with PyTorch JIT and deploy it on Vertex AI Prediction.
B.Convert the model to TensorFlow SavedModel and serve it on Vertex AI Prediction with TensorFlow Serving.
C.Use Cloud Functions with a PyTorch wrapper to handle inference requests.
D.Deploy the model on Vertex AI Prediction using the prebuilt PyTorch container.
AnswerA

Custom container allows fine-grained optimization and inclusion of custom kernels.

Why this answer

Option C is correct because a custom container with a PyTorch JIT server offers full control over the model execution and avoids overhead of generic servers. Option A is wrong because Vertex AI Prediction does not support custom containers? Actually it does, but the best fit for dynamic control flow is a custom container. Option B is wrong because TensorFlow Serving does not support PyTorch natively.

Option D is wrong because Cloud Functions are not suitable for real-time inference at scale with strict latency.

31
MCQhard

A team deploys a TensorFlow model using a custom container to Vertex AI Endpoint. The container expects the saved model at the /model directory, but predictions fail with a 'model not found' error. The team used the default Vertex AI serving container in the past. What is the most likely cause?

A.The container does not have a GPU accelerator configured.
B.The model artifact must be downloaded from Cloud Storage and placed in /gcs.
C.The container reads from a fixed directory /model, but Vertex AI mounts the model at /tmp/model.
D.The model was saved in a different format (e.g., SavedModel vs. HDF5).
AnswerC

Custom containers must adapt to the Vertex AI model mount point.

Why this answer

Option D is correct because Vertex AI mounts the model artifact at a path specified by the environment variable AIP_STORAGE_URI, typically under /tmp/model. The custom container must read from this location or copy the model. Option A is wrong because the model format is not the issue.

Option B is wrong because Vertex AI does not require the model to be in a Cloud Storage bucket mounted at /gcs in this context. Option C is wrong because the container can be GPU-enabled; the error is about file not found.

32
MCQeasy

An organization needs to serve a large model (10 GB) with low latency across multiple regions. Which Vertex AI feature best meets this requirement?

A.Private endpoints
B.Batch prediction
C.Model Monitoring
D.Global endpoints
AnswerD

Global endpoints automatically route to the closest region, providing low latency across regions.

Why this answer

Option A is correct because Vertex AI Global Endpoints automatically route traffic to the nearest region with capacity, reducing latency for geographically distributed users. Option B is for batch jobs, not real-time. Option C is for private access within VPC, which does not address multi-region latency.

Option D is for monitoring, not serving.

33
Multi-Selecteasy

Which TWO actions can help reduce the latency of online prediction requests for a deep learning model served on Vertex AI?

Select 2 answers
A.Increase the number of CPU vCPUs per machine.
B.Set min_replica_count to 0 to avoid idle instances.
C.Use a GPU accelerator for the deployed model.
D.Decrease the number of replicas to reduce resource contention.
E.Enable request batching to process multiple inputs together.
AnswersC, E

GPU accelerates deep learning inference.

Why this answer

Using a GPU accelerator speeds up inference, and batching requests reduces overhead per request. Minimizing replicas doesn't help latency; increasing CPU doesn't always help if GPU is better.

34
MCQhard

You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?

A.Add a NVIDIA T4 GPU accelerator to the existing machine type.
B.Reduce the min_replica_count to 0 to allow scaling down aggressively and add more replicas during spikes.
C.Increase the machine type to n1-highmem-16 with more vCPUs.
D.Switch the model to Vertex AI batch prediction and run predictions every hour.
AnswerA

GPU accelerates the deep learning parts, reducing total latency.

Why this answer

Adding a GPU accelerator (e.g., NVIDIA T4) to the instances can significantly speed up the TensorFlow and PyTorch components, which are deep learning models. The XGBoost part runs on CPU but the overall latency bottleneck is likely the deep learning models. GPU will accelerate inference of those models, reducing total latency.

Increasing CPUs will help only marginally as the main bottleneck is compute. Reducing min replicas may increase cold start and queue. Switching to batch prediction changes the model from real-time to batch, which does not meet the latency requirement.

35
MCQhard

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

A.The custom container is not compatible with Vertex AI
B.The model is cached and needs cache invalidation
C.Traffic is not split to the new model version
D.The new model version was not deployed to the same endpoint
AnswerC

Traffic splitting must be adjusted to route to the new version.

Why this answer

In Vertex AI Prediction, when you deploy a new model version to an existing endpoint, you must explicitly allocate traffic to it. By default, the new version receives 0% traffic, so all requests continue to be served by the old version. The correct fix is to update the endpoint's traffic split, for example via the console or the `gcloud ai endpoints update` command with the `--traffic-split` flag.

Exam trap

Google Cloud often tests the misconception that deploying a new model version automatically replaces the old one, when in fact Vertex AI requires an explicit traffic split update to shift requests to the new version.

How to eliminate wrong answers

Option A is wrong because Vertex AI supports custom containers for TensorFlow models as long as they implement the required HTTP health check and prediction endpoints; incompatibility would cause deployment failure, not silent routing to an old version. Option B is wrong because Vertex AI does not cache model predictions at the endpoint level; caching is not a factor in traffic routing between model versions. Option D is wrong because deploying to the same endpoint is exactly what the user did; the issue is that the new version was deployed but not given any traffic share, not that it was deployed to a different endpoint.

36
Multi-Selecteasy

Which TWO options are best practices for reducing model serving latency on Vertex AI Endpoints? (Choose two.)

Select 2 answers
A.Use a larger machine type with more memory
B.Optimize the model using quantization or pruning
C.Deploy the model in the same region as the clients
D.Use batch prediction instead of online prediction
E.Enable model caching at the endpoint
AnswersB, C

Reduces model size and inference time, lowering latency with minimal accuracy impact.

Why this answer

Options C and E are correct. Deploying in the same region as clients reduces network latency. Optimizing the model (quantization/pruning) reduces compute time without major accuracy loss.

Option A increases cost but not necessarily latency. Option B is not a feature. Option D increases latency due to batch processing.

37
Multi-Selecthard

A company runs batch predictions on a large dataset using Vertex AI Batch Prediction. They want to reduce costs without significantly increasing processing time. Which three actions should they take? (Choose three.)

Select 3 answers
A.Use preemptible VMs for the batch prediction job.
B.Use a larger machine type to reduce the number of workers.
C.Use custom machine types with only the necessary resources (vCPU and memory).
D.Use TPUs instead of GPUs to accelerate processing.
E.Tune the batch size to maximize throughput per worker.
AnswersA, C, E

Preemptible VMs are significantly cheaper and suitable for batch jobs.

Why this answer

Options A, C, and E are correct. A uses preemptible VMs which are cheaper. C tunes batch size to maximize throughput per worker.

E uses custom machine types to avoid overprovisioning. Option B increases machine size which may increase cost per worker. Option D uses TPUs which are more expensive and may not be beneficial for all model types.

38
MCQhard

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

A.Profile the container's memory usage locally with memory_profiler to find the leak, then fix the code.
B.Reduce the number of replicas to 1 to reduce memory contention.
C.Increase the machine memory to n1-standard-8 (30 GB).
D.Restart the endpoint every hour using a Cloud Scheduler job.
AnswerA

Identifies root cause for permanent fix.

Why this answer

Option A is correct because the steady memory growth despite a fixed 5 GB embedding table indicates a memory leak in the custom container code, not a capacity issue. Profiling locally with memory_profiler allows you to trace object allocations and identify the leak source before modifying the serving code, which is the most direct diagnostic step.

Exam trap

Google Cloud often tests the distinction between scaling up resources (Option C) and fixing the root cause (Option A), tempting candidates to choose a quick capacity increase instead of proper debugging.

How to eliminate wrong answers

Option B is wrong because reducing replicas to 1 does not address the memory leak; it only reduces total cluster memory, making the leak more severe per replica. Option C is wrong because increasing machine memory to n1-standard-8 (30 GB) merely postpones the failure by providing more headroom, but the leak will eventually consume that memory as well. Option D is wrong because restarting the endpoint every hour via Cloud Scheduler is a workaround that masks the symptom without fixing the underlying code defect, and it introduces request downtime during restarts.

39
MCQhard

Refer to the exhibit. A data scientist deploys a new model version (model_v2) to an existing endpoint with 20% traffic. After a few days, they notice that model_v2's error rate is higher than model_v1's. They want to route all traffic back to model_v1 immediately. Which command achieves this with minimal disruption?

A.gcloud ai endpoints update my-endpoint --region=us-central1 --remove-deployed-model=model_v2
B.gcloud ai endpoints undepoly-model my-endpoint --region=us-central1 --model=model_v2
C.gcloud ai endpoints update my-endpoint --region=us-central1 --traffic-split=model_v1=1,model_v2=0
D.gcloud ai endpoints update-traffic my-endpoint --region=us-central1 --model=model_v1 --traffic-percentage=100
AnswerC

This command updates the traffic split to direct 100% traffic to model_v1 and 0% to model_v2, a zero-downtime change.

Why this answer

Option A is correct. The gcloud ai endpoints update command with --traffic-split allows setting the traffic split to 100% for model_v1 and 0% for model_v2, routing all traffic to the stable model without redeploying. Option B removes the model, which may cause temporary unavailability.

Option C uses a misspelled command. Option D changes the endpoint's update time but not traffic.

40
Multi-Selecteasy

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

Select 2 answers
A.Use batch prediction for all requests.
B.Enable autoscaling to handle traffic variations.
C.Use manual scaling with a fixed number of replicas.
D.Deploy all models on the same machine type for consistency.
E.Set up model monitoring for prediction drift and data quality.
AnswersB, E

Autoscaling adjusts resources dynamically.

Why this answer

Option B is correct because Vertex AI's autoscaling dynamically adjusts the number of replicas based on incoming request traffic, ensuring low latency during spikes and cost savings during lulls. This is critical for real-time inference, where consistent response times are required and manual scaling would either over-provision or under-provision resources. Autoscaling uses metrics like CPU utilization or request count to scale up or down, directly improving serving performance and reliability.

Exam trap

Google Cloud often tests the distinction between batch and real-time serving, trapping candidates who think batch prediction can be used for low-latency inference, or who assume that manual scaling is more reliable than autoscaling for variable workloads.

41
Multi-Selecteasy

Which TWO are best practices for deploying models to Vertex AI Prediction? (Choose 2.)

Select 2 answers
A.Monitor prediction latency and error rates with Cloud Monitoring alerts.
B.Log all raw prediction inputs and outputs for every request for auditing.
C.Use a dedicated service account with minimal permissions for the endpoint.
D.Always deploy the model in the same environment as training to avoid incompatibility.
E.Use the default model version alias 'default' for all deployments to simplify updates.
AnswersA, C

Essential for detecting performance issues.

Why this answer

Options B and D are correct. Option A is wrong because exact same environment may not be available. Option C is wrong because version aliases should be used for easy rollback.

Option E is wrong because logging all inputs may cause privacy issues.

42
MCQhard

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

A.Switch to GPU instances (e.g., n1-standard-4 with T4) and set min replicas=2 with autoscaling up to 10
B.Increase min replicas to 5 to keep warm instances
C.Set min replicas=1 and max replicas=5 to control cost
D.Increase max replicas to 20 and keep CPU instances
AnswerA

GPUs accelerate inference, reducing per-request latency; warm instances handle spike.

Why this answer

Option A is correct because switching to GPU instances (n1-standard-4 with T4) offloads compute-intensive recommendation model inference to GPUs, significantly reducing per-request latency. Setting min replicas=2 ensures that at least two instances are always warm, reducing cold-start delays and allowing autoscaling to handle traffic spikes more responsively. This combination addresses both the CPU bottleneck and the slow scaling trigger, keeping latency under 50ms even at 1000 QPS.

Exam trap

Google Cloud often tests the misconception that simply increasing the number of CPU instances or adjusting autoscaling parameters can solve a CPU-bound latency problem, when the real fix is to change the compute architecture (e.g., GPU) to match the workload's computational profile.

How to eliminate wrong answers

Option B is wrong because increasing min replicas to 5 on CPU-only instances does not resolve the fundamental CPU bottleneck; the model still runs on CPU, so each request will still suffer high latency under load, and the cost increases without performance gain. Option C is wrong because setting max replicas to 5 limits the maximum capacity to only 5 CPU instances, which cannot handle 1000 QPS without severe latency, and min replicas=1 still risks cold-start delays. Option D is wrong because increasing max replicas to 20 on CPU instances only adds more CPU-bound nodes, which still cannot process requests fast enough per instance due to the CPU bottleneck, leading to continued high latency and timeouts.

43
MCQmedium

You deploy a PyTorch model to Vertex AI Online Prediction. After deployment, you observe that inference latency is approximately 300ms per request, but the desired SLA is under 100ms. The model uses a custom container with CPU only. Which action is most likely to reduce latency to the target?

A.Deploy the model on a machine with a GPU accelerator.
B.Switch from online prediction to batch prediction.
C.Increase the min_replica_count to ensure more instances are always available.
D.Use a smaller machine type with less CPU to reduce overhead.
AnswerA

GPU can accelerate PyTorch inference significantly, reducing latency.

Why this answer

Enabling GPU acceleration can significantly speed up inference for deep learning models. Adding more CPU instances may help with throughput but not per-request latency. Switching to batch prediction changes the use case, and using a smaller instance type might reduce latency if the model is small, but GPU is more impactful.

44
MCQeasy

A company needs to serve a model with strict latency requirements (<100ms). They are using Vertex AI Prediction with CPU. During testing, latency is 150ms. What should they do?

A.Enable batching to improve throughput
B.Use a smaller machine type with more replicas
C.Export the model to TensorFlow Lite
D.Switch to a GPU machine type
AnswerD

GPUs can reduce inference latency.

Why this answer

The model's latency of 150ms exceeds the 100ms requirement. Switching to a GPU machine type (Option D) is correct because GPUs are optimized for parallel computation, significantly reducing inference latency for many ML models, especially deep learning models, compared to CPUs. Vertex AI Prediction supports GPU machine types, and this change directly addresses the latency bottleneck without altering the model or its serving configuration.

Exam trap

The trap here is that candidates confuse throughput optimization (batching or scaling replicas) with latency reduction, failing to recognize that GPUs directly address compute-bound latency while CPU-based solutions cannot meet strict sub-100ms requirements for complex models.

How to eliminate wrong answers

Option A is wrong because batching improves throughput (requests per second) by grouping multiple inference requests, but it typically increases per-request latency due to queuing and processing delays, making it unsuitable for a strict sub-100ms latency requirement. Option B is wrong because using a smaller machine type with more replicas can improve throughput and availability but does not reduce per-request inference latency; smaller machines often have less compute power, potentially increasing latency. Option C is wrong because exporting the model to TensorFlow Lite is designed for edge or mobile deployment with limited resources, not for optimizing latency in a cloud-based Vertex AI Prediction serving environment; it would require significant model conversion and may not be compatible with all model architectures.

45
MCQeasy

A company deploys a model on Vertex AI Endpoints for real-time inference. They notice latency spikes during peak hours. Which action is most effective to reduce latency without sacrificing accuracy?

A.Enable autoscaling based on CPU utilization
B.Use a larger machine type
C.Reduce model size by pruning
D.Implement client-side caching
AnswerA

Autoscaling adds instances during load spikes, maintaining low latency without sacrificing accuracy.

Why this answer

Option B is correct because enabling autoscaling based on CPU utilization dynamically adjusts the number of instances to handle traffic spikes, reducing latency. Option A increases cost without addressing scaling elasticity. Option C may help but not during peaks if requests are unique.

Option D can reduce accuracy.

46
MCQmedium

Your organization has a large production system that uses Vertex AI Prediction for an NLP model with a 2 GB memory footprint. The endpoint is configured with 5 replicas, each using an n1-standard-4 with a single T4 GPU. Recently, you observed an increase in 503 errors during peak hours. Cloud Monitoring shows that GPU utilization is consistently above 90% across all replicas, while CPU and memory are below 50%. You have already increased the max replicas to 10, but the errors persist because the increased replicas also become saturated. What should you do to resolve the issue?

A.Switch to a larger GPU such as V100 or A100 to increase per-replica throughput.
B.Implement request batching in the custom container to improve GPU utilization efficiency.
C.Enable model parallelism across multiple GPUs within each replica.
D.Use a high-memory machine type like n1-highmem-16 to reduce memory pressure.
AnswerD

High-memory machines address memory bottlenecks; memory is likely the real issue given GPU saturation.

Why this answer

Option D is correct because CPU bottlenecks cause high latency; switching to a machine type with more CPU cores (e.g., n1-highcpu-16) reduces CPU contention. Option A adds memory but not CPU. Option B uses more replicas but each already saturated.

Option C is irrelevant; batch processing is not in use.

47
MCQmedium

You are deploying a scikit-learn model for online predictions. The model size is 200 MB. You want to minimize latency and cost. Which serving option should you choose?

A.Deploy to Vertex AI online prediction using a prebuilt container for scikit-learn.
B.Use Cloud Run with a custom container.
C.Create a Kubernetes cluster on GKE and deploy the model there.
D.Export the model as a Cloud Function.
AnswerA

Vertex AI provides optimized containers and autoscaling for online prediction.

Why this answer

Vertex AI online prediction with custom containers is suitable for scikit-learn models. Vertex AI will host the container and scale. Using AI Platform or Cloud Functions with a 200 MB model might hit limits.

48
MCQmedium

A data scientist uses Vertex AI Workbench to train a model and then deploys it to an endpoint. They want to automate the retraining and redeployment pipeline when new data arrives. Which service should they use?

A.Cloud Composer
B.Vertex AI Pipelines
C.Cloud Scheduler
D.Cloud Functions
AnswerB

Vertex AI Pipelines is purpose-built for ML workflows, allowing easy automation of retraining and redeployment.

Why this answer

Option C is correct because Vertex AI Pipelines provides a serverless, managed pipeline orchestration service that can automate retraining and redeployment. Option A (Cloud Composer) is a workflow orchestration service but is more complex and not as integrated with Vertex AI. Option B (Cloud Functions) is event-driven but lacks pipeline capabilities.

Option D (Cloud Scheduler) is for scheduled jobs, not event-driven retraining.

49
MCQhard

You are troubleshooting a Vertex AI endpoint for a customer. The exhibit shows the endpoint configuration. The customer reports that Model A is experiencing high latency during peaks. Model B runs fine. What is the most likely cause?

A.Model A is not autoscaling properly due to minReplicaCount=1.
B.Model A's machine type has insufficient CPU and GPU for the load.
C.Dedicated endpoint is disabled, causing resource sharing between models.
D.The traffic split is unevenly balanced, causing Model A to receive more requests.
AnswerB

Model A uses n1-standard-4 with 1 GPU, while Model B uses n1-standard-8 with 2 GPUs.

Why this answer

Model A has only one GPU and fewer CPU cores compared to Model B. During high traffic, Model A's resources become a bottleneck. The traffic split is equal, so both get similar load, but Model A's hardware is weaker.

50
MCQeasy

A company needs to serve a model for real-time predictions with a strict latency SLA of 100ms at the 99th percentile. The model is lightweight and traffic patterns are highly variable with occasional spikes. Which deployment strategy best meets the SLA while controlling cost?

A.Deploy the model as a Cloud Run service with autoscaling to zero.
B.Deploy to Vertex AI Endpoint with manual scaling and a fixed number of replicas.
C.Use Vertex AI Batch Prediction.
D.Deploy to Vertex AI Endpoint with min_replica_count=3 and autoscaling enabled.
AnswerD

Min replicas provide baseline capacity to absorb spikes, and autoscaling adds replicas as needed.

Why this answer

Option D is correct because setting a minimum number of replicas ensures baseline capacity to handle initial spikes without cold start delays, while autoscaling handles larger spikes. Option A is wrong because batch prediction is not real-time. Option B is wrong because no scaling may cause over-provisioning or under-provisioning.

Option C is wrong because Cloud Run with no accelerator may not meet latency SLA for ML models.

51
MCQhard

You manage a multi-tenant serving system on Vertex AI Prediction where multiple models are deployed in a single endpoint using model versioning. One particular model version (v2) is consuming excessive resources, causing latency spikes for other versions. You need to isolate this model to prevent interference. The models are all in TensorFlow SavedModel format. What is the best approach?

A.Shard the models across multiple replicas using a custom routing logic in the container.
B.Set resource limits on the container using Kubernetes resource requests/limits, but Vertex AI Prediction does not support that.
C.Use Vertex AI Model Registry to deploy v2 to a dedicated endpoint and update the model alias.
D.Create a separate endpoint for v2 and redirect traffic using a load balancer.
AnswerC

Dedicated endpoint ensures resource isolation.

Why this answer

Option B is correct because creating a separate endpoint for v2 provides full resource isolation. Option A is similar but less direct (load balancer still distributes to same endpoint). Option C is not possible in Vertex AI Prediction.

Option D is complex and error-prone.

52
MCQmedium

A team wants to deploy two versions of a model (v1 and v2) on Vertex AI Endpoint to conduct an A/B test. They need to split traffic so that 10% of requests go to v2. Which configuration achieves this?

A.Deploy both versions on the same endpoint and use the `traffic_split` parameter to allocate 90% to v1 and 10% to v2.
B.Configure a global load balancer in front of two endpoints and set the weight.
C.Create two separate endpoints, one for each version, and have the client randomly select the endpoint.
D.Deploy v2 as a canary deployment and set the canary rollout to 10% in Cloud Deployment Manager.
AnswerA

Vertex AI endpoints support traffic splitting between deployed models.

Why this answer

Option C is correct because Vertex AI Endpoints support traffic splitting by allocating percentages to each model. Option A is wrong because canary deployment gradually rolls out, not fixed split. Option B is wrong because multiple endpoints cannot share traffic splitting.

Option D is wrong because routing at load balancer is not necessary.

53
MCQeasy

A company has deployed a TensorFlow model on Vertex AI Prediction for real-time inference. They notice that during peak hours, the prediction latency increases significantly, and some requests time out. The model requires GPU acceleration. Which action should they take to reduce latency and avoid timeouts?

A.Enable autoscaling with min replicas set to the base load and max replicas set to handle peak load, and ensure GPU quota is sufficient.
B.Switch to a larger machine type with more vCPUs.
C.Increase the number of replicas in the Vertex AI Prediction endpoint statically to handle peak load.
D.Use Cloud Functions to invoke the model asynchronously.
AnswerA

Autoscaling adjusts replicas dynamically, and sufficient GPU quota prevents resource bottlenecks.

Why this answer

Option A is correct because enabling autoscaling with appropriate min and max replicas dynamically adjusts capacity to handle peak load, and ensuring sufficient GPU quota prevents resource constraints. Option B is wrong because statically increasing replicas leads to resource waste during low traffic and may not react quickly to spikes. Option C is wrong because increasing CPU resources does not address GPU-bound inference.

Option D is wrong because Cloud Functions is not designed for GPU-accelerated inference and introduces additional latency.

54
Multi-Selecthard

An ML engineer is deploying a large BERT-based natural language processing model for real-time inference on Vertex AI Prediction. The model has a large memory footprint (2GB) and experiences unpredictable traffic spikes up to 10x the baseline. The engineer needs to minimize latency and cost while handling spiky traffic. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers
A.Configure the endpoint to use manual scaling with a fixed number of replicas equal to peak traffic.
B.Enable automatic scaling with a maximum of 3 replicas to limit cost.
C.Use a custom prediction routine with model quantization to reduce model size.
D.Set up model monitoring to detect prediction drift and retrain regularly.
E.Use a GPU machine type (NVIDIA T4) to accelerate inference.
AnswersC, E

Quantization reduces model size and inference latency, improving both cost and speed.

Why this answer

Option C is correct because model quantization reduces the memory footprint of a BERT model (e.g., from 2GB to ~500MB with INT8 quantization), which directly lowers inference latency and cost by enabling faster loading and more efficient use of hardware. This is critical for real-time inference with unpredictable traffic spikes, as smaller models scale more easily and reduce the need for excessive replicas.

Exam trap

The trap here is that candidates often assume GPU acceleration (Option E) is always the best choice for reducing latency, but for a 2GB BERT model with spiky traffic, quantization (Option C) can achieve similar latency improvements at a fraction of the cost, and the question explicitly asks to minimize both latency and cost, making quantization a more balanced solution.

55
Matchingmedium

Match each optimization algorithm to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Stochastic gradient descent with constant learning rate

Adaptive moment estimation with per-parameter learning rates

Root mean square propagation, adapts learning rate per parameter

Adaptive gradient algorithm, reduces learning rate for frequent features

Accelerates SGD by adding a fraction of previous update

Why these pairings

Optimizers affect training convergence.

56
MCQhard

A company serves a PyTorch model using a custom container on Vertex AI Prediction. They notice that after a few hours, the endpoint returns 502 errors. The logs show 'Out of memory' errors. The container has a memory limit of 4GB, and the model loads a 3GB vocabulary file. What is the most likely cause and best fix?

A.Increase the container memory to 8GB.
B.Load the vocabulary file once at startup and reuse it.
C.Increase the number of replicas to distribute load.
D.Switch to Vertex AI Batch Prediction.
AnswerB

Prevents repeated loading, solving OOM.

Why this answer

The 502 errors and 'Out of memory' errors indicate that the container is running out of memory during inference. Since the model loads a 3GB vocabulary file, and the container has only 4GB of memory, loading this file repeatedly for each prediction request (e.g., inside the prediction handler) would quickly exhaust memory. The correct fix is to load the vocabulary file once at container startup and reuse it across all requests, which is a standard best practice for serving models with large static assets.

Exam trap

Google Cloud often tests the misconception that OOM errors are always solved by increasing memory, but the trap here is that the real issue is inefficient resource reuse—loading a large file per request—rather than insufficient total memory.

How to eliminate wrong answers

Option A is wrong because simply increasing memory to 8GB does not address the root cause—the vocabulary file is being loaded repeatedly, which will still cause memory bloat and eventual OOM errors, just at a higher threshold. Option C is wrong because increasing replicas distributes incoming traffic but does not fix the per-container memory leak caused by repeated vocabulary loading; each replica would still suffer the same OOM issue. Option D is wrong because switching to Batch Prediction is for offline, asynchronous processing, not for real-time serving, and does not solve the memory management problem within the container.

57
MCQmedium

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

A.Disable autoscaling and use a fixed number of replicas
B.Increase the max replicas setting
C.Decrease the machine type to reduce provisioning time
D.Set a higher min replicas to maintain a baseline of warm instances
AnswerD

Warm instances reduce latency during spikes.

Why this answer

Option D is correct because setting a higher min replicas ensures that a baseline number of instances are always warm and ready to serve traffic. During a traffic spike, new instances still take time to provision (cold start), but the warm instances handle the initial surge without latency spikes. This directly addresses the observed high latency during spikes.

Exam trap

Google Cloud often tests the misconception that increasing max replicas or decreasing machine type solves cold-start latency, when the real solution is maintaining a warm baseline via min replicas.

How to eliminate wrong answers

Option A is wrong because disabling autoscaling and using a fixed number of replicas eliminates elasticity, leading to either over-provisioning (cost) or under-provisioning (latency) during variable traffic. Option B is wrong because increasing max replicas only raises the ceiling for scaling out; it does not reduce the cold-start provisioning time for new instances during a spike. Option C is wrong because decreasing the machine type reduces compute capacity per instance, which can increase latency under load, and does not meaningfully reduce provisioning time (which is dominated by container image pull and model loading, not machine type).

58
MCQhard

Your team has deployed a PyTorch model using a custom container on Vertex AI Prediction. The model uses dynamic batching to combine incoming requests. You notice that the average latency is 150 ms, but the 99th percentile latency is 2 seconds. Cloud Monitoring shows that the CPU is idle much of the time, and GPU utilization is around 70%. The model is deployed on a single n1-standard-4 with a T4 GPU. You suspect the issue is related to request queuing. Which change would most effectively reduce tail latency?

A.Add a second replica to share the load.
B.Increase the batch timeout to allow larger batches to form, reducing the number of batches.
C.Decrease the batch size to reduce processing time per batch.
D.Implement a priority queue to handle high-priority requests first.
AnswerA

More replicas reduce queue depth and tail latency.

Why this answer

Option C is correct because adding a replica reduces the queue length per replica, thus reducing waiting time for requests. Option A might increase tail latency if timeout is too long. Option B could reduce processing time but not queuing delay.

Option D adds complexity and doesn't address root cause.

59
MCQhard

Your team is serving a large language model on Vertex AI using a custom container. The endpoint experiences intermittent 502 errors during traffic spikes. The autoscaling configuration uses a CPU utilization target of 60% and the model is deployed on n1-standard-4 instances. The model requires significant memory. Which combination of changes is most likely to resolve the issue?

A.Increase the target CPU utilization to 90% to allow more requests per instance.
B.Switch to a machine type with more memory, e.g., n1-highmem-8, and increase min_replica_count.
C.Enable canary traffic splitting to reduce load on the main endpoint.
D.Reduce the model batch size from 32 to 1 to lower memory per request.
AnswerB

High memory instances reduce memory contention, and more replicas absorb traffic spikes.

Why this answer

The 502 errors likely indicate the instances are overwhelmed or timing out. Increasing the machine type to a high-memory instance reduces memory pressure, and adding more replicas through a lower target scaling metric or higher min replicas provides capacity. Tuning batch size helps but is secondary.

GPU may not help if the issue is memory.

60
MCQmedium

A company deploys a model on Vertex AI Endpoint and expects high traffic spikes during promotional events. The current configuration uses manual scaling with 2 replicas. Which autoscaling configuration should they use to handle spikes while minimizing cost during normal traffic?

A.Keep manual scaling but increase replicas to 10.
B.Set min_replica_count=2 and max_replica_count=10 with no scaling metric.
C.Enable basic scaling with target_cpu_utilization=0.6 and set min_replica_count=2, max_replica_count=10.
D.Use custom metric scaling with a Cloud Monitoring metric for prediction latency.
AnswerC

Basic scaling adjusts replicas based on CPU load.

Why this answer

Option B is correct because basic scaling with a target metric (e.g., CPU utilization) automatically adjusts replicas based on load, reducing cost during low traffic and scaling up during spikes. Option A is wrong because no scaling cannot adapt. Option C is wrong because manual scaling requires constant adjustments.

Option D is wrong because custom metric scaling is possible but basic scaling is simpler and sufficient for CPU-bound models.

61
MCQhard

A company uses Vertex AI Endpoints for model serving and wants to implement A/B testing between model versions. They need to gradually shift traffic from the old to the new version while monitoring performance. Which Vertex AI feature allows this with minimal operational overhead?

A.Using a custom load balancer with weighted backend services
B.Model Deployments with traffic splitting
C.Vertex AI Experiments for tracking
D.Cloud Run revisions with traffic migration

Why this answer

Option B is correct because Vertex AI Endpoints allow deploying multiple model versions to the same endpoint and setting a traffic split percentage that can be gradually adjusted. Option A is not a feature. Option C is possible but adds overhead.

Option D is for experiments, not serving.

62
Multi-Selecthard

A team is serving a large language model (LLM) on Vertex AI using a custom container. They want to reduce tail latency. Which THREE strategies should they consider?

Select 3 answers
A.Increase the number of replicas.
B.Use dynamic batching to combine requests.
C.Implement response caching for common queries.
D.Quantize the model to INT8 to reduce computation.
E.Upgrade to a more powerful GPU type.
AnswersB, C, D

Improves GPU utilization and reduces per-request latency.

Why this answer

Dynamic batching (B) reduces tail latency by grouping multiple inference requests into a single batch, which improves GPU utilization and amortizes overhead across requests. This is particularly effective for LLMs because it allows the model to process more tokens per forward pass, reducing the per-request latency variance that contributes to tail latency.

Exam trap

The trap here is that candidates confuse scaling strategies (like increasing replicas or upgrading hardware) with latency-optimization techniques, failing to recognize that tail latency is primarily reduced by batching and caching, not by adding more compute resources.

63
Matchingmedium

Match each feature engineering technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Convert categorical variable into binary columns

Combine two or more features to capture interactions

Normalize numeric features to a standard range

Group continuous values into discrete intervals

Weight term frequency by inverse document frequency

Why these pairings

Feature engineering is essential for model performance.

64
Multi-Selectmedium

A team wants to serve a large PyTorch model (3 GB) for online predictions with low latency. Which THREE actions should they take?

Select 3 answers
A.Use a custom container that preloads the model into memory.
B.Use batch prediction instead of online prediction.
C.Use a machine type with a GPU accelerator.
D.Optimize the model using TorchScript or quantization.
E.Deploy in multiple regions with Cloud Load Balancing.
AnswersA, C, D

Preloading avoids loading model on each request, reducing latency.

Why this answer

Options A, B, and E are correct. Option A: GPU accelerator speeds up inference. Option B: model optimization (TorchScript, quantization) reduces inference time.

Option E: custom container with model preloading reduces cold start latency. Option C (multiregion) reduces network latency, not prediction latency. Option D (batch prediction) is not for online.

65
MCQeasy

A company deploys a model on Vertex AI Prediction for real-time inference. Users report intermittent high latency during peak hours. The model is deployed on a single machine type with `min_replica_count=1` and `max_replica_count=5`. Autoscaling is enabled based on CPU utilization. What is the most likely cause of the latency spikes?

A.The model server is crashing under load due to memory issues.
B.Autoscaling based on CPU utilization does not react quickly to inference request spikes.
C.The load balancer is misconfigured and routes traffic unevenly.
D.The container image is not optimized for the model.
AnswerB

CPU utilization may lag behind request surges; Vertex AI recommends using target utilization or custom metrics for faster response.

Why this answer

Option B is correct because CPU utilization may not be a good proxy for inference load; the system may not scale up fast enough under sudden traffic bursts. Option A is wrong because Vertex AI automatically manages container health. Option C is wrong because Vertex AI endpoints automatically distribute traffic.

Option D is wrong because the container image is built correctly.

66
MCQeasy

A company has deployed a computer vision model on Vertex AI Prediction using a custom container. The model processes high-resolution images and serves predictions to a mobile application. Recently, users have reported that predictions sometimes take over 10 seconds, and the application times out. The ML engineer's monitoring shows that the endpoint's CPU utilization is consistently high (above 85%) and that the request latency spikes during peak hours. The model is deployed on n1-standard-4 machines with automatic scaling set to minReplicaCount=1 and maxReplicaCount=5. The engineer has observed that the endpoint rarely scales beyond 2 replicas even during peak hours. What should the engineer do to reduce prediction latency?

A.Increase the maxReplicaCount to 20 to allow more instances during spikes.
B.Review the custom container's startup time and consider pre-warming or reducing model loading time.
C.Change the machine type to a higher CPU machine like n1-standard-8.
D.Set the minReplicaCount to 5 to ensure enough capacity at all times.
AnswerB

The endpoint rarely scales beyond 2 replicas due to slow container startup, causing CPU overload on existing instances. Reducing startup time or pre-warming enables faster scaling and lower latency.

Why this answer

Option D is correct because the root cause is likely that the custom container takes a long time to start (model loading), preventing the endpoint from scaling quickly. Pre-warming or reducing model loading time addresses this directly. Option A (increasing max replicas) does not solve the scaling delay.

Option B (upgrading machine type) may help but does not address the scaling speed. Option C (increasing min replicas) would be costly and still not handle sudden spikes if new replicas start slowly.

67
MCQhard

Your team deploys a multi-model endpoint on Vertex AI with two models: Model A (small, low latency) and Model B (large, high latency). You configure traffic splitting so that 90% goes to Model A and 10% to Model B. However, you notice that the latency for Model A increases when Model B receives traffic. What is the most likely cause?

A.Model A is being overloaded because autoscaling is based on aggregate traffic.
B.The traffic split is misconfigured, causing requests to be routed incorrectly.
C.The models are collocated on the same instances, leading to resource contention.
D.Model B's logging is generating too much output, slowing down the predictor.
AnswerC

Multi-model endpoints share replicas; Model B's work impacts Model A.

Why this answer

In a multi-model endpoint, all models share the underlying infrastructure. When Model B handles requests, it consumes resources (CPU/memory), causing contention that degrades Model A's latency. Collocation of models on the same instance is the issue.

68
MCQhard

Your model serving endpoint on Vertex AI is experiencing increased memory usage after a recent update. The model was converted from TensorFlow to TF Lite for faster inference. You notice that the endpoint's instances occasionally get killed due to out-of-memory (OOM) errors. What is the most likely cause?

A.The TF Lite model is larger in size than the original model.
B.The Vertex AI endpoint is not configured with enough CPU.
C.The number of inference threads in the TF Lite runtime is set too high, causing memory consumption.
D.The traffic to the endpoint has increased significantly.
AnswerC

TF Lite can use multiple threads; excessive threads increase memory.

Why this answer

TF Lite models can have different memory footprint depending on the number of threads used for inference. If the custom container or the runtime allocates many threads, memory usage can spike. The model conversion itself may not reduce memory; thread count is a key factor.

69
MCQeasy

A company is serving a model for their e-commerce website. They expect traffic to be low at night and very high during flash sales. They want to minimize costs while ensuring availability during spikes. Which autoscaling configuration should they use?

A.min_replica_count=5, max_replica_count=5, target_cpu=60
B.min_replica_count=1, max_replica_count=20, target_cpu=60
C.min_replica_count=10, max_replica_count=10, target_cpu=60
D.min_replica_count=0, max_replica_count=100, target_cpu=80
AnswerB

Scales from 1 to 20 based on load, cost-efficient.

Why this answer

Setting a high max_replica_count allows scaling to handle spikes, while a low min_replica_count saves cost during low traffic. CPU utilization target of 60% is reasonable.

70
MCQeasy

A company has deployed a fraud detection model on Vertex AI Prediction. After three months, the model's accuracy has degraded, and the business is losing money due to undetected fraud. What should the team implement to proactively detect such issues?

A.Enable Vertex AI Model Monitoring to track prediction drift and alert when metrics exceed thresholds.
B.Set up Cloud Logging to capture all prediction requests and responses for manual review.
C.Randomly shuffle the training data before retraining to improve robustness.
D.Schedule a monthly job to retrain the model with the latest data without monitoring.
AnswerA

Model Monitoring automatically analyzes input distributions and prediction quality over time.

Why this answer

Option B is correct because monitoring prediction drift is a key practice for model quality. Option A is wrong because logs don't automatically detect drift. Option C is wrong because model monitoring helps, but retraining alone doesn't detect.

Option D is wrong because shuffling data doesn't address drift.

71
MCQhard

A team deploys a real-time model using a custom container on Vertex AI Prediction. The container is large (5 GB) and cold starts are causing latency spikes. The endpoint is configured with `min_replica_count=0` to reduce cost. The team wants to keep the cost low while reducing cold starts. What is the best approach?

A.Set `min_replica_count=1` to keep at least one replica always warm.
B.Use a prebuilt container for the model framework to reduce image size.
C.Enable container memory optimization to reduce startup time.
D.Provision a Persistent Disk (SSD) for the container image to speed up download.
AnswerA

A single warm replica handles traffic immediately while autoscaling adds more.

Why this answer

Option B is correct because configuring a minimum number of always-on replicas (e.g., 1) eliminates cold starts for most traffic. Option A is wrong because it may not help if container is large. Option C is wrong because prebuilding images doesn't reduce cold start startup overhead.

Option D is wrong because SSD can help but not eliminate cold start latency.

72
MCQeasy

A machine learning engineer wants to manage multiple model versions and facilitate collaboration across teams. The goal is to track model lineage, versioning, and approvals. Which Vertex AI service should they use?

A.Vertex AI Model Registry
B.Vertex AI ML Metadata
C.Vertex AI Feature Store
D.Vertex AI Vizier
AnswerA

Model Registry is designed for model versioning, lifecycle management, and collaboration.

Why this answer

Option C is correct because Model Registry provides versioning, approval tracking, and integration with Vertex AI Pipelines. Option A is wrong because Feature Store stores features, not models. Option B is wrong because ML Metadata is lower-level and less user-friendly.

Option D is wrong because Vizier is for hyperparameter tuning.

73
MCQmedium

You run the above command to deploy a new model version to an existing endpoint. After deployment, you observe that the endpoint's previous model version is still receiving 100% of traffic. What is the most likely reason for this?

A.The new model is still in the 'creating' state and hasn't been activated.
B.The model ID provided does not exist in the endpoint.
C.The --traffic-split flag is specified incorrectly; it should use model IDs, not '0-100'.
D.The min-replica-count is too high, preventing traffic splitting.
AnswerC

Correct syntax requires model IDs with percentages.

Why this answer

The traffic-split flag syntax is incorrect. The correct syntax for Vertex AI is --traffic-split=<model-id>=<percentage> for each model. Without correct model IDs, the flag is ignored, and no traffic split is applied, so the existing version continues to receive all traffic.

74
MCQeasy

A startup wants to deploy a small machine learning model for real-time predictions but has a very limited budget. Traffic is minimal and predictable. They want to avoid paying for idle resources. Which serving option is most cost-effective?

A.Deploy the model on a single Compute Engine VM with a GPU.
B.Use Vertex AI Batch Prediction for each prediction request.
C.Deploy the model as a Cloud Run service using a custom container.
D.Deploy the model to Vertex AI Endpoint with min_replica_count=0.
AnswerC

Cloud Run scales to zero and charges only when serving requests.

Why this answer

Option B is correct because Cloud Run with a custom container can scale to zero when idle, incurring no cost when not in use. Option A is wrong because Vertex AI Endpoint requires at least one replica (min_replica_count >= 1). Option C is wrong because batch prediction is not real-time.

Option D is wrong because deploying on a Compute Engine VM requires 24/7 cost even when idle.

75
MCQhard

A model serving team notices that during a flash sale, a real-time recommendation model experiences sudden spikes in traffic, causing some requests to time out. The endpoint is configured with `min_replica_count=3`, `max_replica_count=10`, and autoscaling metric set to `target_utilization=0.6` on CPU. Despite this, autoscaling is too slow. What change will most improve the autoscaling responsiveness?

A.Add a custom metric based on GPU utilization, assuming the model uses GPU.
B.Increase the target CPU utilization to 0.8 to reduce the number of replicas and save cost.
C.Reduce `min_replica_count` to 1 to allow more aggressive scaling.
D.Change the autoscaling metric to 'average request count per replica' with an appropriate target.
AnswerD

Request count directly reflects load and scales more quickly than CPU.

Why this answer

Option A is correct because using request count per replica (transactions per second) as a direct measure of load triggers autoscaling faster. Option B is wrong because increasing target utilization makes it slower. Option C is wrong because GPU metrics are only relevant for GPU models.

Option D is wrong because reducing min replicas may cause underprovisioning.

Page 1 of 2 · 95 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Serve Scale Models questions.