Knowledge + Practice

CCNA Serving and scaling models Questions

20 of 95 questions · Page 2/2 · Serving and scaling models · Answers revealed

Practice these questions Domain overview All questions

76

MCQmedium

A team deploys a model on Vertex AI that uses a custom prediction routine (CPR) with a dependency on a native library. The container crashes with 'ImportError: libcudart.so.11.0: cannot open shared object file'. How should they resolve this?

A.Build a custom container image that includes the CUDA runtime library.

B.Submit the model for batch prediction to avoid the error.

C.Request a GPU machine type for the endpoint.

D.Use a Vertex AI pre-built container for PyTorch instead.

AnswerA

Ensures the library is available.

Why this answer

The error 'ImportError: libcudart.so.11.0: cannot open shared object file' indicates that the CUDA runtime library (version 11.0) is missing from the container environment. Since the custom prediction routine (CPR) depends on a native library that requires this CUDA runtime, the correct solution is to build a custom container image that includes the CUDA runtime library. This ensures the shared object is available at runtime, resolving the import error.

Exam trap

Google Cloud often tests the misconception that requesting a GPU machine type automatically provides the necessary CUDA libraries, but in reality, the CUDA runtime must be explicitly included in the container image, as the GPU machine type only provides the hardware and driver, not the user-space libraries.

How to eliminate wrong answers

Option B is wrong because submitting the model for batch prediction does not change the container environment; the same missing CUDA runtime library will cause the same ImportError during batch prediction. Option C is wrong because requesting a GPU machine type for the endpoint provides GPU hardware but does not install the CUDA runtime library into the container; the library must be present in the container image regardless of the underlying hardware. Option D is wrong because using a Vertex AI pre-built container for PyTorch does not guarantee inclusion of the specific CUDA runtime version 11.0 required by the native library; the pre-built container may have a different CUDA version or omit the library entirely.

Practice this question →

77

MCQeasy

You are using Vertex AI Training to train a model and then automatically deploy the best candidate to a Vertex AI Prediction endpoint via the Vertex AI Model Registry. However, after deployment, you notice that the endpoint returns predictions for the new model, but they are significantly different from the evaluation metrics computed during training. The training scripts used TensorFlow with a serving input function. What is the most likely issue and how would you fix it?

A.The endpoint is using a different machine type affecting numerical precision; you should use the same machine type as training.

B.The serving input function's preprocessing steps do not match the training preprocessing; you should verify and align them.

C.The model registry deployed a different version; you should check the alias.

D.The model was saved with training-only metrics; you should retrain with evaluation metrics.

AnswerB

Preprocessing mismatch is a common cause for prediction discrepancies.

Why this answer

Option B is correct because the serving input function's preprocessing must match training preprocessing exactly; any mismatch causes prediction errors. Option A is wrong because the model saved includes evaluation metrics. Option C is possible but less likely given the consistency of difference.

Option D is unlikely as numerical precision differences are minimal.

Practice this question →

78

Multi-Selectmedium

A company wants to reduce costs for serving a model on Vertex AI Prediction without sacrificing availability. Which THREE strategies should they consider?

Select 3 answers

A.Use larger machine types to reduce the number of replicas

B.Switch to HTTP/2 to reduce network overhead

C.Enable automatic batching to improve throughput per instance

D.Use CPU instead of GPU for models that can run on CPU

E.Use min replicas=0 and enable autoscaling

AnswersC, D, E

Batching increases efficiency, reducing number of instances needed.

Why this answer

Option C is correct because enabling automatic batching on Vertex AI Prediction allows the model server to group multiple inference requests into a single batch, which increases throughput per instance and reduces the total number of compute resources needed. This directly lowers serving costs without sacrificing availability, as the batching is handled transparently by the Vertex AI Prediction infrastructure.

Exam trap

Google Cloud often tests the misconception that reducing replicas with larger machines is cost-effective, but the trap here is that larger machines increase per-unit cost and can lead to idle capacity, whereas autoscaling with min replicas=0 and batching optimizes cost without sacrificing availability.

Practice this question →

79

Multi-Selecthard

Which THREE factors are critical when designing a model serving architecture for a global user base with strict latency SLAs? (Choose 3.)

Select 3 answers

A.Use batch prediction to process requests in bulk for efficiency.

B.Deploy the model in a single region to avoid data sovereignty issues.

C.Enable autoscaling with request-based metrics to handle traffic spikes.

D.Implement request caching for idempotent predictions when appropriate.

E.Use multi-region deployment with Vertex AI Endpoints in multiple locations.

AnswersC, D, E

Autoscaling ensures capacity matches demand.

Why this answer

Options B, C, and E are correct. Option A is wrong because single-region deployment cannot meet global latency. Option D is wrong because batch processing adds latency.

Practice this question →

80

MCQhard

You have a model that requires GPU for efficient inference. You deploy it on Vertex AI with a single NVIDIA T4 GPU accelerator and notice that the GPU utilization hovers around 30%. The endpoint has 10 replicas. What is the best way to improve cost efficiency while maintaining throughput?

A.Use a larger GPU like V100 to process requests faster.

B.Reduce the number of replicas to increase GPU utilization per instance.

C.Enable autoscaling to increase the number of replicas.

D.Switch to a CPU-only instance; the model can run on CPU.

AnswerB

Fewer replicas with same traffic will increase GPU utilization and reduce cost.

Why this answer

If GPU utilization is low, you can reduce the number of replicas or increase the batch size per request to fully utilize the GPU. Reducing replicas directly saves cost. Increasing batch size may also help but requires code changes.

Practice this question →

81

MCQeasy

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

A.Enable autoscaling for the deployment

B.Increase the machine type of the node

C.Decrease the min replicas to 0

D.Enable automatic batching of requests

AnswerA

Autoscaling adds nodes during peak traffic, reducing latency.

Why this answer

Enabling autoscaling for the deployment is the correct first step because it allows Vertex AI Prediction to dynamically adjust the number of replicas based on incoming traffic. During peak hours, autoscaling can add more nodes to distribute the inference load, directly reducing latency without requiring manual intervention or over-provisioning.

Exam trap

The trap here is that candidates often confuse improving throughput (batching or bigger machines) with reducing latency under load, but the first action should always be to add more replicas via autoscaling to handle concurrent requests, not to optimize a single node's performance.

How to eliminate wrong answers

Option B is wrong because increasing the machine type of the node (e.g., moving to a larger VM) may improve per-node throughput but does not address the root cause of insufficient capacity during traffic spikes; it also increases cost without guaranteeing latency reduction if the single node is already saturated. Option C is wrong because decreasing the min replicas to 0 would cause the deployment to scale down to zero during idle periods, but during peak hours it would still need to scale up from zero, causing cold-start latency and potentially failing to handle the initial burst of requests. Option D is wrong because enabling automatic batching of requests can improve throughput by grouping multiple inference requests into a single batch, but it does not reduce latency for individual requests—in fact, it may increase latency as requests wait for a batch to fill.

Practice this question →

82

MCQmedium

Refer to the exhibit. A team deploys a model with the above configuration. They observe that during traffic spikes, the endpoint does not scale up quickly enough, causing increased latency. The average CPU utilization never exceeds 50%. What is the most likely reason for the slow scaling?

A.The autoscaling metric is not configured

B.The minReplicaCount is too low

C.The accelerator is causing a bottleneck

D.The machineType does not have enough CPU

AnswerA

The strategy is 'manual', so autoscaling is not configured; changing to 'autoscaling' with a target metric would resolve the issue.

Why this answer

Option C is correct. The configuration shows strategy: manual, meaning autoscaling is disabled. Without autoscaling, the endpoint does not add instances in response to load.

Option A increases min replicas but still manual. Option B changes machine type but scaling remains manual. Option D is irrelevant because CPU utilization is low.

Practice this question →

83

MCQhard

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

A.The input data format is incorrect

B.The model was trained with a different framework

C.The model uses a scikit-learn version not supported by Vertex AI

D.The endpoint is overloaded and timing out

AnswerC

Version mismatch causes evaluation failure.

Why this answer

Vertex AI Prediction supports specific versions of scikit-learn for serving models. If the model was trained with a version that is not in the supported list (e.g., 0.19, 0.20, 0.22, 0.23, 0.24, 1.0, 1.1), the prediction endpoint will fail with a 'Model evaluation error' because the underlying runtime cannot load the serialized model (e.g., pickle or joblib file). This is the most likely cause of a 400 error when the input format is otherwise correct.

Exam trap

Google Cloud often tests the misconception that a 400 error always indicates a client-side input format issue, but here the error message 'Model evaluation error' points to a server-side model loading failure due to version incompatibility, not the input data.

How to eliminate wrong answers

Option A is wrong because an incorrect input data format typically results in a different error message, such as 'Invalid input' or 'Prediction failed: Input parsing error', not 'Model evaluation error'. Option B is wrong because Vertex AI Prediction supports multiple frameworks (TensorFlow, PyTorch, XGBoost, scikit-learn) and will not throw a 'Model evaluation error' solely due to a different training framework; it would fail at model upload or deployment with an unsupported framework error. Option D is wrong because an overloaded endpoint or timeout would return a 429 (Too Many Requests) or 504 (Gateway Timeout) status code, not a 400 error with 'Model evaluation error'.

Practice this question →

84

MCQmedium

A machine learning engineer notices that the Vertex AI Prediction endpoint's error rate has increased over the past week. The model was retrained with new data and redeployed. Which step should the engineer take first to diagnose the issue?

A.Increase the number of replicas to reduce error rate.

B.Compare the input data distribution of recent requests to the training data distribution using Explainable AI.

C.Roll back to the previous model version immediately.

D.Check the Cloud Monitoring dashboard for latency and error codes, and review the model's prediction logs.

AnswerD

Monitoring and logs provide direct evidence to diagnose errors.

Why this answer

Option C is correct because reviewing Cloud Monitoring dashboards and logs provides immediate insights into error patterns and root cause. Option A is premature without investigation. Option B is more advanced and requires setup.

Option D might temporarily reduce errors due to overload but does not address the underlying cause.

Practice this question →

85

MCQeasy

Refer to the exhibit. A team deploys a model using Cloud Run. They notice that after scaling up, the new instances take about 90 seconds to become ready and serve requests. They want to reduce this startup time. Which configuration change is most likely to help?

A.Reduce the startupProbe initialDelaySeconds to 30

B.Change the container image to use a smaller base image

C.Reduce the memory limit to 4Gi

D.Increase the containerConcurrency to 100

AnswerB

A smaller base image reduces download and extraction time, speeding up startup.

Why this answer

Option D is correct. Using a smaller container image (e.g., a minimal base image) reduces pull and initialization time, directly lowering startup latency. Option A increases concurrency but doesn't affect startup.

Option B reduces the probe delay but the instance may not be ready earlier. Option C reduces memory but could cause OOM if model requires more.

Practice this question →

86

MCQmedium

A team uses Vertex AI Prediction with a custom container. They want to perform canary deployments by sending 5% of traffic to a new model version. Which method should they use?

A.Create a new endpoint with manual traffic splitting

B.Deploy two separate endpoints and use a load balancer

C.Use Cloud Run for serving with gradual rollout

D.Use the Vertex AI Model Registry and configure traffic splitting on the endpoint

AnswerB

This is correct because you can deploy the new model version to the same endpoint with a small traffic split (e.g., 5%) using the traffic splitting feature.

Why this answer

Option C is correct because Vertex AI endpoints support traffic splitting between deployed models, allowing a controlled canary rollout. Option A is not possible as endpoints cannot have separate traffic splits on different deployments without manual configuration. Option B is incorrect as Model Registry itself does not handle traffic splitting.

Option D uses Cloud Run which is not integrated with Vertex AI Prediction.

Practice this question →

87

Multi-Selectmedium

A model serving team is experiencing high latency in production. Which TWO actions should they take to diagnose the root cause? (Choose 2.)

Select 2 answers

A.Convert the model to a different framework that is faster.

B.Enable Cloud Trace to analyze request latency across services.

C.Check the endpoint's autoscaling metrics and cold start frequency.

D.Increase the number of replicas to reduce load per replica.

E.Set the logging verbosity to DEBUG in the container.

AnswersB, C

Cloud Trace provides detailed latency breakdowns.

Why this answer

Options A and D are correct. Option B is wrong because increasing replicas may mask the issue but not diagnose. Option C is wrong because converting framework may not address latency.

Option E is wrong because log level changes do not provide granular latency analysis.

Practice this question →

88

MCQmedium

An ML engineer notices that predictions are taking longer than expected under moderate traffic. Reviewing the endpoint configuration, what is the most likely cause of the high latency?

A.Container logging is disabled, slowing down request processing.

B.The accelerator count is 0, meaning no GPU is used.

C.The machine type n1-standard-4 is underpowered for the model's compute needs.

D.Automatic scaling is set with a maxReplicaCount of 10, which creates overhead.

AnswerB

BERT models are computationally intensive and benefit greatly from GPU acceleration; without it, inference is CPU-bound and slow.

Why this answer

When the accelerator count is set to 0, the endpoint runs inference on the CPU only, which is significantly slower than GPU-accelerated inference for deep learning models. This is the most direct cause of high latency under moderate traffic, as the model's compute demands exceed CPU throughput.

Exam trap

Google Cloud often tests the misconception that CPU machine type is the primary cause of latency, when in fact the accelerator count being zero is the more direct and common misconfiguration for deep learning models.

How to eliminate wrong answers

Option A is wrong because disabling container logging reduces I/O overhead and actually speeds up request processing, not slows it down. Option C is wrong because n1-standard-4 (4 vCPUs, 15 GB RAM) is a standard compute-optimized machine type that is generally sufficient for moderate traffic; the primary bottleneck is the lack of GPU acceleration, not CPU underpowering. Option D is wrong because a maxReplicaCount of 10 does not create overhead; automatic scaling with a higher maxReplicaCount allows more instances to handle load, reducing latency under traffic.

Practice this question →

89

MCQeasy

For a low-latency real-time serving requirement, which type of Vertex AI Endpoint is appropriate?

A.Regional endpoint

B.Public endpoint

C.Private endpoint with VPC network

D.Global endpoint

AnswerA

Regional endpoints are deployed in a specific region, allowing proximity to clients for low latency.

Why this answer

Option C is correct because a regional endpoint can be deployed in the same region as the clients to minimize network latency, and it provides low-latency serving. Option A (private endpoint) is for security, not necessarily low latency. Option B (public endpoint) adds internet latency.

Option D (global endpoint) is optimized for multi-region traffic but may add slight overhead.

Practice this question →

90

MCQhard

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

A.Change the machine type to n1-highmem-16 (104 GB memory).

B.Use batch prediction instead of online prediction.

C.Add a GPU accelerator (e.g., NVIDIA T4) to offload computation.

D.Quantize the model from FP32 to INT8.

AnswerA

Increasing memory directly resolves out-of-memory errors.

Why this answer

Option C is correct because n1-standard-8 has only 30 GB memory, which may be insufficient for a large BERT model (e.g., around 1.5 GB parameters but with intermediate tensors can exceed 30 GB). Upgrading to a high-memory machine type provides more memory. Option A is wrong because adding a GPU does not increase system memory.

Option B is wrong because model quantization reduces model size but not necessarily memory spikes during inference. Option D is wrong because batch prediction is not for real-time, and OOM might still occur.

Practice this question →

91

Multi-Selectmedium

A company wants to deploy a model for real-time inference with high availability across multiple Google Cloud regions. The model is small and stateless. Which two steps should they take? (Choose two.)

Select 2 answers

A.Deploy the model to Vertex AI Prediction endpoints in multiple regions and use a global external HTTP(S) load balancer to route traffic to the nearest region.

B.Use Cloud Run with multi-region deployment and a global HTTP(S) load balancer.

C.Use Cloud Functions with a global HTTP(S) load balancer.

D.Use a single Vertex AI Prediction endpoint with multiple replicas across zones in the same region.

E.Deploy the model to a Vertex AI Prediction endpoint in a single region and use a global external HTTP(S) load balancer.

AnswersA, B

Multi-region endpoints with global load balancer provide HA and low latency.

Why this answer

Options B and C are correct. B deploys the model to Vertex AI Prediction endpoints in multiple regions behind a global load balancer, providing regional failover. C uses Cloud Run with multi-region deployment and a global load balancer, which also offers multi-region HA.

Option A is insufficient as a single region does not survive a regional outage. Option D is wrong because Cloud Functions is region-specific and not designed for latency-sensitive inference across regions. Option E is wrong because a single region does not provide cross-region HA.

Practice this question →

92

MCQmedium

A model deployed on Vertex AI Prediction repeatedly exits with code 137. What is the most likely cause?

A.The model has a disk I/O bottleneck.

B.The model is using too much CPU.

C.The container image is incompatible with the machine type.

D.The model is using more memory than allocated (4GB).

AnswerD

Memory limit reached, OOM killer terminates process.

Why this answer

Exit code 137 indicates that the container was killed by the Linux kernel's Out-Of-Memory (OOM) killer. In Vertex AI Prediction, each model deployment has a fixed memory allocation (default 4GB for custom containers). When the model's inference process exceeds this limit, the OOM killer terminates the container, resulting in exit code 137.

This is the most direct and common cause for this specific exit code in Vertex AI.

Exam trap

Google Cloud often tests the distinction between exit codes: candidates may confuse exit code 137 (OOM kill) with exit code 1 (generic error) or exit code 139 (segmentation fault), leading them to incorrectly attribute the issue to CPU or disk problems.

How to eliminate wrong answers

Option A is wrong because disk I/O bottlenecks typically cause slow performance or timeouts, not exit code 137 (SIGKILL from OOM). Option B is wrong because high CPU usage may cause throttling or latency, but does not trigger the OOM killer; exit code 137 is specifically memory-related. Option C is wrong because an incompatible container image would result in a different error, such as a crash loop with exit code 1 or 139 (segfault), not the OOM-specific exit code 137.

Practice this question →

93

MCQmedium

A team deploys a model using Vertex AI Endpoint with automatic scaling. They observe that during traffic spikes, new instances take a long time to become ready, causing high latency for some requests. What should they configure to reduce this startup time?

A.Increase the max replicas

B.Use a custom container with a smaller footprint

C.Enable predictive autoscaling

D.Set a higher target CPU utilization

AnswerB

Smaller containers pull and initialize faster, reducing the time to become ready.

Why this answer

Option D is correct because using a custom container with a smaller footprint (e.g., smaller base image, fewer dependencies) reduces the time to pull and initialize the container. Option A increases max replicas but does not speed up startup. Option B may help trigger scaling earlier but startup time remains.

Option C is not a standard setting.

Practice this question →

94

Multi-Selecthard

Which TWO statements are true about canary deployments for Vertex AI endpoints?

Select 2 answers

A.Canary deployments are only supported for custom containers, not prebuilt frameworks.

B.You can roll back a canary by resetting traffic to 0% for the new version.

C.You can use traffic splitting to gradually shift 1-100% of traffic to a new version.

D.Canary deployments require the use of Vertex AI Model Registry.

E.Once a canary receives 50% traffic, you cannot increase it further.

AnswersB, C

Traffic can be shifted back to old version easily.

Why this answer

Traffic splitting is supported for gradual rollout; you cannot increase split after max traffic limit (though you can adjust). Canary can help test before full rollout; monitoring metrics can be used for automated rollback.

Practice this question →

95

MCQmedium

A financial services company deploys a fraud detection model on Vertex AI. The model must make predictions in under 100ms. After deployment, latency spikes to 300ms during peak hours. The model is a large ensemble with 500MB size. Which action is most likely to reduce latency?

A.Optimize the model using TensorFlow Lite and convert to a smaller format.

B.Switch to batch prediction to process requests asynchronously.

C.Reduce the machine type to a smaller instance.

D.Increase the number of replicas on the endpoint.

AnswerA

Reduces model size and inference time.

Why this answer

The primary cause of latency is the large model size (500MB) combined with real-time inference constraints. Optimizing the model with TensorFlow Lite reduces the model size and computational overhead, directly decreasing inference time. This addresses the root cause—model complexity—rather than scaling infrastructure around an inefficient model.

Exam trap

The trap here is that candidates often confuse scaling (replicas or instance size) with optimization, failing to recognize that model size and inference efficiency are the primary drivers of latency in real-time serving.

How to eliminate wrong answers

Option B is wrong because batch prediction processes requests asynchronously, which does not meet the sub-100ms real-time requirement; it is designed for offline, high-throughput scenarios, not low-latency serving. Option C is wrong because reducing the machine type to a smaller instance would decrease available CPU/memory, likely increasing latency further due to resource contention. Option D is wrong because increasing the number of replicas improves throughput and availability but does not reduce per-request latency; it may even add network overhead from load balancing.

Practice this question →

← PreviousPage 2 of 2 · 95 questions total

Ready to test yourself?

Try a timed practice session using only Serving and scaling models questions.

Start 20-question session