Knowledge + Practice

CCNA Serving and Scaling Models Questions

34 of 109 questions · Page 2/2 · Serving and Scaling Models · Answers revealed

Practice these questions Domain overview All questions

76

MCQmedium

A machine learning team deploys a PyTorch model for online prediction on Vertex AI using a custom container. They notice that the first few requests after scaling up experience high latency. What is the most likely cause and how should they mitigate it?

A.The endpoint is not configured for autoscaling; enable min_replica=0 to allow scale-to-zero.

B.The model file is corrupted; re-upload to Vertex AI Model Registry.

C.The container has a slow initialization; set initialDelaySeconds in the health probe to give more time before considering the pod ready.

D.Use a smaller machine type (n1-standard-2) to reduce startup overhead.

AnswerC

Giving the container more time to load the model reduces premature traffic and latency.

Why this answer

Option C is correct because the high latency on the first few requests after scaling up is a classic symptom of a slow container initialization. By setting `initialDelaySeconds` in the health probe, you allow the container more time to start up and become ready before it receives traffic, preventing premature routing that causes timeouts or retries. This is a common tuning parameter for custom containers on Vertex AI, where model loading or dependency initialization can take several seconds.

Exam trap

The trap here is that candidates confuse slow initialization with autoscaling misconfiguration, assuming that scale-to-zero or smaller machines would fix the latency, when in fact the root cause is the readiness probe timing.

How to eliminate wrong answers

Option A is wrong because the problem occurs after scaling up, not from a cold start with zero replicas; setting min_replica=0 would actually worsen latency by requiring full cold starts. Option B is wrong because a corrupted model file would cause persistent prediction failures or errors, not just high latency on the first few requests after scaling. Option D is wrong because using a smaller machine type (n1-standard-2) would increase startup overhead and latency, not reduce it, as it provides fewer CPU and memory resources for initialization.

Practice this question →

77

Multi-Selecthard

A team is building a batch prediction pipeline that processes raw data from Cloud Storage, performs complex preprocessing, and then runs predictions using a large model. The preprocessing step is compute-intensive and the prediction step is I/O-bound. Which TWO Google Cloud services should they combine to optimize cost and performance? (Choose 2)

Select 2 answers

A.Dataflow for preprocessing and writing results to Cloud Storage

B.Cloud Functions to preprocess data row by row

C.Cloud Run to serve the preprocessed data as an API

D.Vertex AI Batch Prediction with Cloud Storage source

E.Vertex AI Batch Prediction with BigQuery source

AnswersA, D

Dataflow can perform complex transforms at scale.

Why this answer

Dataflow is ideal for the compute-intensive preprocessing step because it can horizontally scale across many workers to handle complex transformations in parallel, and it writes results directly to Cloud Storage, which serves as the input source for Vertex AI Batch Prediction. Vertex AI Batch Prediction is optimized for I/O-bound inference workloads: it reads batches of data from Cloud Storage, runs predictions using the large model, and writes results back to Cloud Storage, all without requiring a persistent serving endpoint, which minimizes cost for offline predictions.

Exam trap

Google often tests the distinction between batch and online serving patterns, and the trap here is that candidates may choose Cloud Functions or Cloud Run for preprocessing because they are familiar serverless options, without realizing that Dataflow is purpose-built for large-scale, compute-intensive batch processing and that Vertex AI Batch Prediction is the correct service for offline inference at scale.

Practice this question →

78

Multi-Selecteasy

Which TWO of the following can be used as input sources for Vertex AI batch prediction jobs? (Choose 2)

Select 2 answers

A.Cloud Firestore

B.Cloud SQL

C.BigQuery

D.Cloud Spanner

E.Cloud Storage

AnswersC, E

BigQuery is a supported input source.

Why this answer

BigQuery is a supported input source for Vertex AI batch prediction jobs because Vertex AI can directly read data from BigQuery tables for batch predictions. This integration allows you to store your prediction requests in BigQuery and have Vertex AI write the predictions back to a BigQuery output table, streamlining the workflow for large-scale predictions without needing to export data to Cloud Storage first.

Exam trap

The trap here is that candidates often assume any Google Cloud database (like Firestore, Cloud SQL, or Spanner) can serve as a direct input source for batch predictions, but Vertex AI batch prediction only supports BigQuery and Cloud Storage as input sources, requiring data to be exported or staged in those services first.

Practice this question →

79

Multi-Selectmedium

A company wants to use Vertex AI Vector Search for real-time product recommendations based on user embeddings. They need to update the index frequently with new product embeddings without significant downtime. Which TWO options should they consider? (Choose 2)

Select 2 answers

A.Maintain two separate indexes: one for building and one for serving, and swap them after building.

B.Use streaming updates to add new embeddings incrementally.

C.Use batch updates to rebuild the entire index daily.

D.Increase the number of replicas on the deployed index.

E.Use a GPU-powered endpoint for faster indexing.

AnswersA, B

This allows building a new index offline and swapping with minimal downtime.

Why this answer

Option A is correct because maintaining two separate indexes allows you to build a new index in the background while the existing index continues to serve queries. Once the new index is fully built and validated, you can swap it into production with minimal downtime. This is a common pattern for high-availability systems where index rebuilds are necessary.

Exam trap

This question tests the distinction between scaling serving capacity (replicas) and updating index content. Candidates often mistakenly choose D or E when the real need is for a zero-downtime update strategy.

Practice this question →

80

Multi-Selectmedium

You are optimizing a model for deployment on Vertex AI using NVIDIA Triton Inference Server. Which TWO actions can you take to improve inference performance?

Select 2 answers

A.Increase the number of model replicas to the maximum.

B.Use TensorRT to quantize the model to FP16 or INT8.

C.Disable model caching to reduce memory usage.

D.Enable dynamic batching in Triton to aggregate requests.

E.Use a larger machine type with more vCPUs.

AnswersB, D

Quantization reduces model size and speeds up inference with minimal accuracy loss.

Why this answer

Option B is correct because TensorRT optimizes model inference by quantizing weights and activations to lower precision formats like FP16 or INT8, reducing memory bandwidth and computation time without significant accuracy loss. This is a standard technique for improving throughput on NVIDIA GPUs, especially when deploying with Triton Inference Server, which natively supports TensorRT-optimized model repositories.

Exam trap

Google often tests the misconception that simply adding more replicas or CPU resources will linearly improve inference performance, ignoring the GPU-bound nature of model serving and the importance of batching and precision optimization.

Practice this question →

81

MCQmedium

You have a TensorFlow model that you want to deploy on edge devices for real-time inference. The model was trained in Vertex AI. You need to convert it to a format suitable for on-device inference. Which approach should you use?

A.Export the model as a serialized TFX pipeline.

B.Export the model to a SavedModel and deploy it using Vertex AI Edge Manager.

C.Convert the model to TensorFlow Lite using the TensorFlow Lite converter.

D.Use Vertex AI Model Optimization to compile the model for edge devices.

AnswerC

TensorFlow Lite is optimized for on-device inference with low latency.

Why this answer

Option C is correct because TensorFlow Lite is specifically designed for on-device inference on edge devices, offering optimized performance and reduced model size. The TensorFlow Lite converter transforms a TensorFlow model (e.g., from a SavedModel) into the FlatBuffer format (.tflite), which is lightweight and compatible with mobile and embedded platforms. This directly addresses the requirement for real-time inference on edge devices.

Exam trap

A common misconception is that Vertex AI services like Edge Manager or Model Optimization directly produce a deployable edge format, when in fact TensorFlow Lite conversion is the required final step for on-device inference.

How to eliminate wrong answers

Option A is wrong because a TFX pipeline is a production ML workflow framework for orchestrating training, validation, and deployment, not a model format for on-device inference; serializing it does not produce a deployable edge model. Option B is wrong because Vertex AI Edge Manager is a service for managing and deploying models to edge devices, but it expects models in a compatible format like TensorFlow Lite; exporting to SavedModel alone is insufficient without conversion, and Edge Manager itself does not perform the conversion. Option D is wrong because Vertex AI Model Optimization focuses on techniques like pruning and quantization to improve model efficiency, but it does not compile the model into a format suitable for on-device inference; the output still requires conversion to TensorFlow Lite for edge deployment.

Practice this question →

82

MCQeasy

You need to serve multiple models on a single Vertex AI endpoint to reduce costs. How can you achieve this?

A.Use Cloud Run to serve each model separately.

B.Use Vertex AI Prediction with multi-model serving by deploying multiple models to one endpoint with traffic splits.

C.Package all models into a single container and deploy that container.

D.Deploy each model to its own endpoint and use a load balancer.

AnswerB

Multiple models can be deployed to a single endpoint, each receiving a portion of the traffic.

Why this answer

Option B is correct because Vertex AI Prediction supports multi-model serving, allowing you to deploy multiple models to a single endpoint and use traffic splits to route a percentage of requests to each model. This reduces costs by sharing underlying infrastructure (e.g., compute resources) across models, rather than provisioning separate endpoints or containers for each model.

Exam trap

The trap here is that candidates often confuse multi-model serving with containerization, assuming that bundling models into a single container (Option C) is equivalent to Vertex AI's native multi-model support, but this ignores the need for traffic splitting and independent model lifecycle management.

How to eliminate wrong answers

Option A is wrong because Cloud Run serves each model as a separate service, which does not consolidate models onto a single endpoint and incurs additional costs for individual scaling and networking. Option C is wrong because packaging all models into a single container violates the principle of model isolation, complicates updates, and does not leverage Vertex AI's native traffic-splitting mechanism for granular control. Option D is wrong because deploying each model to its own endpoint and using a load balancer increases operational overhead and cost, as each endpoint requires separate compute resources, defeating the purpose of cost reduction.

Practice this question →

83

MCQeasy

A data scientist wants to deploy a trained TensorFlow model to Vertex AI for online predictions. They need to serve predictions with low latency and want to leverage GPU acceleration. Which machine type should they select when creating the Vertex AI endpoint?

A.n1-standard-4 with 1 NVIDIA Tesla T4

B.n1-standard-4

C.e2-standard-4

D.n1-highmem-8

AnswerA

Attaching a GPU to an n1-standard machine enables GPU acceleration.

Why this answer

Option A is correct because the n1-standard-4 machine type supports attaching GPUs such as the NVIDIA Tesla T4, which provides GPU acceleration for low-latency online predictions. Vertex AI endpoints require a machine type that allows GPU attachment, and the n1-series is one of the few families that supports GPUs, while the T4 offers a good balance of cost and performance for inference workloads.

Exam trap

The trap here is that candidates may assume any machine type can be paired with a GPU, but only specific series (like n1, n2, g2) support GPU attachment, and the e2 series explicitly does not, leading to a wrong selection if the GPU requirement is overlooked.

How to eliminate wrong answers

Option B is wrong because n1-standard-4 without a GPU does not provide GPU acceleration, so it cannot meet the requirement for low-latency predictions with GPU. Option C is wrong because e2-standard-4 does not support attaching GPUs at all; the e2 series is designed for cost-optimized CPU-only workloads. Option D is wrong because n1-highmem-8, while it can support GPUs, is over-provisioned in memory for typical inference tasks and does not include a GPU by default, so it would not satisfy the explicit need for GPU acceleration unless a GPU is attached, but the option as stated lacks the GPU specification.

Practice this question →

84

MCQmedium

You have a Vertex AI endpoint with autoscaling enabled. You notice that during traffic spikes, the endpoint takes a long time to scale up, causing prediction errors. What is the most effective solution?

A.Use a larger machine type to handle more requests per replica.

B.Reduce the target CPU utilization to trigger scaling earlier.

C.Increase the maximum number of replicas.

D.Increase the minimum number of replicas to maintain a larger buffer.

AnswerD

Higher min replicas reduce the time to scale up during spikes.

Why this answer

Setting higher min replicas ensures a baseline of compute always available to absorb traffic spikes, reducing cold start latency. Increasing max replicas allows scaling higher but does not address the initial delay.

Practice this question →

85

MCQhard

You are deploying a PyTorch model on Vertex AI and want to use NVIDIA Triton Inference Server for optimal performance. You have built a custom container with Triton. Which serving configuration should you use?

A.Deploy the model on GKE with Triton and expose via Istio.

B.Use the prebuilt Vertex AI PyTorch prediction container and set environment variables to enable Triton.

C.Use Vertex AI Model Optimization to automatically convert the model to TensorRT and deploy with built-in server.

D.Upload your Triton container to Container Registry and specify it as the prediction container in Vertex AI Model.

AnswerD

Vertex AI allows custom containers for prediction; Triton can be included in the container.

Why this answer

Vertex AI supports custom containers for prediction. You can bring your own container with Triton installed, and Vertex AI will deploy it to the endpoint.

Practice this question →

86

MCQeasy

You deployed a model to a Vertex AI endpoint with minReplicas=0 and maxReplicas=5. After sending prediction requests, you notice the endpoint takes about 30 seconds to respond initially, but subsequent requests are fast. What is the most likely cause?

A.The model is too large for the machine type.

B.Cold start occurs because the endpoint scaled down to zero.

C.The VPC Service Controls are blocking the initial request.

D.The endpoint's autoscaling is misconfigured.

AnswerB

Correct. With minReplicas=0, the endpoint scales down to zero, leading to cold start latency.

Why this answer

Option B is correct because Vertex AI endpoints with minReplicas=0 scale down to zero when idle. The first request after a period of inactivity triggers a cold start, where the endpoint must provision a new VM instance and load the model, causing a ~30-second delay. Subsequent requests are fast because the instance remains warm and handles them without provisioning overhead.

Exam trap

Google often tests the distinction between cold start latency and persistent performance issues, so candidates may mistakenly attribute the initial delay to model size or network misconfiguration instead of recognizing the intentional scaling-to-zero behavior.

How to eliminate wrong answers

Option A is wrong because a model too large for the machine type would cause persistent latency or errors on every request, not just the first one after idle time. Option C is wrong because VPC Service Controls enforce network boundaries and would block all requests consistently, not just the initial one with a 30-second delay. Option D is wrong because the autoscaling configuration (minReplicas=0, maxReplicas=5) is correct for scaling to zero; the observed behavior is the expected cold start, not a misconfiguration.

Practice this question →

87

MCQmedium

You are using Vertex AI Vector Search to find nearest neighbors for a recommendation system. Your index is built on 10M embeddings and you need low-latency queries. You want to ensure that adding new embeddings does not require a full index rebuild. Which index type should you use?

A.Brute-force index (exact neighbor search)

B.ANN index with the 'streaming' update mode

C.ANN index with the 'batch' update mode

D.Tree-AH index

AnswerB

Streaming mode allows incremental updates without full index rebuild.

Why this answer

The ANN index with 'streaming' update mode is correct because it supports real-time insertion of new embeddings without requiring a full index rebuild, which is essential for low-latency recommendation systems. This mode uses a separate unsearched buffer for new vectors and periodically merges them into the main index, enabling continuous updates while maintaining query performance.

Exam trap

Google often tests the misconception that all ANN indexes support incremental updates, but only the 'streaming' mode in Vertex AI Vector Search avoids full rebuilds, while 'batch' mode and Tree-AH require periodic full reindexing.

How to eliminate wrong answers

Option A is wrong because a brute-force index computes exact distances against all 10M embeddings, which is computationally prohibitive for low-latency queries and does not support efficient incremental updates. Option C is wrong because the 'batch' update mode requires rebuilding the entire index from scratch when new embeddings are added, violating the requirement to avoid full rebuilds. Option D is wrong because Tree-AH (Asymmetric Hashing) is a specific ANN algorithm that typically requires batch rebuilding and does not natively support streaming updates like the 'streaming' mode in Vertex AI Vector Search.

Practice this question →

88

MCQhard

A company uses Vertex AI Matching Engine for a product recommendation system. They need to update the index with new product embeddings every hour, but the index is used for online queries with low latency. Which index update strategy should they use?

A.Use streaming updates to insert new embeddings incrementally

B.Use a hybrid approach with batch for daily full rebuild and streaming for hourly

C.Use batch updates to replace the index every hour

D.Recreate the index from scratch each hour

AnswerA

Streaming updates allow incremental, real-time changes while the index serves queries.

Why this answer

Streaming updates in Vertex AI Matching Engine allow incremental insertion of new embeddings into an existing index without rebuilding it. This satisfies the requirement for hourly updates while maintaining low-latency online queries, as the index remains available and consistent during the update process.

Exam trap

Google often tests the misconception that batch updates are required for consistency or that streaming updates cannot handle frequent changes, leading candidates to choose hybrid or batch approaches when incremental streaming is both sufficient and optimal for low-latency online serving.

How to eliminate wrong answers

Option B is wrong because a hybrid approach with batch for daily full rebuild and streaming for hourly adds unnecessary complexity and cost; streaming updates alone suffice for hourly increments without needing a daily rebuild. Option C is wrong because batch updates replace the entire index, causing downtime or increased latency during the rebuild, which violates the low-latency online query requirement. Option D is wrong because recreating the index from scratch each hour is inefficient, time-consuming, and disrupts query availability, making it unsuitable for real-time serving.

Practice this question →

89

MCQhard

You are using Vertex AI Matching Engine (Vector Search) to serve similarity search for an e-commerce product recommendation system. The index is updated daily with new product embeddings via a batch job. However, you notice that some new products are not appearing in the search results for up to 24 hours. You need to ensure that new products are discoverable within 1 hour of ingestion. What should you do?

A.Switch to a streaming update configuration for the index.

B.Deploy multiple index endpoints and use traffic splitting.

C.Increase the frequency of the batch update job to run every 30 minutes.

D.Use a brute-force index instead of an ANN index.

AnswerA

Streaming updates allow near real-time index updates, making new products searchable within minutes.

Why this answer

Option A is correct because Vertex AI Matching Engine supports streaming updates to the index, which allows new embeddings to be added in near real-time (typically within minutes) without requiring a full batch rebuild. By switching to a streaming update configuration, new product embeddings become searchable within the 1-hour requirement, as the index is updated incrementally as data arrives.

Exam trap

The trap here is that candidates may assume increasing batch frequency (Option C) is sufficient, but they overlook that batch updates in Vertex AI Matching Engine are designed for daily or less frequent rebuilds and cannot guarantee sub-hour freshness due to build and deployment overhead, whereas streaming updates are the intended solution for low-latency index updates.

How to eliminate wrong answers

Option B is wrong because deploying multiple index endpoints with traffic splitting does not address the latency of index updates; it only distributes query load across existing index versions, which still rely on the same batch update schedule. Option C is wrong because increasing the batch job frequency to every 30 minutes still introduces a delay of up to 30 minutes plus processing time, and batch updates are not designed for sub-hour freshness; they also require rebuilding the entire index, which is resource-intensive and may not meet the 1-hour SLA consistently. Option D is wrong because using a brute-force index (exact nearest neighbor) instead of an ANN index does not change the update mechanism; it only affects search accuracy and performance, not the freshness of data ingestion.

Practice this question →

90

Multi-Selecthard

A fintech company needs to deploy a TensorFlow model for real-time fraud detection with strict latency SLO (p99 < 100ms). They expect variable traffic with spikes. They also want to minimize cold-start latency. Which two configurations should they use? (Choose 2)

Select 2 answers

A.Set min_replicas = 0 to allow scale-to-zero and save costs.

B.Use a GPU-enabled machine type (e.g., N1 with T4) to accelerate inference.

C.Set min_replicas = 3 to keep a baseline of warm instances.

D.Enable Vertex AI Model Optimization for automatic quantization.

E.Use batch prediction instead of online prediction.

AnswersB, C

GPUs can reduce inference latency for deep learning models.

Why this answer

Option B is correct because GPU-enabled machine types (e.g., N1 with T4) significantly accelerate TensorFlow model inference, which is critical for meeting the p99 < 100ms latency SLO. GPUs parallelize matrix operations common in deep learning models, reducing per-request latency even under variable traffic spikes.

Exam trap

A common misconception is that scale-to-zero (min_replicas = 0) is always cost-effective, but in latency-sensitive real-time inference, it introduces unacceptable cold-start delays, making baseline warm instances (min_replicas > 0) essential.

Practice this question →

91

MCQeasy

Which machine type is most suitable for a Vertex AI endpoint serving a GPU-accelerated model?

A.n1-standard-4 with attached GPU

B.e2-standard-4

C.n1-highmem-8

D.n1-standard-4

AnswerA

You need GPU machine type or attach GPU to N1.

Why this answer

Option A is correct because Vertex AI endpoints require a machine type that supports GPU acceleration, and the n1-standard-4 with an attached GPU provides the necessary CPU-to-GPU balance for inference workloads. The n1 series supports GPU attachments via the `accelerator` configuration, enabling CUDA-based model serving, while the e2 series and n1-highmem-8 without GPU cannot leverage GPU acceleration.

Exam trap

A common misconception is that any n1 machine type inherently supports GPU acceleration, but the trap is that the GPU must be explicitly attached via the `accelerator` configuration. Options like n1-highmem-8 or n1-standard-4 without the `accelerator` are CPU-only, failing to meet the requirement for a GPU-accelerated model on Vertex AI.

How to eliminate wrong answers

Option B is wrong because e2-standard-4 does not support GPU attachments; the e2 machine series lacks the PCIe passthrough capability required for NVIDIA GPUs in Vertex AI. Option C is wrong because n1-highmem-8, while part of the n1 series, is a high-memory configuration that is overkill for most GPU-accelerated models and does not include a GPU by default; without an attached GPU, it cannot accelerate model inference. Option D is wrong because n1-standard-4 without an attached GPU is a CPU-only instance that cannot utilize GPU acceleration, making it unsuitable for a GPU-accelerated model endpoint.

Practice this question →

92

MCQmedium

You have a Vertex AI endpoint with two deployed models: model A (champion) and model B (challenger). Traffic split is 90:10. You want to gradually increase model B's traffic to 50% over a week. What is the best way to update the traffic split?

A.Use gcloud ai models upload to overwrite model B with new settings.

B.Create a new endpoint and migrate traffic gradually using a load balancer.

C.Use the gcloud ai endpoints update command to change traffic split.

D.Delete model B and redeploy with a new traffic split.

AnswerC

Correct. The gcloud command can update traffic percentages.

Why this answer

The `gcloud ai endpoints update` command allows you to directly modify the traffic split of an existing Vertex AI endpoint without redeploying or recreating the endpoint. This is the intended method for gradually shifting traffic between deployed models, as it supports incremental updates to the `--traffic-split` parameter, enabling a controlled rollout from 90:10 to 50:50 over a week.

Exam trap

Google often tests the misconception that you must recreate or redeploy resources to change configuration, when in fact Vertex AI endpoints support live updates via the `update` command, avoiding unnecessary downtime and complexity.

How to eliminate wrong answers

Option A is wrong because `gcloud ai models upload` is used to upload a new model version, not to update traffic split settings on an existing endpoint; traffic splits are managed at the endpoint level, not the model level. Option B is wrong because creating a new endpoint and using a load balancer adds unnecessary complexity and cost, and Vertex AI endpoints natively support traffic splitting without external load balancers. Option D is wrong because deleting and redeploying model B would cause downtime for the 10% of traffic already served by model B, and the traffic split can be updated without redeployment.

Practice this question →

93

MCQmedium

An engineer needs to deploy multiple models on a single Vertex AI endpoint with separate traffic allocations. What is the maximum number of deployed models that can be assigned traffic on one endpoint?

A.2

B.Unlimited

C.10

D.5

AnswerD

Vertex AI allows up to 5 deployed models per endpoint with traffic splitting.

Why this answer

Vertex AI endpoints allow up to 5 deployed models to receive traffic simultaneously, with each model assigned a traffic percentage that sums to 100%. This limit ensures predictable routing and resource management, preventing overcommitment of the endpoint's underlying infrastructure.

Exam trap

Google often tests the misconception that Vertex AI endpoints support an unlimited number of deployed models or a higher number like 10, but the actual hard limit is 5, as defined in the Vertex AI quotas documentation.

How to eliminate wrong answers

Option A is wrong because the limit is not 2; Vertex AI supports up to 5 deployed models per endpoint, not just a pair for canary testing. Option B is wrong because the number is not unlimited; there is a hard cap of 5 to maintain endpoint stability and avoid excessive routing complexity. Option C is wrong because the limit is 5, not 10; this is a specific quota enforced by Vertex AI to control resource allocation and prevent performance degradation.

Practice this question →

94

MCQhard

Your team has built a low-latency similarity search service using Vertex AI Matching Engine (Vector Search). The index is updated daily with new embeddings. You need to serve the latest index without downtime. What is the correct deployment strategy?

A.Use streaming updates to continuously update the index.

B.Create a new index version, deploy it to the same endpoint, and then update the endpoint to use the new index version.

C.Update the existing index in place by calling the index update API.

D.Delete the old index and create a new index each day, then deploy the new index to a new endpoint and update DNS.

AnswerB

Correct. This allows zero-downtime updates.

Why this answer

Option B is correct because Vertex AI Matching Engine supports deploying a new index version to the same endpoint without downtime. You create a new index version from the updated embeddings, deploy it to the existing endpoint, and then update the endpoint to use the new version. This allows traffic to seamlessly switch to the updated index once it is fully deployed, avoiding any service interruption.

Exam trap

Candidates may assume streaming updates (Option A) are possible for low-latency similarity search, but Vertex AI Matching Engine requires batch index creation and does not support real-time streaming updates. Thus, versioned deployment (Option B) is the only correct approach for zero-downtime updates.

How to eliminate wrong answers

Option A is wrong because streaming updates are not supported for Vertex AI Matching Engine indexes; the index is built offline and cannot be continuously updated in real-time. Option C is wrong because updating an index in place via the update API is not supported; you must create a new index version for each update. Option D is wrong because deleting the old index and creating a new one each day, then deploying to a new endpoint and updating DNS, introduces unnecessary complexity and potential downtime during DNS propagation and endpoint creation.

Practice this question →

95

MCQmedium

Your team has deployed a model on Vertex AI endpoints. You need to monitor the prediction latency to ensure it meets a 99th percentile SLO of 500ms. You want to set up an alert if the latency exceeds this threshold. Which metric should you use?

A.The 99th percentile of the `prediction/online/response_latencies` metric.

B.The number of prediction requests that timeout.

C.Average prediction latency from the endpoint's logs.

D.The maximum prediction latency from the endpoint's monitoring dashboard.

AnswerA

This metric provides quantile data for latency, allowing you to monitor the 99th percentile.

Why this answer

Option A is correct because the `prediction/online/response_latencies` metric in Vertex AI provides a distribution of latency values, allowing you to query the 99th percentile directly. This aligns with the SLO requirement to monitor the tail latency, not the average or maximum, ensuring that the worst-case performance for 1% of requests stays under 500ms.

Exam trap

Google Cloud often tests the distinction between tail latency (percentiles) and central tendency (average) or extreme values (maximum), trapping candidates who confuse SLO monitoring with simple failure counts or averages.

How to eliminate wrong answers

Option B is wrong because the number of prediction requests that timeout is a count of failures, not a latency measurement; it does not capture the 99th percentile latency and would miss requests that complete but exceed 500ms. Option C is wrong because average prediction latency can mask high tail latencies; a low average could hide a significant number of requests exceeding 500ms, violating the SLO. Option D is wrong because the maximum prediction latency is a single extreme value, often an outlier due to cold starts or transient spikes, and does not represent the 99th percentile behavior required for the SLO.

Practice this question →

96

MCQhard

A company is deploying multiple models on a single Vertex AI endpoint to reduce costs. Each model has different traffic patterns. Which configuration should they use?

A.Use Cloud Run to serve each model as a separate service.

B.Use a single endpoint with multiple deployed models and traffic allocation.

C.Deploy each model to separate endpoints and use a load balancer.

D.Use Vertex AI Matching Engine to serve models.

AnswerB

Multi-model serving allows deploying several models on one endpoint with traffic splitting.

Why this answer

Vertex AI endpoints support deploying multiple models behind a single endpoint with traffic splitting, allowing you to route different percentages of requests to each model based on their traffic patterns. This reduces infrastructure costs compared to separate endpoints, as the endpoint's underlying compute resources are shared. Traffic allocation can be adjusted dynamically to match changing model usage without redeploying.

Exam trap

The trap here is that candidates confuse Vertex AI endpoints with generic load balancing (Option C) or think separate services (Option A) are needed for different models, missing the cost-saving capability of traffic splitting on a single endpoint.

How to eliminate wrong answers

Option A is wrong because Cloud Run serves each model as a separate service, which does not consolidate models under a single endpoint and incurs additional networking and management overhead, failing to reduce costs as intended. Option C is wrong because deploying each model to separate endpoints and using a load balancer adds complexity and cost (multiple endpoints, load balancer charges) without leveraging Vertex AI's native traffic splitting, which is more efficient. Option D is wrong because Vertex AI Matching Engine is designed for vector similarity search (e.g., embeddings), not for serving multiple prediction models with traffic allocation on a single endpoint.

Practice this question →

97

MCQhard

A company wants to deploy a TensorFlow model on edge devices for real-time inference without internet connectivity. Which Vertex AI service should they use to manage the deployment?

A.TensorFlow Lite Converter

B.Vertex AI Endpoint

C.Vertex AI Edge Manager

D.AI Platform Prediction (legacy)

AnswerC

Edge Manager specifically manages model deployment on edge devices.

Why this answer

Vertex AI Edge Manager allows you to deploy models to edge devices and manage them centrally.

Practice this question →

98

MCQmedium

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

A.Brute-force index with cosine distance.

B.Approximate Nearest Neighbor (ANN) index using the ScaNN algorithm.

C.A custom distance-based index using Cloud SQL.

D.A tree-based index from scikit-learn deployed as a custom container.

AnswerB

ANN with ScaNN is designed for low-latency, high-scale similarity search.

Why this answer

Option B is correct because the ScaNN (Scalable Nearest Neighbors) algorithm is specifically designed for high-dimensional, large-scale similarity search with strict latency requirements. With 10 million 512-dimensional embeddings, an ANN index like ScaNN can achieve sub-10ms query latency at the 99th percentile by trading a small amount of recall for dramatic speed improvements, which is exactly what Vertex AI Matching Engine optimizes for.

Exam trap

The trap here is that candidates assume brute-force is the only 'accurate' option and underestimate how severely the curse of dimensionality degrades tree-based and exact methods at 512 dimensions, leading them to pick A or D despite the explicit latency constraint.

How to eliminate wrong answers

Option A is wrong because a brute-force index computes exact distances against all 10 million embeddings, which for 512-dimensional vectors would require O(10M * 512) operations per query, far exceeding the 10ms latency target even with optimized hardware. Option C is wrong because Cloud SQL is a relational database not designed for vector similarity search; it lacks native support for high-dimensional distance computations and would require full table scans, making sub-10ms latency impossible at this scale. Option D is wrong because scikit-learn's tree-based indices (e.g., KD-Tree, Ball Tree) degrade to near-linear search in high dimensions (curse of dimensionality), performing no better than brute force for 512 dimensions, and deploying as a custom container adds unnecessary overhead without addressing the fundamental algorithmic limitation.

Practice this question →

99

MCQmedium

You are using Vertex AI Vector Search with an approximate nearest neighbor index. You need to update the index with new data every hour. The updates must be available for queries immediately. Which update method should you use?

A.Recreate the index every hour using a scheduled job.

B.Batch update by creating a new index and deploying it.

C.Streaming updates using the streaming API.

D.Use a brute-force index that supports real-time updates.

AnswerC

Streaming updates allow immediate visibility of new data in the index.

Why this answer

Option C is correct because Vertex AI Vector Search supports streaming updates via its streaming API, which allows you to insert, update, or delete vectors in real time. This ensures that new data is immediately available for approximate nearest neighbor (ANN) queries without requiring index recreation or redeployment, meeting the requirement for hourly updates with instant query availability.

Exam trap

A common misconception is that updating an approximate nearest neighbor (ANN) index requires recreating and redeploying it. However, Vertex AI Vector Search provides a streaming API that enables real-time insert, update, and delete operations, making new data immediately available for queries without the overhead of full index rebuilds.

How to eliminate wrong answers

Option A is wrong because recreating the entire index every hour is inefficient and introduces downtime during the rebuild process, failing the requirement for immediate query availability. Option B is wrong because batch updating by creating a new index and deploying it involves a delay for building and deploying the index, so updates are not available immediately for queries. Option D is wrong because Vertex AI Vector Search does not offer a brute-force index that supports real-time updates; brute-force indices are typically used for exact nearest neighbor search and are not designed for real-time streaming updates in this service.

Practice this question →

100

MCQhard

You are using Vertex AI Vector Search for a product recommendation system. Your index is updated with new embeddings every hour. To minimize query latency while keeping the index fresh, what should you do?

A.Use streaming updates to insert new embeddings into the deployed index.

B.Rebuild the entire index hourly as a batch job and redeploy it.

C.Create a new index each hour and use traffic splitting to gradually shift traffic.

D.Use a brute-force index instead of ANN to ensure accuracy after updates.

AnswerA

Streaming updates provide low-latency freshness without full rebuild.

Why this answer

Streaming updates allow real-time insertion of new embeddings without rebuilding the entire index, maintaining low latency for queries.

Practice this question →

101

MCQhard

A team is deploying a model that has strict latency requirements: p99 response time under 100 ms. The model is CPU-only and will receive up to 1000 QPS. They want to minimize cost while meeting the SLO. Which machine type and scaling configuration is most appropriate?

A.GPU-enabled machine with min_replicas=1 and max_replicas=2

B.n1-standard-8 with min_replicas=3 and max_replicas=3 (fixed)

C.n1-highmem-2 with min_replicas=2 and max_replicas=10

D.n1-standard-4 with min_replicas=1 and max_replicas=5, CPU utilization target 60%

AnswerD

Correct: n1-standard-4 provides moderate CPU; autoscaling on CPU utilization meets latency and cost goals.

Why this answer

Option D is correct because it uses a CPU-only machine (n1-standard-4) with autoscaling based on CPU utilization target of 60%, which balances cost and performance for a latency-sensitive, CPU-bound inference workload at 1000 QPS. The min_replicas=1 ensures a baseline capacity, while max_replicas=5 allows scaling to handle spikes without over-provisioning, keeping p99 under 100 ms.

Exam trap

Google often tests the misconception that GPU machines are always faster for ML inference, but for CPU-only models with strict latency SLOs, a properly scaled CPU instance with autoscaling is more cost-effective and meets performance requirements.

How to eliminate wrong answers

Option A is wrong because GPU-enabled machines are unnecessary and cost-prohibitive for a CPU-only model, and the low max_replicas=2 may not handle 1000 QPS within the latency SLO. Option B is wrong because a fixed 3-replica deployment (n1-standard-8) lacks autoscaling, leading to either over-provisioning (waste) or under-provisioning (SLO violations) under variable load, and the machine type is oversized for the workload. Option C is wrong because n1-highmem-2 is memory-optimized, not compute-optimized, and the wide scaling range (2 to 10) with no utilization target can cause thrashing or high cost without guaranteeing latency.

Practice this question →

102

MCQmedium

You need to run batch predictions on a large dataset stored in BigQuery using a Vertex AI model. The dataset contains 10 million rows, and each prediction takes about 100ms. You want to minimize cost and execution time. What should you do?

A.Export the BigQuery data to CSV in GCS, then run a custom Dataflow pipeline to make predictions.

B.Use Vertex AI batch prediction with BigQuery as the source and sink.

C.Use Vertex AI online prediction and send all rows as separate requests.

D.Use a custom container running on Google Kubernetes Engine to perform inference.

AnswerB

Batch prediction natively supports BigQuery, is cost-effective, and scales automatically.

Why this answer

Vertex AI batch prediction natively supports BigQuery as both input and output, eliminating the need for data export or custom pipelines. For 10 million rows at 100ms each, batch prediction processes them in parallel across multiple machines, minimizing execution time while avoiding the per-node costs of online prediction or the overhead of managing Dataflow or GKE clusters.

Exam trap

Google Cloud exams often test the distinction between batch and online prediction, trapping candidates who overlook that batch prediction is purpose-built for large-scale, offline inference with native BigQuery integration, while online prediction is for real-time, low-latency use cases.

How to eliminate wrong answers

Option A is wrong because exporting to CSV and using Dataflow adds unnecessary complexity and cost; Vertex AI batch prediction can read directly from BigQuery, avoiding data movement and extra processing steps. Option C is wrong because online prediction is designed for low-latency, real-time requests on small payloads, and sending 10 million separate requests would be prohibitively expensive and slow due to per-request pricing and network overhead. Option D is wrong because running a custom container on GKE requires you to manage infrastructure, scaling, and fault tolerance, which is more costly and complex than using Vertex AI's managed batch prediction service.

Practice this question →

103

MCQmedium

You have a Vertex AI endpoint serving a model with min replicas=2 and max replicas=10. You notice that during low traffic hours, the endpoint still runs 2 replicas, incurring costs. You want to reduce costs to zero when there is no traffic. What should you do?

A.Change min replicas to 0 and max replicas to 10.

B.Use a custom metric to trigger scaling down to zero.

C.Delete the endpoint when not in use and recreate it on demand.

D.Set max replicas to 0.

AnswerA

This enables scale-to-zero, allowing endpoint to scale down to zero when idle.

Why this answer

Setting min replicas to 0 allows Vertex AI to scale down to zero instances when there is no traffic, eliminating costs during idle periods. The endpoint will automatically scale up from 0 to handle incoming requests, while max replicas=10 ensures it can handle peak load. This is the standard approach for cost optimization in Vertex AI endpoints.

Exam trap

The trap here is that candidates assume min replicas must be at least 1 for the endpoint to be available, but Vertex AI supports scale-to-zero with min replicas=0, which is the correct way to eliminate idle costs.

How to eliminate wrong answers

Option B is wrong because custom metrics can trigger scaling but cannot override the min replicas constraint; with min replicas=2, the endpoint will never scale below 2 replicas regardless of the metric. Option C is wrong because deleting and recreating the endpoint on demand is impractical, introduces latency for cold starts, and violates best practices for production serving; Vertex AI endpoints are designed to be persistent. Option D is wrong because setting max replicas to 0 would prevent the endpoint from serving any traffic, effectively breaking the service, not just reducing costs.

Practice this question →

104

MCQhard

An ML team wants to deploy multiple models (e.g., a recommender and a classifier) behind a single Vertex AI endpoint. The models have different resource requirements: the recommender needs GPU, the classifier needs high memory. How should they configure the endpoint?

A.Use Cloud Run for one model and Vertex AI for the other.

B.Use a single machine type that meets the highest requirements.

C.Deploy both models to the same endpoint with different machine types per deployed model.

D.Create separate endpoints for each model.

AnswerC

Vertex AI supports deploying multiple models with independent machine specifications.

Why this answer

Vertex AI allows deploying multiple models on the same endpoint, each with its own machine type and resources. Traffic splitting routes requests to the correct model.

Practice this question →

105

MCQeasy

Which API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints?

A.Cloud Functions

B.REST API

C.Cloud Pub/Sub

D.gRPC API

AnswerD

gRPC provides better performance for online prediction due to binary serialization and streaming.

Why this answer

gRPC API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints because it uses HTTP/2 for multiplexed streaming, binary serialization (Protocol Buffers), and supports bidirectional streaming, which reduces latency and improves throughput compared to REST. Vertex AI's prediction service natively supports gRPC for real-time inference, making it the optimal choice for latency-sensitive applications.

Exam trap

Google often tests the misconception that REST API is the default or only way to interact with cloud services, but the trap here is that for high-throughput, low-latency online predictions, gRPC is explicitly recommended over REST due to its performance advantages with Protocol Buffers and HTTP/2.

How to eliminate wrong answers

Option A is wrong because Cloud Functions is a serverless compute service for event-driven code, not an API for making prediction requests; it can invoke Vertex AI endpoints via REST or gRPC but is not itself an API protocol. Option B is wrong because REST API uses HTTP/1.1 with JSON serialization, which introduces higher latency and larger payload sizes compared to gRPC's binary Protocol Buffers, making it suboptimal for high-throughput, low-latency scenarios. Option C is wrong because Cloud Pub/Sub is a message queue for asynchronous, decoupled messaging, not designed for synchronous, low-latency online predictions; it adds queuing delay and is intended for batch or event-driven workflows.

Practice this question →

106

MCQhard

You have a Vertex AI endpoint serving a model for real-time predictions. The endpoint is configured with minReplicaCount=2 and maxReplicaCount=10. Over the past week, you notice that the actual number of replicas rarely exceeds 2, but the average CPU utilization is around 85%. You want to reduce costs without impacting performance. What should you do?

A.Increase minReplicaCount to 5.

B.Decrease minReplicaCount to 1.

C.Increase maxReplicaCount to 20.

D.Decrease the CPU utilization target to 50%

AnswerB

Since the number of replicas rarely exceeds 2, lowering min to 1 reduces the baseline cost, and the autoscaler can still scale up if needed.

Why this answer

Option B is correct because decreasing minReplicaCount to 1 allows the endpoint to scale down to a single replica when traffic is low, reducing compute costs. Since the actual replica count rarely exceeds 2, the current minReplicaCount=2 forces at least two replicas to run continuously, even when one would suffice. With average CPU utilization at 85%, the model is already efficiently handling load, so scaling down to one replica will not impact performance while saving costs.

Exam trap

The trap here is that candidates often assume increasing minReplicaCount or maxReplicaCount improves performance, but the question focuses on cost reduction without impacting performance, and the key insight is that the current minReplicaCount is unnecessarily high given the actual scaling behavior.

How to eliminate wrong answers

Option A is wrong because increasing minReplicaCount to 5 would force at least 5 replicas to run at all times, increasing costs without any performance benefit since the actual replica count rarely exceeds 2. Option C is wrong because increasing maxReplicaCount to 20 does not address the cost issue; the endpoint rarely scales beyond 2 replicas, so a higher maximum has no effect on current spending. Option D is wrong because decreasing the CPU utilization target to 50% would cause the autoscaler to add more replicas prematurely, increasing costs and potentially causing unnecessary scaling events, while the current 85% utilization indicates efficient resource usage.

Practice this question →

107

MCQmedium

An organization wants to deploy a TensorFlow model on edge devices such as smartphones and IoT devices for offline inference. Which format should they export the model to?

A.ONNX format

B.TensorFlow Lite (TFLite)

C.SavedModel format

D.HDF5 format

AnswerB

TFLite is the standard format for deploying models on mobile, embedded, and IoT devices.

Why this answer

TensorFlow Lite is designed for on-device inference on mobile and edge devices, with reduced model size and optimized performance.

Practice this question →

108

MCQeasy

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

A.Deploy the model to a Compute Engine instance and use instance groups.

B.Use a custom metric for autoscaling

C.Enable autoscaling with minReplicaCount=0

D.Set maxReplicaCount to 0

AnswerC

minReplicaCount=0 allows the endpoint to scale to zero when idle.

Why this answer

Option C is correct because Vertex AI endpoints support autoscaling with a `minReplicaCount` of 0, which allows the endpoint to scale down to zero instances when there are no incoming requests, thereby minimizing costs. This feature is specifically designed for serverless model serving, where the endpoint automatically scales up from zero when traffic arrives and scales down to zero during idle periods.

Exam trap

The trap here is that candidates confuse `minReplicaCount=0` with `maxReplicaCount=0`, thinking that setting the maximum to zero will scale down to zero, but in reality, `maxReplicaCount=0` disables the endpoint entirely, while `minReplicaCount=0` is the correct parameter to allow scaling to zero instances.

How to eliminate wrong answers

Option A is wrong because deploying to a Compute Engine instance with instance groups does not natively support scaling down to zero; instance groups require at least one running instance, and you would still incur costs for the underlying VMs even if they are idle. Option B is wrong because custom metrics for autoscaling can help scale based on custom signals, but they do not enable scaling to zero replicas unless the underlying autoscaler supports a `minReplicaCount` of 0, which is not a feature of custom metrics alone. Option D is wrong because setting `maxReplicaCount` to 0 would prevent any replicas from being deployed, making the endpoint unable to serve any requests; `maxReplicaCount` controls the upper limit, not the lower limit for scaling down.

Practice this question →

109

MCQmedium

You need to query a Vertex AI Vector Search index for nearest neighbours. The index is deployed on an endpoint. Which API method should you use to perform the query?

A.projects.locations.indexEndpoints.findNeighbors

B.projects.locations.indexes.match

C.projects.locations.indexes.query

D.projects.locations.endpoints.predict

AnswerA

Correct. The findNeighbors method is used to query a deployed index endpoint.

Why this answer

The correct API method to query a deployed Vertex AI Vector Search index for nearest neighbors is `projects.locations.indexEndpoints.findNeighbors`. This method is specifically designed for vector similarity search against an index endpoint, returning the nearest neighbors for a given query vector. The other options either target the wrong resource (indexes instead of indexEndpoints) or use methods intended for different purposes like model prediction.

Exam trap

The exam often tests the distinction between model prediction endpoints and vector search endpoints, so the trap here is confusing the `predict` method (for model inference) with the `findNeighbors` method (for vector similarity search), leading candidates to incorrectly select option D.

How to eliminate wrong answers

Option B is wrong because `projects.locations.indexes.match` is not a valid API method; the correct method for matching against an index is `findNeighbors` on the index endpoint. Option C is wrong because `projects.locations.indexes.query` does not exist; the query operation for vector search is performed via the index endpoint, not directly on the index resource. Option D is wrong because `projects.locations.endpoints.predict` is used for online prediction from a deployed model, not for querying a vector search index.

Practice this question →

← PreviousPage 2 of 2 · 109 questions total

Ready to test yourself?

Try a timed practice session using only Serving and Scaling Models questions.

Start 20-question session

CCNA Serving and Scaling Models Questions — Page 2 of 2 | Courseiva