Knowledge + Practice

CCNA Pmle Serving Scaling Questions

75 of 109 questions · Page 1/2 · Pmle Serving Scaling topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQmedium

You are deploying a PyTorch model for online predictions on Vertex AI. The model expects input tensors and performs GPU-accelerated inference. You want to minimize prediction latency and maximize throughput. Which approach should you use?

A.Package the model in a custom container without any inference server.

B.Deploy using a prebuilt PyTorch serving container with NVIDIA Triton Inference Server.

C.Use Vertex AI Model Optimization to quantize the model to FP16 and deploy using the optimized model.

D.Use batch prediction instead of online prediction to reduce latency.

AnswerB

Triton is optimized for GPU inference and can reduce latency and increase throughput.

Why this answer

Option B is correct because NVIDIA Triton Inference Server provides advanced features like dynamic batching, concurrent model execution, and GPU scheduling that maximize throughput and minimize latency for GPU-accelerated inference. Vertex AI's prebuilt PyTorch serving container with Triton is specifically designed to handle online prediction workloads efficiently, outperforming a plain custom container without an inference server.

Exam trap

Cisco often tests the misconception that model optimization alone (e.g., quantization) is sufficient for low-latency serving, when in fact the inference server's request handling and batching capabilities are critical for minimizing latency and maximizing throughput in online predictions.

How to eliminate wrong answers

Option A is wrong because a custom container without any inference server lacks request batching, model queuing, and GPU utilization optimizations, leading to higher latency and lower throughput under concurrent requests. Option C is wrong because Vertex AI Model Optimization for FP16 quantization reduces model size and can improve throughput, but it does not address the serving infrastructure needed for low-latency online predictions; the deployment still requires an inference server like Triton to handle request management and GPU scheduling. Option D is wrong because batch prediction is designed for high-throughput, offline processing of large datasets and typically has higher latency per request due to job queuing and resource provisioning, making it unsuitable for minimizing prediction latency in online scenarios.

Practice this question →

2

Multi-Selectmedium

A company is deploying a model on Vertex AI for online predictions with strict latency SLOs. The model requires GPU acceleration. Which TWO configurations should they consider to meet the SLOs while optimizing cost?

Select 2 answers

A.Use n1-highmem-32 machine types without GPU

B.Set min_replica_count to handle base traffic and max_replica_count to handle spikes

C.Use GPU-enabled machine types such as n1-standard-4 with T4

D.Enable autoscaling with min_replica_count=0 and max_replica_count=10

E.Disable autoscaling and set a fixed number of replicas equal to peak load

AnswersB, C

Ensures always-on capacity for base load and ability to scale up.

Why this answer

Enabling autoscaling with min_replica_count to handle base load and max_replica_count for spikes, and using scale-to-zero for non-production (but for production, scale-to-zero may not meet SLOs due to cold starts; however, the question says 'optimizing cost', so scale-to-zero is not appropriate for low latency. The correct answers are: set appropriate min and max replicas, and use GPU-enabled machine types. The other options are either irrelevant or counterproductive.

Practice this question →

3

MCQeasy

Which of the following is a benefit of using Vertex AI Endpoints with autoscaling and scale-to-zero?

A.It eliminates the need for a load balancer.

B.It reduces costs by scaling down to zero replicas when no requests are received.

C.It reduces model training time.

D.It automatically upgrades the model version.

AnswerB

Scale-to-zero minimizes cost for low-traffic endpoints.

Why this answer

Vertex AI Endpoints with autoscaling and scale-to-zero allow the number of serving replicas to dynamically adjust based on incoming traffic. When no requests are received, the endpoint can scale down to zero replicas, meaning you are not charged for idle compute resources. This directly reduces operational costs compared to maintaining a minimum number of always-on instances.

Exam trap

Cisco often tests the misconception that autoscaling eliminates the need for a load balancer, but in reality, the load balancer is a separate component that remains essential for request distribution even when scaling to zero.

How to eliminate wrong answers

Option A is wrong because Vertex AI Endpoints still require a load balancer (the built-in Google Cloud Load Balancer) to distribute incoming requests across replicas; autoscaling does not eliminate this need. Option C is wrong because model training time is a function of training infrastructure and algorithm, not of serving endpoint configuration like autoscaling. Option D is wrong because Vertex AI Endpoints do not automatically upgrade model versions; you must explicitly deploy a new model version or use a traffic split to route requests to a different version.

Practice this question →

4

MCQmedium

You are using Vertex AI batch prediction and your model requires preprocessing that involves joining two BigQuery tables. The preprocessing logic is complex and must be done before inference. How should you design the pipeline?

A.Write a Cloud Composer workflow that runs the preprocessing and then triggers the batch prediction job.

B.Use Dataflow to read from both BigQuery tables, perform the join and preprocessing, write the results to GCS, then run Vertex AI batch prediction with GCS source.

C.Use Vertex AI batch prediction with a custom container that includes logic to read and join tables on the fly.

D.Use BigQuery to create a materialized view that joins the tables and directly use that as the batch prediction source.

AnswerB

Dataflow handles the complex join and scales; batch prediction can read from GCS.

Why this answer

Option B is correct because Dataflow (Apache Beam) is designed for complex, stateful data processing like joining two BigQuery tables and performing custom preprocessing. It can read from BigQuery, execute the join logic, and write the preprocessed results to Cloud Storage (GCS). Vertex AI batch prediction then reads the preprocessed data from GCS, which is the recommended pattern for non-trivial transformations before inference, as it decouples preprocessing from prediction and avoids resource contention.

Exam trap

Cisco often tests the misconception that batch prediction can handle live data transformations within the prediction container, but the correct design is to preprocess data in a separate, scalable data processing service like Dataflow before feeding it to batch prediction.

How to eliminate wrong answers

Option A is wrong because Cloud Composer (Apache Airflow) is an orchestration tool, not a data processing engine; using it to run the preprocessing itself would be inefficient and error-prone, as it lacks native support for large-scale data joins and transformations. Option C is wrong because Vertex AI batch prediction with a custom container that reads and joins tables on the fly violates the principle of separation of concerns, leading to longer inference latency, higher memory usage, and potential timeouts during prediction, as batch prediction expects preprocessed input, not live database joins. Option D is wrong because BigQuery materialized views are precomputed, read-only snapshots that cannot be used directly as a batch prediction source; batch prediction requires input data in GCS (JSON/CSV) or BigQuery tables, but a materialized view is not a table and cannot be referenced as a source URI.

Practice this question →

5

MCQeasy

What is the primary purpose of Vertex AI Model Optimization (formerly Model Garden)?

A.To monitor model performance in production

B.To search for optimal hyperparameters

C.To optimize models for deployment by quantizing and compiling them

D.To train models faster using distributed training

AnswerC

Model Optimization reduces model size and improves inference speed.

Why this answer

Vertex AI Model Optimization automatically quantizes and compiles models to reduce latency and memory footprint for serving.

Practice this question →

6

MCQeasy

Which Vertex AI service is best suited for finding similar items in a large dataset based on embedding vectors, such as product recommendations or image similarity search?

A.Vertex AI Prediction Endpoint

B.Vertex AI Model Monitoring

C.Vertex AI Feature Store

D.Vertex AI Matching Engine

AnswerD

Correct: Matching Engine (Vector Search) is for ANN-based similarity search on embeddings.

Why this answer

Vertex AI Matching Engine is specifically designed for high-performance vector similarity search (also known as approximate nearest neighbor search) using embedding vectors. It scales to billions of vectors and is ideal for use cases like product recommendations and image similarity search, where you need to find the most similar items based on dense vector representations.

Exam trap

Cisco often tests the distinction between serving predictions from a trained model (Prediction Endpoint) versus performing vector similarity search (Matching Engine), leading candidates to confuse a general-purpose serving endpoint with a specialized vector database.

How to eliminate wrong answers

Option A is wrong because Vertex AI Prediction Endpoint serves model predictions via HTTP requests but does not provide built-in vector similarity search or indexing capabilities. Option B is wrong because Vertex AI Model Monitoring tracks prediction quality and data drift over time, not similarity search. Option C is wrong because Vertex AI Feature Store is a centralized repository for storing, serving, and sharing feature data, but it does not perform nearest neighbor search on embedding vectors.

Practice this question →

7

MCQmedium

Your Vertex AI endpoint is experiencing high latency during traffic spikes. You have set maxReplicas=10 and minReplicas=2. The CPU utilisation target is 60%. During spikes, the endpoint never scales beyond 4 replicas. What is the most likely reason?

A.The maxReplicas limit is set to 10, but the cooldown period is preventing rapid scaling.

B.You need to enable GPU acceleration.

C.The endpoint is using a legacy model framework.

D.The machine type is too small.

AnswerA

Correct. Cooldown periods can delay scaling decisions, especially for short spikes.

Why this answer

The default cooldown period (usually 120 seconds) prevents rapid scaling. If traffic spikes are very short, the endpoint may not trigger a scale-up because the utilisation spike doesn't persist long enough.

Practice this question →

8

MCQhard

A company needs to perform real-time similarity search on a dataset of 10 million embedding vectors. They expect low latency (under 10ms) and high throughput. Which index type should they use in Vertex AI Vector Search?

A.Brute-force index

B.Hash-based index

C.Tree-based index

D.Approximate nearest neighbor (ANN) index with ScaNN

AnswerD

ANN with ScaNN provides fast approximate search, suitable for large-scale real-time search.

Why this answer

For large datasets requiring low latency, an approximate nearest neighbor (ANN) index is appropriate. The Scann algorithm (ScaNN) is used by Vertex AI Vector Search for ANN.

Practice this question →

9

Multi-Selectmedium

You are deploying a large deep learning model on Vertex AI endpoints. The model requires GPU acceleration and you want to minimize cold-start latency. Which TWO actions should you take? (Choose 2 correct answers)

Select 2 answers

A.Set minReplicaCount to 0 to allow scale-to-zero.

B.Use a custom container that loads the model during startup.

C.Increase maxReplicaCount to a high number.

D.Use batch prediction instead of online prediction.

E.Set minReplicaCount to 1 to always have at least one replica running.

AnswersB, E

Pre-loading the model reduces latency for the first prediction.

Why this answer

Option B is correct because loading the model during container startup (e.g., in the Dockerfile's ENTRYPOINT or CMD) ensures that the model is already in memory when the first prediction request arrives, drastically reducing cold-start latency. This is a standard practice for Vertex AI endpoints where the container must be ready to serve immediately after scaling up.

Exam trap

Cisco often tests the misconception that scale-to-zero (minReplicaCount=0) reduces latency, when in fact it increases cold-start latency; the correct approach is to keep at least one replica always warm (minReplicaCount=1) and pre-load the model during container startup.

Practice this question →

10

MCQmedium

You need to run a batch prediction job on Vertex AI using a model that requires custom preprocessing using a Python script. The preprocessing must be applied before inference. Which approach should you use?

A.Use Cloud Functions to preprocess each record individually.

B.Preprocess the data on a single VM and then upload to GCS.

C.Use Dataflow to preprocess the data and write the results to BigQuery or GCS, then launch a batch prediction job on the preprocessed data.

D.Include the preprocessing logic in the custom container used for batch prediction.

AnswerC

Dataflow can handle large-scale preprocessing and then feed the cleaned data to batch prediction.

Why this answer

Option C is correct because Dataflow (Apache Beam) is the recommended serverless service for distributed, scalable preprocessing of large datasets on Google Cloud. It can read raw data from GCS, apply custom Python preprocessing logic, and write the preprocessed results to GCS or BigQuery. The batch prediction job then reads the preprocessed data directly, avoiding the need to embed preprocessing in the prediction container or handle data on a single VM.

Exam trap

Cisco often tests the misconception that preprocessing can be embedded in the prediction container (Option D) to simplify the pipeline, but this violates the separation of concerns principle and leads to slower, less maintainable batch jobs.

How to eliminate wrong answers

Option A is wrong because Cloud Functions is designed for event-driven, lightweight processing of individual records, not for batch preprocessing of large datasets; it has a 9-minute timeout and limited memory, making it unsuitable for scalable batch preprocessing. Option B is wrong because preprocessing on a single VM creates a bottleneck, lacks fault tolerance, and does not scale horizontally for large datasets, violating best practices for production batch pipelines. Option D is wrong because including preprocessing logic in the custom container for batch prediction couples preprocessing with inference, increasing container complexity and startup time, and prevents reuse of the preprocessing pipeline for other downstream tasks.

Practice this question →

11

MCQmedium

A company deploys a model on Vertex AI Endpoints for real-time inference. They need to minimize latency for prediction requests that are identical to previous requests. Which approach should they use?

A.Use a regional load balancer with session affinity.

B.Implement a caching layer using Cloud Memorystore with request hashing.

C.Use Cloud CDN to cache prediction responses.

D.Enable prediction caching on Vertex AI Endpoints.

AnswerB

Cloud Memorystore (Redis) can cache prediction results keyed by a hash of the input request, reducing latency for duplicate requests.

Why this answer

Option B is correct because caching identical prediction requests using Cloud Memorystore with request hashing reduces latency by serving cached responses directly from an in-memory cache, avoiding redundant model inference. This approach is ideal for real-time inference where many requests are identical, as it bypasses the model endpoint entirely for cached requests, minimizing response time.

Exam trap

The trap here is that candidates may confuse Vertex AI's built-in features with external caching mechanisms, assuming 'prediction caching' is a native endpoint option when it is not, leading them to select option D.

How to eliminate wrong answers

Option A is wrong because a regional load balancer with session affinity distributes traffic based on client sessions, not request content, so it does not cache or reuse responses for identical requests, failing to reduce latency for repeated predictions. Option C is wrong because Cloud CDN caches static content at edge locations, but prediction responses from Vertex AI are dynamic and often require authentication or vary per request, making CDN caching unsuitable for real-time inference. Option D is wrong because Vertex AI Endpoints do not have a built-in 'prediction caching' feature; caching must be implemented externally, such as with Cloud Memorystore or Redis.

Practice this question →

12

MCQhard

You have a Vertex AI endpoint with two deployed models: a champion (v1) and a challenger (v2). You set the traffic split to 90% v1 and 10% v2. After a week, you observe that v2 has better business metrics. You want to shift all traffic to v2 gradually over 3 days to avoid any risk. What should you do?

A.Deploy v2 to a new endpoint and update your clients to use the new endpoint.

B.Use Vertex AI Experiments to compare v1 and v2, then redeploy v2 with 100% traffic.

C.Update the traffic split configuration on the endpoint multiple times over the 3 days to gradually increase v2's percentage.

D.Delete v1 from the endpoint so that all traffic automatically goes to v2.

AnswerC

This is the correct method for gradual traffic shifting.

Why this answer

Option C is correct because Vertex AI endpoints support live traffic splitting between deployed models, allowing you to gradually shift traffic from v1 to v2 by updating the traffic split configuration multiple times over the 3-day period. This approach minimizes risk by enabling incremental rollouts and immediate rollback if issues arise, without requiring client-side changes or downtime.

Exam trap

The trap here is that candidates may assume deleting the old model or redeploying with 100% traffic is acceptable, but the question explicitly requires a gradual shift over 3 days to avoid risk, which only incremental traffic split updates can achieve.

How to eliminate wrong answers

Option A is wrong because deploying v2 to a new endpoint and updating clients introduces unnecessary complexity, potential downtime, and defeats the purpose of gradual traffic shifting; it also requires client-side changes, which is riskier and not aligned with the goal of avoiding risk. Option B is wrong because Vertex AI Experiments are used for offline model evaluation and comparison, not for live traffic management; redeploying v2 with 100% traffic would be an abrupt switch, not a gradual shift over 3 days. Option D is wrong because deleting v1 from the endpoint would immediately route 100% of traffic to v2, which is an abrupt change, not gradual, and violates the requirement to shift traffic gradually over 3 days to avoid risk.

Practice this question →

13

MCQmedium

You need to run batch predictions on 10 TB of text data stored in BigQuery using a custom container model hosted in Vertex AI. What is the most cost-effective and simple approach?

A.Use Vertex AI batch prediction with BigQuery source and sink.

B.Use Cloud Run jobs to read from BigQuery and write results back.

C.Export BigQuery data to GCS, then run a Dataflow pipeline to call the model's online prediction endpoint for each row.

D.Use Cloud Dataproc to spin up a Spark cluster and run the model inference in parallel.

AnswerA

Correct. Vertex AI batch prediction natively supports BigQuery as both source and sink.

Why this answer

Vertex AI batch prediction natively supports BigQuery as both source and sink, allowing you to run predictions on 10 TB of text data without any data movement or intermediate storage. This is the most cost-effective and simple approach because it eliminates the need for exporting data, managing infrastructure, or calling online endpoints, and it leverages Vertex AI's optimized batch inference infrastructure that scales automatically.

Exam trap

Cisco often tests the misconception that you must export data from BigQuery to GCS before running batch predictions, when in fact Vertex AI batch prediction can directly read from and write to BigQuery, making the export step unnecessary and cost-inefficient.

How to eliminate wrong answers

Option B is wrong because Cloud Run jobs have a maximum request timeout of 60 minutes and are not designed for processing 10 TB of data efficiently; they would require complex batching and retry logic, and would incur higher costs due to per-request pricing and lack of native BigQuery integration. Option C is wrong because exporting data to GCS and then using Dataflow to call the online prediction endpoint for each row introduces unnecessary data movement, storage costs, and network latency; online endpoints are designed for low-latency single requests, not high-throughput batch processing, and this approach would be both slower and more expensive. Option D is wrong because Cloud Dataproc requires you to manage a Spark cluster, handle autoscaling, and write custom inference code, which adds operational complexity and cost for a task that Vertex AI batch prediction can handle natively with no infrastructure management.

Practice this question →

14

MCQmedium

You need to perform batch predictions on 10 TB of data stored in BigQuery using Vertex AI. The model requires some preprocessing that cannot be expressed in SQL. What is the most scalable approach?

A.Use a Cloud Function to preprocess each row and write to a new BigQuery table, then run batch prediction.

B.Use Dataflow to read from BigQuery, perform preprocessing, write results to GCS, then run Vertex AI batch prediction job with GCS source.

C.Use Vertex AI batch prediction with BigQuery source and include preprocessing logic in the model container.

D.Export BigQuery data to CSV, run a local Python script for preprocessing, then upload to GCS and start a batch prediction job.

AnswerB

Dataflow handles large-scale preprocessing and the pipeline integrates well with Vertex AI.

Why this answer

Option B is correct because Dataflow (Apache Beam) provides a fully managed, auto-scaling, serverless execution environment that can read from BigQuery, apply arbitrary Python/Java preprocessing logic (e.g., feature engineering, normalization) that cannot be expressed in SQL, and write the preprocessed results to Cloud Storage (GCS). Vertex AI batch prediction can then read from GCS as input, making this the most scalable approach for 10 TB of data without requiring custom model container changes or manual data movement.

Exam trap

Cisco often tests the misconception that Cloud Functions can handle large-scale batch processing, but the trap here is that Cloud Functions are designed for event-driven, short-lived tasks, not for processing terabytes of data in a batch pipeline.

How to eliminate wrong answers

Option A is wrong because Cloud Functions have a 9-minute timeout and limited memory (up to 8 GB), making them unsuitable for processing 10 TB of data row-by-row; they would require an impractical number of invocations and lack built-in parallelization for large-scale batch workloads. Option C is wrong because Vertex AI batch prediction with a BigQuery source does not support preprocessing logic inside the model container — the container receives raw data and must handle all transformations itself, which couples preprocessing to the model and violates separation of concerns; also, BigQuery source does not allow custom preprocessing steps before inference. Option D is wrong because exporting 10 TB of data to CSV, running a local Python script (single machine, no distributed processing), then uploading to GCS is not scalable — it creates a bottleneck at the local script, requires significant network transfer, and does not leverage managed services for parallel processing.

Practice this question →

15

MCQhard

You are using Vertex AI Prediction with a custom container that requires a large model file (5 GB). Deployment takes 10 minutes to start. You want to reduce cold start latency. Which action would be MOST effective?

A.Compress the model file and decompress on startup.

B.Use a machine type with local SSD to speed up model loading.

C.Switch to batch prediction to avoid online cold start.

D.Set minReplicas to 1 to keep at least one instance always running.

AnswerD

Correct. By keeping an instance warm, you avoid cold start entirely for that instance.

Why this answer

Cold start latency occurs when a new instance is started and must load the 5 GB model from disk, which can take 10 minutes. Setting minReplicas to 1 ensures that at least one instance is always running and serving, so the model is already loaded in memory, completely avoiding cold start on prediction requests. While other options (compressing the model, using local SSD, or switching to batch prediction) may reduce load time or avoid online serving, they do not eliminate the cold start penalty as effectively as maintaining a warm replica.

Practice this question →

16

MCQmedium

A company runs batch predictions on Vertex AI every hour using a custom container. They want to reduce costs by minimizing idle time while ensuring the batch job completes within 10 minutes. Which endpoint configuration should they use?

A.Use Vertex AI online prediction with a load balancer in front to distribute requests.

B.Use Vertex AI batch prediction job with a custom service account and set machine_type to 'n1-standard-4' and batch_size to optimize throughput.

C.Create a Dataflow pipeline to read from BigQuery and write predictions to GCS, using the trained model as a side input.

D.Deploy the model to an endpoint with min_replicas=0 and max_replicas=10, then send batch requests to the endpoint.

AnswerB

Batch prediction jobs handle resource scaling automatically; choosing appropriate machine type and batch size ensures performance.

Why this answer

Option B is correct because Vertex AI batch prediction jobs are designed for high-throughput, asynchronous processing of large datasets without maintaining persistent infrastructure. By tuning `machine_type` and `batch_size`, you can minimize idle time and ensure the job completes within the 10-minute window, as the job only runs while actively processing and scales resources as needed.

Exam trap

The trap here is that candidates confuse online prediction with autoscaling (min_replicas=0) as a cost-saving measure for batch workloads, but online prediction endpoints still incur a minimum charge for the underlying infrastructure and are not optimized for asynchronous batch jobs.

How to eliminate wrong answers

Option A is wrong because online prediction with a load balancer is for real-time, low-latency serving, not batch jobs; it keeps endpoints running continuously, incurring costs even when idle, and does not address the requirement to minimize idle time. Option C is wrong because a Dataflow pipeline with a model as a side input is an alternative architecture for batch inference but is not a Vertex AI endpoint configuration; it introduces additional complexity and does not directly leverage Vertex AI's batch prediction job optimization. Option D is wrong because deploying a model with `min_replicas=0` and `max_replicas=10` is for online prediction with autoscaling; sending batch requests to an online endpoint still incurs per-request latency and scaling overhead, and the endpoint may remain provisioned for a period after requests, leading to idle costs.

Practice this question →

17

MCQeasy

You have a Vertex AI endpoint that serves a model for real-time predictions. You want to update the model to a new version with zero downtime. Which approach should you take?

A.Delete the endpoint and recreate it with the new model.

B.Deploy the new model version to the same endpoint and then set traffic to 100% for the new version.

C.Use Cloud Load Balancing to switch traffic between two endpoints.

D.Create a new endpoint and update the client application to point to the new endpoint.

AnswerB

This allows zero-downtime deployment; the old version remains available during transition.

Why this answer

Option B is correct because Vertex AI endpoints support canary deployments by allowing you to deploy a new model version to the same endpoint and then gradually shift traffic to it using the `traffic_split` parameter. Setting traffic to 100% for the new version after deployment ensures zero downtime, as the endpoint remains active and serves requests from the old version until the switch is complete.

Exam trap

The trap here is that candidates assume a new endpoint or load balancer is required for zero-downtime updates, but Vertex AI endpoints natively support traffic splitting between model versions on the same endpoint, making external components unnecessary.

How to eliminate wrong answers

Option A is wrong because deleting and recreating the endpoint causes downtime during the deletion and creation process, and the endpoint URL changes, requiring client updates. Option C is wrong because Cloud Load Balancing is an external traffic management layer that adds unnecessary complexity and latency; Vertex AI endpoints natively support traffic splitting without needing an external load balancer. Option D is wrong because creating a new endpoint changes the endpoint URL, which requires updating client applications, leading to potential downtime or misrouting during the transition.

Practice this question →

18

Multi-Selectmedium

You are deploying a model for real-time inference with strict latency requirements (<100ms P99). You want to autoscale based on custom metrics. Which TWO actions should you take? (Choose 2)

Select 2 answers

A.Use a regional endpoint to reduce network latency.

B.Configure the endpoint to use custom metrics from Cloud Monitoring.

C.Set a target value for the custom metric in the autoscaling policy.

D.Enable GPU acceleration for faster inference.

E.Set minReplicas to 0 to save cost.

AnswersB, C

Correct. Custom metrics can be used for autoscaling.

Why this answer

Option B is correct because Cloud Monitoring custom metrics allow you to define autoscaling based on signals that are directly relevant to your inference latency, such as request queue depth or model throughput. This enables the autoscaler to react to real-time demand more precisely than CPU or memory utilization alone, which is critical for meeting strict P99 latency targets.

Exam trap

Cisco often tests the distinction between infrastructure-level optimizations (like regional endpoints or GPU acceleration) and autoscaling configuration actions, leading candidates to confuse network latency reduction with scaling metric selection.

Practice this question →

19

Multi-Selectmedium

A company is deploying a complex model that requires GPU for inference. They want to use Vertex AI for serving. Which TWO steps are required to deploy the model with GPU support? (Choose 2)

Select 2 answers

A.Select a GPU-enabled machine type such as n1-standard-4 with 1 x NVIDIA Tesla T4.

B.Enable Vertex AI Model Optimization for automatic GPU compilation.

C.Deploy the model using a custom container that includes CUDA and cuDNN.

D.Increase the minimum replicas to at least 2 for GPU redundancy.

E.Use gRPC protocol for prediction requests to reduce latency.

AnswersA, C

GPU-enabled machine type is necessary for GPU inference.

Why this answer

Option A is correct because Vertex AI requires selecting a GPU-enabled machine type (e.g., n1-standard-4 with 1 x NVIDIA Tesla T4) when deploying a model for inference. This is done in the machine specification of the endpoint deployment, ensuring the GPU hardware is allocated for the serving container.

Exam trap

Cisco often tests the misconception that GPU support is automatic or requires only a machine type selection, but the custom container with CUDA/cuDNN is equally mandatory to enable GPU acceleration.

Practice this question →

20

MCQhard

You are deploying a PyTorch model on Vertex AI using a custom container with NVIDIA Triton Inference Server. The model is a large transformer that requires GPU. You want to optimize GPU utilization and reduce memory footprint. Which technique should you apply?

A.Enable dynamic batching in Triton.

B.Use CPU-only instances to avoid GPU memory issues.

C.Increase the number of GPU replicas.

D.Apply model quantization using TensorRT.

AnswerD

Quantization reduces model size and memory footprint, enabling better GPU utilization.

Why this answer

Option D is correct because model quantization using TensorRT reduces the precision of model weights (e.g., from FP32 to FP16 or INT8), which directly decreases GPU memory usage and can improve throughput by enabling faster arithmetic operations on compatible NVIDIA GPUs. This technique is specifically designed to optimize GPU utilization and memory footprint for large transformer models deployed with Triton Inference Server.

Exam trap

Cisco often tests the distinction between throughput optimization techniques (like dynamic batching) and memory footprint reduction techniques (like quantization), leading candidates to mistakenly choose dynamic batching when the question specifically asks about reducing memory footprint.

How to eliminate wrong answers

Option A is wrong because dynamic batching improves throughput by grouping inference requests, but it does not reduce the memory footprint per model instance or optimize GPU utilization in terms of memory efficiency. Option B is wrong because CPU-only instances cannot run the large transformer model with acceptable latency or throughput, and the question explicitly requires GPU. Option C is wrong because increasing the number of GPU replicas scales horizontally, which increases total memory footprint and cost, rather than reducing memory footprint per replica or optimizing utilization of a single GPU.

Practice this question →

21

MCQhard

Your team is deploying a large recommendation model on Vertex AI endpoints using GPUs. You need to minimise latency while optimising cost. The model serves many similar requests from the same users within short time windows. Which additional service would best reduce latency and cost?

A.Switch to CPU-only instances to reduce cost.

B.Increase maxReplicas to handle the load without caching.

C.Set up a Cloud CDN in front of the endpoint.

D.Use Cloud Memorystore to cache prediction results.

AnswerD

Correct. Memorystore provides low-latency caching for prediction results, reducing repetitive model invocations.

Why this answer

Caching identical prediction requests can reduce load on the model and improve latency. Cloud Memorystore (Redis) can be used to cache responses based on a hash of the request, and the endpoint can check cache before invoking the model.

Practice this question →

22

MCQmedium

A company wants to cache predictions for identical requests to reduce latency and cost. They use Vertex AI Prediction with a custom container. Which GCP service should they use to implement prediction caching?

A.Cloud Bigtable

B.Cloud Memorystore for Redis

C.Cloud Storage

D.Cloud Firestore

AnswerB

Redis is an in-memory data store perfect for caching with low latency.

Why this answer

Cloud Memorystore (Redis) is ideal for caching because it provides low-latency key-value storage. The prediction request can be hashed to create a key.

Practice this question →

23

Multi-Selectmedium

A company uses Vertex AI Matching Engine for real-time recommendations. They need to serve queries with low latency and support frequent updates. Which two configurations are appropriate? (Choose 2)

Select 2 answers

A.Store the index in Cloud Storage and query via Python

B.Enable streaming updates for the index

C.Use a brute-force index for exact results

D.Deploy the index to a Vertex AI Matching Engine endpoint

E.Use batch updates only

AnswersB, D

Streaming updates allow real-time insertion without downtime.

Why this answer

Vertex AI Matching Engine supports streaming updates, which allow real-time insertion, deletion, and modification of vectors without rebuilding the entire index. This is essential for use cases requiring frequent updates, such as real-time recommendation systems, because it maintains low latency for serving queries while keeping the index current.

Exam trap

The trap here is that candidates often confuse batch updates with streaming updates, assuming that batch updates can be made frequent enough to approximate real-time, but they fail to recognize that batch updates require full index rebuilds, which introduce significant latency and downtime for serving.

Practice this question →

24

MCQhard

A data scientist wants to perform A/B testing between two model versions deployed on the same Vertex AI endpoint. They need to route 10% of traffic to the challenger model. Which approach should they use?

A.Use Vertex AI Experiments to compare models offline, then deploy the winner

B.Deploy the challenger model to a separate endpoint and use a load balancer to split traffic

C.Update the champion model with a new version and use model version aliases

D.Deploy both models to the same endpoint and set traffic_split to 90 for champion and 10 for challenger

AnswerD

Vertex AI traffic_split directly routes the specified percentage of requests to each model.

Why this answer

Vertex AI endpoints support traffic splitting directly, allowing you to route a percentage of requests to different model versions deployed on the same endpoint. By setting `traffic_split` to 90 for the champion and 10 for the challenger, the data scientist can perform online A/B testing without additional infrastructure. This is the simplest and most cost-effective approach, as it avoids managing separate endpoints or load balancers.

Exam trap

Cisco often tests the misconception that separate endpoints or load balancers are required for A/B testing, when in fact Vertex AI endpoints provide built-in traffic splitting for this exact purpose.

How to eliminate wrong answers

Option A is wrong because Vertex AI Experiments is an offline evaluation tool for comparing model performance on historical data, not for live traffic splitting. Option B is wrong because deploying the challenger to a separate endpoint and using a load balancer adds unnecessary complexity and cost; Vertex AI endpoints natively support traffic splitting across model versions. Option C is wrong because updating the champion model with a new version and using model version aliases does not provide granular traffic splitting; aliases are for version management, not for routing a specific percentage of live traffic.

Practice this question →

25

MCQeasy

Which Vertex AI service is designed for building and managing approximate nearest neighbor (ANN) indexes for similarity search at scale?

A.Vertex AI AutoML

B.Vertex AI Workbench

C.Vertex AI Prediction

D.Vertex AI Matching Engine (Vector Search)

AnswerD

Matching Engine (Vector Search) is for ANN similarity search on embeddings.

Why this answer

Vertex AI Matching Engine (now Vector Search) provides ANN indexes for similarity search, enabling fast vector similarity queries at scale.

Practice this question →

26

Multi-Selectmedium

A company needs to reduce inference latency for their online prediction service on Vertex AI. Which two actions would help? (Choose 2)

Select 2 answers

A.Increase the maximum number of replicas

B.Deploy the model on a GPU-enabled machine

C.Enable model quantization via Vertex AI Model Optimization

D.Use a smaller machine type with less memory

E.Enable autoscaling with a lower target CPU utilization

AnswersB, C

GPUs accelerate compute-heavy models, reducing latency.

Why this answer

Option B is correct because deploying the model on a GPU-enabled machine significantly accelerates matrix operations and parallel computations inherent in deep learning inference, directly reducing per-request latency. Option C is correct because model quantization reduces the precision of model weights (e.g., from FP32 to INT8), which decreases memory footprint and speeds up computation, especially on compatible hardware like TPUs or GPUs.

Exam trap

Cisco often tests the distinction between scaling for throughput (replicas, autoscaling) versus reducing per-request latency (hardware acceleration, model optimization), leading candidates to confuse horizontal scaling with performance optimization.

Practice this question →

27

MCQmedium

A data science team has trained a custom TensorFlow model for real-time fraud detection. They need to deploy it on Vertex AI with minimal latency and support for multiple concurrent requests. The model requires a GPU for inference. Which machine type should they choose for the Vertex AI endpoint?

A.n2-standard-8

B.n1-standard-4 with NVIDIA T4

C.e2-standard-4

D.n1-highmem-4

AnswerB

n1-standard machines support GPU attachment, such as T4, which is suitable for inference.

Why this answer

Option B is correct because the n1-standard-4 with NVIDIA T4 provides the GPU acceleration required for real-time inference, while the n1 machine family supports GPU attachments on Vertex AI. The T4 GPU is optimized for low-latency inference workloads, and the n1-standard-4 offers sufficient CPU and memory for serving a custom TensorFlow model with multiple concurrent requests.

Exam trap

Cisco often tests the misconception that any machine type can be used with a GPU on Vertex AI, but only specific families (n1, a2, g2) support GPU attachments, and the question explicitly requires a GPU for inference.

How to eliminate wrong answers

Option A is wrong because n2-standard-8 is a general-purpose machine type that does not support GPU attachments on Vertex AI; it lacks the necessary GPU capability for inference. Option C is wrong because e2-standard-4 is a cost-optimized machine type that does not support GPU attachments, making it unsuitable for GPU-required inference. Option D is wrong because n1-highmem-4, while part of the n1 family that can attach GPUs, is optimized for memory-intensive workloads rather than balanced compute and GPU inference, and it does not include a GPU by default; the question specifies a GPU is required, so the machine type must explicitly include or support a GPU like the T4.

Practice this question →

28

Multi-Selecthard

A company is migrating from an on-premises ML serving infrastructure to Vertex AI. They have multiple models that need to be served from the same endpoint with different traffic percentages. They also need to monitor prediction quality. Which THREE actions should they take? (Choose 3)

Select 3 answers

A.Deploy multiple model versions on the same endpoint with traffic_split parameter.

B.Deploy each model as a separate endpoint and use Cloud Load Balancing.

C.Enable Vertex AI Model Monitoring to detect prediction drift.

D.Use Cloud Monitoring to create custom metrics based on business outcomes.

E.Export logs to BigQuery for manual analysis only.

AnswersA, C, D

Vertex AI supports this natively for A/B testing.

Why this answer

Option A is correct because Vertex AI endpoints support deploying multiple model versions and using the `traffic_split` parameter to distribute traffic percentages among them. This allows the company to serve different models from a single endpoint while controlling the proportion of requests each model receives, meeting the requirement for a unified serving infrastructure.

Exam trap

The trap here is that candidates may think separate endpoints with a load balancer (Option B) are required for traffic distribution, overlooking Vertex AI's built-in traffic splitting on a single endpoint, which is simpler and more aligned with the platform's design.

Practice this question →

29

MCQmedium

An engineer deploys a model to a Vertex AI endpoint with minReplicas=1 and maxReplicas=3. The endpoint receives a sudden traffic spike, but it does not scale up beyond 1 replica. The CPU utilization target is 60%. What is the most likely cause?

A.The model is not deployed correctly.

B.The endpoint is configured with the wrong machine type.

C.The CPU utilization is below the target threshold, so the autoscaler does not add replicas.

D.The endpoint is using GPU which cannot autoscale.

AnswerC

If CPU utilization is below 60%, the autoscaler sees no need to scale up.

Why this answer

Option C is correct because Vertex AI's autoscaler uses CPU utilization as a metric to decide when to add replicas. If the CPU utilization remains below the 60% target threshold, the autoscaler will not trigger scale-up, even during a traffic spike. The endpoint is configured with minReplicas=1 and maxReplicas=3, but without exceeding the target, it stays at the minimum.

Exam trap

The trap here is that candidates assume any traffic spike automatically triggers scaling, but Vertex AI's autoscaler only scales based on the configured metric (CPU utilization), not request volume directly.

How to eliminate wrong answers

Option A is wrong because the model being deployed correctly is unrelated to autoscaling behavior; a misdeployment would typically cause prediction failures or errors, not a failure to scale. Option B is wrong because the machine type affects performance and cost, but does not directly prevent the autoscaler from adding replicas when CPU utilization exceeds the target. Option D is wrong because GPU-enabled endpoints can autoscale; Vertex AI supports autoscaling for GPU instances, though GPU metrics may require custom configuration.

Practice this question →

30

MCQeasy

A machine learning engineer needs to run batch predictions on 50 TB of data stored in BigQuery using a Vertex AI model. The model is a custom container. What is the most efficient way to set up the batch prediction job?

A.Create a Vertex AI batch prediction job with BigQuery source and BigQuery destination.

B.Use Dataflow to process the data and call the model via Vertex AI online prediction.

C.Export BigQuery data to CSV in GCS, then create a batch prediction job with GCS source.

D.Create a Cloud Function to iterate over BigQuery rows and call the endpoint.

AnswerA

Vertex AI batch prediction supports BigQuery directly for input and output.

Why this answer

Vertex AI batch prediction supports BigQuery as both input and output source, which is the most direct approach. Dataflow preprocessing is optional only if needed.

Practice this question →

31

Multi-Selecthard

You are designing a batch prediction pipeline using Vertex AI. The input data is 100 TB of images stored in Cloud Storage. The model is a custom TensorFlow model that expects TFRecord format. The pipeline must be cost-effective and run within a time window of 2 hours. Which THREE steps should you include?

Select 3 answers

A.Store batch prediction results in BigQuery.

B.Create a Vertex AI batch prediction job with input from GCS (TFRecord files).

C.Use Dataflow to read images and write TFRecord files to GCS.

D.Store batch prediction results in GCS.

E.Use Cloud Functions to convert images to TFRecord.

AnswersB, C, D

Batch prediction supports GCS input.

Why this answer

Option B is correct because Vertex AI batch prediction jobs natively accept TFRecord files stored in Cloud Storage as input, which aligns with the requirement for a custom TensorFlow model. This approach is cost-effective and can complete within 2 hours by leveraging Vertex AI's managed infrastructure, avoiding the need to spin up and manage compute resources manually.

Exam trap

Cisco often tests the misconception that Cloud Functions can handle large-scale data processing tasks, but the trap here is that Cloud Functions have strict timeout and memory limits, making them unsuitable for converting 100 TB of images to TFRecord format.

Practice this question →

32

MCQmedium

A data science team needs to serve multiple versions of the same ML model on Vertex AI Endpoints for A/B testing. They want to gradually shift traffic from the current 'champion' model to a new 'challenger' model. Which feature should they use?

A.Deploy the challenger to a separate endpoint and use a proxy to split traffic.

B.Use Cloud Load Balancing with weighted backend services.

C.Deploy both models to the same endpoint and use traffic splitting.

D.Use Vertex AI Experiments to manage model versions.

AnswerC

Vertex AI endpoints allow deploying multiple model versions and assigning traffic percentages to each, enabling gradual rollouts and A/B testing.

Why this answer

Vertex AI Endpoints natively support traffic splitting, allowing you to deploy multiple model versions (e.g., champion and challenger) to the same endpoint and assign a percentage of traffic to each. This enables gradual A/B testing without additional infrastructure, as the endpoint automatically routes requests based on the configured split. Option C is correct because it leverages this built-in feature, which is designed specifically for this use case.

Exam trap

Cisco often tests the misconception that traffic splitting requires external load balancers or proxies, when in fact Vertex AI Endpoints provide this capability natively, and candidates may overlook the built-in feature in favor of more complex architectures.

How to eliminate wrong answers

Option A is wrong because deploying the challenger to a separate endpoint and using a proxy adds unnecessary complexity, latency, and management overhead; Vertex AI Endpoints already provide traffic splitting without external proxies. Option B is wrong because Cloud Load Balancing operates at the network layer (HTTP(S) or TCP/UDP) and is designed for distributing traffic across regional backends, not for splitting traffic between model versions on the same Vertex AI Endpoint; it would require separate endpoints and does not integrate with Vertex AI's model versioning. Option D is wrong because Vertex AI Experiments is a tool for tracking and comparing model training runs and hyperparameters, not for serving or routing live traffic; it has no mechanism to split traffic between deployed models.

Practice this question →

33

Multi-Selecthard

A retail company deploys a new recommendation model alongside the current champion on Vertex AI Endpoints. They want to gradually shift traffic to the challenger while monitoring business metrics (conversion rate). Which two steps are required? (Choose 2)

Select 2 answers

A.Use Vertex AI Experiments to track the traffic split percentages.

B.Enable Cloud Memorystore to cache identical requests for both models.

C.Deploy the challenger model to the same endpoint as the champion with a separate deployed model.

D.Configure traffic split in the endpoint's traffic_split field (e.g., champion:90, challenger:10).

E.Use Cloud Monitoring to track custom metrics like conversion rate per model version.

AnswersC, D

Multiple models can be deployed on one endpoint with traffic allocation.

Why this answer

Option C is correct because Vertex AI Endpoints support deploying multiple model versions (champion and challenger) to the same endpoint, each as a separate deployed model. This allows the endpoint to serve both models simultaneously, enabling traffic splitting without requiring separate endpoints or infrastructure.

Exam trap

Cisco often tests the distinction between monitoring (which is optional after deployment) and the actual configuration steps required to shift traffic; candidates mistakenly select Cloud Monitoring (Option E) as a required step, but the question specifically asks for steps to 'gradually shift traffic,' which is accomplished by deploying to the same endpoint and setting the traffic split, not by monitoring after the fact.

Practice this question →

34

MCQeasy

What is the primary purpose of Vertex AI Edge Manager?

A.To run batch predictions on edge devices

B.To deploy and manage ML models on edge devices at scale

C.To convert models to TensorFlow Lite automatically

D.To train models on edge devices using federated learning

AnswerB

Correct: Edge Manager handles model deployment, monitoring, and lifecycle on edge devices.

Why this answer

Vertex AI Edge Manager is specifically designed to deploy, monitor, and manage ML models on edge devices at scale. It handles model packaging, over-the-air updates, and health monitoring across fleets of edge devices, which is distinct from simply running batch predictions or converting model formats.

Exam trap

Cisco often tests the distinction between 'managing models at scale' (deployment, updates, monitoring) and 'running inference' or 'converting formats' — candidates confuse the operational management role with the execution or preprocessing steps.

How to eliminate wrong answers

Option A is wrong because batch predictions on edge devices are a use case, not the primary purpose; Vertex AI Edge Manager focuses on lifecycle management (deployment, updates, monitoring) rather than just executing predictions. Option C is wrong because model conversion to TensorFlow Lite is handled by tools like the TensorFlow Lite Converter or Vertex AI's model optimization services, not by Edge Manager itself. Option D is wrong because training on edge devices using federated learning is a separate paradigm (e.g., TensorFlow Federated) and is not a core function of Vertex AI Edge Manager, which manages already-trained models.

Practice this question →

35

MCQeasy

Which Vertex AI feature allows you to reduce the size of a trained model to improve inference speed on edge devices without significant accuracy loss?

A.Vertex AI Model Optimization

B.Vertex AI Model Monitoring

C.Vertex AI Matching Engine

D.Vertex AI Continuous Training

AnswerA

Correct: Model Optimization offers quantization and compilation to reduce model size and speed up inference.

Why this answer

Vertex AI Model Optimization is the correct feature because it provides model quantization, pruning, and distillation techniques specifically designed to reduce model size and improve inference latency on edge devices. This service applies post-training quantization (e.g., FP32 to INT8) and structured weight pruning to shrink the model footprint while maintaining accuracy within acceptable thresholds, directly addressing the need for efficient deployment on resource-constrained hardware.

Exam trap

Cisco often tests the distinction between 'optimization' (size/speed improvements) and 'monitoring' (observability), leading candidates to confuse Model Monitoring with performance tuning because both involve 'model performance' terminology.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Monitoring is used for detecting prediction drift, data skew, and feature attribution changes in deployed models, not for reducing model size or optimizing inference speed. Option C is wrong because Vertex AI Matching Engine is a vector similarity search service for large-scale embedding-based retrieval (e.g., recommendation systems), not a model compression or optimization tool. Option D is wrong because Vertex AI Continuous Training automates retraining pipelines based on new data or schedules, but it does not perform model size reduction or inference optimization.

Practice this question →

36

Multi-Selectmedium

You are deploying a model on Vertex AI and need to ensure high availability and low latency. Which THREE configurations should you implement?

Select 3 answers

A.Choose a machine type with GPUs for compute-intensive models.

B.Enable logging and monitoring for the endpoint.

C.Use a custom endpoint with a static IP address.

D.Deploy with min_replicas=2 and max_replicas=10 across multiple zones.

E.Deploy to a single zone to reduce network latency.

AnswersA, B, D

GPUs reduce inference latency for deep learning models.

Why this answer

Option A is correct because GPUs are essential for compute-intensive models, such as deep neural networks, as they provide parallel processing capabilities that significantly reduce inference latency compared to CPUs. On Vertex AI, selecting a machine type with GPUs (e.g., n1-standard-4 with NVIDIA T4) ensures that the model can handle high-throughput requests with low latency, which is critical for real-time serving.

Exam trap

The trap here is that candidates often confuse static IP addresses with reliability, not realizing that Vertex AI endpoints already provide a stable DNS name with built-in load balancing, and that single-zone deployments are a common anti-pattern for high availability.

Practice this question →

37

MCQeasy

You want to deploy a TensorFlow model to a Vertex AI endpoint and enable online predictions. The model requires GPU for inference. Which machine type should you select when deploying the model?

A.n1-standard-4

B.e2-standard-4

C.a2-highgpu-1g (with A100 GPU)

D.n1-highmem-8

AnswerC

Correct. This is a GPU-enabled machine type suitable for model inference.

Why this answer

Option C is correct because the a2-highgpu-1g machine type is specifically designed for GPU-accelerated workloads on Vertex AI, featuring an NVIDIA A100 GPU that meets the inference requirements of a TensorFlow model. Vertex AI online prediction endpoints require a machine type that supports GPU attachment, and the A2 series is the only option among the choices that provides a dedicated GPU for inference.

Exam trap

The trap here is that candidates often assume any standard machine type (like n1 or e2) can be used with a GPU by simply attaching one later, but the question specifically asks for the machine type to select, and only the A2 series provides an integrated GPU option for Vertex AI endpoints.

How to eliminate wrong answers

Option A is wrong because n1-standard-4 is a general-purpose machine type that does not include a GPU by default; while it can be attached with a GPU via a separate configuration, the question asks for the machine type to select, and n1-standard-4 alone lacks the GPU required for inference. Option B is wrong because e2-standard-4 is a cost-optimized machine type that does not support GPU attachment at all, making it unsuitable for GPU-dependent model inference. Option D is wrong because n1-highmem-8 is a memory-optimized machine type without a built-in GPU; although it can be paired with a GPU, the machine type itself does not provide the GPU, and the question expects a machine type that inherently includes GPU capability.

Practice this question →

38

Multi-Selecthard

An organization is deploying a mission-critical model on Vertex AI Endpoints. They need to ensure high availability and meet a strict SLO of 99.9% uptime. Which THREE steps should they take? (Choose 3)

Select 3 answers

A.Use Cloud CDN to cache responses.

B.Set minReplicas to at least 2 to ensure redundancy within a region.

C.Use a single large instance instead of multiple small ones.

D.Deploy the endpoint in multiple regions.

E.Configure health checks to detect and replace unhealthy instances.

AnswersB, D, E

Multiple replicas in a region protect against instance failures.

Why this answer

To meet a 99.9% SLO, they should deploy across multiple regions for redundancy, set minimum replicas to ensure baseline capacity, and configure health checks to route traffic away from unhealthy instances.

Practice this question →

39

Multi-Selecteasy

You need to deploy a model for online predictions with low latency. You want to ensure that the endpoint can handle traffic bursts without cold start. Which TWO configurations should you set? (Choose 2)

Select 2 answers

A.Set maxReplicas to a high number to handle bursts.

B.Set minReplicas to 1.

C.Deploy the model as a custom container.

D.Enable autoscaling with a target CPU utilisation of 30%.

E.Use a machine type with sufficient memory for the model.

AnswersB, E

Correct. This ensures at least one instance is always running, avoiding cold start.

Why this answer

To avoid cold start, you must keep at least one replica always running (minReplicas ≥ 1) and ensure that the machine type has enough capacity. Setting minReplicas to 0 would cause cold start. Also, setting a higher CPU utilisation target may help but not directly avoid cold start.

Practice this question →

40

MCQhard

A company is using Vertex AI Prediction with a custom container that performs preprocessing before inference. The preprocessing step is CPU-intensive and the inference step uses a GPU. They want to minimize prediction latency while optimizing cost. Which architecture should they use?

A.Use Cloud Run for preprocessing and send HTTP requests to a GPU-backed Vertex AI endpoint for inference.

B.Use two separate Vertex AI endpoints: one CPU-based for preprocessing, one GPU-based for inference, and chain them with Cloud Tasks.

C.Use Dataflow for preprocessing and then invoke the model, but Dataflow is not designed for real-time prediction.

D.Use a single GPU machine (e.g., n1-standard-4 with T4) and perform both preprocessing and inference on the same instance.

AnswerD

This minimizes latency by keeping all processing local, and you can choose a machine with sufficient CPU cores.

Why this answer

Using a CPU-only node for preprocessing and then sending the preprocessed data to a GPU node for inference separates concerns and allows independent scaling, but adds network latency. The best approach is to use a single machine with both CPU and GPU to avoid network round-trip, and to adjust the machine type to have enough CPU resources.

Practice this question →

41

MCQmedium

A company is deploying a new model version to an existing Vertex AI endpoint. They want to test the new version with 5% of traffic before fully rolling it out. What is the correct approach?

A.Create a new endpoint for the new version and update the client to call both endpoints.

B.Deploy the new version and set the minimum replicas to 0, then gradually increase.

C.Use Cloud Load Balancing to distribute traffic between two endpoints.

D.Deploy the new version as a separate model on the same endpoint and use the `traffic_split` parameter in the deployment request.

AnswerD

Correct: Vertex AI allows multiple deployed models on one endpoint with traffic split percentages.

Why this answer

Option D is correct because Vertex AI endpoints support traffic splitting between multiple deployed models. By deploying the new model version to the same endpoint and setting `traffic_split` to 5% for the new version and 95% for the existing version, the endpoint automatically routes a corresponding proportion of inference requests to each model without any client-side changes.

Exam trap

The trap here is that candidates may confuse traffic splitting with scaling or load balancing, assuming that adjusting replicas or using an external load balancer is required, when Vertex AI's native `traffic_split` is the simplest and correct method for canary deployments.

How to eliminate wrong answers

Option A is wrong because creating a new endpoint and updating clients to call both endpoints introduces unnecessary complexity, latency, and risk of client misconfiguration; Vertex AI endpoints natively support traffic splitting, making this approach redundant. Option B is wrong because setting minimum replicas to 0 does not control traffic distribution; it only affects autoscaling behavior, and gradually increasing replicas does not route a specific percentage of traffic to the new version. Option C is wrong because Cloud Load Balancing operates at the network layer and cannot intelligently split traffic between two Vertex AI endpoints based on model version; it would require additional proxy logic and defeats the purpose of Vertex AI's built-in traffic management.

Practice this question →

42

Multi-Selecthard

Your team is using Vertex AI Prediction for a large-scale NLP model (PyTorch, custom ops). The model currently runs on CPU but you want to optimise inference cost and performance. Which THREE approaches should you consider? (Choose 3)

Select 3 answers

A.Deploy the model with a GPU machine type and use TensorRT optimisation.

B.Use Vertex AI Model Optimisation to automatically quantise and compile the model.

C.Integrate the model with NVIDIA Triton Inference Server for dynamic batching and model ensembles.

D.Convert the model to TensorFlow Lite and deploy on Vertex AI endpoint.

E.Switch to batch prediction to reduce cost.

AnswersA, B, C

Correct. GPU and TensorRT can improve throughput and latency.

Why this answer

Option A is correct because deploying the model with a GPU machine type (e.g., NVIDIA A100 or T4) and using TensorRT optimization can significantly accelerate inference for PyTorch models with custom ops. TensorRT performs layer fusion, precision calibration (FP16/INT8), and kernel auto-tuning, which reduces latency and improves throughput on GPU hardware. This directly addresses the goal of optimizing both cost and performance for large-scale NLP models.

Exam trap

Cisco often tests the misconception that converting to a lighter framework (like TensorFlow Lite) or switching to batch prediction is a universal optimization, ignoring that custom ops and real-time latency requirements make those approaches invalid for this scenario.

Practice this question →

43

Multi-Selectmedium

An organization wants to deploy a model on edge devices (e.g., Android phones) for offline inference. They trained a model using TensorFlow. Which THREE steps should they take to prepare and deploy the model?

Select 3 answers

A.Convert the model to TensorFlow Lite format.

B.Deploy the model to a Vertex AI endpoint for online inference.

C.Use Vertex AI Edge Manager to package and deploy the model.

D.Export the model to ONNX format.

E.Deploy the TFLite model to the edge devices.

AnswersA, C, E

TFLite is optimized for mobile and edge devices.

Why this answer

For edge deployment on Android, you need to convert the model to TensorFlow Lite, use Vertex AI Edge Manager to manage the deployment, and then deploy the TFLite model to the devices. Exporting to ONNX is for other platforms, and creating a REST endpoint is for online serving.

Practice this question →

44

MCQeasy

You deploy a new version of a model to a Vertex AI endpoint and want to gradually shift traffic from the old version to the new version over 24 hours. The endpoint currently serves 100% traffic to the old version. What should you do?

A.Use Vertex AI Experiments to run an A/B test between the two versions.

B.Deploy the new version to a separate endpoint and update your client to use the new endpoint for a percentage of requests.

C.Update the endpoint to split traffic between the two model versions using the traffic split configuration.

D.Delete the old version and redeploy the new version with a different endpoint name, then update DNS.

AnswerC

Vertex AI traffic split allows gradual shifting of traffic between model versions on the same endpoint.

Why this answer

Option C is correct because Vertex AI endpoints support a built-in traffic split configuration that allows you to gradually shift traffic between model versions deployed to the same endpoint. By updating the endpoint's traffic split percentages (e.g., from 100% old / 0% new to 0% old / 100% new over 24 hours), you can achieve a smooth, controlled rollout without changing client code or managing multiple endpoints.

Exam trap

Cisco often tests the misconception that traffic splitting requires separate endpoints or client-side logic, when in fact Vertex AI provides a native traffic split configuration on a single endpoint.

How to eliminate wrong answers

Option A is wrong because Vertex AI Experiments is designed for tracking and comparing model training runs, not for managing production traffic splits or A/B testing at the serving layer. Option B is wrong because deploying to a separate endpoint and updating the client to split requests manually introduces unnecessary complexity, client-side changes, and potential inconsistency; Vertex AI's traffic split feature handles this natively at the server side. Option D is wrong because deleting the old version and redeploying with a different endpoint name, then updating DNS, would cause a complete traffic cutover (not gradual) and disrupt service during the DNS propagation period, which can take minutes to hours.

Practice this question →

45

MCQmedium

A company uses Vertex AI Vector Search for similarity search. They have a dataset of 10 million 512-dimensional vectors. Which index type should they choose for lowest latency at high recall?

A.Brute-force (flat) index

B.Approximate nearest neighbor (ANN) index with Scann

C.Tree-based index

D.Hashing-based index

AnswerB

ANN is designed for large-scale, low-latency search with high recall.

Why this answer

For a dataset of 10 million 512-dimensional vectors, a brute-force (flat) index would be far too slow for low-latency queries. Approximate Nearest Neighbor (ANN) with ScaNN (Scalable Nearest Neighbors) is specifically designed by Google for high-dimensional vector search, offering sub-linear query time while maintaining high recall through techniques like anisotropic quantization and tree-based partitioning. This makes it the optimal choice for balancing latency and recall at this scale.

Exam trap

The trap here is that candidates often assume brute-force is the only way to guarantee high recall, but the question explicitly asks for lowest latency at high recall, which is the exact trade-off that ANN indexes like ScaNN are designed to optimize.

How to eliminate wrong answers

Option A is wrong because a brute-force (flat) index computes exact distances to every vector, resulting in O(N) complexity per query, which is prohibitively slow for 10 million vectors and cannot achieve low latency. Option C is wrong because tree-based indexes (e.g., KD-trees, R-trees) suffer from the curse of dimensionality in high-dimensional spaces (512-D), where their performance degrades to near brute-force due to the sparsity of data. Option D is wrong because hashing-based indexes (e.g., LSH) typically require multiple hash tables to achieve high recall, leading to high memory usage and often lower recall compared to optimized ANN methods like ScaNN, especially for 512-dimensional vectors.

Practice this question →

46

MCQmedium

An ML engineer needs to update a model deployed on a Vertex AI endpoint without downtime. They want to gradually shift traffic to the new version while monitoring for errors. What is the correct procedure?

A.Use a canary deployment by deploying to a separate endpoint and using a load balancer with weighted routing.

B.Deploy the new model to a new endpoint, then update DNS to point to the new endpoint.

C.Delete the old model and deploy the new one with the same endpoint.

D.Deploy the new model to the same endpoint with 0% traffic initially, then gradually increase traffic while monitoring.

AnswerD

This ensures no downtime and allows controlled rollout.

Why this answer

The correct procedure is to deploy the new model version to the same endpoint, initially with 0% traffic, then gradually increase its traffic allocation while monitoring.

Practice this question →

47

MCQhard

You need to deploy a TensorFlow model to edge devices for real-time inference with minimal latency. The model is currently trained on Vertex AI. Which approach should you use?

A.Convert the model to TensorFlow Lite using the TF Lite converter, then deploy to edge devices via Vertex AI Edge Manager.

B.Export the model as a SavedModel and deploy it to Vertex AI Edge Manager using the Edge Manager API.

C.Deploy the model to Vertex AI endpoint and use Cloud IoT Core to stream data to the cloud for inference.

D.Use Vertex AI Model Optimization to quantize the model to INT8 and then deploy as a web service on a Raspberry Pi.

AnswerA

TFLite is optimized for on-device inference; Edge Manager can deploy it to edge devices.

Why this answer

Option A is correct because TensorFlow Lite is specifically designed for on-device inference on edge devices, offering reduced model size and optimized performance for low-latency, real-time scenarios. Vertex AI Edge Manager provides a managed service to deploy, monitor, and update models on edge devices, making it the ideal combination for this use case.

Exam trap

Cisco often tests the distinction between cloud-based serving (SavedModel, Vertex AI endpoints) and edge-optimized deployment (TF Lite, Edge Manager), and the trap here is assuming that any Vertex AI deployment method works for edge devices without considering the need for model optimization and offline inference capability.

How to eliminate wrong answers

Option B is wrong because exporting as a SavedModel alone does not optimize the model for edge devices; SavedModel is a full TensorFlow format intended for serving on cloud or server infrastructure, not for resource-constrained edge hardware. Option C is wrong because streaming data to the cloud for inference introduces network latency and dependency on connectivity, which contradicts the requirement for minimal latency and real-time inference on edge devices. Option D is wrong because deploying as a web service on a Raspberry Pi does not leverage Vertex AI Edge Manager for lifecycle management, and while INT8 quantization helps, the approach lacks the managed deployment and monitoring capabilities needed for production edge deployments.

Practice this question →

48

MCQmedium

You need to serve a large embedding model for similarity search with low latency. The model was trained to generate 256-dimensional embeddings. You plan to use Vertex AI Vector Search. Which index type should you choose to balance accuracy and performance for a dataset with 10 million vectors?

A.Tree-based index

B.Approximate nearest neighbor (ANN) index using ScaNN

C.Brute-force index

D.Hash-based index

AnswerB

ScaNN is designed for efficient large-scale similarity search with configurable accuracy.

Why this answer

Vertex AI Vector Search uses ScaNN (Scalable Nearest Neighbors) as its underlying ANN algorithm, which is specifically designed for high-dimensional embeddings (like 256-d) and large-scale datasets (10M vectors). ScaNN balances accuracy and performance by employing anisotropic quantization and tree-based partitioning, making it the optimal choice for low-latency similarity search without requiring exhaustive comparison.

Exam trap

The trap here is that candidates often confuse 'tree-based index' (Option A) with the tree-based partitioning used inside ScaNN, but a standalone tree index fails in high dimensions, whereas ScaNN combines tree partitioning with quantization to overcome the curse of dimensionality.

How to eliminate wrong answers

Option A is wrong because a pure tree-based index (e.g., KD-tree) suffers from the 'curse of dimensionality' at 256 dimensions, where performance degrades to near brute-force levels. Option C is wrong because a brute-force index computes exact distances for all 10M vectors, resulting in O(n) latency that is unacceptable for real-time serving. Option D is wrong because hash-based indexes (e.g., LSH) are typically used for approximate nearest neighbor search in lower dimensions or for specific distance metrics, but they are not natively supported as a primary index type in Vertex AI Vector Search, and they often require extensive tuning to match ScaNN's accuracy-latency trade-off.

Practice this question →

49

Multi-Selecteasy

Which TWO of the following are benefits of using Vertex AI Matching Engine (Vector Search) over a brute-force nearest neighbor search? (Choose 2)

Select 2 answers

A.Simpler to implement than brute-force

B.Supports real-time updates without index rebuild

C.Lower query latency for large datasets

D.Reduced memory footprint compared to brute-force

E.Guaranteed exact nearest neighbor results

AnswersC, D

ANN trades off some accuracy for much faster search.

Why this answer

Vertex AI Matching Engine uses approximate nearest neighbor (ANN) algorithms like ScaNN (Scalable Nearest Neighbors) to index high-dimensional vectors. For large datasets, ANN dramatically reduces query latency by avoiding a full scan of all vectors, unlike brute-force search which must compute distances against every vector. This makes option C correct because ANN trades a negligible accuracy loss for orders-of-magnitude faster retrieval.

Exam trap

Cisco often tests the misconception that approximate nearest neighbor search provides exact results, but the trap here is that candidates confuse 'nearest neighbor' with 'exact nearest neighbor,' forgetting that ANN algorithms like ScaNN are inherently approximate.

Practice this question →

50

MCQmedium

You are A/B testing a new model version (challenger) against the current version (champion) on Vertex AI. You want to gradually shift traffic from champion to challenger while measuring business metrics. Which approach should you use?

A.Deploy the challenger to a separate endpoint and use a load balancer to route a percentage of requests.

B.Use Cloud Armor to route traffic based on headers.

C.Deploy both models to the same endpoint and use the traffic split feature to allocate percentages.

D.Create a new endpoint for the challenger and gradually shift DNS records.

AnswerC

Vertex AI allows multiple models on one endpoint with traffic allocation.

Why this answer

Vertex AI endpoints support traffic splitting between deployed models. By updating the traffic percentage, you can gradually shift traffic and monitor performance.

Practice this question →

51

Multi-Selecthard

Your team has deployed a model on Vertex AI endpoints and you are planning an A/B test to compare a new challenger model (v2) against the current champion (v1). The test should measure business metrics such as click-through rate. Which THREE steps should you take to set up the A/B test correctly? (Choose 3 correct answers)

Select 3 answers

A.Deploy the challenger model (v2) to the same endpoint as the champion (v1).

B.Modify your application to log which model version served each prediction.

C.Create a new endpoint for v2 and gradually shift DNS traffic.

D.Use Vertex AI Experiments to compare model performance.

E.Set up a traffic split between v1 and v2, e.g., 90% v1 and 10% v2.

AnswersA, B, E

Both models must be on the same endpoint to use traffic splitting.

Why this answer

Option A is correct because deploying both v1 and v2 to the same Vertex AI endpoint allows you to use the built-in traffic splitting feature. This enables you to route a percentage of requests to each model version without managing separate endpoints or DNS changes, which is the standard approach for A/B testing on Vertex AI.

Exam trap

The trap here is that candidates confuse Vertex AI Experiments (for training) with endpoint traffic splitting (for serving), and they incorrectly think creating separate endpoints with DNS shifting is a valid A/B testing method, when Vertex AI's native traffic splitting is the correct and simpler approach.

Practice this question →

52

MCQmedium

A company needs to run batch predictions on 10 TB of data stored in Cloud Storage. The predictions should be written to BigQuery. Which approach should they use?

A.Export the model to Cloud Functions and trigger on file upload

B.Create a Vertex AI Batch Prediction job with GCS input and BigQuery output

C.Use Vertex AI Online Prediction with a batch job

D.Use Dataflow to read from GCS and write to BigQuery, calling the model for each record

AnswerB

Batch Prediction directly supports this configuration.

Why this answer

Vertex AI Batch Prediction natively supports reading input from Cloud Storage and writing predictions directly to BigQuery, making it the most efficient and fully managed solution for large-scale batch inference on 10 TB of data. This approach avoids the complexity of custom infrastructure or per-record model calls, leveraging Vertex AI's optimized batch processing pipeline.

Exam trap

Cisco often tests the distinction between batch and online prediction modes, and the trap here is that candidates may confuse Vertex AI's batch prediction with using Dataflow or Cloud Functions, not realizing that Vertex AI natively supports BigQuery as a direct output destination for batch jobs.

How to eliminate wrong answers

Option A is wrong because Cloud Functions are designed for event-driven, lightweight processing and cannot handle 10 TB of data efficiently; exporting a model to Cloud Functions also lacks native batch prediction orchestration and BigQuery output support. Option C is wrong because Vertex AI Online Prediction is intended for real-time, low-latency inference on individual requests, not for batch jobs; there is no 'batch job' mode within online prediction. Option D is wrong because while Dataflow can read from GCS and write to BigQuery, calling the model for each record would require custom code and per-record inference, which is less efficient and more complex than using Vertex AI's built-in batch prediction with direct BigQuery output.

Practice this question →

53

MCQmedium

You need to deploy a PyTorch model for online inference on Vertex AI but the model was trained using custom ops that are not natively supported. You want to use NVIDIA Triton Inference Server for optimisation. How should you proceed?

A.Convert the model to TFLite and deploy on an edge device.

B.Build a custom container with NVIDIA Triton Inference Server and deploy it to Vertex AI.

C.Export the model to ONNX and deploy using Vertex AI's built-in TensorFlow serving.

D.Use Vertex AI Model Optimisation to automatically quantise the model.

AnswerB

Correct. Custom containers allow using Triton with arbitrary models.

Why this answer

Vertex AI supports deploying models with NVIDIA Triton Inference Server. You can build a custom container with Triton and the model, then deploy it to a Vertex AI endpoint. This allows using Triton's optimisations.

Practice this question →

54

MCQmedium

An ML engineer needs to run batch predictions on 10 TB of data stored in BigQuery using a TensorFlow model. The predictions must be written to BigQuery. Which service should they use?

A.Create a Dataflow pipeline to read from BigQuery, run the model using Python, and write results to BigQuery.

B.Export BigQuery data to GCS, run batch prediction on GCS, then load results back to BigQuery.

C.Use Vertex AI online prediction with batch requests.

D.Use Vertex AI Batch Prediction with BigQuery source and sink.

AnswerD

Correct: Vertex AI Batch Prediction supports BigQuery directly for both input and output.

Why this answer

Vertex AI Batch Prediction supports BigQuery as both source and sink, enabling direct batch prediction without additional infrastructure.

Practice this question →

55

Multi-Selectmedium

An ML engineer needs to deploy a model to Vertex AI for online predictions and enable autoscaling to zero when not in use. Which THREE conditions must be met? (Choose 3)

Select 3 answers

A.Enable serverless serving by selecting the appropriate serving mode.

B.Use a GPU-enabled machine type for faster scaling.

C.Deploy the model using a custom container with WebSockets support.

D.Set `min_replica_count=0` in the endpoint deployment config.

E.Set `max_replica_count` to a value > 0.

AnswersA, D, E

Vertex AI offers serverless mode for CPU models that supports scale-to-zero.

Why this answer

Option A is correct because Vertex AI offers a serverless serving mode that automatically scales resources to zero when no requests are being processed. By selecting this mode, the ML engineer enables the endpoint to scale down completely during idle periods, eliminating costs for unused infrastructure. This is distinct from standard serving, which maintains a minimum number of replicas.

Exam trap

Cisco often tests the misconception that setting `min_replica_count=0` alone is sufficient, but candidates forget that `max_replica_count` must also be set to a positive value to allow scaling up from zero.

Practice this question →

56

MCQmedium

A company needs to serve a high-throughput prediction service with strict latency requirements. They want to minimize cold starts and ensure consistent performance. Which endpoint configuration is most appropriate?

A.Set min_replicas to an estimated baseline and max_replicas to a higher number

B.Set min_replicas and max_replicas equal to a fixed number

C.Set min_replicas to 0 and max_replicas to a high number

D.Do not set min_replicas; let Vertex AI automatically determine

AnswerA

This ensures always-on capacity for baseline traffic and room to scale.

Why this answer

Setting min_replicas to an estimated baseline ensures that a minimum number of instances are always running, eliminating cold starts for baseline traffic. Setting max_replicas to a higher number allows the service to scale up to handle traffic spikes while maintaining consistent performance. This configuration balances cost and latency by avoiding the overhead of scaling from zero while still accommodating bursts.

Exam trap

Cisco often tests the misconception that setting min_replicas to 0 is cost-effective, but the trap here is that it ignores the strict latency requirement and the reality of cold start delays in model serving.

How to eliminate wrong answers

Option B is wrong because setting min_replicas and max_replicas equal to a fixed number prevents any autoscaling, leading to either over-provisioning (waste) or under-provisioning (latency spikes) under variable load. Option C is wrong because setting min_replicas to 0 means the service can scale down to zero, causing cold starts on every request when traffic resumes, which violates the strict latency requirement. Option D is wrong because not setting min_replicas and relying on Vertex AI's automatic determination may result in the service scaling to zero or having unpredictable baseline capacity, introducing cold starts and inconsistent performance.

Practice this question →

57

MCQmedium

You are deploying a new version of a model to a Vertex AI endpoint that already has a champion model serving 100% of traffic. You want to gradually shift traffic to the new version while monitoring for errors. Which approach should you use?

A.Use Cloud Load Balancing with weighted backend services pointing to different endpoints.

B.Deploy the challenger to the same endpoint with initial traffic split, e.g., champion 90%, challenger 10%, and gradually adjust.

C.Delete the champion model and redeploy with the challenger as the new version.

D.Create a new endpoint for the challenger and use a load balancer to split traffic.

AnswerB

This is the correct method for A/B testing with traffic splitting in Vertex AI.

Why this answer

Vertex AI endpoints support traffic splitting between model versions deployed to the same endpoint. By deploying the challenger to the same endpoint and setting an initial split (e.g., champion 90%, challenger 10%), you can gradually shift traffic while monitoring for errors. This approach uses the endpoint's built-in traffic management, avoiding the complexity and latency of external load balancers.

Exam trap

Cisco often tests the misconception that external load balancers are required for traffic splitting, when in fact Vertex AI endpoints provide native traffic management that is simpler and more appropriate for model versioning.

How to eliminate wrong answers

Option A is wrong because Cloud Load Balancing operates at the network layer and cannot directly split traffic between model versions within a single Vertex AI endpoint; it would require separate endpoints and adds unnecessary overhead. Option C is wrong because deleting the champion model removes the ability to roll back or compare performance, violating the principle of gradual, safe deployment. Option D is wrong because creating a new endpoint for the challenger and using a load balancer to split traffic bypasses Vertex AI's native traffic splitting, which is simpler, more reliable, and designed for this exact use case.

Practice this question →

58

MCQmedium

An application serving predictions from a Vertex AI endpoint receives many identical requests within a short time window. The team notices redundant computation and wants to cache responses to reduce latency and cost. What is the recommended solution?

A.Deploy the model on a larger machine type to handle duplicate requests faster.

B.Enable Vertex AI endpoint caching by setting the `enable_cache` flag.

C.Implement a cache layer using Cloud Memorystore for Redis, hashing prediction requests.

D.Use Cloud CDN in front of the endpoint.

AnswerC

Correct: Cloud Memorystore provides low-latency caching for identical requests.

Why this answer

Option C is correct because Vertex AI does not provide built-in request caching; instead, the recommended pattern is to implement an external cache like Cloud Memorystore for Redis. By hashing the prediction request payload and using it as a cache key, identical requests within the short time window can be served from Redis, eliminating redundant model inference and reducing both latency and cost.

Exam trap

The trap here is that candidates assume Vertex AI has a native caching feature (like `enable_cache`) because other Google Cloud services (e.g., Cloud CDN, Cloud Load Balancing) offer caching, but Vertex AI endpoints require an external cache layer like Memorystore for Redis.

How to eliminate wrong answers

Option A is wrong because deploying on a larger machine type increases throughput but does not eliminate redundant computation for identical requests; it still performs the same inference multiple times, wasting resources. Option B is wrong because Vertex AI endpoints do not support an `enable_cache` flag; this is a fictitious feature, and Vertex AI has no built-in request caching mechanism. Option D is wrong because Cloud CDN caches static content at the edge based on HTTP cache headers, but prediction requests are typically POST with dynamic payloads that are not cacheable by CDN, and CDN cannot inspect or hash request bodies for deduplication.

Practice this question →

59

Multi-Selecthard

You are designing a batch prediction pipeline using Vertex AI. The input data is 50 TB in CSV format on GCS. The model requires feature engineering that involves complex transformations (e.g., datetime parsing, one-hot encoding). Which THREE services or steps should you include in your pipeline?

Select 2 answers

A.Use Cloud Functions to transform each file individually.

B.Use Cloud SQL to store intermediate results.

C.Run Vertex AI batch prediction job with GCS source pointing to the processed TFRecord files.

D.Use Dataflow to read CSV, perform feature engineering, and write to GCS in TFRecord format.

E.Use Dataflow to read CSV, perform feature engineering, and write to BigQuery.

AnswersC, D

Batch prediction can read from GCS and use the trained model.

Why this answer

Option C is correct because Vertex AI batch prediction jobs require input data in TFRecord format for optimal performance with TensorFlow-based models. By writing the processed data as TFRecords to GCS, you enable the batch prediction service to read and score the data efficiently, leveraging its native support for this format.

Exam trap

Cisco often tests the misconception that any cloud service can handle large-scale data processing, but the trap here is that Cloud Functions and Cloud SQL are inappropriate for batch processing of 50 TB, leading candidates to overlook the need for a distributed data processing service like Dataflow.

Practice this question →

60

MCQmedium

A company uses Vertex AI Vector Search (Matching Engine) for a product recommendation system. The product embeddings are updated hourly. Which index update method should they use to ensure low latency for new items?

A.Batch rebuild the index every hour

B.Use streaming updates to add new embeddings incrementally

C.Create a new index each hour and swap endpoints

D.Use brute-force index to simplify updates

AnswerB

Correct: Streaming updates allow near-real-time ingestion of new vectors.

Why this answer

Option B is correct because Vertex AI Vector Search supports streaming updates, allowing new embeddings to be added incrementally without rebuilding the entire index. This ensures low latency for new items by making them searchable almost immediately after update, which is critical for hourly refresh cycles where batch rebuilds would introduce significant delay.

Exam trap

The trap here is that candidates often assume batch rebuilds are the only reliable method for consistency, overlooking that streaming updates in Vertex AI Vector Search are designed specifically for low-latency incremental ingestion without sacrificing search quality.

How to eliminate wrong answers

Option A is wrong because batch rebuilding the index every hour incurs high latency and computational cost, as the entire index must be reconstructed from scratch, delaying availability of new items. Option C is wrong because creating a new index each hour and swapping endpoints is inefficient and introduces downtime during the swap, plus it requires managing multiple index versions unnecessarily. Option D is wrong because brute-force indices do not simplify updates; they perform exhaustive linear scans, which are slow and unscalable for large embedding sets, and they lack the optimized approximate nearest neighbor (ANN) search that Vector Search provides.

Practice this question →

61

MCQeasy

A machine learning engineer wants to deploy a trained model to Vertex AI for online predictions. Which Vertex AI resource is required to serve the model and provide an endpoint URL?

A.Vertex AI Pipeline

B.Vertex AI Model Registry

C.Vertex AI Feature Store

D.Vertex AI Endpoint

AnswerD

Correct. An endpoint is required to deploy a model and obtain a URL for online predictions.

Why this answer

Vertex AI Endpoint is the required resource to deploy a trained model for online predictions, as it provides a dedicated endpoint URL that accepts prediction requests and routes them to the model. Without an endpoint, the model cannot be accessed via HTTP/HTTPS for real-time inference, which is the core requirement for online serving.

Exam trap

The trap here is that candidates confuse the Model Registry (which stores and versions models) with the actual serving infrastructure, assuming that registering a model automatically creates an endpoint, when in fact a separate Endpoint resource must be created and the model must be deployed to it.

How to eliminate wrong answers

Option A is wrong because Vertex AI Pipeline is used for orchestrating and automating ML workflows (e.g., training, evaluation), not for serving models or providing an endpoint URL. Option B is wrong because Vertex AI Model Registry is a central repository for managing model versions and metadata, but it does not itself expose an endpoint for predictions; models must be deployed to an endpoint for serving. Option C is wrong because Vertex AI Feature Store is designed for storing, serving, and sharing feature data for training and prediction, not for hosting models or providing inference endpoints.

Practice this question →

62

MCQeasy

Your Vertex AI endpoint receives many identical prediction requests (same input features). You want to cache responses to reduce latency and cost. Which Google Cloud service should you use?

A.Cloud Memorystore for Redis

B.Cloud CDN

C.Bigtable

D.Cloud Storage with object versioning

AnswerA

Redis provides low-latency caching of serialized prediction responses keyed by request hash.

Why this answer

Cloud Memorystore (Redis) is ideal for caching prediction results. By hashing the request input, you can store and retrieve cached responses, avoiding redundant inference.

Practice this question →

63

MCQeasy

Your company runs a high-traffic web application that serves the same machine learning model prediction for many identical requests (e.g., product recommendations for the same user profile). You want to reduce latency and load on the prediction endpoint by caching responses. Which Google Cloud service should you use?

A.Cloud CDN

B.Cloud Memorystore

C.Cloud Spanner

D.BigQuery

AnswerB

Memorystore (Redis) provides low-latency caching for prediction responses.

Why this answer

Cloud Memorystore (B) is correct because it provides a managed in-memory cache (Redis or Memcached) that can store the results of identical prediction requests, reducing latency and load on the prediction endpoint. By caching responses keyed on the user profile or request parameters, subsequent identical requests can be served directly from Memorystore in microseconds, avoiding redundant model inference.

Exam trap

The trap here is that candidates confuse caching at the edge (CDN) with caching at the application layer (Memorystore), assuming any cache service works for dynamic API responses, but Cloud CDN cannot cache POST requests or application-specific payloads without significant configuration and still lacks the fine-grained key-value semantics needed for identical prediction requests.

How to eliminate wrong answers

Option A (Cloud CDN) is wrong because it caches static content (e.g., images, CSS) at edge locations, not dynamic API responses for identical requests; it cannot cache POST request payloads or application-level prediction results without complex workarounds. Option C (Cloud Spanner) is wrong because it is a globally distributed relational database designed for transactional consistency and high availability, not for low-latency caching of ephemeral prediction responses. Option D (BigQuery) is wrong because it is a serverless data warehouse for analytical queries on large datasets, not a caching layer for real-time inference results.

Practice this question →

64

Multi-Selectmedium

You need to deploy a model that requires a large amount of memory (over 200 GB) for inference. The model is a custom PyTorch model. Vertex AI endpoints have machine type limitations. Which TWO actions can you take to handle this memory requirement? (Choose 2 correct answers)

Select 2 answers

A.Use a machine type from the n1-highmem series, such as n1-highmem-32 (208 GB) or higher.

B.Use multiple replicas and split the model across them.

C.Use a custom container to load the model and optimize its memory footprint.

D.Use batch prediction instead of online prediction.

E.Deploy the model on Cloud Run with 32 GB memory.

AnswersA, C

n1-highmem provides high memory per CPU, with sizes up to 416 GB.

Why this answer

Option A is correct because the n1-highmem-32 machine type provides 208 GB of memory, which meets the requirement of over 200 GB. Vertex AI endpoints support this machine series, allowing you to deploy a custom PyTorch model with sufficient RAM for inference without modification.

Exam trap

Cisco often tests the misconception that model parallelism across replicas is a valid strategy for memory constraints, but in Vertex AI, replicas are stateless and cannot share model weights for a single inference request.

Practice this question →

65

Multi-Selecthard

A team wants to deploy a model on Vertex AI Edge Manager for offline inference on edge devices. Which three steps are required? (Choose 3)

Select 3 answers

A.Enable Vertex AI Model Monitoring

B.Package the model and deploy using Vertex AI Edge Manager

C.Convert the model to TensorFlow Lite or ONNX format

D.Upload the model to Vertex AI Model Registry

E.Create a Vertex AI Endpoint with HTTP endpoint

AnswersB, C, D

Edge Manager handles packaging and pushing to edge devices.

Why this answer

The model must be in a format compatible with edge devices (TFLite/ONNX), and Edge Manager requires packaging and deploying the model to the device.

Practice this question →

66

MCQeasy

You are deploying a model to a Vertex AI endpoint and need to minimize latency for online predictions. Which machine type should you choose?

A.n1-standard-2 with NVIDIA Tesla T4

B.e2-standard-2

C.n1-standard-2

D.n1-highmem-2

AnswerA

GPUs accelerate inference, reducing latency for deep learning models.

Why this answer

GPU-enabled machines (e.g., n1-standard-2 with NVIDIA Tesla T4) accelerate compute-heavy models, reducing prediction latency significantly compared to CPU-only instances.

Practice this question →

67

MCQeasy

You need to serve a model on an edge device with low latency and offline capability. Which approach should you use?

A.Export the model to TensorFlow Lite and use Vertex AI Edge Manager for deployment.

B.Use Cloud Run for on-device inference.

C.Deploy the model to a Vertex AI endpoint and rely on mobile connectivity.

D.Use AI Platform Prediction (not Vertex AI).

AnswerA

Correct. Edge Manager handles deployment to devices with TFLite or ONNX models.

Why this answer

TensorFlow Lite is specifically designed for on-device inference with low latency and offline capability, converting models into a lightweight format optimized for edge hardware. Vertex AI Edge Manager extends this by providing deployment, monitoring, and management of models on edge devices, ensuring they run efficiently without constant cloud connectivity.

Exam trap

Cisco often tests the distinction between cloud-based inference services (like Vertex AI endpoints or Cloud Run) and edge-optimized solutions (like TensorFlow Lite with Edge Manager), trapping candidates who assume any Google Cloud ML service can be deployed offline.

How to eliminate wrong answers

Option B is wrong because Cloud Run is a serverless compute platform for containerized applications in the cloud, not designed for on-device inference; it requires network connectivity and cannot operate offline. Option C is wrong because deploying to a Vertex AI endpoint relies on mobile connectivity for inference requests, introducing latency and failing when offline, contradicting the low-latency and offline requirements. Option D is wrong because AI Platform Prediction (the predecessor to Vertex AI) is a cloud-based prediction service that also requires network connectivity and is not optimized for edge deployment or offline operation.

Practice this question →

68

MCQmedium

You have a champion model serving 100% traffic on a Vertex AI endpoint. You want to deploy a challenger model and gradually shift 10% of traffic to it for A/B testing. What is the correct approach?

A.Use Cloud Run to deploy both models and use Cloud Endpoints for traffic splitting.

B.Deploy the challenger on the same endpoint and use the traffic split parameter to allocate 10% traffic to it.

C.Deploy the challenger on a separate endpoint and use Cloud Armor to split traffic.

D.Create a new endpoint for the challenger and route 10% of requests via a load balancer.

AnswerB

Correct. Vertex AI allows multiple deployed models on one endpoint with traffic percentages.

Why this answer

Vertex AI endpoints support traffic splitting by deploying multiple model versions and assigning traffic percentages. You deploy the challenger as a new deployed model on the same endpoint and set traffic split: champion 90%, challenger 10%.

Practice this question →

69

MCQmedium

You have a Vertex AI endpoint with min_replica_count=2 and max_replica_count=10. You notice that during a traffic spike, the endpoint does not scale up quickly enough, causing increased latency. What should you do to improve autoscaling responsiveness?

A.Increase max_replica_count to 20.

B.Disable autoscaling and manually manage replicas.

C.Increase min_replica_count to 10.

D.Reduce the target CPU utilization percentage from default to a lower value.

AnswerD

Lower target utilization triggers scaling sooner, improving responsiveness.

Why this answer

Option D is correct because reducing the target CPU utilization percentage (e.g., from the default 60% to a lower value like 40%) causes the autoscaler to trigger scale-up actions sooner, as the threshold for adding replicas is reached at a lower CPU load. This improves responsiveness during traffic spikes by initiating scaling earlier, reducing latency. The endpoint's min_replica_count=2 and max_replica_count=10 remain unchanged, so the scaling range is preserved.

Exam trap

Cisco often tests the misconception that increasing max_replica_count or min_replica_count improves scaling speed, when in fact the key lever is the target utilization threshold that controls autoscaler sensitivity.

How to eliminate wrong answers

Option A is wrong because increasing max_replica_count to 20 does not address the speed of scaling; it only expands the upper bound, which may help if the spike exceeds 10 replicas but does not make the autoscaler react faster. Option B is wrong because disabling autoscaling and manually managing replicas removes the ability to dynamically handle traffic spikes, leading to either over-provisioning or under-provisioning and increased latency. Option C is wrong because increasing min_replica_count to 10 forces a minimum of 10 replicas at all times, which wastes resources during low traffic and does not improve the autoscaler's responsiveness to sudden spikes; it only pre-allocates capacity.

Practice this question →

70

MCQmedium

A team is deploying a large PyTorch model for online inference. They want to use NVIDIA Triton Inference Server to optimize serving performance. How can they integrate Triton with Vertex AI?

A.Package the model with Triton in a custom container and deploy it to Vertex AI

B.Vertex AI automatically uses Triton for all PyTorch models

C.Deploy the model to GKE and use Vertex AI as a frontend

D.Use a prebuilt Vertex AI PyTorch container that includes Triton

AnswerA

Custom containers allow full control, including running Triton.

Why this answer

Vertex AI supports custom containers; you can build a Docker image with Triton Inference Server and deploy it as a model on Vertex AI.

Practice this question →

71

MCQhard

A company uses Vertex AI for online predictions with a large ensemble model that requires GPU acceleration. They want to reduce inference latency by batching multiple requests into a single GPU inference call. What should they configure?

A.Use Vertex AI Model Optimization for automatic compilation.

B.Deploy the model with NVIDIA Triton Inference Server configured for dynamic batching.

C.Increase the number of GPU replicas to handle higher concurrency.

D.Enable model quantization using TensorRT.

AnswerB

Correct: Triton supports dynamic batching to improve GPU utilization and reduce per-request latency.

Why this answer

NVIDIA Triton Inference Server supports dynamic batching, which automatically groups multiple inference requests into a single GPU call. This reduces overhead and improves GPU utilization, directly addressing the need to lower latency for online predictions with a large ensemble model on Vertex AI.

Exam trap

Cisco often tests the distinction between model-level optimizations (quantization, compilation) and runtime optimizations (batching), leading candidates to confuse techniques that improve single-request speed with those that improve throughput via request aggregation.

How to eliminate wrong answers

Option A is wrong because Vertex AI Model Optimization for automatic compilation focuses on model-level optimizations like pruning or quantization, not on batching runtime requests. Option C is wrong because increasing GPU replicas improves concurrency but does not batch requests into a single inference call; it may even increase latency due to inter-replica coordination. Option D is wrong because model quantization using TensorRT reduces model size and speeds up computation per request, but it does not implement request batching at the inference server level.

Practice this question →

72

MCQmedium

A data scientist wants to deploy a model trained with PyTorch to a Vertex AI endpoint for online predictions. What is the recommended approach?

A.Package the model in a custom container with a web server (e.g., FastAPI) and deploy to Vertex AI.

B.Use Vertex AI's pre-built PyTorch container and upload the state dictionary.

C.Export the PyTorch model to a SavedModel format and deploy using Vertex AI's pre-built TensorFlow container.

D.Convert the model to TensorFlow.js and deploy to Cloud Functions.

AnswerA

Correct: Custom container allows full control over PyTorch serving environment.

Why this answer

Option A is correct because Vertex AI requires a custom container for PyTorch models, as it does not provide a pre-built PyTorch serving container. The recommended approach is to package the trained PyTorch model with a web server like FastAPI (or Flask) that loads the model and exposes an HTTP endpoint for predictions. This container is then deployed to Vertex AI for online predictions, giving full control over the inference environment and dependencies.

Exam trap

Cisco often tests the misconception that Vertex AI provides pre-built containers for all major frameworks, but in reality, only TensorFlow, scikit-learn, and XGBoost have official pre-built containers; PyTorch requires a custom container.

How to eliminate wrong answers

Option B is wrong because Vertex AI does not offer a pre-built PyTorch container; the pre-built containers are for TensorFlow, scikit-learn, and XGBoost only. Uploading a state dictionary alone would not create a runnable serving environment. Option C is wrong because exporting a PyTorch model to SavedModel format (TensorFlow's format) and using a TensorFlow container is not a recommended or straightforward path; it requires conversion via ONNX or other tools, introduces potential incompatibilities, and is not the standard deployment method for PyTorch.

Option D is wrong because converting to TensorFlow.js and deploying to Cloud Functions is intended for client-side or lightweight serverless inference, not for production-grade online predictions with GPU support or complex model serving requirements; Cloud Functions also have request timeout and memory limitations unsuitable for many PyTorch models.

Practice this question →

73

MCQhard

An ML engineer is optimizing a large model for deployment on Vertex AI with GPU acceleration. They want to reduce model size and improve inference latency without significant accuracy loss. Which tool should they use?

A.Use gcloud CLI to prune the model.

B.Use Cloud TPU for faster inference.

C.Use Vertex AI Model Optimization with TensorRT.

D.Use TensorFlow.js converter to optimize the model for web.

AnswerC

Vertex AI Model Optimization uses TensorRT to quantize and compile models for NVIDIA GPUs, reducing latency and model size.

Why this answer

Option C is correct because Vertex AI Model Optimization with TensorRT is specifically designed to reduce model size and improve inference latency on NVIDIA GPUs by applying techniques like quantization, pruning, and graph optimizations. TensorRT optimizes the model for the target GPU architecture, enabling faster inference with minimal accuracy loss, which directly addresses the engineer's goals.

Exam trap

Cisco often tests the misconception that any optimization tool (like gcloud CLI or TensorFlow.js) can perform model pruning or latency reduction, when in fact each tool has a specific domain—Vertex AI Model Optimization with TensorRT is the only option that directly targets GPU-accelerated inference optimization on Vertex AI.

How to eliminate wrong answers

Option A is wrong because the gcloud CLI is a command-line tool for managing Google Cloud resources, not for model pruning; pruning requires specialized frameworks like TensorFlow Model Optimization Toolkit or TensorRT. Option B is wrong because Cloud TPU is a hardware accelerator for training and inference, but it does not reduce model size or optimize for GPU acceleration; it is a different hardware type and not a tool for model optimization. Option D is wrong because TensorFlow.js converter is used to convert models for web browser deployment, not for optimizing inference latency on GPU-accelerated Vertex AI deployments; it targets client-side JavaScript execution, not server-side GPU optimization.

Practice this question →

74

MCQmedium

A team has deployed a model on Vertex AI and wants to cache frequent identical prediction requests to improve latency and reduce cost. Which Google Cloud service should they use?

A.Cloud Bigtable

B.Cloud CDN

C.Cloud Memorystore

D.Cloud SQL

AnswerC

Memorystore provides a managed Redis or Memcached instance for caching.

Why this answer

Cloud Memorystore (Redis) can be used as a cache. The application hashes the request and checks the cache before calling the model.

Practice this question →

75

MCQmedium

A company wants to run batch predictions on millions of records stored in BigQuery. They need to preprocess the data (e.g., feature engineering) before feeding it to the model. Which approach is most scalable and cost-effective?

A.Use a large DataProc cluster to preprocess and run batch predictions.

B.Preprocess inline in the batch prediction job using a custom container.

C.Use a custom Python script on a Compute Engine instance.

D.Preprocess with Cloud Dataflow, output to Cloud Storage, then submit a Vertex AI batch prediction job.

AnswerD

Dataflow provides scalable preprocessing, and Vertex AI batch prediction reads from Cloud Storage.

Why this answer

Option D is the most scalable and cost-effective because Cloud Dataflow (Apache Beam) provides serverless, auto-scaling preprocessing that handles large volumes of data efficiently, and Vertex AI batch predictions natively read from Cloud Storage, avoiding the need to manage infrastructure. This decouples preprocessing from prediction, allowing each to scale independently and minimizing costs by using ephemeral, pay-per-use resources.

Exam trap

Cisco often tests the misconception that a single large cluster (Dataproc) or a single VM is sufficient for batch processing, when in fact serverless, auto-scaling services like Dataflow are more appropriate for large-scale, ephemeral preprocessing tasks.

How to eliminate wrong answers

Option A is wrong because Dataproc clusters require manual sizing, incur idle costs, and add operational overhead for a simple preprocessing task; it is overkill and less cost-effective than serverless options. Option B is wrong because preprocessing inline in a custom container for batch prediction tightly couples preprocessing with prediction, preventing independent scaling and making it harder to handle large-scale data transformations efficiently. Option C is wrong because a single Compute Engine instance cannot scale horizontally to process millions of records in a reasonable time, and managing failover, retries, and parallelization would require custom code, making it neither scalable nor cost-effective.

Practice this question →

Page 1 of 2 · 109 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Pmle Serving Scaling questions.

Start 20-question session