How many Serving and Scaling Models questions are on the PMLE exam?

The Serving and Scaling Models domain is one of the weighted domains on the PMLE exam. The Courseiva question bank has 109 practice questions for this domain.

Free PMLE Serving and Scaling Models Practice Questions (2026)

Q: What does the Serving and Scaling Models domain cover on the PMLE exam?

The Serving and Scaling Models domain covers the key concepts and skills tested in this area of the PMLE exam blueprint published by Google Cloud.

Q: How can I practice Serving and Scaling Models questions for PMLE?

Click any of the 109 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Serving and Scaling Models domain.

Practice Serving and Scaling Models questions

10Q 20Q 30Q 50Q

All PMLE Serving and Scaling Models questions (109)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data scientist wants to deploy a trained TensorFlow model to Vertex AI for online predictions. They need to serve predictions with low latency and want to leverage GPU acceleration. Which machine type should they select when creating the Vertex AI endpoint?

You are deploying a new version of a model to a Vertex AI endpoint that already has a champion model serving 100% of traffic. You want to gradually shift traffic to the new version while monitoring for errors. Which approach should you use?

A company is using Vertex AI Prediction with a custom container that performs preprocessing before inference. The preprocessing step is CPU-intensive and the inference step uses a GPU. They want to minimize prediction latency while optimizing cost. Which architecture should they use?

You need to serve a large embedding model for similarity search with low latency. The model was trained to generate 256-dimensional embeddings. You plan to use Vertex AI Vector Search. Which index type should you choose to balance accuracy and performance for a dataset with 10 million vectors?

A machine learning engineer needs to run batch predictions on 50 TB of data stored in BigQuery using a Vertex AI model. The model is a custom container. What is the most efficient way to set up the batch prediction job?

You have a Vertex AI endpoint with min_replica_count=2 and max_replica_count=10. You notice that during a traffic spike, the endpoint does not scale up quickly enough, causing increased latency. What should you do to improve autoscaling responsiveness?

You are deploying a PyTorch model on Vertex AI using a custom container with NVIDIA Triton Inference Server. The model is a large transformer that requires GPU. You want to optimize GPU utilization and reduce memory footprint. Which technique should you apply?

A company wants to cache predictions for identical requests to reduce latency and cost. They use Vertex AI Prediction with a custom container. Which GCP service should they use to implement prediction caching?

You have a Vertex AI endpoint that serves a model for real-time predictions. You want to update the model to a new version with zero downtime. Which approach should you take?

You are using Vertex AI Vector Search with an approximate nearest neighbor index. You need to update the index with new data every hour. The updates must be available for queries immediately. Which update method should you use?

An ML team wants to deploy multiple models (e.g., a recommender and a classifier) behind a single Vertex AI endpoint. The models have different resource requirements: the recommender needs GPU, the classifier needs high memory. How should they configure the endpoint?

You need to run a batch prediction job on Vertex AI using a model that requires custom preprocessing using a Python script. The preprocessing must be applied before inference. Which approach should you use?

A company is deploying a model on Vertex AI for online predictions with strict latency SLOs. The model requires GPU acceleration. Which TWO configurations should they consider to meet the SLOs while optimizing cost?

You are designing a batch prediction pipeline using Vertex AI. The input data is 100 TB of images stored in Cloud Storage. The model is a custom TensorFlow model that expects TFRecord format. The pipeline must be cost-effective and run within a time window of 2 hours. Which THREE steps should you include?

An organization wants to deploy a model on edge devices (e.g., Android phones) for offline inference. They trained a model using TensorFlow. Which THREE steps should they take to prepare and deploy the model?

You deployed a model to a Vertex AI endpoint with minReplicas=0 and maxReplicas=5. After sending prediction requests, you notice the endpoint takes about 30 seconds to respond initially, but subsequent requests are fast. What is the most likely cause?

You have a champion model serving 100% traffic on a Vertex AI endpoint. You want to deploy a challenger model and gradually shift 10% of traffic to it for A/B testing. What is the correct approach?

You need to run batch predictions on 10 TB of text data stored in BigQuery using a custom container model hosted in Vertex AI. What is the most cost-effective and simple approach?

Your team is deploying a large recommendation model on Vertex AI endpoints using GPUs. You need to minimise latency while optimising cost. The model serves many similar requests from the same users within short time windows. Which additional service would best reduce latency and cost?

You want to deploy a TensorFlow model to a Vertex AI endpoint and enable online predictions. The model requires GPU for inference. Which machine type should you select when deploying the model?

Your Vertex AI endpoint is experiencing high latency during traffic spikes. You have set maxReplicas=10 and minReplicas=2. The CPU utilisation target is 60%. During spikes, the endpoint never scales beyond 4 replicas. What is the most likely reason?

You need to deploy a PyTorch model for online inference on Vertex AI but the model was trained using custom ops that are not natively supported. You want to use NVIDIA Triton Inference Server for optimisation. How should you proceed?

Your team has built a low-latency similarity search service using Vertex AI Matching Engine (Vector Search). The index is updated daily with new embeddings. You need to serve the latest index without downtime. What is the correct deployment strategy?

You need to serve a model on an edge device with low latency and offline capability. Which approach should you use?

You have a Vertex AI endpoint with two deployed models: model A (champion) and model B (challenger). Traffic split is 90:10. You want to gradually increase model B's traffic to 50% over a week. What is the best way to update the traffic split?

You are using Vertex AI Prediction with a custom container that requires a large model file (5 GB). Deployment takes 10 minutes to start. You want to reduce cold start latency. Which action would be MOST effective?

You need to query a Vertex AI Vector Search index for nearest neighbours. The index is deployed on an endpoint. Which API method should you use to perform the query?

You are deploying a model for real-time inference with strict latency requirements (<100ms P99). You want to autoscale based on custom metrics. Which TWO actions should you take? (Choose 2)

Your team is using Vertex AI Prediction for a large-scale NLP model (PyTorch, custom ops). The model currently runs on CPU but you want to optimise inference cost and performance. Which THREE approaches should you consider? (Choose 3)

You need to deploy a model for online predictions with low latency. You want to ensure that the endpoint can handle traffic bursts without cold start. Which TWO configurations should you set? (Choose 2)

A company deploys a model on Vertex AI Endpoints for real-time inference. They need to minimize latency for prediction requests that are identical to previous requests. Which approach should they use?

A data science team needs to serve multiple versions of the same ML model on Vertex AI Endpoints for A/B testing. They want to gradually shift traffic from the current 'champion' model to a new 'challenger' model. Which feature should they use?

An ML engineer is optimizing a large model for deployment on Vertex AI with GPU acceleration. They want to reduce model size and improve inference latency without significant accuracy loss. Which tool should they use?

Which Vertex AI service is designed for building and managing approximate nearest neighbor (ANN) indexes for similarity search at scale?

A company wants to run batch predictions on millions of records stored in BigQuery. They need to preprocess the data (e.g., feature engineering) before feeding it to the model. Which approach is most scalable and cost-effective?

Which of the following is a benefit of using Vertex AI Endpoints with autoscaling and scale-to-zero?

An engineer deploys a model to a Vertex AI endpoint with minReplicas=1 and maxReplicas=3. The endpoint receives a sudden traffic spike, but it does not scale up beyond 1 replica. The CPU utilization target is 60%. What is the most likely cause?

A company needs to perform real-time similarity search on a dataset of 10 million embedding vectors. They expect low latency (under 10ms) and high throughput. Which index type should they use in Vertex AI Vector Search?

Which API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints?

An organization wants to deploy a TensorFlow model on edge devices such as smartphones and IoT devices for offline inference. Which format should they export the model to?

A company is deploying multiple models on a single Vertex AI endpoint to reduce costs. Each model has different traffic patterns. Which configuration should they use?

An ML engineer needs to update a model deployed on a Vertex AI endpoint without downtime. They want to gradually shift traffic to the new version while monitoring for errors. What is the correct procedure?

A company wants to use Vertex AI Vector Search for real-time product recommendations based on user embeddings. They need to update the index frequently with new product embeddings without significant downtime. Which TWO options should they consider? (Choose 2)

An organization is deploying a mission-critical model on Vertex AI Endpoints. They need to ensure high availability and meet a strict SLO of 99.9% uptime. Which THREE steps should they take? (Choose 3)

Which TWO of the following can be used as input sources for Vertex AI batch prediction jobs? (Choose 2)

You are deploying a model to a Vertex AI endpoint and need to minimize latency for online predictions. Which machine type should you choose?

You need to perform batch predictions on 10 TB of data stored in BigQuery using Vertex AI. The model requires some preprocessing that cannot be expressed in SQL. What is the most scalable approach?

You are using Vertex AI Vector Search for a product recommendation system. Your index is updated with new embeddings every hour. To minimize query latency while keeping the index fresh, what should you do?

You have a Vertex AI endpoint serving a model with min replicas=2 and max replicas=10. You notice that during low traffic hours, the endpoint still runs 2 replicas, incurring costs. You want to reduce costs to zero when there is no traffic. What should you do?

You are A/B testing a new model version (challenger) against the current version (champion) on Vertex AI. You want to gradually shift traffic from champion to challenger while measuring business metrics. Which approach should you use?

You need to serve multiple models on a single Vertex AI endpoint to reduce costs. How can you achieve this?

You are deploying a PyTorch model on Vertex AI and want to use NVIDIA Triton Inference Server for optimal performance. You have built a custom container with Triton. Which serving configuration should you use?

You are using Vertex AI batch prediction and your model requires preprocessing that involves joining two BigQuery tables. The preprocessing logic is complex and must be done before inference. How should you design the pipeline?

You have a Vertex AI endpoint with autoscaling enabled. You notice that during traffic spikes, the endpoint takes a long time to scale up, causing prediction errors. What is the most effective solution?

You need to deploy a TensorFlow model to edge devices for real-time inference with minimal latency. The model is currently trained on Vertex AI. Which approach should you use?

Your Vertex AI endpoint receives many identical prediction requests (same input features). You want to cache responses to reduce latency and cost. Which Google Cloud service should you use?

You are using Vertex AI Vector Search to find nearest neighbors for a recommendation system. Your index is built on 10M embeddings and you need low-latency queries. You want to ensure that adding new embeddings does not require a full index rebuild. Which index type should you use?

You are optimizing a model for deployment on Vertex AI using NVIDIA Triton Inference Server. Which TWO actions can you take to improve inference performance?

You are deploying a model on Vertex AI and need to ensure high availability and low latency. Which THREE configurations should you implement?

You are designing a batch prediction pipeline using Vertex AI. The input data is 50 TB in CSV format on GCS. The model requires feature engineering that involves complex transformations (e.g., datetime parsing, one-hot encoding). Which THREE services or steps should you include in your pipeline?

A data science team has trained a custom TensorFlow model for real-time fraud detection. They need to deploy it on Vertex AI with minimal latency and support for multiple concurrent requests. The model requires a GPU for inference. Which machine type should they choose for the Vertex AI endpoint?

You deploy a new version of a model to a Vertex AI endpoint and want to gradually shift traffic from the old version to the new version over 24 hours. The endpoint currently serves 100% traffic to the old version. What should you do?

You have a Vertex AI endpoint serving a model for real-time predictions. The endpoint is configured with minReplicaCount=2 and maxReplicaCount=10. Over the past week, you notice that the actual number of replicas rarely exceeds 2, but the average CPU utilization is around 85%. You want to reduce costs without impacting performance. What should you do?

You are deploying a PyTorch model for online predictions on Vertex AI. The model expects input tensors and performs GPU-accelerated inference. You want to minimize prediction latency and maximize throughput. Which approach should you use?

Your team has deployed a model on Vertex AI endpoints. You need to monitor the prediction latency to ensure it meets a 99th percentile SLO of 500ms. You want to set up an alert if the latency exceeds this threshold. Which metric should you use?

You are using Vertex AI Matching Engine (Vector Search) to serve similarity search for an e-commerce product recommendation system. The index is updated daily with new product embeddings via a batch job. However, you notice that some new products are not appearing in the search results for up to 24 hours. You need to ensure that new products are discoverable within 1 hour of ingestion. What should you do?

Your company runs a high-traffic web application that serves the same machine learning model prediction for many identical requests (e.g., product recommendations for the same user profile). You want to reduce latency and load on the prediction endpoint by caching responses. Which Google Cloud service should you use?

You have a TensorFlow model that you want to deploy on edge devices for real-time inference. The model was trained in Vertex AI. You need to convert it to a format suitable for on-device inference. Which approach should you use?

You need to run batch predictions on a large dataset stored in BigQuery using a Vertex AI model. The dataset contains 10 million rows, and each prediction takes about 100ms. You want to minimize cost and execution time. What should you do?

You have a Vertex AI endpoint with two deployed models: a champion (v1) and a challenger (v2). You set the traffic split to 90% v1 and 10% v2. After a week, you observe that v2 has better business metrics. You want to shift all traffic to v2 gradually over 3 days to avoid any risk. What should you do?

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

You are deploying a large deep learning model on Vertex AI endpoints. The model requires GPU acceleration and you want to minimize cold-start latency. Which TWO actions should you take? (Choose 2 correct answers)

Your team has deployed a model on Vertex AI endpoints and you are planning an A/B test to compare a new challenger model (v2) against the current champion (v1). The test should measure business metrics such as click-through rate. Which THREE steps should you take to set up the A/B test correctly? (Choose 3 correct answers)

You need to deploy a model that requires a large amount of memory (over 200 GB) for inference. The model is a custom PyTorch model. Vertex AI endpoints have machine type limitations. Which TWO actions can you take to handle this memory requirement? (Choose 2 correct answers)

A machine learning engineer wants to deploy a trained model to Vertex AI for online predictions. Which Vertex AI resource is required to serve the model and provide an endpoint URL?

A company needs to serve a high-throughput prediction service with strict latency requirements. They want to minimize cold starts and ensure consistent performance. Which endpoint configuration is most appropriate?

A data scientist wants to perform A/B testing between two model versions deployed on the same Vertex AI endpoint. They need to route 10% of traffic to the challenger model. Which approach should they use?

A team is deploying a large PyTorch model for online inference. They want to use NVIDIA Triton Inference Server to optimize serving performance. How can they integrate Triton with Vertex AI?

A company needs to run batch predictions on 10 TB of data stored in Cloud Storage. The predictions should be written to BigQuery. Which approach should they use?

What is the primary purpose of Vertex AI Model Optimization (formerly Model Garden)?

A company uses Vertex AI Matching Engine for a product recommendation system. They need to update the index with new product embeddings every hour, but the index is used for online queries with low latency. Which index update strategy should they use?

A team has deployed a model on Vertex AI and wants to cache frequent identical prediction requests to improve latency and reduce cost. Which Google Cloud service should they use?

An engineer needs to deploy multiple models on a single Vertex AI endpoint with separate traffic allocations. What is the maximum number of deployed models that can be assigned traffic on one endpoint?

A company wants to deploy a TensorFlow model on edge devices for real-time inference without internet connectivity. Which Vertex AI service should they use to manage the deployment?

Which machine type is most suitable for a Vertex AI endpoint serving a GPU-accelerated model?

A company uses Vertex AI Vector Search for similarity search. They have a dataset of 10 million 512-dimensional vectors. Which index type should they choose for lowest latency at high recall?

A company needs to reduce inference latency for their online prediction service on Vertex AI. Which two actions would help? (Choose 2)

A team wants to deploy a model on Vertex AI Edge Manager for offline inference on edge devices. Which three steps are required? (Choose 3)

A company uses Vertex AI Matching Engine for real-time recommendations. They need to serve queries with low latency and support frequent updates. Which two configurations are appropriate? (Choose 2)

A company is deploying a new model version to an existing Vertex AI endpoint. They want to test the new version with 5% of traffic before fully rolling it out. What is the correct approach?

Which Vertex AI service is best suited for finding similar items in a large dataset based on embedding vectors, such as product recommendations or image similarity search?

An ML engineer needs to run batch predictions on 10 TB of data stored in BigQuery using a TensorFlow model. The predictions must be written to BigQuery. Which service should they use?

A company uses Vertex AI for online predictions with a large ensemble model that requires GPU acceleration. They want to reduce inference latency by batching multiple requests into a single GPU inference call. What should they configure?

A data scientist wants to deploy a model trained with PyTorch to a Vertex AI endpoint for online predictions. What is the recommended approach?

Which Vertex AI feature allows you to reduce the size of a trained model to improve inference speed on edge devices without significant accuracy loss?

An application serving predictions from a Vertex AI endpoint receives many identical requests within a short time window. The team notices redundant computation and wants to cache responses to reduce latency and cost. What is the recommended solution?

A team is deploying a model that has strict latency requirements: p99 response time under 100 ms. The model is CPU-only and will receive up to 1000 QPS. They want to minimize cost while meeting the SLO. Which machine type and scaling configuration is most appropriate?

What is the primary purpose of Vertex AI Edge Manager?

100

A company uses Vertex AI Vector Search (Matching Engine) for a product recommendation system. The product embeddings are updated hourly. Which index update method should they use to ensure low latency for new items?

101

A company is deploying a complex model that requires GPU for inference. They want to use Vertex AI for serving. Which TWO steps are required to deploy the model with GPU support? (Choose 2)

102

A team is building a batch prediction pipeline that processes raw data from Cloud Storage, performs complex preprocessing, and then runs predictions using a large model. The preprocessing step is compute-intensive and the prediction step is I/O-bound. Which TWO Google Cloud services should they combine to optimize cost and performance? (Choose 2)

103

An ML engineer needs to deploy a model to Vertex AI for online predictions and enable autoscaling to zero when not in use. Which THREE conditions must be met? (Choose 3)

104

Which TWO of the following are benefits of using Vertex AI Matching Engine (Vector Search) over a brute-force nearest neighbor search? (Choose 2)

105

A company is migrating from an on-premises ML serving infrastructure to Vertex AI. They have multiple models that need to be served from the same endpoint with different traffic percentages. They also need to monitor prediction quality. Which THREE actions should they take? (Choose 3)

106

A machine learning team deploys a PyTorch model for online prediction on Vertex AI using a custom container. They notice that the first few requests after scaling up experience high latency. What is the most likely cause and how should they mitigate it?

107

A company runs batch predictions on Vertex AI every hour using a custom container. They want to reduce costs by minimizing idle time while ensuring the batch job completes within 10 minutes. Which endpoint configuration should they use?

108

A retail company deploys a new recommendation model alongside the current champion on Vertex AI Endpoints. They want to gradually shift traffic to the challenger while monitoring business metrics (conversion rate). Which two steps are required? (Choose 2)

109

A fintech company needs to deploy a TensorFlow model for real-time fraud detection with strict latency SLO (p99 < 100ms). They expect variable traffic with spikes. They also want to minimize cold-start latency. Which two configurations should they use? (Choose 2)

Practice all 109 Serving and Scaling Models questions

Other PMLE exam domains

Automating and Orchestrating ML Pipelines Collaborating Within and Across Teams to Manage Data and Models Monitoring ML Solutions Architecting Low-Code ML Solutions Scaling Prototypes into ML Models Collaborating to manage data and models Solving business challenges with ML

Frequently asked questions

What does the Serving and Scaling Models domain cover on the PMLE exam?

The Serving and Scaling Models domain covers the key concepts tested in this area of the PMLE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PMLE domains — no account required.

How many Serving and Scaling Models questions are in the PMLE question bank?

The Courseiva PMLE question bank contains 109 questions in the Serving and Scaling Models domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Serving and Scaling Models for PMLE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Serving and Scaling Models questions for PMLE?

Yes — the session launcher on this page draws questions exclusively from the Serving and Scaling Models domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PMLE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included