Is Serving and Scaling Models hard on the PMLE?

Serving and Scaling Models is one of the core PMLE topics. Consistent practice with scenario-based questions is the best way to build confidence and score well on exam day.

PMLE Serving and Scaling Models Practice Questions

Q: How many PMLE Serving and Scaling Models questions are on the real exam?

The PMLE exam covers Serving and Scaling Models as part of the Google Professional Machine Learning Engineer blueprint. Courseiva has 20+ practice questions on this topic to help you prepare.

Q: Are these PMLE Serving and Scaling Models practice questions free?

Yes. All PMLE Serving and Scaling Models practice questions on Courseiva are free. No account or payment is required to start practising.

Sample Serving and Scaling Models Questions

Practice all 20+ →

A data scientist wants to deploy a trained TensorFlow model to Vertex AI for online predictions. They need to serve predictions with low latency and want to leverage GPU acceleration. Which machine type should they select when creating the Vertex AI endpoint?

A.n1-standard-4 with 1 NVIDIA Tesla T4

B.n1-standard-4

C.e2-standard-4

D.n1-highmem-8

Explanation: Option A is correct because the n1-standard-4 machine type supports attaching GPUs such as the NVIDIA Tesla T4, which provides GPU acceleration for low-latency online predictions. Vertex AI endpoints require a machine type that allows GPU attachment, and the n1-series is one of the few families that supports GPUs, while the T4 offers a good balance of cost and performance for inference workloads.

You are deploying a new version of a model to a Vertex AI endpoint that already has a champion model serving 100% of traffic. You want to gradually shift traffic to the new version while monitoring for errors. Which approach should you use?

A.Use Cloud Load Balancing with weighted backend services pointing to different endpoints.

B.Deploy the challenger to the same endpoint with initial traffic split, e.g., champion 90%, challenger 10%, and gradually adjust.

C.Delete the champion model and redeploy with the challenger as the new version.

D.Create a new endpoint for the challenger and use a load balancer to split traffic.

Explanation: Vertex AI endpoints support traffic splitting between model versions deployed to the same endpoint. By deploying the challenger to the same endpoint and setting an initial split (e.g., champion 90%, challenger 10%), you can gradually shift traffic while monitoring for errors. This approach uses the endpoint's built-in traffic management, avoiding the complexity and latency of external load balancers.

A company is using Vertex AI Prediction with a custom container that performs preprocessing before inference. The preprocessing step is CPU-intensive and the inference step uses a GPU. They want to minimize prediction latency while optimizing cost. Which architecture should they use?

A.Use Cloud Run for preprocessing and send HTTP requests to a GPU-backed Vertex AI endpoint for inference.

B.Use two separate Vertex AI endpoints: one CPU-based for preprocessing, one GPU-based for inference, and chain them with Cloud Tasks.

C.Use Dataflow for preprocessing and then invoke the model, but Dataflow is not designed for real-time prediction.

D.Use a single GPU machine (e.g., n1-standard-4 with T4) and perform both preprocessing and inference on the same instance.

Explanation: Using a CPU-only node for preprocessing and then sending the preprocessed data to a GPU node for inference separates concerns and allows independent scaling, but adds network latency. The best approach is to use a single machine with both CPU and GPU to avoid network round-trip, and to adjust the machine type to have enough CPU resources.

You need to serve a large embedding model for similarity search with low latency. The model was trained to generate 256-dimensional embeddings. You plan to use Vertex AI Vector Search. Which index type should you choose to balance accuracy and performance for a dataset with 10 million vectors?

A.Tree-based index

B.Approximate nearest neighbor (ANN) index using ScaNN

C.Brute-force index

D.Hash-based index

Explanation: Vertex AI Vector Search uses ScaNN (Scalable Nearest Neighbors) as its underlying ANN algorithm, which is specifically designed for high-dimensional embeddings (like 256-d) and large-scale datasets (10M vectors). ScaNN balances accuracy and performance by employing anisotropic quantization and tree-based partitioning, making it the optimal choice for low-latency similarity search without requiring exhaustive comparison.

A machine learning engineer needs to run batch predictions on 50 TB of data stored in BigQuery using a Vertex AI model. The model is a custom container. What is the most efficient way to set up the batch prediction job?

A.Create a Vertex AI batch prediction job with BigQuery source and BigQuery destination.

B.Use Dataflow to process the data and call the model via Vertex AI online prediction.

C.Export BigQuery data to CSV in GCS, then create a batch prediction job with GCS source.

D.Create a Cloud Function to iterate over BigQuery rows and call the endpoint.

Explanation: Vertex AI batch prediction supports BigQuery as both input and output source, which is the most direct approach. Dataflow preprocessing is optional only if needed.

+15 more Serving and Scaling Models questions available

Practice all Serving and Scaling Models questions

How to master Serving and Scaling Models for PMLE

1. Baseline your knowledge

Start with 10 questions to gauge your current understanding of Serving and Scaling Models. This tells you whether you need a concept refresher or just practice.

2. Review every explanation

For each question — right or wrong — read the full explanation. Understanding why an answer is correct is more valuable than knowing the answer itself.

3. Focus on exam traps

Serving and Scaling Models questions on the PMLE frequently use trap wording. Look for subtle differences in answers that test your precision, not just general knowledge.

4. Reach 80% consistently

Do repeated sessions until you score 80%+ three times in a row. Then move to mixed-mode practice to test cross-topic recall under realistic conditions.

Frequently asked questions

How many PMLE Serving and Scaling Models questions are on the real exam?

The exact number varies per candidate. Serving and Scaling Models is tested as part of the Google Professional Machine Learning Engineer blueprint. Practicing with targeted Serving and Scaling Models questions ensures you can handle any format or difficulty that appears.

Are these PMLE Serving and Scaling Models practice questions free?

Yes. Courseiva provides free PMLE practice questions across all exam topics and domains. The platform includes topic-based practice, mock exams, missed-question review, bookmarked questions, and readiness tracking — no account required.

Is Serving and Scaling Models one of the harder PMLE topics?

Difficulty is subjective, but Serving and Scaling Models is a high-priority exam concept tested in multiple ways — direct recall, scenario analysis, and command-output interpretation. Consistent practice is the best way to build confidence.

Ready to practice?

Launch a full Serving and Scaling Models practice session with instant scoring and detailed explanations.

Start Serving and Scaling Models Practice →

How to master Serving and Scaling Models for PMLE

1. Baseline your knowledge

Start with 10 questions to gauge your current understanding of Serving and Scaling Models. This tells you whether you need a concept refresher or just practice.

2. Review every explanation

For each question — right or wrong — read the full explanation. Understanding why an answer is correct is more valuable than knowing the answer itself.

3. Focus on exam traps

Serving and Scaling Models questions on the PMLE frequently use trap wording. Look for subtle differences in answers that test your precision, not just general knowledge.

4. Reach 80% consistently

Do repeated sessions until you score 80%+ three times in a row. Then move to mixed-mode practice to test cross-topic recall under realistic conditions.

Frequently asked questions