PMLE · topic practice

Serving and Scaling Models practice questions

Practise Google Professional Machine Learning Engineer Serving and Scaling Models practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Serving and Scaling Models

What the exam tests

What to know about Serving and Scaling Models

Serving and Scaling Models questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Serving and Scaling Models exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Serving and Scaling Models questions

20 questions · select your answer, then reveal the explanation

A data scientist wants to deploy a trained TensorFlow model to Vertex AI for online predictions. They need to serve predictions with low latency and want to leverage GPU acceleration. Which machine type should they select when creating the Vertex AI endpoint?

You are deploying a new version of a model to a Vertex AI endpoint that already has a champion model serving 100% of traffic. You want to gradually shift traffic to the new version while monitoring for errors. Which approach should you use?

A company is using Vertex AI Prediction with a custom container that performs preprocessing before inference. The preprocessing step is CPU-intensive and the inference step uses a GPU. They want to minimize prediction latency while optimizing cost. Which architecture should they use?

You need to serve a large embedding model for similarity search with low latency. The model was trained to generate 256-dimensional embeddings. You plan to use Vertex AI Vector Search. Which index type should you choose to balance accuracy and performance for a dataset with 10 million vectors?

A machine learning engineer needs to run batch predictions on 50 TB of data stored in BigQuery using a Vertex AI model. The model is a custom container. What is the most efficient way to set up the batch prediction job?

You have a Vertex AI endpoint with min_replica_count=2 and max_replica_count=10. You notice that during a traffic spike, the endpoint does not scale up quickly enough, causing increased latency. What should you do to improve autoscaling responsiveness?

You are deploying a PyTorch model on Vertex AI using a custom container with NVIDIA Triton Inference Server. The model is a large transformer that requires GPU. You want to optimize GPU utilization and reduce memory footprint. Which technique should you apply?

A company wants to cache predictions for identical requests to reduce latency and cost. They use Vertex AI Prediction with a custom container. Which GCP service should they use to implement prediction caching?

You have a Vertex AI endpoint that serves a model for real-time predictions. You want to update the model to a new version with zero downtime. Which approach should you take?

You are using Vertex AI Vector Search with an approximate nearest neighbor index. You need to update the index with new data every hour. The updates must be available for queries immediately. Which update method should you use?

An ML team wants to deploy multiple models (e.g., a recommender and a classifier) behind a single Vertex AI endpoint. The models have different resource requirements: the recommender needs GPU, the classifier needs high memory. How should they configure the endpoint?

Question 12mediummultiple choice
Study the full Python automation breakdown →

You need to run a batch prediction job on Vertex AI using a model that requires custom preprocessing using a Python script. The preprocessing must be applied before inference. Which approach should you use?

A company is deploying a model on Vertex AI for online predictions with strict latency SLOs. The model requires GPU acceleration. Which TWO configurations should they consider to meet the SLOs while optimizing cost?

You are designing a batch prediction pipeline using Vertex AI. The input data is 100 TB of images stored in Cloud Storage. The model is a custom TensorFlow model that expects TFRecord format. The pipeline must be cost-effective and run within a time window of 2 hours. Which THREE steps should you include?

An organization wants to deploy a model on edge devices (e.g., Android phones) for offline inference. They trained a model using TensorFlow. Which THREE steps should they take to prepare and deploy the model?

You deployed a model to a Vertex AI endpoint with minReplicas=0 and maxReplicas=5. After sending prediction requests, you notice the endpoint takes about 30 seconds to respond initially, but subsequent requests are fast. What is the most likely cause?

You have a champion model serving 100% traffic on a Vertex AI endpoint. You want to deploy a challenger model and gradually shift 10% of traffic to it for A/B testing. What is the correct approach?

You need to run batch predictions on 10 TB of text data stored in BigQuery using a custom container model hosted in Vertex AI. What is the most cost-effective and simple approach?

Your team is deploying a large recommendation model on Vertex AI endpoints using GPUs. You need to minimise latency while optimising cost. The model serves many similar requests from the same users within short time windows. Which additional service would best reduce latency and cost?

You want to deploy a TensorFlow model to a Vertex AI endpoint and enable online predictions. The model requires GPU for inference. Which machine type should you select when deploying the model?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Serving and Scaling Models sessions

Start a Serving and Scaling Models only practice session

Every question in these sessions is drawn from the Serving and Scaling Models domain — nothing else.

Related practice questions

Related PMLE topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the PMLE exam test about Serving and Scaling Models?
Serving and Scaling Models questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Serving and Scaling Models questions in a focused session?
Yes — the session launcher on this page draws every question from the Serving and Scaling Models domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PMLE topics?
Use the topic links above to move to related areas, or go back to the PMLE question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the PMLE exam covers. They are not copied from any real exam or dump site.