How should I use these Serving and scaling models practice questions?

Read each scenario carefully and choose your answer before revealing the explanation. Then check why your choice was right or wrong. Repeat until the reasoning feels automatic.

Can I practise just Serving and scaling models questions in a focused session?

Yes — use the session launcher on this page to start a 10-, 20-, 30- or 50-question session drawn entirely from the Serving and scaling models domain.

PMLE · topic practice

Serving and scaling models practice questions

Practise Google Professional Machine Learning Engineer Serving and scaling models practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security

20 questionsDomain: Serving and scaling models

Practice 10 questions Browse domain →

What the exam tests

What to know about Serving and scaling models

Serving and scaling models questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Serving and scaling models exam traps

▸Answering from memory before reading the full scenario.
▸Missing a constraint such as cost, availability, security, scope or command context.
▸Choosing a broad answer when the question asks for the most specific fix.
▸Ignoring why the wrong options are tempting.

Practice set

Serving and scaling models questions

20 questions · select your answer, then reveal the explanation

Question 1easymultiple choice

Read the full Serving and scaling models explanation →

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

Trap 1: Increase the machine type of the node

Increasing machine type may help but does not address scaling under load.

Trap 2: Decrease the min replicas to 0

Reducing min replicas may cause cold starts and increase latency.

Trap 3: Enable automatic batching of requests

Batching increases latency as requests wait to be batched.

Study all Serving and scaling models common traps →

A
Enable autoscaling for the deployment
Autoscaling adds nodes during peak traffic, reducing latency.
B
Increase the machine type of the node
Why wrong: Increasing machine type may help but does not address scaling under load.
C
Decrease the min replicas to 0
Why wrong: Reducing min replicas may cause cold starts and increase latency.
D
Enable automatic batching of requests
Why wrong: Batching increases latency as requests wait to be batched.

Serving and scaling models practice questions

What to know about Serving and scaling models

Common Serving and scaling models exam traps

Serving and scaling models questions

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

A company needs to serve a model with strict latency requirements (<100ms). They are using Vertex AI Prediction with CPU. During testing, latency is 150ms. What should they do?

A company is deploying a model for online predictions on Vertex AI. They want to minimize latency while also handling traffic spikes. Which TWO configurations should they choose?

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

A company wants to reduce costs for serving a model on Vertex AI Prediction without sacrificing availability. Which THREE strategies should they consider?

A data science team has trained a TensorFlow model and wants to serve it online with minimal latency. Which Vertex AI deployment option should they use to ensure the model can handle traffic spikes without manual scaling?

A financial services company deploys a fraud detection model on Vertex AI. The model must make predictions in under 100ms. After deployment, latency spikes to 300ms during peak hours. The model is a large ensemble with 500MB size. Which action is most likely to reduce latency?

A team deploys a model on Vertex AI that uses a custom prediction routine (CPR) with a dependency on a native library. The container crashes with 'ImportError: libcudart.so.11.0: cannot open shared object file'. How should they resolve this?

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

A team is serving a large language model (LLM) on Vertex AI using a custom container. They want to reduce tail latency. Which THREE strategies should they consider?

A model deployed on Vertex AI Prediction repeatedly exits with code 137. What is the most likely cause?

Exhibit

A company is deploying a machine learning model for real-time fraud detection. The model must respond to requests within 100ms. The model is a TensorFlow model and will be deployed on Google Kubernetes Engine (GKE). Which Google Cloud service should be used to serve the model to minimize latency?

Track your progress over time

Start a Serving and scaling models only practice session

Related PMLE topic practice pages

Scaling prototypes into ML models practice questions

Automating and orchestrating ML pipelines practice questions

Collaborating within and across teams to manage data and models practice questions

Architecting low-code ML solutions practice questions

Collaborating to manage data and models practice questions

Serving and scaling models practice questions

Monitoring ML solutions practice questions

Solving business challenges with ML practice questions

PMLE fundamentals practice questions

PMLE scenario practice questions

PMLE troubleshooting practice questions

Frequently asked questions

Track your progress

Study resources

Exam traps to avoid