PMLE · topic practice

Serving and scaling models practice questions

Practise Google Professional Machine Learning Engineer Serving and scaling models practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Serving and scaling models

What the exam tests

What to know about Serving and scaling models

Serving and scaling models questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Serving and scaling models exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Serving and scaling models questions

20 questions · select your answer, then reveal the explanation

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

A company needs to serve a model with strict latency requirements (<100ms). They are using Vertex AI Prediction with CPU. During testing, latency is 150ms. What should they do?

A company is deploying a model for online predictions on Vertex AI. They want to minimize latency while also handling traffic spikes. Which TWO configurations should they choose?

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

A company wants to reduce costs for serving a model on Vertex AI Prediction without sacrificing availability. Which THREE strategies should they consider?

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

A data science team has trained a TensorFlow model and wants to serve it online with minimal latency. Which Vertex AI deployment option should they use to ensure the model can handle traffic spikes without manual scaling?

A financial services company deploys a fraud detection model on Vertex AI. The model must make predictions in under 100ms. After deployment, latency spikes to 300ms during peak hours. The model is a large ensemble with 500MB size. Which action is most likely to reduce latency?

A company serves a PyTorch model using a custom container on Vertex AI Prediction. They notice that after a few hours, the endpoint returns 502 errors. The logs show 'Out of memory' errors. The container has a memory limit of 4GB, and the model loads a 3GB vocabulary file. What is the most likely cause and best fix?

Question 15mediummultiple choice
Read the full NAT/PAT explanation →

A team deploys a model on Vertex AI that uses a custom prediction routine (CPR) with a dependency on a native library. The container crashes with 'ImportError: libcudart.so.11.0: cannot open shared object file'. How should they resolve this?

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

A team is serving a large language model (LLM) on Vertex AI using a custom container. They want to reduce tail latency. Which THREE strategies should they consider?

A model deployed on Vertex AI Prediction repeatedly exits with code 137. What is the most likely cause?

Exhibit

Refer to the exhibit.
```
Log entry:
{
  "severity": "ERROR",
  "message": "Model server process exited with code 137 (SIGKILL)",
  "container": {
    "memory_usage_mb": 4096,
    "memory_limit_mb": 4096
  },
  "@type": "type.googleapis.com/google.cloud.ml.v1.PredictionLog"
}
```

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

A company is deploying a machine learning model for real-time fraud detection. The model must respond to requests within 100ms. The model is a TensorFlow model and will be deployed on Google Kubernetes Engine (GKE). Which Google Cloud service should be used to serve the model to minimize latency?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Serving and scaling models sessions

Start a Serving and scaling models only practice session

Every question in these sessions is drawn from the Serving and scaling models domain — nothing else.

Related practice questions

Related PMLE topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the PMLE exam test about Serving and scaling models?
Serving and scaling models questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Serving and scaling models questions in a focused session?
Yes — the session launcher on this page draws every question from the Serving and scaling models domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PMLE topics?
Use the topic links above to move to related areas, or go back to the PMLE question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the PMLE exam covers. They are not copied from any real exam or dump site.