How many Serving and scaling models questions are on the PMLE exam?

The Serving and scaling models domain is one of the weighted domains on the PMLE exam. The Courseiva question bank has 95 practice questions for this domain.

Free PMLE Serving and scaling models Practice Questions (2026)

Q: What does the Serving and scaling models domain cover on the PMLE exam?

The Serving and scaling models domain covers the key concepts and skills tested in this area of the PMLE exam blueprint published by Google Cloud.

Q: How can I practice Serving and scaling models questions for PMLE?

Click any of the 95 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Serving and scaling models domain.

Practice Serving and scaling models questions

10Q 20Q 30Q 50Q

All PMLE Serving and scaling models questions (95)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A company deploys a TensorFlow model on Vertex AI Prediction with a single node. During peak hours, inference latency increases. What should they do first to reduce latency?

A data science team deploys a PyTorch model using Vertex AI Prediction. The model requires GPU for inference, but they notice high costs and underutilized GPUs during off-peak hours. What is the most cost-effective solution?

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

A company wants to serve a large XGBoost model that exceeds the 2GB limit for Vertex AI Prediction. What should they do?

A company deploys a model on Vertex AI Prediction with autoscaling enabled. They notice that during a traffic spike, new instances take several minutes to become available, causing high latency. What is the best solution?

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

A company needs to serve a model with strict latency requirements (<100ms). They are using Vertex AI Prediction with CPU. During testing, latency is 150ms. What should they do?

A company is deploying a model for online predictions on Vertex AI. They want to minimize latency while also handling traffic spikes. Which TWO configurations should they choose?

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

A company wants to reduce costs for serving a model on Vertex AI Prediction without sacrificing availability. Which THREE strategies should they consider?

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

A data science team has trained a TensorFlow model and wants to serve it online with minimal latency. Which Vertex AI deployment option should they use to ensure the model can handle traffic spikes without manual scaling?

A financial services company deploys a fraud detection model on Vertex AI. The model must make predictions in under 100ms. After deployment, latency spikes to 300ms during peak hours. The model is a large ensemble with 500MB size. Which action is most likely to reduce latency?

A company serves a PyTorch model using a custom container on Vertex AI Prediction. They notice that after a few hours, the endpoint returns 502 errors. The logs show 'Out of memory' errors. The container has a memory limit of 4GB, and the model loads a 3GB vocabulary file. What is the most likely cause and best fix?

A team deploys a model on Vertex AI that uses a custom prediction routine (CPR) with a dependency on a native library. The container crashes with 'ImportError: libcudart.so.11.0: cannot open shared object file'. How should they resolve this?

A company is deploying a machine learning model for real-time inference on Vertex AI. Which TWO practices improve serving performance and reliability?

A team is serving a large language model (LLM) on Vertex AI using a custom container. They want to reduce tail latency. Which THREE strategies should they consider?

A model deployed on Vertex AI Prediction repeatedly exits with code 137. What is the most likely cause?

You are a machine learning engineer at a retail company. You have deployed a product recommendation model on Vertex AI Prediction using a custom container. The model is a TensorFlow SavedModel that computes embeddings using a large lookup table. The endpoint is configured with 2 replicas on n1-standard-4 (4 vCPU, 15 GB memory) machines. After deployment, you notice that the endpoint's memory usage grows over time, eventually reaching 90% and causing requests to fail with 503 errors. The container logs show no errors, but the memory usage graph shows a steady increase. The model loads the embedding table (5 GB) at startup. You suspect a memory leak. Which course of action should you take first to diagnose and resolve the issue?

A company is deploying a machine learning model for real-time fraud detection. The model must respond to requests within 100ms. The model is a TensorFlow model and will be deployed on Google Kubernetes Engine (GKE). Which Google Cloud service should be used to serve the model to minimize latency?

A data science team has trained a large deep learning model using Vertex AI Workbench. They want to deploy it to Vertex AI Prediction for online serving. The model is stored in a custom container with a Python-based web server. Which TWO actions should the team take to ensure optimal performance and cost?

A data scientist deployed a model to Vertex AI Prediction. When making a prediction request as shown in the exhibit, they receive a 400 error. What is the most likely cause?

Drag and drop the steps to implement a CI/CD pipeline for ML models using Cloud Build and Vertex AI in the correct order.

Drag and drop the steps to set up a feature store for ML features using Vertex AI Feature Store in the correct order.

Match each feature engineering technique to its description.

Match each optimization algorithm to its characteristic.

A company deploys a model on Vertex AI Prediction for real-time inference. Users report intermittent high latency during peak hours. The model is deployed on a single machine type with `min_replica_count=1` and `max_replica_count=5`. Autoscaling is enabled based on CPU utilization. What is the most likely cause of the latency spikes?

A team needs to serve a PyTorch model for production inference with strict latency requirements (p99 < 100ms). The model has dynamic control flow and uses custom kernels compiled with torch.jit. Which serving approach should they recommend?

A data science team deploys a large language model (LLM) on Vertex AI Prediction using an NVIDIA A100 GPU. The end-to-end latency is acceptable, but the cost is high due to low GPU utilization. The model is stateless and requests are independent. Which strategy would most effectively reduce cost per prediction?

A company has deployed a fraud detection model on Vertex AI Prediction. After three months, the model's accuracy has degraded, and the business is losing money due to undetected fraud. What should the team implement to proactively detect such issues?

A team wants to deploy two versions of a model (v1 and v2) on Vertex AI Endpoint to conduct an A/B test. They need to split traffic so that 10% of requests go to v2. Which configuration achieves this?

A model serving team notices that during a flash sale, a real-time recommendation model experiences sudden spikes in traffic, causing some requests to time out. The endpoint is configured with `min_replica_count=3`, `max_replica_count=10`, and autoscaling metric set to `target_utilization=0.6` on CPU. Despite this, autoscaling is too slow. What change will most improve the autoscaling responsiveness?

A machine learning engineer wants to manage multiple model versions and facilitate collaboration across teams. The goal is to track model lineage, versioning, and approvals. Which Vertex AI service should they use?

A company needs to serve a model for low-frequency inference requests (a few hundred per month) from multiple regions. The priority is simplicity and minimal cost without maintaining infrastructure. Which serving option should they choose?

A team deploys a real-time model using a custom container on Vertex AI Prediction. The container is large (5 GB) and cold starts are causing latency spikes. The endpoint is configured with `min_replica_count=0` to reduce cost. The team wants to keep the cost low while reducing cold starts. What is the best approach?

Which TWO are best practices for deploying models to Vertex AI Prediction? (Choose 2.)

A model serving team is experiencing high latency in production. Which TWO actions should they take to diagnose the root cause? (Choose 2.)

Which THREE factors are critical when designing a model serving architecture for a global user base with strict latency SLAs? (Choose 3.)

A company deploys a model on Vertex AI Endpoints for real-time inference. They notice latency spikes during peak hours. Which action is most effective to reduce latency without sacrificing accuracy?

A team uses Vertex AI Prediction with a custom container. They want to perform canary deployments by sending 5% of traffic to a new model version. Which method should they use?

A machine learning engineer notices that a model served on Vertex AI Endpoints returns predictions that are consistently 20% slower during the first request after idle (cold start). They are using automatic scaling with min replicas=1. What is the most likely cause and best solution?

An organization needs to serve a large model (10 GB) with low latency across multiple regions. Which Vertex AI feature best meets this requirement?

A data scientist uses Vertex AI Workbench to train a model and then deploys it to an endpoint. They want to automate the retraining and redeployment pipeline when new data arrives. Which service should they use?

A model deployed on Vertex AI Endpoints returns predictions, but the performance metrics (e.g., AUC) degrade over time. The input data distribution is shifting. The team wants to detect and alert on this drift automatically. Which set of actions should they take?

For a low-latency real-time serving requirement, which type of Vertex AI Endpoint is appropriate?

A team deploys a model using Vertex AI Endpoint with automatic scaling. They observe that during traffic spikes, new instances take a long time to become ready, causing high latency for some requests. What should they configure to reduce this startup time?

A company uses Vertex AI Endpoints for model serving and wants to implement A/B testing between model versions. They need to gradually shift traffic from the old to the new version while monitoring performance. Which Vertex AI feature allows this with minimal operational overhead?

Which TWO options are best practices for reducing model serving latency on Vertex AI Endpoints? (Choose two.)

Which THREE factors should be considered when choosing between using Vertex AI Endpoints and Cloud Run for model serving? (Choose three.)

Which TWO options can help detect model performance degradation in production? (Choose two.)

Refer to the exhibit. A data scientist deploys a new model version (model_v2) to an existing endpoint with 20% traffic. After a few days, they notice that model_v2's error rate is higher than model_v1's. They want to route all traffic back to model_v1 immediately. Which command achieves this with minimal disruption?

Refer to the exhibit. A team deploys a model with the above configuration. They observe that during traffic spikes, the endpoint does not scale up quickly enough, causing increased latency. The average CPU utilization never exceeds 50%. What is the most likely reason for the slow scaling?

Refer to the exhibit. A team deploys a model using Cloud Run. They notice that after scaling up, the new instances take about 90 seconds to become ready and serve requests. They want to reduce this startup time. Which configuration change is most likely to help?

A company deploys a custom TensorFlow model to Vertex AI Endpoint for online predictions. After deployment, prediction latency is consistently high (over 500ms) even under low traffic. The model is CPU-only and the default machine type (n1-standard-2) is used. Which action will most likely reduce prediction latency?

A data scientist runs a batch prediction job on Vertex AI using a custom container. The job processes a large JSONL file (10 GB) and fails with an out-of-memory error. The machine type is n1-standard-4 (15 GB memory). Which action should be taken to resolve the error while minimizing cost?

A company needs to serve a model for real-time predictions with a strict latency SLA of 100ms at the 99th percentile. The model is lightweight and traffic patterns are highly variable with occasional spikes. Which deployment strategy best meets the SLA while controlling cost?

A machine learning team wants to perform A/B testing between two model versions (v1 and v2) on Vertex AI Endpoint. They need to gradually route 10% of traffic to v2 while monitoring performance. What is the most efficient way to achieve this?

A team deploys a TensorFlow model using a custom container to Vertex AI Endpoint. The container expects the saved model at the /model directory, but predictions fail with a 'model not found' error. The team used the default Vertex AI serving container in the past. What is the most likely cause?

A company deploys a model on Vertex AI Endpoint and expects high traffic spikes during promotional events. The current configuration uses manual scaling with 2 replicas. Which autoscaling configuration should they use to handle spikes while minimizing cost during normal traffic?

A startup wants to deploy a small machine learning model for real-time predictions but has a very limited budget. Traffic is minimal and predictable. They want to avoid paying for idle resources. Which serving option is most cost-effective?

A data engineer is troubleshooting a Vertex AI Endpoint that serves a large BERT model. After deployment, many prediction requests fail with 'Out of Memory' errors. The machine type is n1-standard-8 (30 GB memory) with no accelerator. Which action will most likely resolve the issue?

After deploying a new version of a model to a Vertex AI Endpoint, the team notices that predictions are still returning results from the old version. The deployment command used a traffic split of 100% to the new version. What is the most likely cause?

Which TWO actions can help reduce prediction latency for a model deployed on Vertex AI Endpoint without changing the model architecture?

A company deploys a model to Vertex AI Endpoint with autoscaling enabled. During a traffic spike, they observe high tail latency (99th percentile > 2s). Which TWO factors are most likely contributing to this latency?

A team wants to serve a large PyTorch model (3 GB) for online predictions with low latency. Which THREE actions should they take?

You deploy a PyTorch model to Vertex AI Online Prediction. After deployment, you observe that inference latency is approximately 300ms per request, but the desired SLA is under 100ms. The model uses a custom container with CPU only. Which action is most likely to reduce latency to the target?

Your team is serving a large language model on Vertex AI using a custom container. The endpoint experiences intermittent 502 errors during traffic spikes. The autoscaling configuration uses a CPU utilization target of 60% and the model is deployed on n1-standard-4 instances. The model requires significant memory. Which combination of changes is most likely to resolve the issue?

You need to serve a TensorFlow model that has a cold start latency of 20 seconds. The model is used for a real-time application with unpredictable traffic, but occasional bursts require immediate responses. What is the best deployment strategy to minimize both cold start impact and cost?

Your team deploys a multi-model endpoint on Vertex AI with two models: Model A (small, low latency) and Model B (large, high latency). You configure traffic splitting so that 90% goes to Model A and 10% to Model B. However, you notice that the latency for Model A increases when Model B receives traffic. What is the most likely cause?

You are deploying a scikit-learn model for online predictions. The model size is 200 MB. You want to minimize latency and cost. Which serving option should you choose?

A company is serving a model for their e-commerce website. They expect traffic to be low at night and very high during flash sales. They want to minimize costs while ensuring availability during spikes. Which autoscaling configuration should they use?

Your model serving endpoint on Vertex AI is experiencing increased memory usage after a recent update. The model was converted from TensorFlow to TF Lite for faster inference. You notice that the endpoint's instances occasionally get killed due to out-of-memory (OOM) errors. What is the most likely cause?

You are using Vertex AI continuous evaluation (model monitoring) for your deployed model. You receive an alert that the prediction distribution is significantly different from the training distribution. What should you do first?

You have a model that requires GPU for efficient inference. You deploy it on Vertex AI with a single NVIDIA T4 GPU accelerator and notice that the GPU utilization hovers around 30%. The endpoint has 10 replicas. What is the best way to improve cost efficiency while maintaining throughput?

Which TWO actions can help reduce the latency of online prediction requests for a deep learning model served on Vertex AI?

Which THREE factors should you consider when deciding between online prediction and batch prediction on Vertex AI?

Which TWO statements are true about canary deployments for Vertex AI endpoints?

You run the above command to deploy a new model version to an existing endpoint. After deployment, you observe that the endpoint's previous model version is still receiving 100% of traffic. What is the most likely reason for this?

You are troubleshooting a Vertex AI endpoint for a customer. The exhibit shows the endpoint configuration. The customer reports that Model A is experiencing high latency during peaks. Model B runs fine. What is the most likely cause?

You are a machine learning engineer at a financial technology company. You have deployed a complex ensemble model consisting of three sub-models (XGBoost, TensorFlow, and PyTorch) for real-time fraud detection. The model is served on Vertex AI online prediction with a custom container that orchestrates the three models sequentially. The endpoint currently uses n1-highmem-8 machines with no accelerators. You are experiencing high latency (avg 500ms) during peak trading hours (9:30 AM - 4:00 PM EST), exceeding the 200ms SLA. The container is CPU-bound, and memory usage is around 60%. The model weights total 500 MB. You have already tried increasing the batch size per request from 1 to 4, which reduced latency slightly but not enough. The traffic pattern is very spiky, with sudden bursts of up to 1000 requests per second. Your goal is to meet the latency SLA without significantly increasing cost. Which action should you take?

A company has deployed a TensorFlow model on Vertex AI Prediction for real-time inference. They notice that during peak hours, the prediction latency increases significantly, and some requests time out. The model requires GPU acceleration. Which action should they take to reduce latency and avoid timeouts?

A data science team deploys a custom container on Vertex AI Prediction for a PyTorch model. After deployment, the model returns predictions that are consistently off by a constant factor. The model performed correctly during local testing. What is the most likely cause?

A company needs to serve a large Transformer model (5 GB) with strict latency requirements (< 50 ms) and throughput of 1000 requests per second. The model is in SavedModel format. They are considering deployment options on Google Cloud. Which approach best meets these requirements?

A machine learning engineer notices that the Vertex AI Prediction endpoint's error rate has increased over the past week. The model was retrained with new data and redeployed. Which step should the engineer take first to diagnose the issue?

A company wants to deploy a model for real-time inference with high availability across multiple Google Cloud regions. The model is small and stateless. Which two steps should they take? (Choose two.)

A company runs batch predictions on a large dataset using Vertex AI Batch Prediction. They want to reduce costs without significantly increasing processing time. Which three actions should they take? (Choose three.)

Your team has deployed a scikit-learn model using a custom container on Vertex AI Prediction. The model receives about 100 requests per second, and the endpoint is configured with a single n1-standard-4 machine. You notice that response times are around 200 ms on average, but occasionally spike to over 10 seconds during traffic bursts. You have set the min replicas to 1 and max replicas to 10. Despite this, spikes still occur. What is the most likely cause and the best course of action?

You are using Vertex AI Training to train a model and then automatically deploy the best candidate to a Vertex AI Prediction endpoint via the Vertex AI Model Registry. However, after deployment, you notice that the endpoint returns predictions for the new model, but they are significantly different from the evaluation metrics computed during training. The training scripts used TensorFlow with a serving input function. What is the most likely issue and how would you fix it?

Your organization has a large production system that uses Vertex AI Prediction for an NLP model with a 2 GB memory footprint. The endpoint is configured with 5 replicas, each using an n1-standard-4 with a single T4 GPU. Recently, you observed an increase in 503 errors during peak hours. Cloud Monitoring shows that GPU utilization is consistently above 90% across all replicas, while CPU and memory are below 50%. You have already increased the max replicas to 10, but the errors persist because the increased replicas also become saturated. What should you do to resolve the issue?

You are responsible for deploying a real-time recommendation model that uses a large embedding table (5 GB) and a small neural network. The model is served through a custom container on Vertex AI Prediction. The end-to-end latency requirement is under 200 ms. During load testing with 500 QPS, you observe that latency increases linearly with batch size. You are currently using a single replica with an n1-standard-8 machine and one T4 GPU. The embedding table is loaded entirely in GPU memory. However, CPU utilization is at 100% while GPU is at 30%. What is the best approach to meet the latency requirement at scale?

Your team has deployed a PyTorch model using a custom container on Vertex AI Prediction. The model uses dynamic batching to combine incoming requests. You notice that the average latency is 150 ms, but the 99th percentile latency is 2 seconds. Cloud Monitoring shows that the CPU is idle much of the time, and GPU utilization is around 70%. The model is deployed on a single n1-standard-4 with a T4 GPU. You suspect the issue is related to request queuing. Which change would most effectively reduce tail latency?

You manage a multi-tenant serving system on Vertex AI Prediction where multiple models are deployed in a single endpoint using model versioning. One particular model version (v2) is consuming excessive resources, causing latency spikes for other versions. You need to isolate this model to prevent interference. The models are all in TensorFlow SavedModel format. What is the best approach?

An ML engineer is deploying a large BERT-based natural language processing model for real-time inference on Vertex AI Prediction. The model has a large memory footprint (2GB) and experiences unpredictable traffic spikes up to 10x the baseline. The engineer needs to minimize latency and cost while handling spiky traffic. Which TWO actions should the engineer take? (Choose two.)

An ML engineer notices that predictions are taking longer than expected under moderate traffic. Reviewing the endpoint configuration, what is the most likely cause of the high latency?

A company has deployed a computer vision model on Vertex AI Prediction using a custom container. The model processes high-resolution images and serves predictions to a mobile application. Recently, users have reported that predictions sometimes take over 10 seconds, and the application times out. The ML engineer's monitoring shows that the endpoint's CPU utilization is consistently high (above 85%) and that the request latency spikes during peak hours. The model is deployed on n1-standard-4 machines with automatic scaling set to minReplicaCount=1 and maxReplicaCount=5. The engineer has observed that the endpoint rarely scales beyond 2 replicas even during peak hours. What should the engineer do to reduce prediction latency?

Practice all 95 Serving and scaling models questions

Other PMLE exam domains

Scaling prototypes into ML models Automating and orchestrating ML pipelines Collaborating within and across teams to manage data and models Architecting low-code ML solutions Collaborating to manage data and models Monitoring ML solutions Solving business challenges with ML

Frequently asked questions

What does the Serving and scaling models domain cover on the PMLE exam?

The Serving and scaling models domain covers the key concepts tested in this area of the PMLE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PMLE domains — no account required.

How many Serving and scaling models questions are in the PMLE question bank?

The Courseiva PMLE question bank contains 95 questions in the Serving and scaling models domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Serving and scaling models for PMLE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Serving and scaling models questions for PMLE?

Yes — the session launcher on this page draws questions exclusively from the Serving and scaling models domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PMLE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included