Courseiva
Knowledge + Practice
CertificationsVendorsCareer RoadmapsLabs & ToolsStudy GuidesGlossaryPractice Questions
C
Courseiva

Free IT certification practice questions with explained answers for CCNA, CompTIA, AWS, Azure, Google Cloud, and more.

Certification Practice Questions

CCNA practice questionsSecurity+ SY0-701 practice questionsAWS SAA-C03 practice questionsAZ-104 practice questionsAZ-900 practice questionsCLF-C02 practice questionsA+ Core 1 practice questionsGoogle Cloud ACE practice questionsCySA+ CS0-003 practice questionsNetwork+ N10-009 practice questions
View all certifications →

Product

CertificationsCertification PathsExam TopicsPractice TestsExam Dumps vs Practice TestsStudy HubComparisons

Free Resources

Difficulty IndexLearn — Free ChaptersIT GlossaryFree Tools & LabsStudy GuidesCareer RoadmapsBrowse by VendorCisco Command ReferenceCCNA Scenarios

Company

AboutContactEditorial PolicyQuestion Writing PolicyTrust Center

Legal

Privacy PolicyTerms of Service

Courseiva is a free IT certification practice platform offering original exam-style practice questions, detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics for Cisco, CompTIA, Microsoft, AWS, and other technology certifications.

© 2026 Courseiva. Courseiva is operated by JTNetSolutions Ltd. All rights reserved.

Courseiva is an independent certification practice platform and is not affiliated with, endorsed by, or sponsored by Cisco, Microsoft, AWS, CompTIA, Google, ISC2, ISACA, or any other certification vendor. Vendor names and certification marks are used only to identify the exams learners are preparing for.

← Serving and Scaling Models practice sets

PMLE Serving and Scaling Models • Complete Question Bank

PMLE Serving and Scaling Models — All Questions With Answers

Complete PMLE Serving and Scaling Models question bank — all 0 questions with answers and detailed explanations.

109
Questions
Free
No signup
Certifications/PMLE/Practice Test/Serving and Scaling Models/All Questions
Question 1easymultiple choice
Read the full Serving and Scaling Models explanation →

A data scientist wants to deploy a trained TensorFlow model to Vertex AI for online predictions. They need to serve predictions with low latency and want to leverage GPU acceleration. Which machine type should they select when creating the Vertex AI endpoint?

Question 2mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are deploying a new version of a model to a Vertex AI endpoint that already has a champion model serving 100% of traffic. You want to gradually shift traffic to the new version while monitoring for errors. Which approach should you use?

Question 3hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company is using Vertex AI Prediction with a custom container that performs preprocessing before inference. The preprocessing step is CPU-intensive and the inference step uses a GPU. They want to minimize prediction latency while optimizing cost. Which architecture should they use?

Question 4mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to serve a large embedding model for similarity search with low latency. The model was trained to generate 256-dimensional embeddings. You plan to use Vertex AI Vector Search. Which index type should you choose to balance accuracy and performance for a dataset with 10 million vectors?

Question 5easymultiple choice
Read the full Serving and Scaling Models explanation →

A machine learning engineer needs to run batch predictions on 50 TB of data stored in BigQuery using a Vertex AI model. The model is a custom container. What is the most efficient way to set up the batch prediction job?

Question 6mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint with min_replica_count=2 and max_replica_count=10. You notice that during a traffic spike, the endpoint does not scale up quickly enough, causing increased latency. What should you do to improve autoscaling responsiveness?

Question 7hardmultiple choice
Read the full Serving and Scaling Models explanation →

You are deploying a PyTorch model on Vertex AI using a custom container with NVIDIA Triton Inference Server. The model is a large transformer that requires GPU. You want to optimize GPU utilization and reduce memory footprint. Which technique should you apply?

Question 8mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company wants to cache predictions for identical requests to reduce latency and cost. They use Vertex AI Prediction with a custom container. Which GCP service should they use to implement prediction caching?

Question 9easymultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint that serves a model for real-time predictions. You want to update the model to a new version with zero downtime. Which approach should you take?

Question 10mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Vector Search with an approximate nearest neighbor index. You need to update the index with new data every hour. The updates must be available for queries immediately. Which update method should you use?

Question 11hardmultiple choice
Read the full Serving and Scaling Models explanation →

An ML team wants to deploy multiple models (e.g., a recommender and a classifier) behind a single Vertex AI endpoint. The models have different resource requirements: the recommender needs GPU, the classifier needs high memory. How should they configure the endpoint?

Question 12mediummultiple choice
Study the full Python automation breakdown →

You need to run a batch prediction job on Vertex AI using a model that requires custom preprocessing using a Python script. The preprocessing must be applied before inference. Which approach should you use?

Question 13mediummulti select
Read the full Serving and Scaling Models explanation →

A company is deploying a model on Vertex AI for online predictions with strict latency SLOs. The model requires GPU acceleration. Which TWO configurations should they consider to meet the SLOs while optimizing cost?

Question 14hardmulti select
Read the full Serving and Scaling Models explanation →

You are designing a batch prediction pipeline using Vertex AI. The input data is 100 TB of images stored in Cloud Storage. The model is a custom TensorFlow model that expects TFRecord format. The pipeline must be cost-effective and run within a time window of 2 hours. Which THREE steps should you include?

Question 15mediummulti select
Read the full Serving and Scaling Models explanation →

An organization wants to deploy a model on edge devices (e.g., Android phones) for offline inference. They trained a model using TensorFlow. Which THREE steps should they take to prepare and deploy the model?

Question 16easymultiple choice
Read the full Serving and Scaling Models explanation →

You deployed a model to a Vertex AI endpoint with minReplicas=0 and maxReplicas=5. After sending prediction requests, you notice the endpoint takes about 30 seconds to respond initially, but subsequent requests are fast. What is the most likely cause?

Question 17mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a champion model serving 100% traffic on a Vertex AI endpoint. You want to deploy a challenger model and gradually shift 10% of traffic to it for A/B testing. What is the correct approach?

Question 18mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to run batch predictions on 10 TB of text data stored in BigQuery using a custom container model hosted in Vertex AI. What is the most cost-effective and simple approach?

Question 19hardmultiple choice
Read the full Serving and Scaling Models explanation →

Your team is deploying a large recommendation model on Vertex AI endpoints using GPUs. You need to minimise latency while optimising cost. The model serves many similar requests from the same users within short time windows. Which additional service would best reduce latency and cost?

Question 20easymultiple choice
Read the full Serving and Scaling Models explanation →

You want to deploy a TensorFlow model to a Vertex AI endpoint and enable online predictions. The model requires GPU for inference. Which machine type should you select when deploying the model?

Question 21mediummultiple choice
Read the full Serving and Scaling Models explanation →

Your Vertex AI endpoint is experiencing high latency during traffic spikes. You have set maxReplicas=10 and minReplicas=2. The CPU utilisation target is 60%. During spikes, the endpoint never scales beyond 4 replicas. What is the most likely reason?

Question 22mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to deploy a PyTorch model for online inference on Vertex AI but the model was trained using custom ops that are not natively supported. You want to use NVIDIA Triton Inference Server for optimisation. How should you proceed?

Question 23hardmultiple choice
Read the full Serving and Scaling Models explanation →

Your team has built a low-latency similarity search service using Vertex AI Matching Engine (Vector Search). The index is updated daily with new embeddings. You need to serve the latest index without downtime. What is the correct deployment strategy?

Question 24easymultiple choice
Read the full Serving and Scaling Models explanation →

You need to serve a model on an edge device with low latency and offline capability. Which approach should you use?

Question 25mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint with two deployed models: model A (champion) and model B (challenger). Traffic split is 90:10. You want to gradually increase model B's traffic to 50% over a week. What is the best way to update the traffic split?

Question 26hardmultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Prediction with a custom container that requires a large model file (5 GB). Deployment takes 10 minutes to start. You want to reduce cold start latency. Which action would be MOST effective?

Question 27mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to query a Vertex AI Vector Search index for nearest neighbours. The index is deployed on an endpoint. Which API method should you use to perform the query?

Question 28mediummulti select
Read the full Serving and Scaling Models explanation →

You are deploying a model for real-time inference with strict latency requirements (<100ms P99). You want to autoscale based on custom metrics. Which TWO actions should you take? (Choose 2)

Question 29hardmulti select
Read the full Serving and Scaling Models explanation →

Your team is using Vertex AI Prediction for a large-scale NLP model (PyTorch, custom ops). The model currently runs on CPU but you want to optimise inference cost and performance. Which THREE approaches should you consider? (Choose 3)

Question 30easymulti select
Read the full Serving and Scaling Models explanation →

You need to deploy a model for online predictions with low latency. You want to ensure that the endpoint can handle traffic bursts without cold start. Which TWO configurations should you set? (Choose 2)

Question 31mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company deploys a model on Vertex AI Endpoints for real-time inference. They need to minimize latency for prediction requests that are identical to previous requests. Which approach should they use?

Question 32mediummultiple choice
Read the full Serving and Scaling Models explanation →

A data science team needs to serve multiple versions of the same ML model on Vertex AI Endpoints for A/B testing. They want to gradually shift traffic from the current 'champion' model to a new 'challenger' model. Which feature should they use?

Question 33hardmultiple choice
Read the full Serving and Scaling Models explanation →

An ML engineer is optimizing a large model for deployment on Vertex AI with GPU acceleration. They want to reduce model size and improve inference latency without significant accuracy loss. Which tool should they use?

Question 34easymultiple choice
Read the full Serving and Scaling Models explanation →

Which Vertex AI service is designed for building and managing approximate nearest neighbor (ANN) indexes for similarity search at scale?

Question 35mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company wants to run batch predictions on millions of records stored in BigQuery. They need to preprocess the data (e.g., feature engineering) before feeding it to the model. Which approach is most scalable and cost-effective?

Question 36easymultiple choice
Read the full Serving and Scaling Models explanation →

Which of the following is a benefit of using Vertex AI Endpoints with autoscaling and scale-to-zero?

Question 37mediummultiple choice
Read the full Serving and Scaling Models explanation →

An engineer deploys a model to a Vertex AI endpoint with minReplicas=1 and maxReplicas=3. The endpoint receives a sudden traffic spike, but it does not scale up beyond 1 replica. The CPU utilization target is 60%. What is the most likely cause?

Question 38hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company needs to perform real-time similarity search on a dataset of 10 million embedding vectors. They expect low latency (under 10ms) and high throughput. Which index type should they use in Vertex AI Vector Search?

Question 39easymultiple choice
Read the full Serving and Scaling Models explanation →

Which API is recommended for high-throughput, low-latency online prediction requests to Vertex AI endpoints?

Question 40mediummultiple choice
Read the full Serving and Scaling Models explanation →

An organization wants to deploy a TensorFlow model on edge devices such as smartphones and IoT devices for offline inference. Which format should they export the model to?

Question 41hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company is deploying multiple models on a single Vertex AI endpoint to reduce costs. Each model has different traffic patterns. Which configuration should they use?

Question 42mediummultiple choice
Read the full Serving and Scaling Models explanation →

An ML engineer needs to update a model deployed on a Vertex AI endpoint without downtime. They want to gradually shift traffic to the new version while monitoring for errors. What is the correct procedure?

Question 43mediummulti select
Read the full Serving and Scaling Models explanation →

A company wants to use Vertex AI Vector Search for real-time product recommendations based on user embeddings. They need to update the index frequently with new product embeddings without significant downtime. Which TWO options should they consider? (Choose 2)

Question 44hardmulti select
Read the full Serving and Scaling Models explanation →

An organization is deploying a mission-critical model on Vertex AI Endpoints. They need to ensure high availability and meet a strict SLO of 99.9% uptime. Which THREE steps should they take? (Choose 3)

Question 45easymulti select
Read the full Serving and Scaling Models explanation →

Which TWO of the following can be used as input sources for Vertex AI batch prediction jobs? (Choose 2)

Question 46easymultiple choice
Read the full Serving and Scaling Models explanation →

You are deploying a model to a Vertex AI endpoint and need to minimize latency for online predictions. Which machine type should you choose?

Question 47mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to perform batch predictions on 10 TB of data stored in BigQuery using Vertex AI. The model requires some preprocessing that cannot be expressed in SQL. What is the most scalable approach?

Question 48hardmultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Vector Search for a product recommendation system. Your index is updated with new embeddings every hour. To minimize query latency while keeping the index fresh, what should you do?

Question 49mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint serving a model with min replicas=2 and max replicas=10. You notice that during low traffic hours, the endpoint still runs 2 replicas, incurring costs. You want to reduce costs to zero when there is no traffic. What should you do?

Question 50mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are A/B testing a new model version (challenger) against the current version (champion) on Vertex AI. You want to gradually shift traffic from champion to challenger while measuring business metrics. Which approach should you use?

Question 51easymultiple choice
Read the full Serving and Scaling Models explanation →

You need to serve multiple models on a single Vertex AI endpoint to reduce costs. How can you achieve this?

Question 52hardmultiple choice
Read the full Serving and Scaling Models explanation →

You are deploying a PyTorch model on Vertex AI and want to use NVIDIA Triton Inference Server for optimal performance. You have built a custom container with Triton. Which serving configuration should you use?

Question 53mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI batch prediction and your model requires preprocessing that involves joining two BigQuery tables. The preprocessing logic is complex and must be done before inference. How should you design the pipeline?

Question 54mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint with autoscaling enabled. You notice that during traffic spikes, the endpoint takes a long time to scale up, causing prediction errors. What is the most effective solution?

Question 55hardmultiple choice
Read the full Serving and Scaling Models explanation →

You need to deploy a TensorFlow model to edge devices for real-time inference with minimal latency. The model is currently trained on Vertex AI. Which approach should you use?

Question 56easymultiple choice
Read the full Serving and Scaling Models explanation →

Your Vertex AI endpoint receives many identical prediction requests (same input features). You want to cache responses to reduce latency and cost. Which Google Cloud service should you use?

Question 57mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Vector Search to find nearest neighbors for a recommendation system. Your index is built on 10M embeddings and you need low-latency queries. You want to ensure that adding new embeddings does not require a full index rebuild. Which index type should you use?

Question 58mediummulti select
Read the full Serving and Scaling Models explanation →

You are optimizing a model for deployment on Vertex AI using NVIDIA Triton Inference Server. Which TWO actions can you take to improve inference performance?

Question 59mediummulti select
Read the full Serving and Scaling Models explanation →

You are deploying a model on Vertex AI and need to ensure high availability and low latency. Which THREE configurations should you implement?

Question 60hardmulti select
Read the full Serving and Scaling Models explanation →

You are designing a batch prediction pipeline using Vertex AI. The input data is 50 TB in CSV format on GCS. The model requires feature engineering that involves complex transformations (e.g., datetime parsing, one-hot encoding). Which THREE services or steps should you include in your pipeline?

Question 61mediummultiple choice
Read the full Serving and Scaling Models explanation →

A data science team has trained a custom TensorFlow model for real-time fraud detection. They need to deploy it on Vertex AI with minimal latency and support for multiple concurrent requests. The model requires a GPU for inference. Which machine type should they choose for the Vertex AI endpoint?

Question 62easymultiple choice
Read the full Serving and Scaling Models explanation →

You deploy a new version of a model to a Vertex AI endpoint and want to gradually shift traffic from the old version to the new version over 24 hours. The endpoint currently serves 100% traffic to the old version. What should you do?

Question 63hardmultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint serving a model for real-time predictions. The endpoint is configured with minReplicaCount=2 and maxReplicaCount=10. Over the past week, you notice that the actual number of replicas rarely exceeds 2, but the average CPU utilization is around 85%. You want to reduce costs without impacting performance. What should you do?

Question 64mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are deploying a PyTorch model for online predictions on Vertex AI. The model expects input tensors and performs GPU-accelerated inference. You want to minimize prediction latency and maximize throughput. Which approach should you use?

Question 65mediummultiple choice
Read the full Serving and Scaling Models explanation →

Your team has deployed a model on Vertex AI endpoints. You need to monitor the prediction latency to ensure it meets a 99th percentile SLO of 500ms. You want to set up an alert if the latency exceeds this threshold. Which metric should you use?

Question 66hardmultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Matching Engine (Vector Search) to serve similarity search for an e-commerce product recommendation system. The index is updated daily with new product embeddings via a batch job. However, you notice that some new products are not appearing in the search results for up to 24 hours. You need to ensure that new products are discoverable within 1 hour of ingestion. What should you do?

Question 67easymultiple choice
Read the full Serving and Scaling Models explanation →

Your company runs a high-traffic web application that serves the same machine learning model prediction for many identical requests (e.g., product recommendations for the same user profile). You want to reduce latency and load on the prediction endpoint by caching responses. Which Google Cloud service should you use?

Question 68mediummultiple choice
Read the full Serving and Scaling Models explanation →

You have a TensorFlow model that you want to deploy on edge devices for real-time inference. The model was trained in Vertex AI. You need to convert it to a format suitable for on-device inference. Which approach should you use?

Question 69mediummultiple choice
Read the full Serving and Scaling Models explanation →

You need to run batch predictions on a large dataset stored in BigQuery using a Vertex AI model. The dataset contains 10 million rows, and each prediction takes about 100ms. You want to minimize cost and execution time. What should you do?

Question 70hardmultiple choice
Read the full Serving and Scaling Models explanation →

You have a Vertex AI endpoint with two deployed models: a champion (v1) and a challenger (v2). You set the traffic split to 90% v1 and 10% v2. After a week, you observe that v2 has better business metrics. You want to shift all traffic to v2 gradually over 3 days to avoid any risk. What should you do?

Question 71easymultiple choice
Read the full Serving and Scaling Models explanation →

You need to deploy a model to a Vertex AI endpoint that can scale down to zero when there are no requests to minimize costs. Which feature should you enable?

Question 72mediummultiple choice
Read the full Serving and Scaling Models explanation →

You are using Vertex AI Matching Engine for similarity search. Your index has 10 million embeddings of 512 dimensions. The query latency requirement is under 10ms for 99th percentile. Which index type should you choose?

Question 73mediummulti select
Read the full Serving and Scaling Models explanation →

You are deploying a large deep learning model on Vertex AI endpoints. The model requires GPU acceleration and you want to minimize cold-start latency. Which TWO actions should you take? (Choose 2 correct answers)

Question 74hardmulti select
Read the full Serving and Scaling Models explanation →

Your team has deployed a model on Vertex AI endpoints and you are planning an A/B test to compare a new challenger model (v2) against the current champion (v1). The test should measure business metrics such as click-through rate. Which THREE steps should you take to set up the A/B test correctly? (Choose 3 correct answers)

Question 75mediummulti select
Read the full Serving and Scaling Models explanation →

You need to deploy a model that requires a large amount of memory (over 200 GB) for inference. The model is a custom PyTorch model. Vertex AI endpoints have machine type limitations. Which TWO actions can you take to handle this memory requirement? (Choose 2 correct answers)

Question 76easymultiple choice
Read the full Serving and Scaling Models explanation →

A machine learning engineer wants to deploy a trained model to Vertex AI for online predictions. Which Vertex AI resource is required to serve the model and provide an endpoint URL?

Question 77mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company needs to serve a high-throughput prediction service with strict latency requirements. They want to minimize cold starts and ensure consistent performance. Which endpoint configuration is most appropriate?

Question 78hardmultiple choice
Review the full routing breakdown →

A data scientist wants to perform A/B testing between two model versions deployed on the same Vertex AI endpoint. They need to route 10% of traffic to the challenger model. Which approach should they use?

Question 79mediummultiple choice
Read the full Serving and Scaling Models explanation →

A team is deploying a large PyTorch model for online inference. They want to use NVIDIA Triton Inference Server to optimize serving performance. How can they integrate Triton with Vertex AI?

Question 80mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company needs to run batch predictions on 10 TB of data stored in Cloud Storage. The predictions should be written to BigQuery. Which approach should they use?

Question 81easymultiple choice
Read the full Serving and Scaling Models explanation →

What is the primary purpose of Vertex AI Model Optimization (formerly Model Garden)?

Question 82hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company uses Vertex AI Matching Engine for a product recommendation system. They need to update the index with new product embeddings every hour, but the index is used for online queries with low latency. Which index update strategy should they use?

Question 83mediummultiple choice
Read the full Serving and Scaling Models explanation →

A team has deployed a model on Vertex AI and wants to cache frequent identical prediction requests to improve latency and reduce cost. Which Google Cloud service should they use?

Question 84mediummultiple choice
Read the full Serving and Scaling Models explanation →

An engineer needs to deploy multiple models on a single Vertex AI endpoint with separate traffic allocations. What is the maximum number of deployed models that can be assigned traffic on one endpoint?

Question 85hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company wants to deploy a TensorFlow model on edge devices for real-time inference without internet connectivity. Which Vertex AI service should they use to manage the deployment?

Question 86easymultiple choice
Read the full Serving and Scaling Models explanation →

Which machine type is most suitable for a Vertex AI endpoint serving a GPU-accelerated model?

Question 87mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company uses Vertex AI Vector Search for similarity search. They have a dataset of 10 million 512-dimensional vectors. Which index type should they choose for lowest latency at high recall?

Question 88mediummulti select
Read the full Serving and Scaling Models explanation →

A company needs to reduce inference latency for their online prediction service on Vertex AI. Which two actions would help? (Choose 2)

Question 89hardmulti select
Read the full Serving and Scaling Models explanation →

A team wants to deploy a model on Vertex AI Edge Manager for offline inference on edge devices. Which three steps are required? (Choose 3)

Question 90mediummulti select
Read the full Serving and Scaling Models explanation →

A company uses Vertex AI Matching Engine for real-time recommendations. They need to serve queries with low latency and support frequent updates. Which two configurations are appropriate? (Choose 2)

Question 91mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company is deploying a new model version to an existing Vertex AI endpoint. They want to test the new version with 5% of traffic before fully rolling it out. What is the correct approach?

Question 92easymultiple choice
Read the full Serving and Scaling Models explanation →

Which Vertex AI service is best suited for finding similar items in a large dataset based on embedding vectors, such as product recommendations or image similarity search?

Question 93mediummultiple choice
Read the full Serving and Scaling Models explanation →

An ML engineer needs to run batch predictions on 10 TB of data stored in BigQuery using a TensorFlow model. The predictions must be written to BigQuery. Which service should they use?

Question 94hardmultiple choice
Read the full Serving and Scaling Models explanation →

A company uses Vertex AI for online predictions with a large ensemble model that requires GPU acceleration. They want to reduce inference latency by batching multiple requests into a single GPU inference call. What should they configure?

Question 95mediummultiple choice
Read the full Serving and Scaling Models explanation →

A data scientist wants to deploy a model trained with PyTorch to a Vertex AI endpoint for online predictions. What is the recommended approach?

Question 96easymultiple choice
Read the full Serving and Scaling Models explanation →

Which Vertex AI feature allows you to reduce the size of a trained model to improve inference speed on edge devices without significant accuracy loss?

Question 97mediummultiple choice
Read the full Serving and Scaling Models explanation →

An application serving predictions from a Vertex AI endpoint receives many identical requests within a short time window. The team notices redundant computation and wants to cache responses to reduce latency and cost. What is the recommended solution?

Question 98hardmultiple choice
Read the full Serving and Scaling Models explanation →

A team is deploying a model that has strict latency requirements: p99 response time under 100 ms. The model is CPU-only and will receive up to 1000 QPS. They want to minimize cost while meeting the SLO. Which machine type and scaling configuration is most appropriate?

Question 99easymultiple choice
Read the full Serving and Scaling Models explanation →

What is the primary purpose of Vertex AI Edge Manager?

Question 100mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company uses Vertex AI Vector Search (Matching Engine) for a product recommendation system. The product embeddings are updated hourly. Which index update method should they use to ensure low latency for new items?

Question 101mediummulti select
Read the full Serving and Scaling Models explanation →

A company is deploying a complex model that requires GPU for inference. They want to use Vertex AI for serving. Which TWO steps are required to deploy the model with GPU support? (Choose 2)

Question 102hardmulti select
Read the full Serving and Scaling Models explanation →

A team is building a batch prediction pipeline that processes raw data from Cloud Storage, performs complex preprocessing, and then runs predictions using a large model. The preprocessing step is compute-intensive and the prediction step is I/O-bound. Which TWO Google Cloud services should they combine to optimize cost and performance? (Choose 2)

Question 103mediummulti select
Read the full Serving and Scaling Models explanation →

An ML engineer needs to deploy a model to Vertex AI for online predictions and enable autoscaling to zero when not in use. Which THREE conditions must be met? (Choose 3)

Question 104easymulti select
Read the full Serving and Scaling Models explanation →

Which TWO of the following are benefits of using Vertex AI Matching Engine (Vector Search) over a brute-force nearest neighbor search? (Choose 2)

Question 105hardmulti select
Read the full Serving and Scaling Models explanation →

A company is migrating from an on-premises ML serving infrastructure to Vertex AI. They have multiple models that need to be served from the same endpoint with different traffic percentages. They also need to monitor prediction quality. Which THREE actions should they take? (Choose 3)

Question 106mediummultiple choice
Read the full Serving and Scaling Models explanation →

A machine learning team deploys a PyTorch model for online prediction on Vertex AI using a custom container. They notice that the first few requests after scaling up experience high latency. What is the most likely cause and how should they mitigate it?

Question 107mediummultiple choice
Read the full Serving and Scaling Models explanation →

A company runs batch predictions on Vertex AI every hour using a custom container. They want to reduce costs by minimizing idle time while ensuring the batch job completes within 10 minutes. Which endpoint configuration should they use?

Question 108hardmulti select
Read the full Serving and Scaling Models explanation →

A retail company deploys a new recommendation model alongside the current champion on Vertex AI Endpoints. They want to gradually shift traffic to the challenger while monitoring business metrics (conversion rate). Which two steps are required? (Choose 2)

Question 109hardmulti select
Read the full Serving and Scaling Models explanation →

A fintech company needs to deploy a TensorFlow model for real-time fraud detection with strict latency SLO (p99 < 100ms). They expect variable traffic with spikes. They also want to minimize cold-start latency. Which two configurations should they use? (Choose 2)

Practice tests

Scored 10-question sessions with instant feedback and explanations.

PMLE Practice Test 1 — 25 Questions→PMLE Practice Test 2 — 25 Questions→PMLE Practice Test 3 — 25 Questions→PMLE Practice Test 4 — 25 Questions→PMLE Practice Test 5 — 25 Questions→PMLE Practice Exam 1 — 20 Questions→PMLE Practice Exam 2 — 20 Questions→PMLE Practice Exam 3 — 20 Questions→PMLE Practice Exam 4 — 20 Questions→Free PMLE Practice Test 1 — 30 Questions→Free PMLE Practice Test 2 — 30 Questions→Free PMLE Practice Test 3 — 30 Questions→PMLE Practice Questions 1 — 50 Questions→PMLE Practice Questions 2 — 50 Questions→PMLE Exam Simulation 1 — 100 Questions→

Practice by domain

Each domain maps to a weighted exam section. Focus on the domain where you are weakest.

Automating and Orchestrating ML PipelinesCollaborating Within and Across Teams to Manage Data and ModelsServing and Scaling ModelsMonitoring ML SolutionsArchitecting Low-Code ML SolutionsScaling Prototypes into ML ModelsCollaborating to manage data and modelsSolving business challenges with ML

Practice by scenario

Filter questions by type — troubleshooting, exhibit, drag-and-drop, PBQ, ACLs, OSPF, and more.

Browse scenarios→

Continue studying

All Serving and Scaling Models setsAll Serving and Scaling Models questionsPMLE Practice Hub