CCNA Scaling Ml Models Questions

57 questions · Scaling Ml Models topic · All types, answers revealed

1
Multi-Selectmedium

A company has a TensorFlow model that requires GPU for inference. They are deploying on Vertex AI. Which TWO configurations are necessary to ensure GPU is used?

Select 2 answers
A.Set the environment variable TF_GPU_ALLOCATOR=cuda_malloc_async.
B.Use the pre-built TensorFlow serving container, which automatically uses GPU if available.
C.Build a custom container with GPU drivers.
D.Select a machine type that includes a GPU (e.g., NVIDIA Tesla T4).
E.Set the accelerator type and count in the model deployment configuration.
AnswersD, E

Necessary to have GPU hardware available.

Why this answer

Option D is correct because Vertex AI requires you to explicitly select a machine type that includes a GPU (e.g., n1-standard-4 with an attached NVIDIA Tesla T4) to provide the physical hardware for GPU acceleration. Without selecting a GPU machine type, the inference will run on CPU only, regardless of any other configuration.

Exam trap

Google Cloud often tests the misconception that simply using a pre-built container or setting environment variables is sufficient to enable GPU acceleration, when in fact you must both select a GPU-capable machine type and explicitly configure the accelerator in the deployment settings.

2
MCQhard

A company deploys a model to Vertex AI Prediction with autoscaling enabled. During a flash sale, traffic spikes 10x, but the endpoint fails to scale fast enough, causing high latency. What is the most likely cause and solution?

A.The min_nodes setting is too low; increase min_nodes to handle baseline traffic
B.Switch to preemptible VMs to reduce cost and allow more instances
C.The model container is too large; rebuild with a smaller image
D.Use Cloud Functions to pre-warm instances before the sale
AnswerA

Higher min nodes allow faster scaling as they are already running.

Why this answer

The correct answer is A. With Vertex AI Prediction autoscaling, the `min_nodes` setting defines the baseline number of instances that are always kept running. During a flash sale, traffic spikes 10x, but if `min_nodes` is set too low, the autoscaler cannot provision new instances quickly enough to handle the sudden load, resulting in high latency.

Increasing `min_nodes` ensures a sufficient baseline capacity to absorb the initial spike while the autoscaler scales up additional nodes.

Exam trap

Google Cloud often tests the misconception that autoscaling is instantaneous or that external services like Cloud Functions can directly pre-warm ML instances, when in reality the root cause is an insufficient baseline capacity (`min_nodes`) to handle the initial burst before the autoscaler catches up.

How to eliminate wrong answers

Option B is wrong because preemptible VMs are designed for cost savings on fault-tolerant workloads, but they can be terminated at any time by Google Cloud, which would exacerbate scaling instability and latency during a traffic spike, not solve it. Option C is wrong because the model container size primarily affects cold start time and deployment speed, not the autoscaler's ability to add instances during a traffic spike; a smaller image would not address the scaling latency issue. Option D is wrong because Cloud Functions cannot pre-warm Vertex AI Prediction instances; pre-warming is typically handled by configuring a higher `min_nodes` or using traffic splitting with canary deployments, not by an external serverless function.

3
MCQhard

A team deployed a prototype classification model to Vertex AI Prediction. After a week, they notice the metrics shown in the exhibit. What is the most likely cause of the performance degradation and latency increase?

A.The prediction endpoint's autoscaling is too slow, causing requests to queue and time out.
B.The prediction requests are too large, exceeding the maximum request size limit for Vertex AI.
C.The training data does not represent the current production data distribution, causing the model to make incorrect predictions and requiring more computation.
D.The custom prediction container uses outdated libraries that are incompatible with Vertex AI's runtime.
AnswerC

Data distribution shift degrades accuracy and can increase latency if the model is uncertain.

Why this answer

The exhibit shows both accuracy degradation and increased latency. Option C is correct because when the production data distribution shifts away from the training data (data drift), the model makes more incorrect predictions, which can trigger additional computation (e.g., retries, fallback logic, or increased uncertainty estimation) and cause latency spikes. Vertex AI Prediction does not inherently add computation for wrong predictions, but the model's internal confidence thresholds or post-processing steps may consume extra resources when handling out-of-distribution inputs.

Exam trap

Google Cloud often tests the misconception that latency increase must be caused by infrastructure issues (autoscaling or request size) rather than model behavior, but the key clue is the simultaneous accuracy degradation, which points to data drift as the root cause.

How to eliminate wrong answers

Option A is wrong because autoscaling delays cause request queuing and timeouts, which would manifest as increased error rates and latency, but not as a degradation in prediction accuracy (metrics like precision/recall). Option B is wrong because exceeding the maximum request size limit (typically 1.5 MB for Vertex AI online prediction) would result in immediate 413 Payload Too Large errors, not a gradual performance degradation over a week. Option D is wrong because outdated libraries in a custom container would cause deployment failures or runtime errors (e.g., missing symbols, version conflicts), not a gradual accuracy drop; Vertex AI validates container compatibility at deployment time.

4
MCQeasy

A company has developed a prototype fraud detection model using a small sample of transactions. The prototype runs on a single VM and uses a Random Forest classifier. They want to scale to the full dataset of 50 million transactions. The data is stored in BigQuery. The team wants to use Vertex AI for training. After moving the code to a custom training container and using Vertex AI Training with a single n1-standard-4 machine, the training job fails with an error: "Process terminated with exit code 1". The logs show: "java.lang.OutOfMemoryError: Java heap space". The model uses a scikit-learn RandomForest. Which course of action is most appropriate?

A.Use a distributed training strategy with multiple workers.
B.Increase the machine type to n1-highmem-8 to provide more memory.
C.Switch from Random Forest to a linear model to reduce memory usage.
D.Switch to a high-CPU machine type like n1-highcpu-16.
AnswerB

More memory alleviates the OOM error for in-memory Random Forest.

Why this answer

Option A is correct because increasing memory (n1-highmem) directly addresses the Java heap space error, as Random Forest memory usage scales with data size. Option B is wrong because high-CPU machines have less memory per core. Option C is wrong because scikit-learn does not natively support distributed training, and setting up distributed Random Forest is complex.

Option D is wrong because switching to a linear model may degrade performance unnecessarily.

5
MCQhard

A company has a large-scale ML system that uses Vertex AI Pipelines to retrain models weekly. The pipeline includes a custom training job and a batch prediction step. After moving to production, they observe that batch prediction jobs often fail with 'Quota exceeded' errors. The project has sufficient CPU quota. What is the most likely cause?

A.The pipeline is exceeding the maximum number of concurrent pipeline runs.
B.The batch prediction job is requesting a specific accelerator type that has a separate quota limit.
C.The batch prediction job is using a machine type that is not available in the region.
D.The custom training job is consuming all available quota before the batch prediction job starts.
AnswerB

GPUs/TPUs have separate quotas; if exceeded, the job fails with quota exceeded.

Why this answer

The most likely cause is that the batch prediction job is requesting a specific accelerator type (e.g., GPU or TPU) that has a separate quota limit from CPU quota. In Vertex AI, accelerator quotas are distinct from general compute (CPU) quotas, and even if the project has sufficient CPU quota, the accelerator quota may be exhausted, causing 'Quota exceeded' errors.

Exam trap

Google Cloud often tests the misconception that all quota errors are related to CPU or memory, but the trap here is that accelerator types (GPUs/TPUs) have their own independent quota limits that are easily overlooked when CPU quota appears sufficient.

How to eliminate wrong answers

Option A is wrong because exceeding the maximum number of concurrent pipeline runs would result in pipeline submission failures or throttling, not batch prediction job failures with 'Quota exceeded' errors; Vertex AI Pipelines enforces concurrency limits separately. Option C is wrong because if a machine type is not available in the region, the error would be a resource availability error (e.g., 'Machine type not found'), not a quota exceeded error. Option D is wrong because the custom training job and batch prediction job run sequentially within the same pipeline; the training job completes before the batch prediction job starts, so it cannot consume quota during the batch prediction step.

6
Multi-Selectmedium

A data science team has trained a custom model using Vertex AI and wants to deploy it for online predictions with low latency. Which TWO actions should they take to optimize performance?

Select 2 answers
A.Use Vertex AI Endpoints with traffic splitting for canary deployments.
B.Enable autoscaling with a large min replicas count to handle bursts.
C.Optimize the model by quantizing to FP16.
D.Use a custom prediction routine with pre-processing inside the container.
E.Use a machine type with GPU for inference.
AnswersC, D

Quantization reduces model size and inference latency, often with minimal accuracy loss.

Why this answer

Option C is correct because quantizing the model to FP16 reduces its memory footprint and computational requirements, directly lowering inference latency on compatible hardware (e.g., NVIDIA GPUs with Tensor Cores). This optimization is especially effective for online predictions where response time is critical, as it accelerates matrix operations without significantly sacrificing model accuracy.

Exam trap

Google Cloud often tests the misconception that scaling infrastructure (e.g., autoscaling or GPU selection) is the primary way to optimize latency, when in fact model-level changes (quantization) and architectural changes (custom routines) are more direct and cost-effective.

7
MCQhard

A team has successfully trained a deep learning model on Vertex AI using a custom container and distributed training with TensorFlow. They want to serve this model for online predictions with low latency. They deploy the model to Vertex AI Endpoint with a single n1-standard-4 machine. During load testing, they observe that the median latency is 200ms, but the 99th percentile latency spikes to 2 seconds. The model is a complex neural network that takes variable-length text as input. Which approach will best reduce tail latency while maintaining throughput?

A.Use autoscaling with a target CPU utilization of 70%.
B.Implement request batching to process multiple inputs per request.
C.Use a GPU machine type like n1-standard-4 with an attached GPU.
D.Increase the machine type to n1-highmem-8 to allocate more memory.
AnswerB

Batching reduces overhead and smooths out latency for variable-length inputs.

Why this answer

Option C is correct because batching multiple requests together amortizes overhead and reduces per-request latency variability, particularly for variable-length inputs. Option A is wrong because increasing memory does not address compute-bound latency spikes. Option B is wrong because GPU might improve throughput but not necessarily reduce tail latency from variability.

Option D is wrong because autoscaling adds replicas over time but does not reduce per-request latency spikes.

8
MCQeasy

A team has a trained TensorFlow model running locally and wants to deploy it for low-latency online predictions on Google Cloud. Which service should they use?

A.Vertex AI Prediction
B.AI Platform Training
C.Cloud Run
D.Cloud Functions
AnswerA

Vertex AI Prediction is purpose-built for low-latency online ML predictions.

Why this answer

Vertex AI Prediction is the correct choice because it is a fully managed service designed specifically for deploying trained ML models for online (real-time) prediction with low latency. It supports importing TensorFlow SavedModel artifacts and automatically scales the serving infrastructure, including GPU/TPU support, to handle request traffic while providing built-in monitoring and explainability features.

Exam trap

Google Cloud often tests the distinction between training and prediction services, and the trap here is that candidates may confuse AI Platform Training (which is for model training) with AI Platform Prediction (now part of Vertex AI), or assume that any serverless compute like Cloud Run or Cloud Functions can handle ML inference without considering the need for GPU/TPU support and optimized serving infrastructure.

How to eliminate wrong answers

Option B (AI Platform Training) is wrong because it is a service for training ML models, not for serving predictions; using it for online predictions would require additional custom infrastructure and does not provide the low-latency serving endpoints needed. Option C (Cloud Run) is wrong because while it can host custom containers, it lacks native ML model serving optimizations such as automatic GPU/TPU acceleration, model versioning, and request batching, and would require you to manually build and manage a prediction server. Option D (Cloud Functions) is wrong because it is a serverless compute platform for event-driven, short-lived functions with a maximum timeout of 9 minutes and no support for GPU/TPU, making it unsuitable for low-latency online predictions that require persistent, stateful serving of large ML models.

9
MCQmedium

A company uses Vertex AI for training. They have a large dataset stored in Cloud Storage and need to train a custom model using TensorFlow. The training job is failing with an out-of-memory error. What is the best first step?

A.Reduce model size.
B.Enable data sharding and reduce input pipeline parallelism.
C.Use a larger machine type.
D.Increase the batch size.
AnswerB

Reduces memory footprint of data loading.

Why this answer

Option D is correct because enabling data sharding and reducing input pipeline parallelism can lower memory usage from data loading. Option A is wrong because increasing batch size would increase memory usage. Option B is wrong but might be a later step; it's not the best first step as it increases cost.

Option C is wrong because reducing model size may degrade accuracy and is not a first step.

10
Multi-Selecteasy

Which THREE factors should be considered when choosing a compute option for serving a deep learning model in production on Google Cloud? (Choose three.)

Select 3 answers
A.Integration with Vertex AI for model monitoring
B.Autoscaling capabilities to handle variable traffic
C.GPU or TPU requirements for model inference
D.The programming language used for training
E.The color of the team's logo
AnswersA, B, C

Monitoring integration is crucial for production.

Why this answer

A is correct because Vertex AI provides integrated model monitoring capabilities, including feature drift detection, prediction skew analysis, and outlier detection, which are essential for maintaining model performance in production. Without this integration, you would need to build custom monitoring pipelines, increasing operational complexity.

Exam trap

The trap here is that candidates might think the training language (D) matters for serving, but Google Cloud serving infrastructure is language-agnostic as long as the model is exported in a supported format, making this a common distractor.

11
Multi-Selecteasy

An ML team is converting a prototype model to a production pipeline using Vertex AI. They want to ensure model versioning and lineage. Which two practices should they adopt? (Select TWO)

Select 2 answers
A.Use Vertex AI Model Registry to manage model versions.
B.Only keep the latest model version to save storage.
C.Store model artifacts in Cloud Storage with unique versioned directories.
D.Train models directly in production without tracking.
E.Use a separate GCP project for each model version.
AnswersA, C

Integrates with other Vertex AI services for lineage.

Why this answer

Options A and B are correct. Storing model artifacts in Cloud Storage with versioned directories and using Vertex AI Model Registry provide organized versioning and lineage tracking. Option C is wrong because keeping only the latest version loses history.

Option D is wrong because using a separate GCP project per version is unnecessary and complex. Option E is wrong because not tracking versions is poor practice.

12
MCQeasy

A team prototypes a recommendation model using a Jupyter notebook on Vertex AI Workbench. They want to productionize the model with CI/CD. Which approach should they use to package the model for deployment?

A.Use Cloud Build to deploy the notebook directly as a prediction endpoint
B.Store the model in Cloud Source Repositories and deploy from there
C.Containerize the model and push to Artifact Registry, then deploy via Cloud Run
D.Upload the model to Vertex AI Model Registry and use it for deployment
AnswerD

Model Registry manages versions and deployment targets.

Why this answer

Vertex AI Model Registry is the central repository for managing ML models, enabling versioning, evaluation, and deployment to endpoints. This approach integrates with CI/CD pipelines via the Vertex AI SDK or Cloud Build, allowing automated model promotion and deployment without manual packaging. Option D directly leverages Vertex AI's native deployment workflow, which is the recommended path for productionizing models from Workbench.

Exam trap

Google Cloud often tests the misconception that any storage or code repository (like Cloud Source Repositories or Artifact Registry) can directly serve as a deployment mechanism, when in fact Vertex AI Model Registry is the required service for managing and deploying models within Vertex AI's ecosystem.

How to eliminate wrong answers

Option A is wrong because Cloud Build cannot deploy a Jupyter notebook directly as a prediction endpoint; notebooks contain code and dependencies that must be containerized or exported as a model artifact first. Option B is wrong because Cloud Source Repositories is a code hosting service, not a model deployment mechanism; storing code there does not create a deployable endpoint. Option C is wrong because while containerization and Artifact Registry are valid for custom serving, Vertex AI Model Registry provides built-in model versioning, evaluation, and endpoint management that aligns with Vertex AI's native CI/CD capabilities, making it the more direct and recommended approach for this scenario.

13
MCQhard

An organization runs a batch prediction job on Vertex AI for a large dataset (10 TB). The job is configured to use a cluster of 100 n1-standard-16 machines. Midway through, the job fails with 'Out of memory' errors. What is the most effective mitigation strategy?

A.Split the input data into smaller chunks and run multiple jobs.
B.Enable model parallelism within the prediction script.
C.Increase the number of machines to distribute data more.
D.Use a machine type with more memory per instance.
AnswerD

Directly addresses the OOM by providing more memory for each worker.

Why this answer

The 'Out of memory' error indicates that individual worker nodes are running out of RAM when processing their assigned data shards. Using a machine type with more memory per instance (e.g., n1-highmem-16) directly addresses the root cause by providing each node with sufficient memory to hold the model and its intermediate computations, without changing the data distribution or parallelism strategy.

Exam trap

The trap here is that candidates confuse scaling horizontally (adding more machines) with scaling vertically (increasing per-machine resources), assuming that distributing data further will fix memory exhaustion when the bottleneck is per-node RAM capacity, not data volume per node.

How to eliminate wrong answers

Option A is wrong because splitting the input data into smaller chunks and running multiple jobs does not increase the memory available per machine; it only reduces the data per job, but the same memory constraint per node will still cause OOM errors if the model or batch size per node remains unchanged. Option B is wrong because model parallelism splits the model across devices, which is typically used for very large models that cannot fit on a single GPU/TPU, not for batch prediction jobs where the model is already loaded and the issue is data processing memory. Option C is wrong because increasing the number of machines distributes the data across more nodes, but each node still has the same 16 GB of RAM (n1-standard-16), so the per-node memory pressure remains identical and OOM errors will persist.

14
MCQeasy

You are a Machine Learning Engineer at a financial services company. You have trained a large language model (LLM) using a custom container on Vertex AI Training. The model is used for sentiment analysis on financial news articles. You have deployed the model to a Vertex AI Endpoint for online prediction. However, during peak trading hours, users report high latency ( > 5 seconds) and occasional timeout errors. The model is deployed on n1-highmem-8 machines with 1 replica. You monitor the endpoint and see that CPU utilization is high ( > 90%) and memory is near capacity. The queries are relatively small text inputs. Which course of action should you take to reduce latency?

A.Deploy the model to multiple endpoints and use round-robin load balancing.
B.Use Vertex AI Prediction with GPU accelerators like NVIDIA Tesla T4.
C.Increase the machine type to n1-highmem-16 and keep 1 replica.
D.Reduce the batch size for predictions to lower memory usage.
AnswerB

GPUs excel at matrix operations common in LLMs, dramatically reducing inference latency per request.

Why this answer

Option B is correct because the high CPU utilization and memory pressure indicate that the CPU is the bottleneck for inference, not the model size or input volume. Switching to GPU accelerators like NVIDIA Tesla T4 offloads the computationally intensive matrix operations of the LLM to the GPU, drastically reducing per-query latency and freeing CPU resources for preprocessing and I/O. This directly addresses the root cause of >5-second latency during peak hours.

Exam trap

Google Cloud often tests the misconception that scaling up CPU resources (vertical scaling) is the solution for high-latency inference, when in fact the correct approach for deep learning models is to offload computation to specialized hardware like GPUs or TPUs.

How to eliminate wrong answers

Option A is wrong because deploying to multiple endpoints with round-robin load balancing does not reduce per-query latency; it only distributes the load across replicas, but each replica still suffers from the same CPU bottleneck and would likely still time out. Option C is wrong because increasing the machine type to n1-highmem-16 adds more CPU cores and memory, but the inference bottleneck is the CPU's inability to parallelize the LLM's matrix operations efficiently; a larger CPU instance still cannot match GPU throughput for deep learning inference. Option D is wrong because reducing batch size for predictions would actually increase the number of inference calls and overhead, potentially worsening latency; the model already receives small text inputs, so batching is not the issue.

15
Multi-Selectmedium

Which TWO practices are important when scaling a prototype ML model to production on Google Cloud? (Choose two.)

Select 2 answers
A.Set up model monitoring for data drift and concept drift
B.Manually engineer features for each training iteration
C.Run the model on a single high-memory Compute Engine VM
D.Use proprietary libraries to maximize performance regardless of lock-in
E.Implement CI/CD pipelines for model training and deployment
AnswersA, E

Monitoring is essential for production model health.

Why this answer

Option A is correct because model monitoring for data drift and concept drift is essential in production ML on Google Cloud. Services like Vertex AI Model Monitoring automatically track feature distributions and prediction quality over time, alerting when the statistical properties of incoming data deviate from the training baseline. Without this, a model's accuracy can silently degrade as real-world data shifts, leading to poor business decisions.

Exam trap

Google Cloud often tests the misconception that production ML can rely on manual processes or single-instance deployments, whereas the correct approach emphasizes automation, monitoring, and scalability through managed services.

16
MCQmedium

An ML engineer needs to run batch predictions on tens of petabytes of data using a trained model. The data is stored in Cloud Storage. Which service should they choose?

A.Cloud Dataflow with the model as a side input
B.Cloud Dataproc running Spark ML
C.Cloud Run with multiple revisions
D.Vertex AI Batch Prediction
AnswerD

Batch Prediction scales to petabytes and integrates with Cloud Storage.

Why this answer

Vertex AI Batch Prediction is the correct choice because it is a managed service specifically designed for high-throughput, large-scale batch inference on data stored in Cloud Storage. It automatically handles sharding, scaling, and resource management for tens of petabytes, without requiring the engineer to manage infrastructure or write custom distributed processing code.

Exam trap

Google Cloud often tests the distinction between batch inference and data processing pipelines, so the trap here is that candidates confuse Cloud Dataflow (a data processing tool) with a batch prediction service, not realizing that Vertex AI Batch Prediction is the dedicated service for running models on large static datasets.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow with the model as a side input is optimized for stream and batch data processing pipelines, not for running a trained model's predictions on petabytes of static data; side inputs are not designed for large model inference and would cause severe performance bottlenecks and memory issues. Option B is wrong because Cloud Dataproc running Spark ML requires the engineer to manually manage clusters, configure Spark jobs for inference, and handle scaling, which adds operational overhead and is less efficient than a purpose-built batch prediction service for petabyte-scale data. Option C is wrong because Cloud Run is a serverless container platform for request-driven, low-latency applications, not for batch processing of tens of petabytes; it has a maximum request timeout of 60 minutes and cannot handle the volume or duration required.

17
MCQmedium

You are an ML engineer at a fintech company. You have a prototype credit risk model built using XGBoost that achieves high accuracy on historical data. The model is trained on a dataset with 500,000 rows and 50 features. The company wants to deploy this model to production to score loan applications in real-time. The production environment must handle a peak load of 100 requests per second with a latency under 200ms. You have decided to use Vertex AI for deployment. After deploying the model as a Vertex AI endpoint with a single n1-standard-4 machine, you notice that latency exceeds 500ms at peak load and some requests time out. You have verified that the model prediction itself (excluding network overhead) takes about 50ms on average. What should you do to meet the latency and throughput requirements?

A.Change the machine type to a GPU-accelerated machine like n1-standard-4 with a T4 GPU.
B.Prune the model to reduce size and improve prediction speed.
C.Enable autoscaling with a minimum of 2 replicas and use a larger machine type (e.g., n1-standard-8) to handle more concurrent requests.
D.Switch from online prediction to batch prediction using Vertex AI Batch Prediction.
AnswerC

Autoscaling increases replicas to handle load, and a larger machine can process more requests concurrently, reducing queueing time.

Why this answer

Option C is correct because the latency bottleneck is not the model inference time (50ms) but the inability of a single n1-standard-4 machine to handle 100 concurrent requests per second without queuing. By enabling autoscaling with a minimum of 2 replicas and upgrading to n1-standard-8, you increase both the number of concurrent requests the endpoint can process and the CPU/memory resources per replica, reducing queue wait times and keeping total latency under 200ms. This directly addresses the throughput and latency requirements without changing the model or switching to batch processing.

Exam trap

The trap here is that candidates assume latency issues are always due to model inference speed (leading them to choose GPU or model pruning), when in fact the bottleneck is often the lack of horizontal scaling to handle concurrent requests under load.

How to eliminate wrong answers

Option A is wrong because adding a GPU (e.g., T4) does not reduce latency for XGBoost inference; XGBoost is CPU-optimized and GPU acceleration typically adds overhead for tree-based models, making latency worse. Option B is wrong because pruning the model (e.g., reducing tree depth or number of trees) would only marginally improve the 50ms prediction time, but the primary issue is queuing due to insufficient replicas to handle 100 requests per second, not the raw inference speed. Option D is wrong because batch prediction is designed for offline, asynchronous processing and cannot meet the real-time requirement of under 200ms latency per request; it also does not solve the concurrency problem for online scoring.

18
MCQhard

A company has a prototype ML model that achieves 85% accuracy on historical data. In production, accuracy drops to 70% after two weeks due to data drift. They need an automated retraining pipeline with minimal manual oversight. Which solution is most cost-effective?

A.Use Cloud Functions to trigger a Dataflow job that trains the model using custom containers
B.Deploy the model on a GPU-equipped Compute Engine VM and run retraining every time new data arrives
C.Set up Vertex AI Model Monitoring to detect drift, which triggers a Cloud Function that submits a Vertex AI Training job with new data
D.Schedule a weekly Cloud Composer DAG that runs a new training job with all available data
AnswerC

Monitoring detects drift, automation triggers retraining with new data, cost-effective.

Why this answer

Option C is correct because it combines automated drift detection via Vertex AI Model Monitoring with a serverless retraining trigger (Cloud Function) that submits a Vertex AI Training job, minimizing manual oversight while only incurring costs when drift is detected. This avoids the expense of continuous retraining or always-on GPU instances, making it the most cost-effective solution for the described scenario.

Exam trap

The trap here is that candidates often choose scheduled retraining (Option D) as the simplest automation, overlooking the cost savings and precision of event-driven retraining triggered by actual drift detection, which is a key concept in the PMLE exam for scaling prototypes to production.

How to eliminate wrong answers

Option A is wrong because using Cloud Functions to trigger a Dataflow job for training with custom containers introduces unnecessary complexity and cost for batch processing of training data, whereas Vertex AI Training is purpose-built for ML model training and integrates seamlessly with drift detection. Option B is wrong because deploying a GPU-equipped Compute Engine VM for retraining every time new data arrives incurs high costs for idle GPU time and requires manual management of the VM lifecycle, contradicting the requirement for minimal manual oversight. Option D is wrong because scheduling a weekly Cloud Composer DAG to retrain with all available data ignores the cost savings of event-driven retraining triggered by actual drift, and may waste resources retraining when no drift has occurred.

19
MCQmedium

A team has deployed a model with autoscaling configured as shown. They notice that during off-peak hours, the endpoint consistently runs 3 instances instead of scaling down to 1. What is the most likely cause?

A.There is a sustained request rate that prevents scaling down.
B.The `enableAccessLogging` flag increases resource usage.
C.The `minReplicaCount` is set too high.
D.The model is too large to fit on a single instance.
AnswerA

Autoscaler keeps instances if load requires them, even if low.

Why this answer

The autoscaling configuration is likely based on a target metric (e.g., requests per second or CPU utilization). During off-peak hours, if there is a sustained but low request rate that still exceeds the scale-down threshold, the model will not reduce instances below the number needed to handle that load. The endpoint runs 3 instances because the sustained request rate prevents the scaling-down logic from triggering, even though the traffic is lower than peak.

Exam trap

Google Cloud often tests the misconception that scaling is purely based on instance count or model size, when in reality it is driven by sustained request rates and metric thresholds that prevent scale-down actions.

How to eliminate wrong answers

Option B is wrong because `enableAccessLogging` only controls whether request/response logs are written to CloudWatch (or similar), which does not directly affect compute resource usage or scaling behavior. Option C is wrong because if `minReplicaCount` were set too high, the endpoint would always run at least that many instances, but the question states it runs 3 instances instead of scaling down to 1, implying the minimum is 1 and the scaling logic is failing to reduce further. Option D is wrong because model size affects instance memory and startup time, but it does not prevent scaling down; a large model can still run on a single instance if the instance type supports it.

20
MCQmedium

An ML team is scaling a prototype to production. The data pipeline currently reads from Cloud Storage and transforms data with a custom Python script. They need to handle higher throughput and add monitoring. Which approach should they take?

A.Deploy the Python script on a large Compute Engine instance with a cron job
B.Migrate the pipeline to Apache Beam on Dataflow with Cloud Monitoring
C.Rewrite the pipeline to use Pub/Sub and Cloud Functions for processing
D.Use Cloud Composer to orchestrate the Python script at scale
AnswerB

Dataflow is serverless, auto-scales, and integrates with Cloud Monitoring for observability.

Why this answer

Apache Beam on Dataflow provides a unified programming model for batch and streaming data processing, enabling automatic scaling to handle higher throughput. Cloud Monitoring integrates natively with Dataflow to track pipeline metrics, latency, and error rates, addressing the monitoring requirement. This approach is purpose-built for production-grade data pipelines, unlike ad-hoc solutions.

Exam trap

Google Cloud often tests the distinction between orchestration (Cloud Composer) and execution (Dataflow), leading candidates to choose an orchestrator when a dedicated processing engine is required for scaling and monitoring.

How to eliminate wrong answers

Option A is wrong because deploying a Python script on a single large Compute Engine instance with a cron job does not provide horizontal scaling, fault tolerance, or built-in monitoring; it creates a single point of failure and cannot handle throughput spikes. Option C is wrong because rewriting the pipeline to use Pub/Sub and Cloud Functions is suitable for event-driven, lightweight processing but not for complex data transformations or high-throughput batch workloads; Cloud Functions have timeouts (up to 9 minutes for HTTP functions) and lack stateful processing capabilities. Option D is wrong because Cloud Composer (managed Apache Airflow) is an orchestration tool, not a data processing engine; it would still rely on the Python script's execution, inheriting its scaling and monitoring limitations without addressing the core transformation throughput.

21
Multi-Selectmedium

A team has trained a sentiment analysis model using PyTorch on Vertex AI Training. They now want to deploy it for online predictions with low latency. Which TWO actions should they take? (Choose 2)

Select 2 answers
A.Create multiple model versions for A/B testing.
B.Use a machine type with a GPU for faster inference.
C.Enable batch prediction instead of online prediction.
D.Convert the model to TensorFlow SavedModel format.
E.Package the model in a custom container with a web server (e.g., FastAPI).
AnswersB, E

GPUs can accelerate inference for deep learning models.

Why this answer

Option B is correct because GPU-accelerated inference significantly reduces latency for deep learning models like sentiment analysis, especially when using PyTorch, which has native CUDA support. Vertex AI Prediction supports GPU machine types (e.g., n1-standard-4 with NVIDIA T4) that can process batched requests faster than CPUs, directly addressing the low-latency requirement.

Exam trap

Google Cloud often tests the misconception that converting to TensorFlow SavedModel is required for Vertex AI, but the platform supports PyTorch natively via custom containers, making conversion an unnecessary and potentially error-prone step.

22
MCQhard

A large e-commerce company deploys a recommendation model on Vertex AI with autoscaling enabled. During Black Friday, traffic spikes rapidly. The autoscaler adds new instances, but new instances take several minutes to become ready (cold start). As a result, many requests time out. What should they do to mitigate this issue?

A.Use a larger machine type to reduce the number of instances needed.
B.Configure the autoscaler to use CPU utilization metric instead of request count.
C.Increase the health check grace period for new instances.
D.Set a higher minimum number of instances to handle the expected peak.
AnswerD

Pre-warms instances to absorb traffic spikes without cold start.

Why this answer

Option D is correct because setting a higher minimum number of instances ensures that a baseline capacity is always running and ready to serve traffic. This pre-warms instances, eliminating the cold-start latency during rapid traffic spikes, such as Black Friday, because new instances do not need to initialize from scratch.

Exam trap

The trap here is that candidates confuse scaling metrics or instance readiness with the fundamental need for pre-provisioned capacity, leading them to choose options that adjust autoscaling behavior without eliminating the cold-start latency.

How to eliminate wrong answers

Option A is wrong because using a larger machine type reduces the number of instances needed but does not address the cold-start delay; each new instance still takes minutes to become ready. Option B is wrong because switching to CPU utilization metric does not solve the cold-start problem; the autoscaler still adds instances that take time to initialize, and CPU utilization may not react as quickly to a sudden traffic surge as request count. Option C is wrong because increasing the health check grace period only delays when the load balancer considers an instance healthy, but the instance still takes the same time to become ready; requests will still time out during the cold-start window.

23
Multi-Selectmedium

A data scientist needs to scale a prototype deep learning model to train on a massive dataset using multiple GPUs. Which three strategies are essential for efficient distributed training? (Select THREE)

Select 3 answers
A.Use a single large batch size across all workers.
B.Implement data parallelism.
C.Ensure that the input pipeline is not a bottleneck by using tf.data.Dataset with prefetching and parallel reads.
D.Use synchronous gradient updates.
E.Use asynchronous gradient updates to reduce communication overhead.
AnswersB, C, D

Scales training by splitting data across workers.

Why this answer

Options A, C, and E are correct. Data parallelism (C) is the foundation for scaling across GPUs. Synchronous gradient updates (A) are commonly used to maintain convergence quality.

An optimized input pipeline (E) prevents I/O bottlenecks. Option B is wrong because asynchronous updates can cause convergence issues and are not essential. Option D is wrong because using a single large batch size across all workers is not essential; per-worker batch size must be tuned.

24
MCQmedium

A startup has developed a prototype ML model using scikit-learn on a single machine. They now need to scale it to handle larger datasets and deploy it for real-time predictions. The team is small and wants minimal operational overhead. Which Google Cloud service should they use?

A.AI Platform Prediction
B.Vertex AI
C.Cloud Functions
D.Compute Engine with TensorFlow Serving
AnswerB

Vertex AI provides managed training, deployment, and autoscaling with minimal operational overhead.

Why this answer

Vertex AI (option B) is the correct choice because it provides a unified, fully managed MLOps platform that integrates model training, deployment, and scaling with minimal operational overhead. It supports scikit-learn models natively, offers auto-scaling for real-time predictions, and eliminates the need to manage infrastructure, making it ideal for a small team transitioning from a prototype.

Exam trap

Google Cloud often tests the misconception that any serverless option (like Cloud Functions) is suitable for ML inference, but the trap here is that Cloud Functions has severe resource and timeout limitations that make it impractical for real-time model serving, whereas Vertex AI is purpose-built for this workload.

How to eliminate wrong answers

Option A (AI Platform Prediction) is wrong because it is a legacy service that has been superseded by Vertex AI; while it could technically serve predictions, it lacks the unified workflow and newer features of Vertex AI, and using it would incur unnecessary complexity and potential deprecation risks. Option C (Cloud Functions) is wrong because it is a serverless compute service designed for event-driven, short-lived tasks (max 9 minutes timeout and 2 GB memory), not for hosting persistent ML models requiring real-time inference with low latency and large payloads. Option D (Compute Engine with TensorFlow Serving) is wrong because it requires manual setup, scaling, and maintenance of virtual machines, which contradicts the team's goal of minimal operational overhead; TensorFlow Serving also adds an extra layer of complexity for a scikit-learn model that could be served more simply via Vertex AI's built-in containers.

25
MCQmedium

A team deploys a PyTorch model on Vertex AI for online predictions. They notice that after deployment, the latency increases over time, especially during peak hours. The model is served using a custom container. What is the most likely cause?

A.The custom container does not have a health check, causing instances to be prematurely terminated.
B.The model is not using GPU even though a GPU machine is selected.
C.The model is too large for the machine's memory, causing swapping.
D.The prediction requests are not being batched, and the model inference code is not optimized for concurrency.
AnswerD

Without batching and concurrency, requests queue up, increasing latency under load.

Why this answer

Option D is correct because the latency increase over time, especially during peak hours, indicates that the model inference code is not handling concurrent requests efficiently. Without batching or optimized concurrency, each request is processed sequentially, causing a queue buildup under load. This is a common issue with custom containers on Vertex AI when the prediction handler is single-threaded or lacks async processing.

Exam trap

Google Cloud often tests the misconception that latency increases are always due to resource exhaustion (memory/CPU) rather than concurrency or request handling inefficiencies, leading candidates to pick Option C.

How to eliminate wrong answers

Option A is wrong because a missing health check would cause instances to be terminated and recreated, leading to intermittent failures or startup latency, not a gradual latency increase over time. Option B is wrong because selecting a GPU machine without using the GPU would result in underutilization but not necessarily increasing latency; the model would still run on CPU, and latency would be constant or high from the start. Option C is wrong because if the model were too large for memory, swapping would cause consistently high latency from the outset, not a gradual increase during peak hours.

26
MCQhard

A machine learning engineer is scaling a prototype natural language processing model that uses a transformer encoder. The prototype was trained on a small corpus on a single GPU. For production, they need to train on a much larger corpus using TPUs on Vertex AI. They convert the TensorFlow code to work with TPUStrategy. The training starts but after a few steps, the loss becomes NaN and training diverges. The learning rate scheduler uses a warm-up and then linear decay. The initial learning rate is 1e-4. The batch size per TPU core is 32, with 8 cores total (batch size 256). What is the most likely cause?

A.The batch size is too small for TPU.
B.The learning rate is too high for the batch size.
C.The learning rate schedule should be cosine instead of linear.
D.The warm-up steps are insufficient.
AnswerB

Larger batch size requires lower learning rate to maintain stability.

Why this answer

When scaling from a single GPU to 8 TPU cores, the global batch size increases from 32 to 256. The learning rate of 1e-4, which was appropriate for batch size 32, becomes too high for the larger batch size. This violates the linear scaling rule (learning rate should be scaled proportionally to batch size), causing gradient updates to overshoot minima and leading to NaN loss and divergence.

Exam trap

Google Cloud often tests the misconception that TPU-specific issues (like batch size or hardware compatibility) are the root cause, when in fact the problem is a fundamental hyperparameter scaling error that applies to any distributed training setup.

How to eliminate wrong answers

Option A is wrong because TPUs are designed to handle large batch sizes efficiently; a batch size of 256 is well within typical TPU capabilities and is not the cause of NaN loss. Option C is wrong because the learning rate schedule (linear vs. cosine) is not the primary issue; the fundamental problem is the learning rate magnitude relative to the batch size, not the decay shape. Option D is wrong because insufficient warm-up steps might cause early instability but would not typically lead to persistent NaN loss after several steps; the core issue is the learning rate being too high for the increased batch size.

27
MCQhard

A team uses Vertex AI Pipelines to automate training and deployment. They need to ensure that only models that pass a set of quality checks (e.g., accuracy > 0.9, latency < 100ms) are deployed to production. How should they implement this?

A.Manually review each model before promotion
B.Use Cloud Functions to deploy only if accuracy is reported in BigQuery
C.Set up Cloud Build triggers to deploy every model version
D.Add a Pipeline component that evaluates metrics and uses a conditional gate to deployment
AnswerD

Pipelines support conditional execution based on component outputs.

Why this answer

Vertex AI Pipelines can include custom components to evaluate metrics and conditionally proceed to deployment if thresholds are met. Option A is manual, B lacks conditional logic, D uses different services without such built-in gating.

28
Multi-Selecteasy

An ML team is deploying a model to Vertex AI for the first time. Which THREE are best practices for scaling from prototype to production?

Select 3 answers
A.Manually scale instances based on historical traffic patterns.
B.Store all features in a Feature Store for consistency.
C.Use a single large instance to simplify management.
D.Monitor model performance for drift and accuracy degradation.
E.Automate model retraining and deployment using Vertex AI Pipelines.
AnswersB, D, E

Feature Store ensures consistent feature computation across training and serving.

Why this answer

Storing all features in a Feature Store (Option B) ensures consistency between training and serving, preventing training-serving skew. Vertex AI Feature Store provides a centralized repository for feature values, enabling reuse, point-in-time lookups, and online serving with low latency, which is critical for production reliability.

Exam trap

Google Cloud often tests the misconception that manual scaling or single-instance architectures are simpler and more reliable, but the PMLE exam emphasizes automated, resilient, and consistent practices like autoscaling and feature stores for production ML workloads.

29
MCQeasy

A team has developed a prototype of a recommendation model using a small dataset on a single VM. They need to scale to a larger dataset for production training. They plan to use Vertex AI training with a custom container. What is the best practice for handling the increased data volume?

A.Increase the batch size to maximum.
B.Use TFRecord format and streaming reads.
C.Store all data in memory before training.
D.Use a single powerful VM with high memory.
AnswerB

Efficiently loads data in batches, leveraging Cloud Storage streaming.

Why this answer

Option B is correct because using TFRecord format with streaming reads allows efficient, scalable data loading from Cloud Storage, reducing memory pressure and improving I/O performance. Option A is wrong because storing all data in memory is not scalable. Option C is wrong because increasing batch size to maximum can cause memory issues and may not improve throughput.

Option D is wrong because a single powerful VM still has limits and is not cost-effective for large datasets.

30
MCQhard

A team trains a distributed TensorFlow model using the config above. After training, they deploy the model for online predictions. The model returns poor quality predictions. They suspect that the model was not trained correctly due to a configuration error. What is the most likely mistake?

A.The `scaleTier` is set to 'STANDARD_1' which only supports up to 3 workers.
B.The training job is using a custom container that does not match the requirements.
C.The model was exported incorrectly because the training job did not specify a `--model-export-path`.
D.The parameter server count should be at least equal to the worker count.
AnswerA

STANDARD_1 limits workers to 3; the actual job may have ignored the 10 worker setting.

Why this answer

Option B is correct because 'STANDARD_1' scale tier is for small scale, max workers is 3. The config set 10 workers, which would be ignored or cause error. The training might have run with fewer workers, leading to poor model.

Option A: not required; option C: model-dir is fine; option D: not indicated.

31
Drag & Dropmedium

Drag and drop the steps to create and deploy a custom ML model on Vertex AI using a container in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

First, build and push the container, then register the model, deploy to an endpoint, and finally test.

32
MCQmedium

A team has a prototype image classification model trained on a small dataset using TensorFlow Keras on a single GPU. They need to train on a larger dataset (1 million images) using a distributed strategy on Vertex AI with 8 GPUs. They implement a MirroredStrategy for data parallelism. During the first few epochs, the training speed does not improve significantly compared to a single GPU, and GPU utilization is low. The data is stored as JPEG files in Cloud Storage, and the input pipeline uses tf.data with map to decode images. What is the most likely cause?

A.The batch size per GPU is too large.
B.The MirroredStrategy is not properly configured.
C.The data loading from Cloud Storage is a bottleneck.
D.The model is too small for distributed training.
AnswerC

I/O bottleneck starves GPUs, causing low utilization.

Why this answer

Option B is correct because reading and decoding JPEG images from Cloud Storage can be I/O-bound, causing low GPU utilization. Option A is wrong because large batch size per GPU could cause memory issues but not low utilization. Option C is wrong because MirroredStrategy is typically configured correctly.

Option D is wrong because even if the model is small, distributed training should still improve throughput if the pipeline is not bottlenecked.

33
Multi-Selecteasy

Which TWO actions are best practices when scaling a prototype ML model to production in Google Cloud?

Select 2 answers
A.Store and manage features in a feature store like Vertex AI Feature Store.
B.Test the model only on a small sample of the production data to save costs.
C.Set up monitoring and logging for model performance and data drift.
D.Manually scale inference instances based on historical traffic patterns.
E.Use one-hot encoding for all categorical features without considering cardinality.
AnswersA, C

Feature store ensures consistency and reuse across models.

Why this answer

Vertex AI Feature Store centralizes feature management, ensuring consistency between training and serving. This eliminates training-serving skew by providing a single source of truth for features, which is critical when scaling from prototype to production.

Exam trap

Google Cloud often tests the misconception that cost-saving shortcuts like limited testing or manual scaling are acceptable in production, when in fact reliability and monitoring are non-negotiable for ML systems at scale.

34
MCQmedium

A company has a TensorFlow model that uses custom operations compiled as .so files. They want to deploy it on Vertex AI for online predictions. The model runs correctly when loaded locally. However, on Vertex AI, the prediction fails with a 'Op type not registered' error. What is the most likely reason?

A.The model is using a deprecated TensorFlow version.
B.The custom ops are not included in the model directory.
C.The prediction request format is incorrect.
D.The custom ops were compiled for a different CPU architecture.
AnswerD

Incompatible instruction sets cause the op to fail to register.

Why this answer

Option D is correct because custom TensorFlow operations compiled as .so files are architecture-specific. If the local machine uses a different CPU architecture (e.g., x86_64 with AVX2) than the Vertex AI serving nodes (e.g., x86_64 without AVX2 or ARM), the dynamic library will fail to load, causing the 'Op type not registered' error. The model runs locally because the ops are available, but on Vertex AI the shared object cannot be loaded, so TensorFlow cannot register the custom kernels.

Exam trap

Google Cloud often tests the misconception that 'Op type not registered' is always due to missing files or version mismatches, but the real trap is that candidates overlook CPU architecture compatibility when deploying compiled custom ops to a cloud environment where the serving hardware may differ from the build environment.

How to eliminate wrong answers

Option A is wrong because a deprecated TensorFlow version would typically cause compatibility warnings or missing API errors, not an 'Op type not registered' error specifically for custom ops; the error indicates the op kernel is missing, not that the version is unsupported. Option B is wrong because if the custom ops were not included in the model directory, the model would fail to load entirely or produce a 'file not found' error, not an 'Op type not registered' error; the error occurs when the .so file is present but cannot be loaded due to architecture mismatch. Option C is wrong because an incorrect prediction request format would result in a 400 Bad Request or a deserialization error, not a TensorFlow runtime error about unregistered ops; the error is raised during model inference, not request parsing.

35
MCQmedium

A team uses Vertex AI Feature Store to serve features for online predictions. They notice that the online serving latency is high for certain features. The features are stored in a BigQuery source with high cardinality. What is the best practice to reduce latency?

A.Use batch prediction instead of online prediction.
B.Move the features to Cloud Storage and read them directly.
C.Increase the number of nodes in the feature store cluster.
D.Use feature store caching with a larger cache size.
AnswerD

Caching frequently accessed features reduces BigQuery calls and latency.

Why this answer

Option B is correct because caching can reduce repeated access to BigQuery. Option A might help but not directly address high cardinality; option C would not integrate with Feature Store; option D is a workaround but not best practice.

36
Multi-Selecthard

A company has a prototype ML model that predicts equipment failure. They want to deploy it to production using Vertex AI. The model must be retrained weekly with new data. They also need to monitor for data drift and model performance. Which THREE components should they include in their MLOps pipeline? (Choose 3)

Select 3 answers
A.A scheduled training pipeline that retrains the model weekly.
B.A manual QA step where data scientists approve each deployment.
C.A manual review of new data before it is used for training.
D.An automated trigger that redeploys the model when performance drops below a threshold.
E.A monitoring system that checks for data drift and triggers alerts.
AnswersA, D, E

Scheduled retraining is essential for keeping the model up-to-date.

Why this answer

Option A is correct because the requirement specifies weekly retraining, which is best implemented as a scheduled training pipeline in Vertex AI using Cloud Scheduler or a recurring AI Platform Pipeline run. This automates the retraining process without manual intervention, ensuring the model stays current with new data.

Exam trap

Google Cloud often tests the distinction between necessary manual oversight and fully automated MLOps practices, leading candidates to overestimate the need for human approval steps in a production pipeline that demands speed and scalability.

37
MCQhard

A team is scaling their prototype inference model to handle high-throughput requests with low latency. They use a custom container on Vertex AI Prediction. They notice that latency spikes occur under heavy load. What is the most effective strategy?

A.Enable auto-scaling with a higher minimum number of replicas.
B.Optimize model serving with batching and model warm-up.
C.Use a larger machine type with more CPUs.
D.Use a GPU-based machine.
AnswerB

Batching reduces overhead per request; warm-up avoids cold start.

Why this answer

Option C is correct because optimizing model serving with batching and model warm-up reduces per-request overhead and ensures consistent latency. Option A is wrong because adding CPUs may not help if the bottleneck is model inference computation. Option B is wrong because auto-scaling doesn't reduce latency spikes; it adds replicas over time.

Option D is wrong because GPU may help but not specifically for latency spikes due to load variation.

38
MCQeasy

A team just moved a model from prototype to production using Vertex AI. They notice prediction errors for certain inputs that were not present in training data. What should they do to detect such issues automatically?

A.Set up Vertex AI Experiments to compare predictions
B.Use BigQuery ML to analyze prediction requests
C.Enable Cloud Logging and set up alerts for error logs
D.Enable Vertex AI Model Monitoring to detect prediction anomalies
AnswerD

Model Monitoring automatically checks for drift and anomalies.

Why this answer

Option D is correct because Vertex AI Model Monitoring is specifically designed to detect prediction anomalies, such as data drift and feature skew, by comparing production prediction requests against the training data distribution. This allows the team to automatically identify inputs that deviate from the training data, even if those exact inputs were not present during training, without manual inspection.

Exam trap

Google Cloud often tests the distinction between monitoring for operational errors (e.g., HTTP errors) versus monitoring for model-specific issues (e.g., data drift), leading candidates to choose Cloud Logging (Option C) when the correct answer requires a dedicated ML monitoring service.

How to eliminate wrong answers

Option A is wrong because Vertex AI Experiments is used for tracking and comparing model training runs and hyperparameter tuning, not for monitoring production prediction requests or detecting anomalies in real-time. Option B is wrong because BigQuery ML is a tool for creating and executing machine learning models directly in BigQuery using SQL, not for analyzing prediction requests from a deployed Vertex AI model or detecting input anomalies. Option C is wrong because while Cloud Logging can capture error logs, it only reacts to explicit errors (e.g., 4xx/5xx HTTP responses) and cannot automatically detect prediction anomalies like data drift or feature skew that do not generate error logs.

39
Multi-Selecthard

Which TWO services are commonly used together to implement an end-to-end ML pipeline that automatically retrains and deploys models on Vertex AI? (Choose two.)

Select 2 answers
A.Cloud Dataflow
B.Vertex AI Pipelines
C.Cloud Composer
D.Cloud Source Repositories
E.Cloud Scheduler
AnswersB, E

Pipelines orchestrate the training and deployment steps.

Why this answer

Vertex AI Pipelines (B) is the correct choice because it provides a serverless, scalable orchestration service specifically designed to build, run, and manage ML pipelines on Vertex AI. It enables you to define a directed acyclic graph (DAG) of steps—including data preprocessing, training, evaluation, and deployment—and can be triggered automatically to retrain and deploy models. Cloud Scheduler (E) is commonly used together with Vertex AI Pipelines to schedule pipeline runs at fixed intervals or in response to time-based triggers, forming a complete end-to-end automated retraining and deployment workflow.

Exam trap

Google Cloud often tests the distinction between general-purpose orchestration tools (Cloud Composer) and ML-native pipeline services (Vertex AI Pipelines), leading candidates to pick Cloud Composer because of its familiarity with Airflow, even though Vertex AI Pipelines is the correct, integrated choice for end-to-end ML workflows on Vertex AI.

40
MCQeasy

An ML engineer runs this command to upload a model. The model artifact in Cloud Storage is a directory containing model.pkl and a custom preprocessing script. What will happen when he later deploys this model to an endpoint and sends a prediction request?

A.The prediction will succeed because the pre-built container automatically detects and uses the custom preprocessing script.
B.The prediction will succeed only if he also specifies a custom prediction routine.
C.The prediction will fail because the custom preprocessing script is not a standard scikit-learn serialized object.
D.The prediction will fail because the artifact URI must point to a single file not a directory.
AnswerC

The pre-built container only loads the model; custom preprocessing is not executed.

Why this answer

Option C is correct because the pre-built container for scikit-learn expects a single serialized model file (e.g., model.pkl) as the artifact. A directory containing a custom preprocessing script is not a standard scikit-learn serialized object, so the container cannot load or execute it, causing the prediction to fail.

Exam trap

Google Cloud often tests the misconception that a pre-built container can handle arbitrary directories or custom scripts, when in fact it strictly expects a single serialized model file.

How to eliminate wrong answers

Option A is wrong because the pre-built container does not automatically detect or use custom preprocessing scripts; it only loads a single model file. Option B is wrong because specifying a custom prediction routine would not fix the issue—the artifact must still be a single file, and the custom routine would need to be packaged differently (e.g., as a source distribution). Option D is wrong because the artifact URI can point to a directory; the failure is due to the directory containing a non-standard object, not because it is a directory.

41
Drag & Dropmedium

Drag and drop the steps to set up model monitoring for drift detection on Vertex AI in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Deploy first, then enable monitoring, set thresholds, configure alerts, and review.

42
MCQmedium

A machine learning team has a prototype using a custom TensorFlow model trained on a small dataset stored in Cloud Storage. They want to scale the prototype to production with minimal code changes while ensuring the model can handle increased traffic and new data. The model currently loads data using tf.data.Dataset from CSV files. Which approach best meets these requirements?

A.Use Vertex AI Training with hyperparameter tuning and distributed training, then deploy the model to Vertex AI Prediction with autoscaling.
B.Deploy the model to AI Platform (Unified) Prediction with a custom container, and use AI Platform Training to retrain on larger datasets.
C.Migrate the model to BigQuery ML and use SQL for training and prediction to leverage BigQuery's scalability.
D.Package the model as a Cloud Run Function and use Cloud Scheduler to trigger retraining periodically.
AnswerA

Vertex AI provides seamless scaling with minimal code changes and supports tf.data.Dataset.

Why this answer

Vertex AI Prediction with autoscaling directly addresses the need to handle increased traffic without code changes, while Vertex AI Training with hyperparameter tuning and distributed training enables scaling to larger datasets with minimal modifications to the existing tf.data pipeline. This approach keeps the custom TensorFlow model intact and leverages managed infrastructure for both training and serving.

Exam trap

The trap here is that candidates may overcomplicate by choosing containerization (B) or a completely different platform (C), missing that Vertex AI Prediction natively supports TensorFlow models with autoscaling and minimal code changes.

How to eliminate wrong answers

Option B is wrong because it suggests using AI Platform (Unified) Prediction with a custom container, which is unnecessary and adds complexity; the existing model can be deployed directly without containerization, and the requirement is minimal code changes. Option C is wrong because migrating to BigQuery ML would require rewriting the model logic from TensorFlow to SQL, which is a significant code change and not suitable for a custom TensorFlow model. Option D is wrong because Cloud Run Functions are stateless and not designed for serving ML models with autoscaling for prediction traffic; Cloud Scheduler for retraining does not address the need for handling increased traffic or new data in a production serving path.

43
MCQeasy

A company has a prototype ML model that works well on historical data, but when deployed to production, the model performance degrades over time. The data distribution shifts gradually. Which strategy should they implement to maintain model accuracy?

A.Increase the regularization strength to prevent overfitting.
B.Increase the amount of training data by using more historical records.
C.Implement a retraining pipeline that periodically retrains the model on recent data.
D.Switch to a more complex model architecture to better capture patterns.
AnswerC

Periodic retraining with fresh data helps the model adapt to gradual distribution shifts.

Why this answer

Option C is correct because gradual data distribution shifts (concept drift) require the model to adapt to new patterns over time. A retraining pipeline that periodically retrains on recent data ensures the model remains aligned with the current production distribution, directly addressing the degradation caused by drift without relying on static historical data.

Exam trap

Google Cloud often tests the misconception that overfitting or model complexity is the primary cause of production degradation, leading candidates to choose regularization or more complex architectures instead of recognizing that distribution shift requires data freshness.

How to eliminate wrong answers

Option A is wrong because increasing regularization strength reduces overfitting to historical noise but does not address the root cause—distribution shift—and may actually harm performance on new data by forcing the model to ignore legitimate new patterns. Option B is wrong because adding more historical records only reinforces the old distribution, making the model less responsive to recent shifts and potentially worsening drift. Option D is wrong because switching to a more complex model architecture increases capacity to fit data but does not solve the problem of stale training distribution; it may even overfit to outdated patterns and degrade faster under drift.

44
Matchingmedium

Match each ML acronym to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Area Under the ROC Curve

Mean Squared Error

Tensor Processing Unit

Support Vector Machine

Principal Component Analysis

Why these pairings

These are standard ML acronyms used in Google Cloud ML exams.

45
Matchingmedium

Match each ML model interpretability method to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Game-theoretic approach to explain feature contributions

Local surrogate model to explain individual predictions

Ranking features by their impact on model output

Shows marginal effect of a feature on predictions

Measures decrease in performance when feature is shuffled

Why these pairings

Interpretability is key for trustworthy ML.

46
MCQeasy

A data scientist has trained a scikit-learn model locally and wants to deploy it to Vertex AI for online predictions with low latency. The model is a small RandomForestClassifier (100 MB). What is the recommended way to deploy this model?

A.Deploy the model on a Kubernetes cluster with Istio.
B.Package the model as a Docker container with a custom prediction routine.
C.Upload the model to Vertex AI Model Registry using the pre-built scikit-learn serving container.
D.Export the model as a TensorFlow SavedModel and use the pre-built TF serving container.
AnswerC

Vertex AI offers a pre-built container for scikit-learn that handles prediction out of the box.

Why this answer

Option C is correct because Vertex AI provides a pre-built container for scikit-learn that is optimized for serving predictions with low latency. For a small RandomForestClassifier (100 MB), this container handles model loading, request routing, and scaling automatically, eliminating the need for custom infrastructure. This is the recommended approach for deploying scikit-learn models to Vertex AI for online predictions.

Exam trap

Google Cloud often tests the misconception that any model must be containerized or converted to TensorFlow for deployment, but the correct answer leverages the platform's pre-built container for the specific framework, which is the simplest and most efficient path for small models.

How to eliminate wrong answers

Option A is wrong because deploying on a Kubernetes cluster with Istio adds unnecessary operational complexity and overhead for a small model that can be served directly via Vertex AI's managed infrastructure; it is not the recommended path for a simple scikit-learn model. Option B is wrong because packaging the model as a Docker container with a custom prediction routine is overkill when Vertex AI already offers a pre-built, optimized scikit-learn serving container that handles the prediction logic out of the box. Option D is wrong because exporting a scikit-learn model as a TensorFlow SavedModel is not a direct conversion; scikit-learn models are not natively compatible with TensorFlow Serving, and this would require significant re-engineering or use of ONNX, which is not the recommended path for a RandomForestClassifier.

47
MCQhard

A machine learning engineer is training a large-scale text classification model using a distributed strategy on TPUs. The training loss decreases normally but the validation loss starts increasing after a few epochs while training loss continues to decrease. The engineer suspects overfitting. Which technique is most appropriate to address this while scaling training?

A.Add dropout regularization.
B.Use early stopping with patience.
C.Reduce the learning rate.
D.Increase the batch size.
AnswerA

Reduces overfitting by randomly dropping units, effective in distributed settings.

Why this answer

Option B is correct because dropout regularization is a common technique to prevent overfitting in neural networks, and it can be applied in distributed training without major modifications. Option A is wrong because reducing learning rate may not directly address overfitting. Option C is wrong because increasing batch size can sometimes help generalization but is not a primary anti-overfitting method.

Option D is wrong because early stopping prevents further overfitting but does not address the cause during training.

48
MCQeasy

An ML engineer needs to monitor a deployed model for data drift. They want to compare the distribution of incoming predictions against a baseline distribution. Which Vertex AI service should they use?

A.Vertex AI Feature Store
B.Vertex AI Model Monitoring
C.Vertex AI Experiments
D.Vertex AI Explainable AI
AnswerB

Designed for detecting drift and anomalies in prediction data.

Why this answer

Vertex AI Model Monitoring is the correct service because it is specifically designed to detect data drift and feature skew in deployed models. It continuously compares the distribution of incoming prediction requests against a baseline distribution (e.g., training data or a previous window) and alerts the engineer when statistically significant drift is detected, using metrics like Jensen-Shannon divergence or L-infinity distance.

Exam trap

Google Cloud often tests the distinction between monitoring (drift detection) and other MLOps components like feature stores or experiment tracking, so the trap here is that candidates may confuse 'monitoring' with 'storing features' or 'tracking experiments' because all are part of the ML lifecycle but serve different purposes.

How to eliminate wrong answers

Option A is wrong because Vertex AI Feature Store is a centralized repository for storing, managing, and serving feature values for training and serving, not for monitoring distributional shifts in predictions. Option C is wrong because Vertex AI Experiments is used for tracking and comparing machine learning experiments (e.g., hyperparameter tuning runs), not for real-time monitoring of deployed model predictions. Option D is wrong because Vertex AI Explainable AI provides feature attributions and explanations for model predictions, but does not perform statistical drift detection or baseline comparison.

49
MCQeasy

A machine learning engineer is exporting a trained model from Vertex AI Training to the Model Registry. Which artifact should they upload as the model artifact?

A.The saved model directory containing the model file(s) and any custom dependencies.
B.Only the model checkpoint file (.ckpt or .h5).
C.The entire training directory including training code and logs.
D.A zip file of the training source code.
AnswerA

This is the standard artifact expected by Vertex AI for deployment.

Why this answer

When exporting a trained model from Vertex AI Training to the Model Registry, the correct artifact is the saved model directory that contains the model file(s) (e.g., SavedModel format for TensorFlow, model.pkl for scikit-learn) along with any custom dependencies required for serving. This ensures the model can be deployed consistently to endpoints or batch predictions, as the Model Registry expects a self-contained artifact that includes both the model binary and its runtime dependencies.

Exam trap

Google Cloud often tests the distinction between training artifacts (checkpoints, code) and deployable model artifacts, trapping candidates who confuse a checkpoint (used for resuming training) with a final, serving-ready model.

How to eliminate wrong answers

Option B is wrong because a model checkpoint file (.ckpt or .h5) is an intermediate training state, not a final deployable artifact; it lacks the serialized graph and serving signatures needed for inference. Option C is wrong because uploading the entire training directory, including training code and logs, introduces unnecessary files and violates the Model Registry's expectation of a minimal, serving-ready artifact. Option D is wrong because a zip file of the training source code contains no model weights or architecture, making it useless for deployment.

50
MCQhard

Refer to the exhibit. A Machine Learning Engineer attempts to deploy a model to a Vertex AI Endpoint for online predictions but receives an error. What is the most likely cause of this error?

A.The model is not compatible with the selected machine type.
B.The machine type does not support GPU acceleration.
C.The min replica count is set to 0, which is not allowed for online prediction.
D.The endpoint is not in the same region as the model.
AnswerC

The error clearly states that min_replica_count must be at least 1.

Why this answer

Vertex AI online prediction endpoints require at least one replica to serve traffic. Setting `min_replica_count` to 0 is only valid for batch prediction, not for online prediction, because the endpoint must always have a running instance to handle incoming requests. The error occurs because the deployment request violates this constraint, causing the API to reject the configuration.

Exam trap

Google Cloud often tests the distinction between batch and online prediction configuration requirements, specifically that `min_replica_count = 0` is valid for batch but invalid for online, leading candidates to overlook this subtle but critical constraint.

How to eliminate wrong answers

Option A is wrong because Vertex AI automatically validates model compatibility with the selected machine type at deployment time; if there were an incompatibility, the error would be specific to that mismatch, not a generic deployment failure. Option B is wrong because GPU acceleration is optional and not required for online prediction; the error message would explicitly mention GPU-related issues if that were the cause. Option D is wrong because Vertex AI endpoints and models can be in different regions as long as the endpoint is deployed in a supported region; the platform handles cross-region model serving transparently.

51
MCQhard

An ML engineer is trying to upload a TensorFlow model to Vertex AI using the gcloud command shown. The model was trained using TensorFlow 2.11 and saved with model.save('model/'). The engineer sees the error. What is the most likely cause?

A.The container port should be 8080 instead of 8501.
B.The service account does not have permission to access the bucket.
C.The container image is for TensorFlow 2.11 but the model was saved with an older version.
D.The model was saved in a format other than SavedModel (e.g., HDF5) or the artifact path does not contain the expected directory structure.
AnswerD

The error explicitly states no saved_model.pb found, indicating the model is not in SavedModel format.

Why this answer

Option D is correct because the error indicates that Vertex AI cannot find the expected SavedModel artifacts (saved_model.pb and variables/ directory) at the specified path. When using model.save('model/') with TensorFlow 2.11, the default format is the SavedModel format, but the artifact path must point to the directory containing the saved_model.pb file, not a parent directory or a model saved in HDF5 format. The gcloud command likely references a path that does not contain the required SavedModel structure, causing the upload to fail.

Exam trap

Google Cloud often tests the distinction between SavedModel and HDF5 formats, and candidates mistakenly assume that any model.save() call produces a valid SavedModel, overlooking that the artifact path must point to the correct directory structure with saved_model.pb.

How to eliminate wrong answers

Option A is wrong because the container port 8501 is the default for TensorFlow Serving's REST API, and Vertex AI's prediction container for TensorFlow models typically uses port 8501 for HTTP requests; port 8080 is used for custom containers, not for standard TensorFlow Serving images. Option B is wrong because the error message in the question does not mention permissions or access to a bucket; a bucket permission issue would produce a 403 or 401 error, not a model format error. Option C is wrong because TensorFlow 2.11 is fully backward-compatible with SavedModels saved by older versions, and the container image for TensorFlow 2.11 can serve models saved with any earlier TensorFlow 2.x version without issue.

52
Multi-Selecthard

A team is troubleshooting a Vertex AI Pipelines run that keeps failing at the model evaluation step. The pipeline includes steps: data preprocessing, training, evaluation, and deployment. Which THREE actions should they take to diagnose the issue?

Select 3 answers
A.Verify that the training step output is correctly linked as input to evaluation.
B.Run the evaluation code locally with the same input data.
C.Increase the memory of the evaluation step's machine.
D.Check the logs of the evaluation step in Cloud Logging.
E.Replace the evaluation step with a Vertex AI Model Evaluation service.
AnswersA, B, D

Mismatched outputs are a common pipeline failure cause.

Why this answer

Option A is correct because Vertex AI Pipelines relies on precise input/output artifact linking between steps. If the training step's output (e.g., a model artifact or evaluation metrics) is not correctly wired as the input to the evaluation step, the pipeline will fail due to missing or mismatched data. This is a common misconfiguration in Kubeflow Pipelines DSL, where step outputs must be explicitly passed as arguments to downstream components.

Exam trap

Google Cloud often tests the misconception that resource scaling (Option C) is the first diagnostic step for pipeline failures, when in reality, most failures in Vertex AI Pipelines stem from misconfigured artifact passing or code errors, not hardware limits.

53
MCQmedium

An ML engineer is scaling a prototype to production using Vertex AI Pipelines. The pipeline includes data validation, preprocessing, training, and deployment steps. They want to ensure that the pipeline can be reproduced and audited. What is the best practice?

A.Define the pipeline using Kubeflow Pipelines SDK and run it on Vertex AI Pipelines.
B.Use a Docker container with fixed tags and manually record runs.
C.Store all data and models in a single Cloud Storage bucket with no versioning.
D.Pin all library versions in a requirements.txt file.
AnswerA

Vertex AI Pipelines automatically tracks artifacts, parameters, and lineage.

Why this answer

Using a fully managed pipeline service like Vertex AI Pipelines automatically tracks artifacts, parameters, and lineage, ensuring reproducibility and auditability. Option A is not a service; Option B is about environment consistency but does not provide built-in tracking. Option D is about dependencies but not the pipeline orchestration.

54
MCQhard

A data science team has trained a TensorFlow model on-premises using a large dataset. When they try to deploy the model to Vertex AI for online predictions, the deployed model fails to start with a ‘MemoryError’. The model artifact is 2 GB, and the machine type is n1-standard-4 (15 GB RAM). What is the most likely cause?

A.The model is stored in a regional bucket and the Vertex AI endpoint is in a different region.
B.The machine type does not support TensorFlow models larger than 1 GB.
C.The model is too large for the machine's memory, causing an out-of-memory (OOM) error during loading.
D.The model file is corrupted or missing dependencies, causing a crash.
AnswerC

The 2 GB model may require more than 15 GB RAM during loading due to overhead and intermediate structures.

Why this answer

Option C is correct because the model artifact is 2 GB, and loading it into memory on an n1-standard-4 machine (15 GB RAM) can still cause a MemoryError. TensorFlow models often require additional memory for graph construction, intermediate tensors, and framework overhead, which can easily exceed the available RAM, especially when the model is loaded entirely into memory before serving.

Exam trap

Google Cloud often tests the misconception that model file size must be less than total machine RAM to avoid OOM errors, but the trap here is that TensorFlow's memory footprint during loading and serving is significantly larger than the artifact size due to framework overhead and graph construction.

How to eliminate wrong answers

Option A is wrong because a regional bucket mismatch would cause a permission or access error, not a MemoryError; Vertex AI can access models from any regional bucket as long as the service account has proper permissions. Option B is wrong because there is no inherent machine type limitation that restricts TensorFlow models to 1 GB; the n1-standard-4 can handle larger models if sufficient memory is available. Option D is wrong because a corrupted file or missing dependencies would typically result in an ImportError or a crash with a different error message, not a MemoryError.

55
MCQeasy

An ML team is moving from a prototype Jupyter notebook to a production training pipeline. They want to ensure reproducibility. Which approach should they take?

A.Use interactive parameter tuning.
B.Use a container with fixed dependencies and record hyperparameters.
C.Export the notebook's output model directly.
D.Save the notebook as a .py file.
AnswerB

Captures environment and configuration for reproducibility.

Why this answer

Option C is correct because using a container with fixed dependencies and recording hyperparameters ensures that the training environment and configuration are captured, enabling exact reproduction. Option A is wrong because a .py file does not capture the full environment. Option B is wrong because exporting the notebook's output model directly lacks environment tracking.

Option D is wrong because interactive tuning is not reproducible.

56
MCQmedium

A data scientist trains an XGBoost model on Vertex AI with a custom container. The model performs well on a held-out test set but fails to generalize in production. They suspect data leakage between training and validation. What is the best practice to prevent this?

A.Store and serve features using Vertex AI Feature Store with point-in-time correctness
B.Implement feature engineering in Vertex AI Pipelines to ensure temporal ordering
C.Store all features in BigQuery and join on timestamp during training and serving
D.Use Vertex AI AutoML instead of custom training
AnswerA

Feature Store provides consistent feature values for each timestamp, preventing leakage.

Why this answer

Option A is correct because Vertex AI Feature Store with point-in-time correctness ensures that for each training example, only feature values that were known at the time of the prediction (i.e., before the label occurred) are used. This prevents future data from leaking into the training set, which is the most common cause of poor generalization when temporal ordering matters. The Feature Store automatically retrieves the latest feature value as of a specified timestamp, eliminating the need for manual joins and windowing logic.

Exam trap

Google Cloud often tests the misconception that simply using a pipeline or a data warehouse with timestamps is sufficient to prevent leakage, but the key is the automated enforcement of point-in-time correctness, which only a dedicated feature store with time-travel capabilities provides.

How to eliminate wrong answers

Option B is wrong because implementing feature engineering in Vertex AI Pipelines ensures reproducible workflows but does not inherently enforce temporal ordering or prevent data leakage; pipelines can still join future features if the data is not time-aware. Option C is wrong because storing all features in BigQuery and joining on timestamp during training and serving is a manual approach that is error-prone and does not guarantee point-in-time correctness; it requires careful windowing logic and can still leak future data if the join is not correctly scoped. Option D is wrong because using Vertex AI AutoML does not automatically solve data leakage; AutoML models are equally susceptible to leakage if the training data contains future information, and the user still needs to ensure temporal integrity of the input features.

57
MCQmedium

A data scientist trained a model on a single GPU but needs to train on multiple GPUs for a larger dataset. They observe that training time does not decrease linearly with additional GPUs. Which common issue is most likely?

A.Overfitting.
B.Model architecture too simple.
C.Learning rate too high.
D.Data pipeline bottleneck.
AnswerD

I/O or preprocessing bottleneck limits GPU utilization.

Why this answer

Option A is correct because a data pipeline bottleneck can starve GPUs, preventing linear speedup. Option B is wrong because overfitting relates to model performance, not training speed. Option C is wrong because learning rate affects convergence, not parallelism efficiency.

Option D is wrong because model architecture size does not directly cause non-linear speedup.

Ready to test yourself?

Try a timed practice session using only Scaling Ml Models questions.