PMLE Timed Hard Questions — 25 Difficult Questions

Question 1

A travel booking company has a real-time recommendation system that suggests hotels and flights to users. The model is served using TensorFlow Serving on a Google Kubernetes Engine (GKE) cluster with auto-scaling enabled. The cluster uses n1-standard-4 machine types. The team has set up Cloud Monitoring dashboards and alerts. Last week, during a major holiday promotion, the team noticed that the model's inference latency P99 increased from 150 ms to 450 ms over a 30-minute period, while the request throughput increased from 500 to 1,200 requests per second. CPU utilization across the cluster rose to 95%, but memory utilization remained at 60%. The model version and the serving infrastructure configuration have not changed since the last deployment. Which action should the team take to mitigate the latency issue?

Accepted Answer

Add more nodes to the GKE cluster to increase the total CPU resources available for serving.. The latency spike is caused by CPU saturation (95% utilization) under increased load (500 to 1,200 RPS). Adding more nodes to the GKE cluster directly increases the total CPU resources available, allowing the existing TensorFlow Serving pods to handle the higher throughput without contention. This is the most immediate and infrastructure-appropriate fix because the model version and serving configuration have not changed, ruling out model-level or code-level optimizations.

Answer

Implement a feature engineering pipeline that compresses the input features to reduce data size and inference time.

Answer

Deploy a newer version of the model that uses a more efficient architecture to reduce computational complexity.

Answer

Increase the number of TensorFlow Serving instances by reducing the CPU request per pod in GKE to allow more pods per node.

Question 2

You are an ML engineer at a global e-commerce company. Your team has developed a deep learning model for product recommendation that runs on Vertex AI Prediction. The model is deployed on a single n1-highmem-2 instance (CPU only) with autoscaling enabled (min replicas=1, max replicas=10). During Black Friday, traffic spikes to 1000 requests per second (QPS), and you observe that latency increases from 50ms to over 5000ms, and many requests time out. You check the monitoring dashboard and see that CPU utilization is at 100% on the single instance, and autoscaling is not triggering quickly enough. The team has a budget for this service and wants to handle the spike without compromising latency. What should you do?

Accepted Answer

Switch to GPU instances (e.g., n1-standard-4 with T4) and set min replicas=2 with autoscaling up to 10. Option A is correct because switching to GPU instances (n1-standard-4 with T4) offloads compute-intensive recommendation model inference to GPUs, significantly reducing per-request latency. Setting min replicas=2 ensures that at least two instances are always warm, reducing cold-start delays and allowing autoscaling to handle traffic spikes more responsively. This combination addresses both the CPU bottleneck and the slow scaling trigger, keeping latency under 50ms even at 1000 QPS.

Answer

Increase min replicas to 5 to keep warm instances

Answer

Set min replicas=1 and max replicas=5 to control cost

Answer

Increase max replicas to 20 and keep CPU instances

Question 3

A financial services firm deploys a binary classification model for fraud detection. The model's precision is 0.95 and recall is 0.60 on the test set. After deployment, the fraud rate in production is 0.5% compared to 5% in the test set. The model shows good calibration on the test set (Brier score 0.02) but poor calibration in production (Brier score 0.15). What is the most likely explanation for the calibration degradation?

Accepted Answer

The relationship between features and the target has changed (concept drift), causing the model's probability estimates to be misaligned with the true probabilities.. The model's calibration degrades in production despite being well-calibrated on the test set, which had a 5% fraud rate, while production has a 0.5% fraud rate. This shift in class imbalance (prior probability shift) directly affects the model's probability estimates because the model's predicted probabilities are conditional on the training distribution. Option D is correct because concept drift—specifically a change in the base rate of fraud—causes the model's probability estimates to no longer reflect the true posterior probabilities in production, leading to a higher Brier score.

Answer

The distribution of input features has shifted significantly, causing the model to produce incorrect probabilities.

Answer

The model overfits to noise in the training data, leading to poor generalization.

Answer

The production data has a different class imbalance than the training data, causing the model to be biased toward the majority class.

Question 4

A logistics company uses Vertex AI AutoML Tables to predict delivery delays based on order attributes, weather data, and traffic data. The model is retrained weekly using a Vertex AI Pipeline that runs a BigQuery query to get training data, then triggers AutoML training. Recently, the pipeline fails with the error 'Dataset not found' when the AutoML training step starts. The BigQuery query runs successfully and outputs a table. Which is the most likely cause?

Accepted Answer

The BigQuery output table is not being passed as a Vertex AI Dataset resource.. The error 'Dataset not found' occurs because AutoML Tables requires a Vertex AI Dataset resource (a metadata wrapper) to reference the training data, not just a BigQuery table. The pipeline's BigQuery query produces a table, but if that table is not explicitly converted into or passed as a Vertex AI Dataset resource (via the `aiplatform.Dataset` creation step), AutoML training cannot locate it. Option D correctly identifies this missing step as the root cause.

Answer

The AutoML training step is referencing a different dataset location.

Answer

The training data has been manually deleted from Cloud Storage.

Answer

The pipeline's IAM permissions are insufficient to access BigQuery.

Question 5

A financial services company has deployed a classification model on Vertex AI to detect fraudulent transactions. The model is monitored using Vertex AI Model Monitoring for skew and drift detection, and also logs predictions to BigQuery for analysis. After a month, the monitoring alerts show a significant drift in one feature (transaction_amount). Which TWO actions should the team take to diagnose and address this issue?

Accepted Answer

Compare the feature distribution in the training data with the recent serving data using statistical tests.. Option A is correct because comparing the feature distribution of the training data with recent serving data using statistical tests (e.g., Kolmogorov-Smirnov or Jensen-Shannon divergence) is the standard first step to quantify the drift and confirm it is statistically significant. This diagnostic action helps the team understand the nature and magnitude of the drift before deciding on remediation steps. Vertex AI Model Monitoring already performs such comparisons, but the team should independently verify the results in BigQuery to ensure accuracy.

Answer

Increase the frequency of model monitoring checks to every hour.

Answer

Increase the sampling rate for prediction logging to ensure full data capture.

Answer

Reduce the alert threshold to minimize false positives.

Question 6

You are designing an ML pipeline for a large-scale recommendation system that runs weekly retraining on historical user interaction data. The pipeline uses TensorFlow and is deployed on Google Cloud. The pipeline must be orchestrated and automated with minimal manual intervention. Which THREE options should you include in your design? (Choose three.)

Accepted Answer

Use Vertex AI Pipelines to define the ML pipeline as a Directed Acyclic Graph (DAG) of components.. Vertex AI Pipelines (option B) is correct because it provides a managed, serverless orchestration service for building, testing, and deploying ML pipelines as Directed Acyclic Graphs (DAGs). This directly supports the requirement for automated, minimal-intervention weekly retraining by allowing you to define reusable components and schedule pipeline runs via Cloud Scheduler or event triggers, integrating natively with TensorFlow and Google Cloud services.

Answer

Use BigQuery scheduled queries to run the training script on a schedule.

Answer

Use AI Platform Notebooks to schedule the training job on a recurring basis.

Question 7

A team uses Vertex AI Feature Store to serve features for real-time predictions. They notice that feature values are frequently updated from multiple source systems, leading to inconsistencies. They need to ensure that feature values are consistent across all serving endpoints. What should they do?

Accepted Answer

Use a streaming ingestion pipeline with exactly-once semantics. Option D is correct because streaming ingestion with exactly-once semantics ensures that each feature update is applied precisely once, preventing duplicates or missed updates that cause inconsistencies. This approach synchronizes feature values across all serving endpoints in near real-time, directly addressing the problem of frequent updates from multiple source systems.

Answer

Use batch ingestion with weekly updates to reduce update frequency

Answer

Increase the offline storage TTL to retain historical feature values

Answer

Implement a manual approval process for feature updates

Question 8

A company uses Vertex AI Prediction with a custom container for a TensorFlow model. They notice that after deploying a new model version, requests still go to the old version. What is the most likely cause?

Accepted Answer

Traffic is not split to the new model version. In Vertex AI Prediction, when you deploy a new model version to an existing endpoint, you must explicitly allocate traffic to it. By default, the new version receives 0% traffic, so all requests continue to be served by the old version. The correct fix is to update the endpoint's traffic split, for example via the console or the `gcloud ai endpoints update` command with the `--traffic-split` flag.

Answer

The custom container is not compatible with Vertex AI

Answer

The model is cached and needs cache invalidation

Answer

The new model version was not deployed to the same endpoint

Question 9

A company uses Vertex AI Pipelines to orchestrate their ML training workflow. The pipeline includes a BigQuery ML training step, a model evaluation step, and a deployment step to Vertex AI Endpoints. The engineer notices that the pipeline fails intermittently due to a quota exceeded error on Vertex AI Endpoints during model deployment. What is the best long-term solution to prevent this failure?

Accepted Answer

Add retry logic with exponential backoff to the deployment step in the pipeline.. Option D is correct because implementing retry logic with exponential backoff is a resilient pattern for transient quota errors. Option A is wrong because increasing quota requires a support ticket and may not be granted immediately. Option B is wrong because using a custom container does not address quota limits. Option C is wrong because sequential execution does not prevent quota errors.

Answer

Run the pipeline steps sequentially with longer wait times.

Answer

Switch to deploying models using a custom container on Compute Engine.

Answer

Request a permanent quota increase for Vertex AI Endpoints.

Question 10

An ML team uses Vertex AI Pipelines to automate model retraining. The pipeline includes a step that queries BigQuery to create a training dataset. The team notices that the pipeline fails intermittently with a '403 Exceeded rate limits' error. What is the most likely cause and solution?

Accepted Answer

The pipeline is issuing too many concurrent queries; use a BigQuery reservation to guarantee slot capacity. The 403 'Exceeded rate limits' error in BigQuery indicates that the project is hitting the concurrent query rate limit or the rate of bytes read per second. Using a BigQuery reservation guarantees dedicated slot capacity, which prevents rate-limit errors by ensuring the pipeline has consistent compute resources regardless of other workloads in the project. This is the most direct solution because rate limits are enforced at the project level based on available slots, and a reservation provides a fixed number of slots that bypass those limits.

Answer

The training dataset is too large; partition the table and query only the latest partition

Answer

The pipeline step timeout is too short; increase the timeout to 30 minutes

Answer

The SQL query is inefficient; rewrite it using materialized views

Question 11

Your team has deployed a text classification model on Vertex AI Endpoints. You notice that the model's latency has increased significantly over the last week, but the request rate has remained stable. Which of the following is the most likely cause?

Accepted Answer

A change in the preprocessing logic that now includes a computationally expensive step. A computationally expensive preprocessing step directly increases per-request latency on the inference path, even when request rate is stable. Vertex AI Endpoints execute user-provided preprocessing code before model inference, so adding a heavy operation (e.g., large regex, image resizing, or external API call) will linearly increase response time for every prediction.

Answer

A sudden increase in the number of prediction requests

Answer

The model was replaced with a larger version without updating the endpoint

Answer

A misconfiguration in the autoscaling policy

Question 12

You are an ML engineer at a large e-commerce company. Your team has developed a product recommendation model using TensorFlow and deployed it on Vertex AI Endpoints for real-time inference. The model is retrained weekly using a Vertex AI Pipeline that reads new user interaction data from BigQuery, trains the model, evaluates it, and deploys the new version to the endpoint with a traffic split: 10% to the new model and 90% to the previous champion model. Recently, the team noticed that the new model's online prediction latency has increased significantly (from 50ms to 200ms) after deployment, causing timeouts for some requests. The training code has not changed, and the model size is similar. The pipeline uses a custom container with the same TensorFlow Serving image as before. The deployment step uses the same machine type (n1-standard-4) for the endpoint. What is the most likely cause of the latency increase?

Accepted Answer

The pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation.. Option C is correct because the pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation. This additional operation runs during inference on Vertex AI Endpoints, increasing the per-request latency from 50ms to 200ms, even though the model architecture and size remain unchanged. The custom container and machine type are identical, so the latency increase must stem from a change in the serving graph itself.

Answer

The endpoint is using a machine type that is not optimized for the new model's computation.

Answer

The new model has a significantly different architecture that requires more computation.

Answer

The new model is experiencing data skew because the training data distribution has changed.

Question 13

A machine learning engineer needs to share a trained model with the product team for integration. The model is stored in Cloud Storage, and the product team’s service account needs read access. The engineer wants to follow the principle of least privilege. Which IAM configuration should be used?

Accepted Answer

Grant the product team's service account the roles/storage.objectViewer role at the bucket level.. Option B is correct because granting the product team's service account the roles/storage.objectViewer role at the bucket level provides read-only access to objects in that specific bucket, adhering to the principle of least privilege. This role allows the service account to list and read objects without granting broader permissions, such as modifying or deleting them, and scoping it to the bucket prevents unnecessary access to other buckets in the project.

Answer

Generate a signed URL with read access and share it with the product team.

Answer

Grant the product team's service account the roles/storage.objectAdmin role at the bucket level.

Answer

Grant the product team's service account the roles/storage.objectViewer role at the project level.

Question 14

A healthcare company uses AutoML Tables to predict patient readmission risk. The dataset contains 500,000 rows and 200 features, including patient demographics, lab results, and medical history. The model accuracy is lower than expected. The engineer wants to improve performance using low-code techniques. Which THREE actions are most effective? (Choose THREE.)

Accepted Answer

Remove highly correlated features using AutoML Tables' built-in feature importance analysis.. Option B is correct because AutoML Tables provides built-in feature importance analysis that can identify and remove highly correlated features, which reduces noise and multicollinearity, often improving model performance without manual intervention. This is a low-code technique that leverages the platform's automated capabilities to streamline feature selection.

Answer

Increase the training time budget to the maximum allowed.

Answer

Use a custom model architecture via AutoML Tables advanced options.

Question 15

A financial institution uses BigQuery ML to train a linear regression model to predict loan default risk. The model is trained on a dataset with 100 million rows and 50 features. During inference, the engineer uses the ML.PREDICT function. However, the query takes several minutes to run and times out frequently. The data is static and updated monthly. What is the most cost-effective and low-code solution to improve prediction latency?

Accepted Answer

Export the trained model as a SQL function using the EXPORT MODEL statement, then use it for predictions.. Option A is correct because exporting the trained model as a SQL function via `EXPORT MODEL` converts the linear regression coefficients into a persistent SQL UDF, eliminating the overhead of model loading and serialization during each `ML.PREDICT` call. This approach is low-code (no external pipeline) and cost-effective since predictions are executed as standard SQL without consuming BigQuery ML slot resources for model inference.

Answer

Create a Dataflow pipeline to precompute predictions and store them in a separate table.

Answer

Use a materialized view to precompute the prediction features.

Answer

Increase the BigQuery compute capacity by reserving more slots.

Question 16

Which TWO factors should you consider when choosing between BigQuery and Cloud Storage for storing training data? (Choose 2)

Accepted Answer

The format of the data: structured vs. unstructured.. Option A is correct because BigQuery is optimized for structured, tabular data (e.g., CSV, Avro, Parquet) and supports SQL queries, while Cloud Storage is a better fit for unstructured data (e.g., images, videos, raw text files) that does not require schema enforcement. Choosing the right storage depends on whether the training data has a fixed schema and requires relational querying or is blob-based and needs high-throughput access.

Answer

The requirement for data encryption at rest.

Answer

The need for fine-grained access control at the row level.

Answer

The maximum size of the dataset (BigQuery limit 1 TB).

Question 17

A company trains a model using Vertex AI Training and then deploys it to Vertex AI Prediction. They notice that prediction requests fail with 'InvalidArgument: input tensor shape mismatch'. Which THREE are possible causes?

Accepted Answer

The input data types do not match the expected types (e.g., float vs int). Option C is correct because Vertex AI Prediction expects the input tensor data types to exactly match those used during model training. If the model was trained with float32 inputs but the prediction request sends int32 values, the serving infrastructure detects the mismatch and returns an 'InvalidArgument: input tensor shape mismatch' error, as TensorFlow Serving (which underlies Vertex AI Prediction) validates dtype consistency at the graph level.

Answer

The model was exported in a different format than supported

Answer

The batch size in the request is too large

Question 18

A team uses Vertex AI Pipelines with CustomJob components that pull training code from a Cloud Source Repository. The pipeline fails with a 'Permission denied' error when trying to access the repository. The service account used by the pipeline has the 'Source Repository Viewer' role. What is the likely issue?

Accepted Answer

The 'Source Repository Viewer' role is insufficient; the service account needs 'Source Repository Reader' or higher. The 'Source Repository Viewer' role only allows listing and viewing repository metadata, not reading the actual source code. To clone or pull code from a Cloud Source Repository, the service account needs the 'Source Repository Reader' role (or higher), which grants the `source.repos.get` and `source.repos.read` permissions required for Git operations. The pipeline's CustomJob component fails because the service account lacks these permissions when attempting to access the repository.

Answer

The training code contains a dependency that is not available in the custom container

Answer

The pipeline is running in a different project than the repository; cross-project access is not supported

Answer

The repository URL is incorrectly formatted; use the SSH URL instead of HTTPS

Question 19

A company serves a scikit-learn model on Vertex AI Prediction but receives a 400 error with 'Prediction failed: Model evaluation error'. What is the most likely cause?

Accepted Answer

The model uses a scikit-learn version not supported by Vertex AI. Vertex AI Prediction supports specific versions of scikit-learn for serving models. If the model was trained with a version that is not in the supported list (e.g., 0.19, 0.20, 0.22, 0.23, 0.24, 1.0, 1.1), the prediction endpoint will fail with a 'Model evaluation error' because the underlying runtime cannot load the serialized model (e.g., pickle or joblib file). This is the most likely cause of a 400 error when the input format is otherwise correct.

Answer

The input data format is incorrect

Answer

The model was trained with a different framework

Answer

The endpoint is overloaded and timing out

Question 20

Your company uses a custom container for model serving on Vertex AI. After a recent update, the model returns predictions but they are clearly wrong (e.g., negative probabilities for a classification model). The logs show no errors. What is the most likely cause?

Accepted Answer

The preprocessing code in the container was updated but the model was not retrained on the new preprocessing. Option A is correct because the most likely cause of a model returning predictions without errors, but with clearly wrong outputs like negative probabilities, is a mismatch between the preprocessing logic used during training and inference. If the preprocessing code in the container was updated (e.g., scaling, normalization, or feature engineering steps changed) but the model was not retrained on data processed with that new logic, the model receives inputs that are out of distribution, leading to nonsensical outputs. Vertex AI containers run inference with the deployed code, so any change in preprocessing directly affects the input tensor values without raising runtime errors.

Answer

The model file is corrupted

Answer

The model file was accidentally replaced with a different model

Answer

The container is using an incompatible version of the serving framework