CCNA Automating and orchestrating ML pipelines Questions

51 questions · Automating and orchestrating ML pipelines · All types, answers revealed

1
MCQhard

A company uses Vertex AI Pipelines with prebuilt components for data processing, training, and deployment. They need to integrate a custom validation step written in Python. What is the correct way to include this as a component?

A.Package the code in a Docker container and reference it as a custom job
B.Define the step in the YAML pipeline definition using arbitrary Python commands
C.Create a custom component using the Vertex AI Pipelines SDK @component decorator
D.Use a Cloud Function as a pipeline step
E.Write a standalone Python script and call it using a Cloud Shell step
AnswerC

Standard method for custom components.

Why this answer

Option C is correct because the Vertex AI Pipelines SDK provides a `@component` decorator that allows you to define a custom Python function as a pipeline component. This decorator automatically handles packaging the Python code into a container image, generating the component specification, and integrating it seamlessly with the pipeline orchestration engine. It is the idiomatic and recommended way to add custom validation logic without manually managing Docker or infrastructure.

Exam trap

The trap here is that candidates often confuse the `@component` decorator with a simple function wrapper and assume they can just write inline Python code in the pipeline YAML (Option B), not realizing that Vertex AI Pipelines requires each step to be a containerized component with explicit input/output definitions.

How to eliminate wrong answers

Option A is wrong because packaging code in a Docker container and referencing it as a custom job would create an independent job outside the pipeline DAG, losing the ability to pass inputs/outputs between pipeline steps and breaking the orchestration flow. Option B is wrong because Vertex AI Pipelines YAML definitions do not support arbitrary Python commands; they require prebuilt or custom component definitions with proper container specifications. Option D is wrong because Cloud Functions are event-driven serverless functions not designed for pipeline step integration; they lack native support for pipeline I/O, artifact tracking, and retry logic within Vertex AI Pipelines.

Option E is wrong because Cloud Shell is an interactive environment for ad-hoc commands, not a pipeline execution step; it cannot be used as a component in a Vertex AI Pipeline and would not support parameter passing or artifact management.

2
MCQhard

A team uses Cloud Composer to orchestrate a complex ML pipeline with many tasks. They notice that the DAG parsing time is very high, causing delays in task scheduling. Which action would most effectively reduce DAG parsing time?

A.Remove all DAG files that are not currently needed from the bucket
B.Increase the parallelism of the Airflow scheduler
C.Optimize DAG files to avoid heavy top-level imports and database queries
D.Combine all DAGs into a single file
AnswerC

Top-level imports/queries are executed on every parse, so reducing them speeds up parsing.

Why this answer

Option C is correct because heavy top-level imports and database queries in DAG files are executed every time the scheduler parses the DAG, which happens frequently (default every 30 seconds). By moving imports inside Python callables or using lazy loading, the parsing time is drastically reduced, allowing the scheduler to process DAGs faster and trigger tasks without delay.

Exam trap

Google Cloud often tests the misconception that reducing the number of DAG files or increasing scheduler resources will fix parsing delays, when the real bottleneck is the top-level code execution inside each DAG file.

How to eliminate wrong answers

Option A is wrong because removing unused DAG files reduces clutter but does not address the root cause of high parsing time; the scheduler still parses all present DAG files, and if they contain heavy top-level code, parsing remains slow. Option B is wrong because increasing scheduler parallelism (e.g., `scheduler_parallelism` or `max_threads`) only affects how many tasks the scheduler can process concurrently, not how fast it parses DAG files; parsing is a sequential, per-file operation. Option D is wrong because combining all DAGs into a single file actually increases parsing time, as the scheduler must parse one very large file with all dependencies loaded at once, and it also breaks Airflow's ability to detect changes per DAG.

3
Multi-Selecthard

Which TWO strategies can help reduce the cost of running ML pipelines on Vertex AI?

Select 2 answers
A.Run hyperparameter tuning jobs with a large search space
B.Use Vertex AI managed datasets to reduce storage costs
C.Manually scale up resources during peak times and scale down during off-peak
D.Use preemptible VMs for training steps where possible
E.Use a larger machine type for training to complete faster
AnswersB, D

Managed datasets avoid duplication and reduce storage costs.

Why this answer

Options B and D are correct. Option B is correct because preemptible VMs are cheaper. Option D is correct because using managed datasets avoids duplicates.

Option A is wrong because larger machines increase cost. Option C is wrong because manual scaling is not cost-effective. Option E is wrong because hyperparameter tuning can increase cost due to many trials.

4
MCQmedium

The exhibit shows a Cloud Build configuration. An ML engineer wants to automate the deployment of a model to Vertex AI after training. What is missing in this config to successfully deploy the model?

A.A step to upload the training image to Artifact Registry
B.A step to build the serving container image
C.A step to run unit tests
D.A step to create the Vertex AI Endpoint
AnswerB

The config only builds the training image; it needs a separate step to build and push the serving image.

Why this answer

The Cloud Build configuration shown is for training a model, but to deploy it to Vertex AI, a serving container image must be built and pushed to Artifact Registry. Vertex AI requires a custom serving container (or a prebuilt one) to host the model for predictions. Without a step to build the serving container image (e.g., using a Dockerfile that includes the model and serving dependencies), the deployment will fail because there is no runnable image to deploy to the endpoint.

Exam trap

Google Cloud often tests the distinction between training and serving containers, leading candidates to mistakenly think that the training image (or any image) is sufficient for deployment, when in fact a separate serving container is required.

How to eliminate wrong answers

Option A is wrong because uploading the training image to Artifact Registry is already implied or handled by the training step; the missing piece is the serving container image, not the training image. Option C is wrong because running unit tests, while good practice, is not a prerequisite for deploying a model to Vertex AI; the deployment process specifically requires a serving container image. Option D is wrong because creating the Vertex AI Endpoint can be done as part of the deployment step (e.g., via `gcloud ai endpoints create` or the Vertex AI SDK) and is not the missing piece; the fundamental gap is the absence of a serving container image build step.

5
MCQmedium

Your team is developing a machine learning model for real-time fraud detection. The training pipeline runs on Vertex AI and uses BigQuery for feature engineering. Recently, the pipeline has been taking significantly longer to execute. Upon investigation, you find that the BigQuery query for feature extraction is being rerun every time the pipeline runs, even though the underlying data hasn't changed. The pipeline is scheduled to run every hour. You want to reduce cost and execution time without losing the ability to detect data drifts. Which approach should you take?

A.Implement a caching mechanism in the pipeline that stores the results of the BigQuery query and reuses them if the data hasn't changed.
B.Move the feature extraction to a separate scheduled query in BigQuery and load the results into a table that the pipeline reads from.
C.Reduce the pipeline frequency to once a day to minimize the number of runs.
D.Use a conditional pipeline that checks if the data has changed before running the feature extraction step.
AnswerB

This separates concerns and avoids redundant execution, while still allowing data drift detection via the pipeline.

Why this answer

Option B is correct because it decouples the feature extraction from the training pipeline by using a separate scheduled BigQuery query that writes results to a table. This eliminates redundant query execution on every pipeline run, reducing cost and execution time, while the scheduled query can be set to run at a frequency that still detects data drifts (e.g., hourly). The pipeline then reads from the precomputed table, avoiding repeated full scans of the source data.

Exam trap

Google Cloud often tests the misconception that caching or conditional checks are sufficient to reduce cost, when in fact the most efficient solution is to offload the repetitive computation to a separate scheduled job that writes to a table, avoiding any pipeline-level overhead.

How to eliminate wrong answers

Option A is wrong because implementing a caching mechanism that checks if data hasn't changed still requires an initial query or metadata check each run, and caching in the pipeline itself does not leverage BigQuery's native scheduled query capabilities, potentially missing data drift detection if the cache is stale. Option C is wrong because reducing pipeline frequency to once a day would significantly delay fraud detection, violating the real-time requirement and increasing the risk of missing drifts between runs. Option D is wrong because a conditional pipeline that checks for data changes before running the feature extraction step still incurs the overhead of a check query every hour, and if the check is lightweight, it may not accurately detect all data drifts (e.g., schema changes or new partitions), while still adding complexity without the cost savings of a scheduled query.

6
MCQmedium

An ML engineer is using Cloud Build to trigger a Vertex AI Pipeline on every commit to a repository. The pipeline takes 2 hours. The engineer wants to only run the pipeline when changes are made to specific directories. How can this be achieved?

A.Use Cloud Composer to poll the repository periodically
B.Configure Cloud Build trigger with included file globs
C.Use a Cloud Function to evaluate changes and invoke the pipeline
D.Modify the pipeline to ignore unrelated changes
E.Add a conditional step in the pipeline to abort if no relevant changes
AnswerB

Native feature of Cloud Build triggers.

Why this answer

Cloud Build triggers support 'included file globs' and 'ignored file globs' to filter which file changes should invoke the trigger. By specifying glob patterns for the directories of interest, the trigger will only fire when commits modify files matching those patterns, avoiding unnecessary pipeline runs for unrelated changes.

Exam trap

The trap here is that candidates may think a pipeline-level conditional check (Option E) is sufficient, but they overlook that Cloud Build triggers can filter at the trigger level, avoiding any pipeline startup cost for irrelevant changes.

How to eliminate wrong answers

Option A is wrong because Cloud Composer is an orchestration service for workflows, not a polling mechanism for repository changes; it would add unnecessary complexity and latency. Option C is wrong because using a Cloud Function to evaluate changes and invoke the pipeline is an overengineered solution; Cloud Build triggers natively support file glob filtering without needing an intermediary. Option D is wrong because modifying the pipeline to ignore unrelated changes would still consume resources to start the pipeline and then abort, wasting time and cost.

Option E is wrong because adding a conditional step in the pipeline to abort if no relevant changes still requires the pipeline to start and run until the conditional check, incurring unnecessary execution time and cost.

7
MCQmedium

A data-processing pipeline using Dataflow needs to incorporate a custom ML prediction step. The team wants to maintain fast processing and minimize latency. What is the optimal approach?

A.Write the data to Cloud Storage, trigger a Cloud Function to call the model, and write results back
B.Use a custom ParDo transform in Dataflow that calls Vertex AI Prediction API directly
C.Send data to a Pub/Sub topic and have a separate subscriber that runs predictions
D.Stream data through Cloud Functions that serve predictions and write to BigQuery
AnswerB

Inline calls within Dataflow are efficient and keep the pipeline linear.

Why this answer

Option B is correct because using a custom ParDo transform in Dataflow allows the pipeline to call the Vertex AI Prediction API synchronously within each worker, avoiding the overhead of external triggers, intermediate storage, or asynchronous messaging. This keeps the data in-memory and minimizes latency by processing predictions inline with the Dataflow streaming or batch pipeline.

Exam trap

Google Cloud often tests the misconception that adding external services like Cloud Functions or Pub/Sub improves modularity without considering the latency penalty, leading candidates to choose options that introduce unnecessary hops instead of keeping prediction inline within the Dataflow pipeline.

How to eliminate wrong answers

Option A is wrong because writing data to Cloud Storage and triggering a Cloud Function introduces significant I/O latency and additional orchestration overhead, breaking the low-latency requirement. Option C is wrong because sending data to Pub/Sub and having a separate subscriber decouples the prediction step, adding network round-trips and potential backpressure issues that increase end-to-end latency. Option D is wrong because streaming data through Cloud Functions for predictions and then writing to BigQuery creates a multi-hop architecture with cold-start risks and no native Dataflow optimization for parallelism or state management.

8
MCQeasy

An ML engineer is designing a CI/CD pipeline for ML models using Cloud Build and Cloud Deploy. They want to automatically test model performance on a validation set before promoting to production. Which step should be included in the CI/CD pipeline?

A.Run unit tests on the training code
B.Use Cloud Composer to schedule evaluation
C.Deploy to production immediately after training
D.Train the model in the CI/CD pipeline
E.Run a Vertex AI Pipeline for model evaluation and register the model only if metrics exceed thresholds
AnswerE

Implements a quality gate.

Why this answer

Option E is correct because it directly integrates model evaluation into the CI/CD pipeline using Vertex AI Pipelines, which allows automated validation of model performance against predefined thresholds before promotion. This ensures that only models meeting quality criteria are deployed, aligning with MLOps best practices for gated promotions.

Exam trap

Google Cloud often tests the distinction between code testing (unit tests) and model validation (performance metrics), leading candidates to choose A because they conflate software testing with ML evaluation.

How to eliminate wrong answers

Option A is wrong because unit tests on training code verify code correctness but do not assess model performance on a validation set, which is the requirement. Option B is wrong because Cloud Composer is an orchestration tool for workflows, not a CI/CD step for automatic model evaluation before promotion; it would introduce scheduling latency rather than inline gating. Option C is wrong because deploying immediately after training bypasses validation, risking production degradation from underperforming models.

Option D is wrong because training the model in the CI/CD pipeline is possible but does not include the evaluation step needed to gate promotion; it focuses on the training process itself, not validation.

9
Multi-Selectmedium

A machine learning engineer is designing an ML pipeline on Vertex AI. The pipeline includes multiple steps: data validation, preprocessing, training, evaluation, and deployment. The engineer wants to ensure that if the data validation step fails due to schema mismatch, the pipeline stops immediately and does not proceed. Additionally, they want to reuse the preprocessed data from a previous successful run if the source data hasn't changed. Which two configurations should they use? (Choose two.)

Select 2 answers
A.Use a custom exit handler in the data validation step to abort the pipeline.
B.Set the 'on_failure' parameter of the data validation component to 'Stop'.
C.Use conditional branches to check the output of data validation before proceeding.
D.Set the 'cache' option for the preprocessing step to True.
E.Enable 'skip_if_successful' on the preprocessing step.
AnswersB, D

Setting on_failure='Stop' immediately stops the pipeline if the component fails.

Why this answer

Options A and C are correct. Option A: Enabling caching (cache=True) on the preprocessing step allows reuse of outputs when inputs are identical. Option C: Setting on_failure='Stop' on the data validation component stops the pipeline immediately on failure.

Option B is wrong because custom exit handlers are not a standard feature. Option D is wrong because 'skip_if_successful' is not a standard parameter; caching is the correct way. Option E is wrong because conditional branches add unnecessary complexity; the on_failure parameter is simpler.

10
MCQeasy

An ML engineer is using Vertex AI Pipelines with Kubeflow Pipelines SDK (KFP) to orchestrate a training and deployment workflow. They want to reuse a custom component across multiple pipelines. The component is defined in a Python file 'preprocess.py' that includes a function decorated with @kfp.components.create_component_from_func. How should they package this component for reuse?

A.Import the preprocess module and call create_component_from_func on the function, then use the resulting component in pipeline definitions.
B.Save the component as a YAML file using kfp.components.ComponentStore and load it in other pipelines.
C.Compile the pipeline that uses the component into a JSON file and upload it to Vertex AI.
D.Build a custom container image with the function and use it as a base image in other pipelines.
AnswerA

This allows the component to be defined once and reused.

Why this answer

Option A is correct because the recommended way to reuse a custom component defined via `@kfp.components.create_component_from_func` is to import the Python module containing the decorated function and call `create_component_from_func` on that function in each pipeline definition. This creates a reusable component object that can be used directly in the pipeline's `@dsl.pipeline` definition without additional packaging steps. The KFP SDK treats the function as the source of truth, and re-importing ensures the component logic is always current.

Exam trap

The trap here is that candidates may overthink the packaging step and assume a YAML file or container image is required for reuse, when the KFP SDK is designed to treat Python functions as first-class reusable components through simple module imports.

How to eliminate wrong answers

Option B is wrong because `kfp.components.ComponentStore` does not exist; components are stored as YAML using `kfp.components.ComponentStore.load_component_from_file` or `kfp.components.load_component_from_url`, but saving a component as YAML is not the standard method for reusing a `create_component_from_func` component—it is typically used for pre-built or container-based components. Option C is wrong because compiling a pipeline into JSON (or YAML) is for submitting the pipeline to Vertex AI, not for packaging a single component for reuse; the compiled artifact represents the entire pipeline, not an individual component. Option D is wrong because building a custom container image is unnecessary overhead for a lightweight Python function component; container images are used for components defined with `@kfp.components.create_component_from_func` only when the function requires non-standard dependencies, but the question does not indicate such a need, and the standard reuse method is direct import.

11
MCQeasy

A data scientist wants to automate the retraining of a model when new data arrives in Cloud Storage. Which Google Cloud service is most appropriate for orchestrating this workflow?

A.Cloud Run
B.Vertex AI Predictions
C.Cloud Scheduler
D.Cloud Composer
E.Cloud Functions
AnswerD

Cloud Composer can orchestrate complex workflows triggered by data events.

Why this answer

Cloud Composer (D) is the most appropriate service for orchestrating a retraining workflow because it is a fully managed workflow orchestration service built on Apache Airflow. It allows you to define a Directed Acyclic Graph (DAG) that triggers model retraining when new data arrives in Cloud Storage, handling dependencies, scheduling, and monitoring across multiple steps such as data validation, training, and deployment.

Exam trap

The trap here is that candidates often confuse event-triggered compute services (like Cloud Functions) with full workflow orchestration, failing to recognize that retraining pipelines require multi-step dependency management, retries, and monitoring that only a dedicated orchestrator like Cloud Composer provides.

How to eliminate wrong answers

Option A (Cloud Run) is wrong because it is a serverless compute platform for running stateless containers, not a workflow orchestrator; it lacks native scheduling and dependency management for multi-step pipelines. Option B (Vertex AI Predictions) is wrong because it is a service for deploying models to serve predictions, not for orchestrating the retraining workflow triggered by new data. Option C (Cloud Scheduler) is wrong because it is a cron job service that triggers single actions at fixed times, not a workflow orchestrator that can handle event-driven triggers, conditional logic, and multi-step dependencies.

Option E (Cloud Functions) is wrong because it is a lightweight, event-driven compute service for single-purpose functions; while it can be triggered by Cloud Storage events, it cannot orchestrate complex multi-step pipelines with retries, branching, or monitoring.

12
Multi-Selecteasy

Which TWO are benefits of using Vertex AI Pipelines for ML workflow orchestration over deploying custom Airflow DAGs in Cloud Composer? (Choose TWO.)

Select 2 answers
A.Managed infrastructure without manual configuration
B.Built-in scheduling capabilities
C.Automatic artifact lineage tracking
D.Native integration with Vertex AI services
E.Support for arbitrary Python code in steps
AnswersC, D

Vertex Pipelines automatically tracks metadata and artifacts.

Why this answer

Option C is correct because Vertex AI Pipelines automatically captures and tracks artifact lineage (inputs, outputs, and their relationships) as part of the ML metadata store. This built-in lineage tracking is a key differentiator from custom Airflow DAGs, where you must manually implement artifact tracking using external tools or custom code.

Exam trap

Google Cloud often tests the misconception that managed infrastructure and scheduling are unique to Vertex AI Pipelines, when in fact Cloud Composer also provides these features, so candidates must focus on the specific differentiators like native integration and automatic lineage tracking.

13
MCQmedium

An ML team is using Vertex AI Pipelines to automate model training and deployment. They want to reuse components across multiple pipelines. What is the best practice for managing component code?

A.Define components inline in the pipeline definition
B.Embed component code in Cloud Composer DAGs
C.Copy the component definitions into each pipeline's YAML file
D.Use Cloud Functions to define components
E.Store components as container images in Artifact Registry and reference them from pipelines
AnswerE

Centralized, versioned, reusable.

Why this answer

Option E is correct because Vertex AI Pipelines natively supports reusable components by packaging them as container images stored in Artifact Registry. This allows teams to version, share, and reference components across multiple pipelines without duplicating code, ensuring consistency and reducing maintenance overhead. Container images encapsulate the component's runtime environment and logic, making them portable and independently deployable.

Exam trap

Google Cloud often tests the misconception that inline definitions or YAML duplication are acceptable for reuse, but the trap here is that candidates overlook the requirement for versioned, decoupled, and independently deployable components, which only container images in a registry can provide.

How to eliminate wrong answers

Option A is wrong because defining components inline in the pipeline definition tightly couples the component logic to a specific pipeline, preventing reuse across multiple pipelines and making versioning difficult. Option B is wrong because Cloud Composer DAGs are used for orchestrating Apache Airflow workflows, not for defining Vertex AI pipeline components; embedding component code in DAGs would violate separation of concerns and is not a supported pattern for Vertex AI Pipelines. Option C is wrong because copying component definitions into each pipeline's YAML file leads to code duplication, version drift, and increased maintenance burden, contradicting the goal of reusability.

Option D is wrong because Cloud Functions are event-driven serverless functions, not designed to define or host reusable pipeline components; they lack the containerized runtime and dependency management required by Vertex AI Pipelines.

14
MCQhard

A large e-commerce company uses Vertex AI Pipelines to orchestrate its recommendation model training. The pipeline has several parallel components: feature engineering, model training, and model evaluation. Recently, they noticed that the pipeline often fails due to resource exhaustion in the Vertex AI custom training job for the model training component. The training job consumes significant memory and occasionally exceeds the allocated memory limit, causing the pod to be OOMKilled. The team has already increased the memory to the maximum allowed for the chosen machine type. They need to prevent the pipeline from failing while still using the same machine type. Which approach should they take?

A.Split the training component into multiple smaller steps that process data in chunks to reduce peak memory usage.
B.Use a larger machine type with more memory to accommodate the peaks.
C.Add a memory check step before training that estimates memory usage and skips training if it exceeds the limit.
D.Implement a retry policy with exponential backoff for the training component, so it automatically retries on failure.
AnswerA

This reduces memory footprint and avoids exceeding the limit, allowing successful completion.

Why this answer

Option A is correct because splitting the training component into smaller steps that process data in chunks directly addresses the root cause of OOMKilled failures—peak memory usage exceeding the allocated limit. By reducing the memory footprint per step, the pipeline can stay within the maximum memory of the existing machine type without requiring a larger instance. This approach aligns with best practices for Vertex AI custom training jobs, where resource limits are fixed per machine type and cannot be exceeded.

Exam trap

Google Cloud often tests the misconception that retry policies or pre-checks can solve resource exhaustion, but the correct approach is to redesign the component to reduce peak memory usage, as retries do not fix the underlying OOM condition.

How to eliminate wrong answers

Option B is wrong because it suggests using a larger machine type, which contradicts the requirement to keep the same machine type; it also may increase cost unnecessarily without solving the underlying memory inefficiency. Option C is wrong because adding a memory check step that skips training on high memory usage would cause the pipeline to fail or produce no model, which does not prevent failure—it merely avoids it by not running the component. Option D is wrong because implementing a retry policy with exponential backoff does not address the resource exhaustion; the training job will repeatedly fail with OOMKilled on each retry, wasting time and compute resources without resolving the memory limit issue.

15
MCQmedium

Your team manages a production ML pipeline on Google Cloud that trains a fraud detection model every 6 hours using new transaction data. The pipeline steps are: (1) Cloud Function triggered by new files in Cloud Storage to validate data, (2) Dataflow job for feature engineering, (3) Vertex AI CustomJob for training, (4) Cloud Function to deploy the model to a Vertex AI endpoint after evaluation. You notice that the pipeline sometimes fails during the Dataflow job step with an error: 'Workflow failed. Causes: The job encountered a system error. Please try again later.' The error occurs sporadically, and retrying the pipeline manually usually succeeds. The team needs a reliable automated solution. What should you do?

A.Schedule the pipeline to run less frequently to reduce load on the Dataflow service.
B.Use Cloud Tasks to queue the Dataflow job and retry on failure.
C.Increase the number of Dataflow workers and use flexRS to handle transient errors.
D.Orchestrate the pipeline using Cloud Composer with retry policies on the Dataflow operator.
AnswerD

Cloud Composer (Airflow) can manage the pipeline DAG with automatic retries and dependencies.

Why this answer

Option D is correct because Cloud Composer (Apache Airflow) provides native retry policies on its Dataflow operators, enabling automatic retries of the Dataflow job when it fails due to transient system errors. This addresses the sporadic failure pattern without manual intervention, ensuring the pipeline runs reliably every 6 hours.

Exam trap

The trap here is that candidates confuse scaling solutions (Option C) with fault-tolerance mechanisms, or they choose a generic queuing service (Option B) instead of a dedicated orchestrator with built-in retry policies for pipeline steps.

How to eliminate wrong answers

Option A is wrong because reducing pipeline frequency does not resolve transient system errors in Dataflow; it only delays processing and may cause data staleness. Option B is wrong because Cloud Tasks is a generic task queue that lacks native integration with Dataflow job lifecycle management and retry logic for pipeline-specific errors. Option C is wrong because increasing workers and using FlexRS improves resource availability but does not handle transient system errors that are unrelated to worker count or preemptibility; FlexRS is for cost savings on preemptible VMs, not for retry logic.

16
MCQeasy

The exhibit shows a Vertex AI PipelineJob submission command. The pipeline fails because the component cannot find the input data. What is the most likely cause?

A.The pipeline root path is incorrect
B.The pipeline name is misspelled
C.The input data path is not accessible by the Vertex AI Pipelines service account
D.The region does not support the component
AnswerC

The component likely expects a Cloud Storage path for data, and the service account lacks read permissions.

Why this answer

Option C is correct because the most likely cause of the pipeline failing to find input data is that the Vertex AI Pipelines service account lacks the necessary permissions to access the specified input data path. Vertex AI Pipelines uses the Compute Engine default service account (or a custom service account) to read data from Cloud Storage or other sources; if this account does not have the `storage.objectViewer` role (or equivalent) on the bucket or object, the component will fail with a permission-denied error, even if the path is syntactically correct.

Exam trap

Google Cloud often tests the misconception that a misspelled pipeline name or incorrect pipeline root path is the cause of runtime data access failures, when in fact the service account's IAM permissions on the data source are the critical factor.

How to eliminate wrong answers

Option A is wrong because an incorrect pipeline root path would cause a failure to store pipeline artifacts or metadata, not a failure to find input data; the input data path is specified separately in the component's parameters. Option B is wrong because a misspelled pipeline name would cause the pipeline submission to fail at the API validation stage (e.g., an invalid name error), not during runtime when the component tries to access input data. Option D is wrong because the region not supporting the component would result in a resource or API availability error at submission time, not a runtime data access failure.

17
MCQeasy

A data scientist has trained a model using Vertex AI Training and wants to deploy it to a Vertex AI Endpoint for online predictions. Which orchestration service should be used to automate the deployment step after training completes?

A.Vertex AI Pipelines
B.App Engine
C.Cloud Functions
D.Cloud Build
AnswerA

Vertex AI Pipelines allows you to define a pipeline with training and deployment components, automating the workflow.

Why this answer

Vertex AI Pipelines is the correct orchestration service because it is purpose-built for automating and managing end-to-end ML workflows on Google Cloud. It allows you to define a pipeline that includes both the training step (using Vertex AI Training) and the subsequent deployment step (creating or updating a Vertex AI Endpoint) as a single, repeatable, and monitored workflow. This ensures that after training completes, the model is automatically deployed without manual intervention, leveraging the pipeline's ability to pass artifacts and trigger conditional logic.

Exam trap

Google Cloud often tests the distinction between general-purpose compute services (Cloud Functions, App Engine) and ML-specific orchestration tools (Vertex AI Pipelines), trapping candidates who think any serverless or CI/CD tool can handle the unique requirements of ML workflow automation.

How to eliminate wrong answers

Option B (App Engine) is wrong because it is a platform-as-a-service (PaaS) for building and hosting web applications, not an ML pipeline orchestrator; it lacks native integration with Vertex AI Training and Endpoint APIs for automated model deployment. Option C (Cloud Functions) is wrong because it is a serverless compute service for event-driven, single-purpose functions, not designed for orchestrating multi-step ML workflows with dependencies and artifact tracking. Option D (Cloud Build) is wrong because it is a CI/CD service primarily for building, testing, and deploying software artifacts (e.g., container images), not for orchestrating ML pipelines that involve training jobs and endpoint deployments with state management.

18
MCQeasy

An organization wants to implement continuous training for a model that serves predictions via Vertex AI Endpoints. Which approach best automates the retrain-deploy cycle?

A.Schedule a Vertex AI Pipeline to retrain and conditionally deploy
B.Use Vertex AI Model Registry to auto-deploy on new model upload
C.Manually retrain and deploy monthly
D.Use Cloud Composer to schedule retraining only
E.Use a Cloud Function to retrain the model and update the endpoint
AnswerA

Automates the full cycle.

Why this answer

Option A is correct because Vertex AI Pipelines can be scheduled to run a retraining workflow and include a conditional step that deploys the new model to the endpoint only if it passes validation (e.g., evaluation metrics meet a threshold). This fully automates the retrain-deploy cycle without manual intervention, leveraging the pipeline's orchestration capabilities.

Exam trap

Google Cloud often tests the distinction between partial automation (e.g., only retraining or only deploying) and full end-to-end automation; the trap here is that candidates may choose an option that automates only one part of the cycle (like retraining with Cloud Composer or auto-deployment with Model Registry) and miss that the question requires both retraining and deployment to be automated in a single, orchestrated workflow.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Registry auto-deploys a model to an endpoint only if the endpoint is configured for automatic deployment, but it does not trigger retraining; it merely deploys an already uploaded model, so it does not automate the retrain step. Option C is wrong because manual retraining and deployment monthly is not automated and defeats the purpose of continuous training. Option D is wrong because Cloud Composer (Airflow) can schedule retraining, but it does not automatically deploy the model to the endpoint; deployment requires an additional step, so it does not fully automate the cycle.

Option E is wrong because a Cloud Function can trigger retraining and update an endpoint, but it lacks built-in orchestration for complex workflows like conditional deployment based on model evaluation, and it is less robust for managing dependencies and state compared to a pipeline.

19
MCQhard

A company has a Vertex AI pipeline that trains a model on streaming data from Pub/Sub. The pipeline is triggered by a Cloud Function when new data arrives. Recently, jobs have been failing with 'ResourceExhausted: Quota limit exceeded for regional CPUs in us-central1.' The team needs to ensure successful job execution while minimizing changes. Which approach should they take?

A.Request a quota increase from Google Cloud Support.
B.Change the pipeline to run in a different region with available quota.
C.Reduce the number of parallel pipeline runs by using a Cloud Tasks queue with rate limiting.
D.Configure the pipeline's training job to use preemptible VMs (which count toward a separate, usually higher quota).
AnswerD

Preemptible VMs have a separate quota and are cheaper.

Why this answer

Option D is correct because preemptible VMs count toward a separate, often higher quota for 'Preemptible CPUs' rather than the standard regional CPU quota. By configuring the training job to use preemptible VMs, the team can bypass the exhausted quota without requesting a limit increase or changing the pipeline architecture. This minimizes changes while leveraging the fact that Vertex AI training jobs can be configured to use preemptible VMs via the `worker_pool_specs` with `accelerator_type` and `machine_type` settings.

Exam trap

Google Cloud often tests the misconception that rate limiting (Option C) solves quota exhaustion, but the trap here is that quota limits are per-resource (e.g., regional CPUs) and rate limiting does not change the per-job resource consumption, so it only delays the inevitable failure.

How to eliminate wrong answers

Option A is wrong because requesting a quota increase from Google Cloud Support is a manual, time-consuming process that does not minimize changes and may not be approved quickly, especially if the quota is already at a high default limit. Option B is wrong because changing the pipeline to run in a different region introduces significant architectural changes, potential latency issues, and may require reconfiguring data sources like Pub/Sub topics and Cloud Functions, which contradicts the goal of minimizing changes. Option C is wrong because reducing the number of parallel pipeline runs with a Cloud Tasks queue addresses concurrency but does not resolve the underlying regional CPU quota exhaustion; the quota limit is still hit per run, and rate limiting only delays failures rather than preventing them.

20
Matchingmedium

Match each Google Cloud AI/ML service to its primary purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

End-to-end ML platform for building, deploying, and managing models

Train high-quality custom ML models with minimal effort

Managed service for distributed training of ML models

Custom ASIC for accelerating ML training workloads

Create and execute ML models using SQL queries

Why these pairings

These are core Google Cloud AI/ML services tested in the PMLE exam.

21
Multi-Selectmedium

A company uses Cloud Scheduler to trigger Cloud Functions that submit Vertex AI training jobs. They want to ensure fault tolerance and minimize manual intervention. Which TWO practices should they implement?

Select 2 answers
A.Store training hyperparameters in Cloud Firestore for reproducibility.
B.Use Cloud Run jobs as an alternative execution environment.
C.Use Cloud Tasks with retries to handle failed triggers.
D.Implement a fallback that runs the job on Compute Engine if Vertex AI fails.
E.Set up Cloud Monitoring alerts on failed pipeline runs.
AnswersC, E

Cloud Tasks can schedule and retry HTTP requests to the Cloud Function, providing fault tolerance.

Why this answer

Option C is correct because Cloud Tasks provides built-in retry logic with exponential backoff, which can reliably handle transient failures when triggering Cloud Functions from Cloud Scheduler. By configuring a Cloud Tasks queue with retry parameters, the system automatically retries failed triggers without manual intervention, ensuring fault tolerance for Vertex AI training job submissions.

Exam trap

Google Cloud often tests the distinction between fault tolerance (retry mechanisms) and other concerns like reproducibility or alternative compute; the trap here is that candidates may confuse storing hyperparameters (reproducibility) or switching to Compute Engine (fallback) with actual fault tolerance for trigger failures.

22
MCQeasy

An MLOps team wants to automate the retraining of a model each time new data arrives in a BigQuery table. What is the most efficient Google Cloud service to orchestrate this pipeline?

A.Cloud Composer with an Airflow DAG
B.Dataflow pipeline with a periodic trigger
C.Cloud Functions triggered by BigQuery events
D.Vertex AI Pipelines with a schedule trigger
AnswerD

Vertex AI Pipelines natively supports scheduled triggers and is the recommended service for ML pipeline orchestration.

Why this answer

Vertex AI Pipelines is purpose-built for orchestrating ML workflows, including model retraining. It integrates natively with BigQuery for data ingestion and supports schedule triggers to automate retraining upon new data arrival, making it the most efficient and managed option for this ML-specific task.

Exam trap

The trap here is that candidates often confuse event-driven triggers with BigQuery's lack of native row-level or table-level event notifications, leading them to incorrectly choose Cloud Functions or Dataflow, while Vertex AI Pipelines provides the most integrated and efficient orchestration for ML retraining workflows.

How to eliminate wrong answers

Option A is wrong because Cloud Composer (Airflow) is a general-purpose workflow orchestrator that adds unnecessary overhead and complexity for a simple retraining pipeline, and it is not optimized for ML-specific operations like model versioning and deployment. Option B is wrong because Dataflow is a stream/batch data processing service, not an orchestrator; a periodic trigger would require additional services (e.g., Cloud Scheduler) and does not natively handle model retraining or pipeline orchestration. Option C is wrong because Cloud Functions triggered by BigQuery events cannot directly trigger BigQuery events (BigQuery does not emit event-driven triggers for new table data); this option reflects a misunderstanding of BigQuery's event capabilities.

23
MCQhard

Refer to the exhibit. A ML engineer runs this Vertex AI pipeline. After execution, the "train" task fails with a resource exhaustion error. The task consumes more memory than allocated. Which step should the engineer take to fix this issue without increasing the overall quota cost?

A.Add a 'memory' field to the train task specification.
B.Configure the 'train-exec' executor to use a machine type with higher memory.
C.Increase the memory of the train task to 32 GiB.
D.Set 'acceleratorType' to 'NVIDIA_TESLA_T4' on the train task.
AnswerB

The executor defines the machine type, and modifying it to use a higher-memory machine (e.g., n1-highmem-8) will provide more memory without changing other quota.

Why this answer

In Vertex AI pipelines, the executorLabel maps to a predefined executor that defines machine type. To increase memory for the train task, the engineer must modify the executor specification (e.g., 'machine_type: n1-highmem-*'). Increasing the task's memory directly is not supported; it's done via the executor.

Adding an accelerator does not address memory exhaustion.

24
MCQhard

A company uses Vertex AI Pipelines to train and deploy models. The pipeline has a step that runs a custom container. The step fails intermittently with a timeout error. Which approach should be taken to robustly handle this?

A.Switch to Kubeflow Pipelines
B.Set up a Cloud Composer DAG to monitor and rerun the pipeline
C.Reduce the size of the training data
D.Increase the timeout for the step in the pipeline definition
E.Use Cloud Functions to retry the step
AnswerD

Directly fixes the timeout issue.

Why this answer

Option D is correct because Vertex AI Pipelines (built on Kubeflow Pipelines) allows you to define a `timeout` parameter for each pipeline step. Increasing this timeout directly addresses the intermittent timeout error by giving the custom container more time to complete its work, without changing the pipeline architecture or introducing external monitoring components. This is the most robust and minimal-change solution for a step that occasionally exceeds its current time limit.

Exam trap

The trap here is that candidates may over-engineer the solution by choosing external retry mechanisms (Cloud Functions, Cloud Composer) or changing the pipeline framework, when the simplest and most correct fix is to adjust the step's timeout configuration within the pipeline definition itself.

How to eliminate wrong answers

Option A is wrong because Vertex AI Pipelines is already built on Kubeflow Pipelines; switching does not solve a timeout issue and would require re-architecting the pipeline. Option B is wrong because Cloud Composer (Apache Airflow) is an external orchestrator; adding it to monitor and rerun the pipeline adds complexity and latency, and does not fix the root cause of the step timing out. Option C is wrong because reducing training data size may degrade model quality and does not address the timeout—the step might still fail if the container itself is slow for other reasons.

Option E is wrong because Cloud Functions are stateless and event-driven; they cannot directly retry a step within a Vertex AI Pipeline—retries should be configured natively in the pipeline definition using the `retry_count` or `timeout` parameters.

25
Multi-Selecthard

Which THREE should be considered when setting up an automated retraining pipeline using Vertex AI Pipelines and Cloud Composer? (Choose THREE.)

Select 3 answers
A.Setting performance thresholds for new models to decide deployment
B.Including hyperparameter tuning in every retraining run
C.Optimizing resource allocation to control costs
D.Frequency of code commits to the repository
E.Monitoring for data drift to trigger retraining
AnswersA, C, E

Ensure new model is better than current.

Why this answer

Option A is correct because in an automated retraining pipeline, you must set performance thresholds (e.g., accuracy, precision, recall) for new models to decide whether to deploy them. Vertex AI Pipelines can evaluate model metrics against these thresholds and conditionally deploy only if the new model meets or exceeds the current production model's performance, preventing regressions.

Exam trap

Google Cloud often tests the misconception that hyperparameter tuning must be part of every retraining run, but in practice it is a separate, infrequent optimization step to avoid excessive compute costs and pipeline latency.

26
MCQmedium

A team wants to implement CI/CD for their ML models using Cloud Build. They have a pipeline that trains a model and deploys it. What is the best practice for triggering the pipeline when a new commit is pushed to the source repository?

A.Set up a Cloud Scheduler job to poll the repository periodically
B.Deploy a custom web service on App Engine to call Cloud Build API
C.Use Pub/Sub to notify Cloud Build of new commits
D.Configure a Cloud Build trigger on the source repository (e.g., Cloud Source Repositories, GitHub)
AnswerD

Cloud Build supports triggers that automatically start a build upon a push to the repository.

Why this answer

Option D is correct because Cloud Build natively supports triggers that automatically start a pipeline when a new commit is pushed to a connected source repository (e.g., Cloud Source Repositories, GitHub, Bitbucket). This is the simplest, most event-driven approach, requiring no polling, custom services, or additional messaging infrastructure. It directly maps the git push event to a build invocation, ensuring near-instantaneous pipeline execution.

Exam trap

The trap here is that candidates may overthink the solution and choose Pub/Sub (Option C) because they know Pub/Sub is used for event-driven architectures, but they miss that Cloud Build triggers already abstract this complexity away, making direct trigger configuration the best practice.

How to eliminate wrong answers

Option A is wrong because Cloud Scheduler polling is inefficient, introduces latency (minimum 1-minute intervals), and is not event-driven; it would waste resources and delay pipeline starts. Option B is wrong because deploying a custom web service on App Engine to call the Cloud Build API adds unnecessary complexity, cost, and maintenance overhead, and is not a best practice when native triggers exist. Option C is wrong because while Pub/Sub can be used to trigger builds, it requires an intermediary (e.g., a Cloud Function) to receive the commit notification and call the Cloud Build API, adding latency and complexity compared to a direct Cloud Build trigger.

27
MCQeasy

A data science team uses Vertex AI Pipelines to build a training pipeline. They notice that when the pipeline fails due to a transient error in a component, the entire pipeline restarts from the beginning, taking a long time. What is the best practice to handle transient errors efficiently?

A.Use Vertex AI Experiment to track runs and manually restart failed components.
B.Configure Vertex AI Pipelines to automatically restart from the last successful state by enabling checkpointing.
C.Wrap the component code in a try-except block and retry indefinitely.
D.Set the component's retry count to 3 in the pipeline definition.
AnswerB

Checkpointing allows the pipeline to resume from the last successful state, minimizing rerun time.

Why this answer

Option B is correct because Vertex AI Pipelines supports checkpointing, which allows a pipeline to resume from the last successful state after a transient failure, avoiding a full restart. This is the most efficient approach for handling transient errors in a managed pipeline service, as it minimizes wasted compute time and resources.

Exam trap

The trap here is that candidates often confuse simple retry logic (Option D) with stateful checkpointing, assuming that retrying a component a few times is sufficient, but they miss that checkpointing preserves the pipeline's progress across failures, which is critical for long-running pipelines.

How to eliminate wrong answers

Option A is wrong because Vertex AI Experiment is designed for tracking and comparing runs, not for automating recovery from transient errors; manually restarting failed components defeats the purpose of automation and is inefficient. Option C is wrong because wrapping component code in a try-except block with indefinite retries can lead to infinite loops, resource exhaustion, and does not leverage the pipeline's orchestration capabilities for stateful recovery. Option D is wrong because setting a retry count of 3 in the pipeline definition only retries the failed component from scratch, not from the last successful state, which still wastes time if the component has long-running steps.

28
MCQhard

A data science team is deploying a PyTorch model for real-time inference using Vertex AI Endpoints. The model requires a custom container with specific CUDA drivers and Python packages. They have created a Docker image and pushed it to Artifact Registry. The pipeline should automatically retrain the model every week and deploy the new version if it passes validation. However, the deployment step fails intermittently with the error 'The container image is not compatible with the machine type.' What is the most likely cause?

A.The service account does not have permission to pull the container from Artifact Registry.
B.The container image requires GPU support but the machine type specified in the endpoint is a CPU-only machine.
C.The container's health check endpoint is not responding correctly.
D.The model artifact size exceeds the maximum allowed for the machine type.
AnswerB

CUDA drivers require GPU machines; using a CPU machine causes compatibility error.

Why this answer

The error 'The container image is not compatible with the machine type' indicates a mismatch between the container's hardware requirements and the machine type selected for the Vertex AI Endpoint. Since the custom container requires specific CUDA drivers, it is built for GPU acceleration. If the endpoint is configured with a CPU-only machine type (e.g., n1-standard-4), the container will fail to run because the GPU drivers cannot initialize, triggering this incompatibility error.

Exam trap

Google Cloud often tests the distinction between deployment-time compatibility errors and runtime health check failures, tricking candidates into confusing a misconfigured machine type with a failing health probe.

How to eliminate wrong answers

Option A is wrong because a permission issue (e.g., missing artifactregistry.reader role) would produce an 'unauthorized' or 'access denied' error when pulling the image, not a compatibility error. Option C is wrong because a failing health check would cause the deployment to succeed initially but then report the container as unhealthy, not a pre-deployment compatibility error. Option D is wrong because Vertex AI has no per-machine-type artifact size limit; model size constraints are separate and would manifest as a resource-exhausted error, not a compatibility error.

29
Multi-Selectmedium

Which THREE actions should be taken to automate a machine learning pipeline using Cloud Build and Vertex AI?

Select 3 answers
A.Write a cloudbuild.yaml that builds a training container and submits a Vertex AI PipelineJob
B.Use Cloud Functions to retrain the model each time a build completes
C.Set up a Cloud Scheduler job to poll for new build artifacts
D.Define the training and deployment steps in a Vertex AI Pipeline and submit it from Cloud Build
E.Configure a Cloud Build trigger to run on commits to the source repository
AnswersA, D, E

Cloud Build uses build config to define steps, including submitting pipeline jobs.

Why this answer

Option A is correct because Cloud Build's cloudbuild.yaml can define a step that builds a custom training container and submits it as a Vertex AI PipelineJob. This directly automates the ML pipeline by using Cloud Build to trigger a Vertex AI pipeline, which is the recommended pattern for CI/CD of ML workflows.

Exam trap

Google Cloud often tests the distinction between event-driven triggers (Cloud Build triggers, Pub/Sub) and polling mechanisms (Cloud Scheduler, Cloud Functions) — the trap here is that candidates may think polling or separate functions are needed for automation, when in fact Cloud Build's native triggers and pipeline submission are the correct, integrated approach.

30
MCQhard

You are an ML engineer at a large e-commerce company. Your team has developed a product recommendation model using TensorFlow and deployed it on Vertex AI Endpoints for real-time inference. The model is retrained weekly using a Vertex AI Pipeline that reads new user interaction data from BigQuery, trains the model, evaluates it, and deploys the new version to the endpoint with a traffic split: 10% to the new model and 90% to the previous champion model. Recently, the team noticed that the new model's online prediction latency has increased significantly (from 50ms to 200ms) after deployment, causing timeouts for some requests. The training code has not changed, and the model size is similar. The pipeline uses a custom container with the same TensorFlow Serving image as before. The deployment step uses the same machine type (n1-standard-4) for the endpoint. What is the most likely cause of the latency increase?

A.The endpoint is using a machine type that is not optimized for the new model's computation.
B.The new model has a significantly different architecture that requires more computation.
C.The pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation.
D.The new model is experiencing data skew because the training data distribution has changed.
AnswerC

A data validation step might have inadvertently added preprocessing ops, increasing latency.

Why this answer

Option C is correct because the pipeline now includes a data validation step that modifies the SavedModel's serving signature, adding an extra preprocessing operation. This additional operation runs during inference on Vertex AI Endpoints, increasing the per-request latency from 50ms to 200ms, even though the model architecture and size remain unchanged. The custom container and machine type are identical, so the latency increase must stem from a change in the serving graph itself.

Exam trap

Google Cloud often tests the concept that changes in the ML pipeline (like adding a data validation step) can alter the serving signature and increase latency, even when the model architecture and infrastructure remain unchanged, tricking candidates into focusing on hardware or data distribution instead.

How to eliminate wrong answers

Option A is wrong because the endpoint uses the same machine type (n1-standard-4) as before, so the machine is not the cause of the latency increase. Option B is wrong because the training code has not changed and the model size is similar, indicating the architecture is not significantly different. Option D is wrong because data skew affects prediction accuracy, not latency; it does not explain a 4x increase in inference time.

31
MCQmedium

The exhibit shows part of a Vertex AI Pipeline definition. The pipeline fails at the training step with an error: 'Missing required input: train_data'. What is the most likely cause?

A.The evaluation step expects a metric output but training does not produce it
B.The training step uses the wrong image tag
C.The container command for data_processing is incorrect
D.The data_processing step does not define any outputs
E.The pipeline is missing a deployment step
AnswerD

The pipeline must define an output from data_processing to feed into training.

Why this answer

The error 'Missing required input: train_data' indicates that the training step expects an input artifact named 'train_data', but no upstream step provides it. In Vertex AI Pipelines, a component's output must be explicitly defined and connected to the downstream component's input. Since the data_processing step does not define any outputs, it cannot produce the 'train_data' artifact, causing the training step to fail.

Exam trap

Google Cloud often tests the distinction between runtime errors (e.g., container image issues) and graph validation errors (e.g., missing input/output connections), leading candidates to confuse a missing output definition with a container or command misconfiguration.

How to eliminate wrong answers

Option A is wrong because the error is about a missing input, not a missing metric output; the evaluation step's expectations are irrelevant to the training step's input requirement. Option B is wrong because an incorrect image tag would cause a container runtime error (e.g., 'ImagePullBackOff'), not a 'Missing required input' error, which is a pipeline graph validation issue. Option C is wrong because an incorrect container command for data_processing would cause that step to fail, but the error specifically points to the training step's missing input, not a failure in data_processing.

Option E is wrong because a missing deployment step would not cause a training step input error; deployment occurs after training and evaluation, and its absence would not affect the training step's input requirements.

32
MCQeasy

A company uses Cloud Composer to orchestrate their ML pipelines. They notice that tasks are being queued but not executed, causing delays. What is the most likely cause?

A.The Airflow web server is down
B.The DAG file is corrupted
C.The Cloud Storage bucket containing DAGs is not accessible
D.The Airflow worker resources are exhausted
AnswerD

If workers are busy or the cluster is under-provisioned, tasks will be queued.

Why this answer

When tasks are queued but not executed, it typically indicates that the Airflow workers have no available slots to pick up new tasks. In Cloud Composer, the Celery executor distributes tasks to workers; if all worker concurrency slots are saturated or the worker node pool is under-provisioned, tasks remain in the 'queued' state until a worker becomes free. This is the most likely cause given the symptom of tasks being queued without execution.

Exam trap

The trap here is that candidates confuse the roles of Airflow components (web server, scheduler, worker) and assume a UI or DAG access issue causes queued tasks, when in reality the worker capacity is the bottleneck.

How to eliminate wrong answers

Option A is wrong because the Airflow web server is responsible for the UI and DAG parsing, not for executing tasks; if it were down, the UI would be inaccessible but tasks could still be queued and executed by workers. Option B is wrong because a corrupted DAG file would cause a parse error, preventing the DAG from being scheduled or appearing in the UI, not leaving tasks in a queued state. Option C is wrong because if the Cloud Storage bucket containing DAGs were not accessible, the DAGs would not be synced to the Airflow environment at all, resulting in missing DAGs rather than queued tasks.

33
MCQhard

The exhibit shows a Cloud Composer environment variable configuration. An ML pipeline DAG fails with an authentication error when trying to access Vertex AI. What is the most likely cause?

A.The Airflow worker does not have the proper scopes to access Vertex AI
B.The service account key in the environment variable is expired
C.The DAG file is missing a required Python library
D.The Cloud Composer environment is in a different project than Vertex AI
AnswerA

The environment variable 'GOOGLE_APPLICATION_CREDENTIALS' is set to a service account key path, but the worker VM may not have the necessary scopes.

Why this answer

The authentication error when accessing Vertex AI from Cloud Composer most likely occurs because the Airflow worker's service account lacks the necessary OAuth scopes or IAM permissions. Cloud Composer uses a worker service account to execute tasks; if this account does not have the `https://www.googleapis.com/auth/cloud-platform` scope or the `aiplatform.user` role, the Airflow worker cannot authenticate to Vertex AI APIs, resulting in a 403 or 401 error.

Exam trap

Google Cloud often tests the distinction between authentication (scopes/identity) and authorization (IAM roles), so candidates mistakenly blame cross-project configuration or missing libraries when the root cause is the worker's service account lacking the required OAuth scopes.

How to eliminate wrong answers

Option B is wrong because expired service account keys would cause a different error (e.g., 'invalid_grant' or 'expired key'), not a generic authentication error, and Cloud Composer typically uses a service account attached to the environment, not a key stored in an environment variable. Option C is wrong because a missing Python library would raise an ImportError or ModuleNotFoundError, not an authentication error. Option D is wrong because Cloud Composer and Vertex AI can be in different projects as long as the service account has cross-project IAM permissions; the error would be a permission denied, not an authentication failure.

34
Multi-Selectmedium

Which THREE are best practices for implementing CI/CD for ML pipelines on Google Cloud? (Choose THREE.)

Select 3 answers
A.Maintain separate environments for dev, staging, and production
B.Track all experiments and artifacts using Vertex ML Metadata
C.Use Cloud Build to automate testing, building, and deployment of pipeline components
D.Design pipelines with low-code components to reduce development time
E.Write unit tests for every training job
AnswersA, B, C

Prevents unintended changes to production.

Why this answer

Maintaining separate environments for dev, staging, and production is a core CI/CD best practice because it isolates changes, prevents accidental breakage in production, and allows thorough validation at each stage. On Google Cloud, this aligns with using distinct Vertex AI Pipelines instances or separate projects to enforce environment-specific configurations and access controls.

Exam trap

Google Cloud often tests the distinction between general software CI/CD practices and ML-specific CI/CD needs, trapping candidates who over-apply traditional unit testing or assume low-code tools are always best practices for production ML pipelines.

35
MCQhard

A company uses Vertex AI Pipelines with Kubeflow DSL for hyperparameter tuning. They notice that some trials fail due to OOM errors. How should they configure the pipeline to automatically handle this?

A.Use a larger machine type for the whole pipeline
B.Use Cloud Composer to catch failures and resubmit
C.Reduce the number of trials
D.Add a retry policy to the hyperparameter tuning step with backoff
E.Increase the memory for all trials in the pipeline definition
AnswerD

Retries failed trials automatically.

Why this answer

Option D is correct because Vertex AI Pipelines supports retry policies on individual pipeline steps, including hyperparameter tuning jobs. By adding a retry policy with exponential backoff, the pipeline can automatically re-run failed trials caused by transient OOM errors without manual intervention, while avoiding immediate retries that could overload resources.

Exam trap

Google Cloud often tests the misconception that retry policies are only for network requests or that OOM errors require permanent resource increases, when in fact transient OOMs in ML pipelines can be handled gracefully with step-level retries and backoff.

How to eliminate wrong answers

Option A is wrong because using a larger machine type for the whole pipeline is inefficient and costly; it does not target only the failing trials and may not resolve OOM errors if the issue is specific to certain hyperparameter configurations. Option B is wrong because Cloud Composer is an orchestration service for Apache Airflow workflows, not designed to catch and resubmit individual Vertex AI pipeline step failures; it adds unnecessary complexity and latency. Option C is wrong because reducing the number of trials limits the search space and may prevent finding the optimal hyperparameters, without addressing the root cause of OOM errors.

Option E is wrong because increasing memory for all trials in the pipeline definition is a blunt approach that wastes resources on trials that do not need extra memory, and it does not handle transient failures that may occur even with sufficient memory.

36
MCQmedium

An MLOps team is implementing a CI/CD pipeline for a TensorFlow model on Vertex AI. The model training job takes 2 hours and produces a SavedModel. The team wants to automatically trigger a new pipeline run whenever a change is pushed to the 'main' branch of their source repository. The pipeline should include training, evaluation, and if metrics exceed a threshold, deploy the model to a Vertex AI endpoint. Which trigger configuration should they use?

A.Use Eventarc to listen for Cloud Source Repository push events and invoke a Cloud Run service that starts the pipeline.
B.Use an Artifact Registry trigger to detect new model images and then start the pipeline.
C.Set up a Cloud Scheduler job that runs every 2 hours and triggers a Vertex AI Pipeline run.
D.Configure a Cloud Build trigger that watches the 'main' branch of Cloud Source Repositories; in the build config, use steps to run the pipeline via the Vertex AI API.
AnswerD

Cloud Build triggers are designed for source code events and can orchestrate ML pipelines.

Why this answer

Option D is correct because Cloud Build triggers can be configured to watch a specific branch (e.g., 'main') in Cloud Source Repositories and automatically execute a build configuration. Within that build config, you can use the `gcloud` or `curl` steps to invoke the Vertex AI Pipeline API, which starts the training, evaluation, and conditional deployment workflow. This directly matches the requirement for a branch-based push trigger that orchestrates the full ML pipeline.

Exam trap

Google Cloud often tests the distinction between event-driven triggers (Cloud Build for source code changes) and artifact-based triggers (Artifact Registry for new images), leading candidates to confuse the two when the requirement is to start a pipeline from a code push.

How to eliminate wrong answers

Option A is wrong because Eventarc is designed for event-driven, asynchronous invocations (e.g., from Cloud Storage or Pub/Sub), but it does not natively integrate with Cloud Source Repositories push events; Cloud Build triggers are the correct service for repository push events. Option B is wrong because an Artifact Registry trigger would fire only after a new model image is pushed, but the requirement is to trigger on a source code change (push to 'main'), not on a new artifact. Option C is wrong because a Cloud Scheduler job running every 2 hours is a time-based schedule, not a push-triggered event; it would not respond to code changes and would run even when no changes occur, wasting resources.

37
MCQeasy

A developer creates a Cloud Build trigger that runs a training pipeline whenever code is pushed to the main branch of the repository. The trigger is configured to use a source archive stored in Cloud Storage. After pushing code to main, the build fails with the error shown. What is the most likely cause of this failure?

A.The build configuration file is missing from the source archive.
B.The included files filter 'train/**' excludes all files outside the train directory, causing the build to have no source.
C.The source archive is not being updated when code is pushed, so the trigger tries to fetch an old or nonexistent object.
D.The service account does not have storage.objectViewer permission on the bucket.
AnswerC

The trigger points to a static archive; pushing new code does not update the archive, leading to missing source.

Why this answer

Option C is correct because the trigger is configured to use a source archive stored in Cloud Storage. When code is pushed to the main branch, the trigger attempts to fetch the archive from the specified Cloud Storage location. If the archive is not updated (e.g., via a separate upload or a Cloud Function that rebuilds the archive on push), the trigger will either fetch an old version or fail if the object does not exist.

The error indicates that the build cannot proceed because the source archive is stale or missing, not because of a missing config file or permission issue.

Exam trap

The trap here is that candidates assume the included files filter (Option B) causes the failure, but the error is about the source archive itself being outdated or missing, not about which files are included within it.

How to eliminate wrong answers

Option A is wrong because the error message does not indicate a missing build configuration file; a missing cloudbuild.yaml would produce a specific 'build configuration file not found' error, not a generic fetch failure. Option B is wrong because the included files filter 'train/**' only restricts which files are included in the build context, but it does not cause the source archive itself to be missing or stale; the error is about fetching the archive, not about empty source. Option D is wrong because if the service account lacked storage.objectViewer permission on the bucket, the error would be a 403 Forbidden or access denied, not a generic build failure related to source archive retrieval.

38
MCQeasy

You are responsible for maintaining an ML pipeline that runs daily on Vertex AI Pipelines. The pipeline preprocesses data, trains a model, and deploys it to an endpoint. Recently, the pipeline has been failing at the deployment step because the endpoint already exists and the deploy step tries to create a new endpoint instead of updating the existing one. The pipeline code is written using the Kubeflow Pipelines SDK. You need to modify the pipeline to resolve this issue with minimal changes. What should you do?

A.Change the pipeline to use a Cloud Function that triggers the deployment independently, bypassing Vertex AI Pipelines.
B.In the deployment component, add a check to verify if the endpoint exists, and if so, call the update endpoint method instead of create.
C.Set the deploy component's retry policy to infinite so it eventually succeeds.
D.Manually delete the existing endpoint before each pipeline run.
AnswerB

This directly fixes the deployment logic to handle existing endpoints.

Why this answer

Option A is correct because it addresses the root cause: the deployment component should check if the endpoint exists and update it instead of creating a new one. Option B is wrong because using a Cloud Function bypasses the pipeline orchestration and adds unnecessary complexity. Option C is wrong because retrying will not fix the fundamental issue of trying to create an existing endpoint.

Option D is wrong because manual deletion defeats automation and is not a robust solution.

39
MCQeasy

A pharmaceutical company uses Vertex AI Pipelines with custom training containers. Recently, the pipeline has been failing with 'Container failed with exit code 137' (out of memory). The container runs with default memory limit. The team needs to fix this without changing the code. The project quota for CPU and memory is sufficient. What should the team do?

A.Add a resource hint to the container spec for more memory.
B.Set the 'machineType' field for the training task to a higher memory machine.
C.Increase the model parallelism by using multi-worker training.
D.Use a smaller dataset for training.
AnswerB

This directly provides more memory to the container without code changes.

Why this answer

Option B is correct because the container is running out of memory (exit code 137) with the default memory limit. In Vertex AI Pipelines, when using custom training containers, the default memory allocation is typically 4 GiB. By setting the 'machineType' field to a higher memory machine (e.g., n1-highmem-8), the container automatically receives more memory without requiring code changes.

This directly resolves the OOM issue while respecting the constraint of not modifying the code.

Exam trap

Google Cloud often tests the misconception that resource hints or environment variables can override default memory limits in Vertex AI Pipelines, but the correct mechanism is the 'machineType' field in the task specification, not hints or code changes.

How to eliminate wrong answers

Option A is wrong because Vertex AI Pipelines does not support resource hints in the container spec for custom training containers; resource allocation is controlled via the 'machineType' field, not hints. Option C is wrong because multi-worker training (model parallelism) distributes computation across workers but does not increase the memory available to a single container; it would require code changes to implement distributed training, which violates the 'without changing the code' constraint. Option D is wrong because using a smaller dataset may reduce memory usage but changes the training data, which is not a valid fix for an OOM error in a production pipeline; the problem is memory allocation, not dataset size.

40
Drag & Dropmedium

Drag and drop the steps to set up a batch prediction job using Vertex AI in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Prepare input data, register model, create job, submit, and retrieve results.

41
MCQhard

A company deploys a training pipeline on Vertex AI using custom containers. The pipeline includes a hyperparameter tuning job that uses Bayesian optimization. After several runs, they observe that the tuning job is not converging and the search space is large. They want to reduce the number of trials while still finding good hyperparameters. Which strategy should they use?

A.Increase the number of parallel trials to explore more points simultaneously.
B.Use Grid search instead of Bayesian optimization to systematically cover the search space.
C.Implement early stopping by using the 'early_stopping' flag in the hyperparameter tuning job.
D.Reduce the search space by applying feature selection and using prior knowledge.
AnswerD

A smaller search space requires fewer trials to find good hyperparameters.

Why this answer

Option D is correct because reducing the search space using prior knowledge directly decreases the number of trials needed. Option A is wrong because increasing parallel trials does not reduce the total number of trials. Option B is wrong because grid search generally requires more trials than Bayesian optimization.

Option C is wrong because early stopping reduces time per trial but does not reduce the number of trials.

42
Multi-Selecteasy

Which TWO options are best practices for building ML pipelines on Vertex AI?

Select 2 answers
A.Use Cloud Functions to execute individual pipeline steps
B.Hardcode pipeline parameters in the component definitions
C.Use custom container components to encapsulate reusable logic
D.Always use the same compute environment for training and serving to ensure consistency
E.Leverage Vertex ML Metadata to track artifact lineage
AnswersC, E

Reusable components allow sharing across pipelines and reduce duplication.

Why this answer

Option C is correct because custom container components allow you to encapsulate reusable logic with specific dependencies, libraries, and environments, enabling consistent execution across pipeline steps. This is a best practice for building modular, maintainable ML pipelines on Vertex AI, as it decouples step logic from the pipeline orchestration and supports versioning and testing.

Exam trap

Google Cloud often tests the misconception that serverless functions like Cloud Functions are suitable for ML pipeline steps, but the trap is that ML steps require persistent state, longer timeouts, and specialized hardware, which Cloud Functions cannot provide.

43
Drag & Dropmedium

Drag and drop the steps to set up a BigQuery ML linear regression model for forecasting in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Start by preparing data, then create the model, evaluate, and predict.

44
Multi-Selectmedium

An ML team is designing an automated pipeline to retrain a recommendation model every day using new user interaction data stored in BigQuery. The pipeline must be cost-efficient, scalable, and require minimal manual intervention. Which two approaches should they consider?

Select 2 answers
A.Deploy a custom Kubernetes cron job on GKE to run the training script directly.
B.Use Cloud Composer (Airflow) to schedule the pipeline with a DAG.
C.Use Cloud Scheduler to publish a Pub/Sub message daily, which triggers a Cloud Function that starts the Vertex AI Pipeline.
D.Use Dataflow to continuously read from BigQuery and trigger training when new data arrives.
E.Use Vertex AI Pipelines to define the workflow and preemptible VMs for training to reduce cost.
AnswersC, E

This provides automated daily triggering with minimal overhead.

Why this answer

Option C is correct because Cloud Scheduler triggers a Pub/Sub message that invokes a Cloud Function, which starts a Vertex AI Pipeline. This serverless approach is cost-efficient (no idle compute), scales automatically, and requires minimal manual intervention. Option E is correct because Vertex AI Pipelines natively orchestrates ML workflows, and using preemptible VMs reduces training costs by up to 80% while maintaining scalability.

Exam trap

Google Cloud often tests the distinction between batch scheduling (Cloud Scheduler) and continuous streaming (Dataflow), and candidates mistakenly choose Dataflow because they think 'new data' implies real-time, but the requirement is a daily retrain, not a streaming trigger.

45
MCQhard

An organization has multiple ML pipelines running on Vertex AI. They want to centralize monitoring and alerting for pipeline failures, including root cause analysis. Which combination of services should they use?

A.Cloud Trace + Cloud Debugger
B.Cloud Logging + Cloud Monitoring + Error Reporting
C.Cloud Operations for GKE + Stackdriver
D.Cloud Audit Logs + Cloud Functions
AnswerB

These services provide log aggregation, metrics, and error analysis for failures.

Why this answer

Option B is correct because Cloud Logging captures pipeline execution logs, Cloud Monitoring provides metrics and alerting on pipeline failures, and Error Reporting aggregates and analyzes errors with stack traces for root cause analysis. Together, they form a centralized observability stack that meets the requirement for monitoring, alerting, and root cause analysis of ML pipeline failures on Vertex AI.

Exam trap

The trap here is that candidates confuse Cloud Trace and Cloud Debugger (debugging tools) with the monitoring and logging services needed for failure detection and root cause analysis, or mistakenly think Cloud Audit Logs (compliance logs) are sufficient for pipeline error monitoring.

How to eliminate wrong answers

Option A is wrong because Cloud Trace is designed for latency analysis of distributed systems, not for monitoring pipeline failures or root cause analysis of errors, and Cloud Debugger inspects live application state without capturing historical failure data. Option C is wrong because Cloud Operations for GKE is specific to Google Kubernetes Engine workloads, not Vertex AI pipelines, and Stackdriver is the legacy name for what is now Cloud Operations, making this option outdated and misaligned with Vertex AI. Option D is wrong because Cloud Audit Logs record administrative actions and access logs, not pipeline execution errors or failures, and Cloud Functions alone cannot provide the centralized monitoring, alerting, and error analysis required.

46
MCQhard

A company is using Vertex AI Pipelines with reusable components. They observe that a component that performs hyperparameter tuning is failing intermittently with a 'ResourceExhausted' error. The component is configured with a small custom service account. What is the most likely cause?

A.The component code has a bug causing infinite recursion
B.The KFP executor is not properly configured
C.The service account does not have sufficient quotas or permissions to create the required number of trials or workers
D.The pipeline system memory is insufficient for the component
AnswerC

Hyperparameter tuning often spawns multiple trial jobs; quota limits on AI Platform training jobs or compute resources can cause this error.

Why this answer

The 'ResourceExhausted' error in Vertex AI Pipelines typically indicates that the component is trying to create more resources (e.g., trials or workers for hyperparameter tuning) than allowed by the assigned service account's quotas or permissions. A small custom service account often has restricted quotas for AI Platform services, such as the number of concurrent trials or training workers, leading to this failure.

Exam trap

Google Cloud often tests the misconception that 'ResourceExhausted' errors are always due to memory or code bugs, rather than understanding that Vertex AI enforces service-account-specific quotas for hyperparameter tuning resources.

How to eliminate wrong answers

Option A is wrong because infinite recursion would cause a stack overflow or timeout error, not a 'ResourceExhausted' error specific to resource quotas. Option B is wrong because the KFP executor is a generic pipeline runner; its configuration does not directly affect resource creation quotas for hyperparameter tuning jobs. Option D is wrong because pipeline system memory is a cluster-level resource, not the cause of a 'ResourceExhausted' error tied to service account quotas for creating trials or workers.

47
MCQhard

A large financial company uses a complex ML pipeline to detect fraudulent transactions. The pipeline consists of multiple steps: data ingestion from Pub/Sub, feature engineering using Dataflow, model training with Vertex AI, and deployment to an endpoint. They currently use Cloud Composer to orchestrate the pipeline with separate DAGs for each step. Recently, they have been experiencing failures in the Dataflow job due to schema changes in the incoming transactions, causing the pipeline to stall. The team manually fixes the schema and re-runs the pipeline, which is time-consuming. They want to improve the robustness of the pipeline. The pipeline is run on a schedule but also triggered by the arrival of new data. The team is considering moving to Vertex AI Pipelines to unify the workflow. They also want to automatically detect schema changes and handle them without manual intervention. Which approach should they take?

A.Keep using Cloud Composer but add retries with exponential backoff to the Dataflow task, and set up a Cloud Monitoring alert to notify the team if the task fails repeatedly
B.Migrate to Vertex AI Pipelines and add a pre-processing step that validates incoming data schema against a schema registry; if schema change is detected, the pipeline sends an alert and uses a default schema to continue processing
C.Use Cloud Scheduler to trigger the pipeline more frequently to reduce the impact of failures
D.Create a separate Dataflow pipeline to handle schema detection and run it before the main pipeline; if schema changes, send an email to the team
AnswerB

This provides automated handling of schema changes.

Why this answer

Option B is correct because it directly addresses the need for automated schema change detection and handling within a unified orchestration framework. By migrating to Vertex AI Pipelines, the team gains a managed, end-to-end ML workflow service that can include a pre-processing step to validate incoming data against a schema registry. When a schema change is detected, the pipeline can automatically apply a default schema and continue, eliminating manual intervention and reducing downtime.

Exam trap

The trap here is that candidates often think retries or alerts (Option A) are sufficient for handling failures, but the question explicitly requires automatic handling without manual intervention, which only a schema validation and fallback step can provide.

How to eliminate wrong answers

Option A is wrong because adding retries with exponential backoff does not solve the root cause of schema changes; it only retries the same failing operation, which will continue to fail until the schema is manually fixed, and Cloud Monitoring alerts still require manual intervention. Option C is wrong because increasing the frequency of pipeline runs via Cloud Scheduler does not address schema change failures; it would only cause more frequent failures and waste resources. Option D is wrong because creating a separate Dataflow pipeline for schema detection still requires manual email notification and manual re-run, and it does not integrate automated handling or a unified workflow like Vertex AI Pipelines provides.

48
Matchingmedium

Match each MLOps practice to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Continuous integration and deployment for ML pipelines

Track and manage different model iterations

Monitor for changes in data or model performance over time

Schedule or trigger model retraining based on conditions

Compare model versions in production with traffic splitting

Why these pairings

MLOps ensures reliable and maintainable ML systems.

49
MCQmedium

An ML engineer is using Vertex AI Pipelines and wants to reuse a trained model across multiple pipeline runs without retraining each time. Which artifact management strategy should be used?

A.Store the model in BigQuery as a ML model
B.Use Cloud Functions to cache the model
C.Save the model to a Cloud Storage bucket and reference by path
D.Use Vertex AI ML Metadata to track and retrieve model artifacts
AnswerD

ML Metadata provides lineage and artifact tracking, enabling efficient reuse across pipelines.

Why this answer

Vertex AI ML Metadata is the correct artifact management strategy because it is purpose-built for tracking and retrieving model artifacts across pipeline runs. It stores metadata about models, datasets, and other artifacts in a lineage graph, enabling you to query and reuse a specific model version without retraining. This integrates natively with Vertex AI Pipelines, allowing you to pass model artifacts between components and retrieve them by ID or custom properties.

Exam trap

Google Cloud often tests the misconception that simply saving a model to Cloud Storage (Option C) is sufficient for artifact management, but the trap is that it ignores the need for metadata tracking, version lineage, and automated retrieval—features that Vertex AI ML Metadata provides as a managed service.

How to eliminate wrong answers

Option A is wrong because BigQuery is a data warehouse for structured data, not an artifact store for ML models; storing a model in BigQuery as an ML model (e.g., CREATE MODEL) is for in-database inference, not for retrieving a trained model artifact across pipelines. Option B is wrong because Cloud Functions are event-driven compute services, not a caching mechanism for model artifacts; they lack persistent storage and artifact versioning, and using them to cache models would be inefficient and unscalable. Option C is wrong because while saving a model to Cloud Storage and referencing by path is a common pattern, it is not a managed artifact management strategy—it lacks metadata tracking, version lineage, and automatic retrieval capabilities that Vertex AI ML Metadata provides, making it error-prone for reuse across multiple pipeline runs.

50
Multi-Selecthard

You are designing an ML pipeline for a large-scale recommendation system that runs weekly retraining on historical user interaction data. The pipeline uses TensorFlow and is deployed on Google Cloud. The pipeline must be orchestrated and automated with minimal manual intervention. Which THREE options should you include in your design? (Choose three.)

Select 3 answers
A.Use BigQuery scheduled queries to run the training script on a schedule.
B.Use Vertex AI Pipelines to define the ML pipeline as a Directed Acyclic Graph (DAG) of components.
C.Use AI Platform Notebooks to schedule the training job on a recurring basis.
D.Use Cloud Build and Cloud Functions to trigger the pipeline when new training data arrives in Cloud Storage.
E.Use Cloud Composer to orchestrate the pipeline steps, including data extraction, preprocessing, training, and deployment.
AnswersB, D, E

Vertex AI Pipelines is purpose-built for ML pipelines.

Why this answer

Vertex AI Pipelines (option B) is correct because it provides a managed, serverless orchestration service for building, testing, and deploying ML pipelines as Directed Acyclic Graphs (DAGs). This directly supports the requirement for automated, minimal-intervention weekly retraining by allowing you to define reusable components and schedule pipeline runs via Cloud Scheduler or event triggers, integrating natively with TensorFlow and Google Cloud services.

Exam trap

The trap here is confusing development tools (like Notebooks) or data-query services (like BigQuery scheduled queries) with production-grade orchestration services, leading candidates to select options that cannot handle multi-step pipeline dependencies or automated scheduling in a managed, scalable way.

51
MCQmedium

A team is using Cloud Composer to orchestrate ML workflows. They have a DAG that triggers a Vertex AI Training job, then a prediction deployment. The deployment step occasionally fails due to quota limits. What is the best way to handle this?

A.Increase the quota manually
B.Use Vertex AI Pipelines instead of Cloud Composer
C.Create a custom sensor to wait for quota to be available
D.Catch the exception in the DAG and send an alert
E.Implement exponential backoff retry in the DAG task
AnswerE

Retries with backoff handle transient failures.

Why this answer

Option E is correct because Cloud Composer (Apache Airflow) provides built-in retry mechanisms via task parameters like `retries` and `retry_delay`. Implementing exponential backoff in the DAG task is the best practice for handling transient quota errors, as it automatically retries the deployment step with increasing delays, reducing load on the quota system and increasing the chance of success without manual intervention. This approach aligns with Airflow's native error-handling capabilities and avoids unnecessary complexity or resource waste.

Exam trap

The trap here is that candidates often confuse manual quota increases or switching tools as the primary solution, when the exam expects knowledge of Airflow's native retry mechanisms and the principle of handling transient errors automatically within the orchestration layer.

How to eliminate wrong answers

Option A is wrong because manually increasing quota is a reactive, non-scalable solution that does not address transient quota limits and may incur additional costs or require approval processes. Option B is wrong because switching to Vertex AI Pipelines does not inherently solve quota limit issues; it changes the orchestration tool but still relies on the same underlying Vertex AI services and quota constraints. Option C is wrong because creating a custom sensor to wait for quota availability is overly complex, introduces polling overhead, and does not leverage Airflow's built-in retry mechanisms; sensors are better suited for waiting on external conditions like file arrival, not for handling transient API errors.

Option D is wrong because catching the exception and sending an alert only notifies the team of failure without automatically recovering the task, leading to manual intervention and potential delays; it does not handle the transient nature of quota errors.

Ready to test yourself?

Try a timed practice session using only Automating and orchestrating ML pipelines questions.