Back to AWS Certified Machine Learning Engineer Associate MLA-C01 questions

Scenario-based practice

Hard Difficulty Questions

Practise AWS Certified Machine Learning Engineer Associate MLA-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

20
scenario questions
MLA-C01
exam code
Amazon Web Services
vendor

Scenario guide

How to approach hard difficulty questions

These are the questions most candidates get wrong. They require connecting multiple concepts, reading tricky output, or knowing edge-case behaviour that isn't on most study cards. Practising them trains you to operate under uncertainty — a necessary skill on the real exam.

Quick answer

Hard Difficulty Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Related practice questions

Related MLA-C01 topic practice pages

Scenario questions usually connect to one or more exam topics. Use these links to review the underlying concepts behind the scenario.

Practice set

Practice scenarios

Question 1hardmulti select
Full question →

A company is running a SageMaker endpoint serving multiple models. They need to monitor for data drift and model quality. Which THREE actions are necessary? (Choose three.)

Question 2hardmultiple choice
Full question →

A data scientist trained a logistic regression model on a dataset with 100 features. After training, the training accuracy is 0.99 but validation accuracy is 0.75. Which action is MOST likely to reduce overfitting?

Question 3hardmultiple choice
Full question →

A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?

Question 4hardmultiple choice
Full question →

A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?

Question 5hardmulti select
Read the full NAT/PAT explanation →

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

Question 6hardmulti select
Full question →

A machine learning engineer is deploying a custom PyTorch model to a SageMaker endpoint for real-time inference. The model requires GPU acceleration. The engineer wants to minimize latency and cost. Which THREE actions should the engineer take? (Select THREE.)

Question 7hardmultiple choice
Full question →

A company wants to use a pre-trained NLP model from SageMaker JumpStart for sentiment analysis. Which step is required to make predictions?

Question 8hardmultiple choice
Full question →

A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?

Question 9hardmultiple choice
Full question →

A data scientist is training a binary classifier on a highly imbalanced dataset (1:100 class ratio). The dataset contains 500,000 rows and 30 features. The data is stored in S3 in Parquet format. The data scientist wants to use SageMaker's built-in XGBoost algorithm. Which data preparation technique should the data scientist apply to best address the class imbalance without causing data leakage?

Question 10hardmultiple choice
Full question →

A data engineer is using Amazon SageMaker Processing to run a data preprocessing script on a dataset with 500 million rows. The script runs out of memory on a single ml.r5.24xlarge instance. The engineer needs to modify the processing job to handle the dataset size. Which approach is most cost-effective and scalable?

Question 11hardmultiple choice
Full question →

Refer to the exhibit. A data engineer deploys this Glue job via CloudFormation. When running, the job fails with a timeout after 2 hours. The job processes a large dataset and expected to take 3 hours. Which change would resolve the issue?

Network Topology
"TempDir": "s3://data-bucket/temp"Resources:MyGlueJob:Type: AWS::Glue::JobProperties:Command:Name: glueetlScriptLocation: s3://data-bucket/scripts/etl.pyPythonVersion: "3"Role: arn:aws:iam::123456789012:role/GlueServiceRoleDefaultArguments:GlueVersion: "2.0"WorkerType: G.1XNumberOfWorkers: 10MaxRetries: 0Timeout: 120
Question 12hardmulti select
Full question →

A data scientist is working with a dataset containing customer demographics and purchase history. The dataset includes categorical variables with high cardinality (e.g., ZIP code, product ID). The data scientist wants to perform feature engineering to improve model performance. Which THREE feature engineering techniques should the data scientist consider? (Choose three.)

Question 13hardmultiple choice
Full question →

A data scientist is preprocessing time series data for a fraud detection model. The data includes transaction timestamps, amounts, and merchant IDs. The model should predict fraud within seconds of a transaction. The data scientist wants to avoid data leakage by not using future information to predict past events. Which data preparation practice should be implemented?

Question 14hardmulti select
Full question →

A company is preparing a large dataset for a SageMaker built-in XGBoost model. The dataset has missing values in both numeric and categorical features, and some categorical features have high cardinality. Which THREE data preparation steps should the company take to optimize model performance? (Choose three.)

Question 15hardmulti select
Full question →

A data engineer is optimizing Amazon Athena queries on large datasets stored in S3 for machine learning data preparation. Which THREE practices improve query performance?

Question 16hardmulti select
Full question →

A company is using an AWS Step Functions state machine to orchestrate a multi-step ML deployment. The workflow includes: training a model, evaluating it, registering the model, and deploying to a staging endpoint. They need to implement an approval gate before deploying to production. Which THREE components are necessary to achieve this? (Choose three.)

Question 17hardmultiple choice
Full question →

Refer to the exhibit. A company configures a SageMaker Model Monitor Data Quality monitoring schedule as shown. The schedule runs every hour. However, the team notices that the monitoring job fails intermittently with an AccessDenied error when accessing the S3 bucket for output. The IAM role SageMakerMonitorRole has permissions to write to s3://my-bucket/monitor-output. What is the MOST likely cause of the failure?

Exhibit

{
  "MonitoringScheduleName": "model-quality-monitor",
  "EndpointName": "my-endpoint",
  "MonitoringType": "DataQuality",
  "MonitoringScheduleConfig": {
    "ScheduleExpression": "cron(0 * * * ? *)",
    "MonitoringJobDefinition": {
      "BaselineConfig": {
        "BaseliningJobName": "baseline-job",
        "ConstraintsResource": {
          "S3Uri": "s3://my-bucket/baseline/constraints.json"
        },
        "StatisticsResource": {
          "S3Uri": "s3://my-bucket/baseline/statistics.json"
        }
      },
      "MonitoringInputs": [
        {
          "EndpointInput": {
            "EndpointName": "my-endpoint",
            "LocalPath": "/opt/ml/processing/input/endpoint",
            "S3DataDistributionType": "FullyReplicated",
            "S3InputMode": "File"
          }
        }
      ],
      "MonitoringOutputConfig": {
        "MonitoringOutputs": [
          {
            "S3Output": {
              "S3Uri": "s3://my-bucket/monitor-output",
              "LocalPath": "/opt/ml/processing/output",
              "S3UploadMode": "Continuous"
            }
          }
        ]
      },
      "MonitoringResources": {
        "ClusterConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m5.large",
          "VolumeSizeInGB": 20
        }
      },
      "RoleArn": "arn:aws:iam::123456789012:role/SageMakerMonitorRole"
    }
  }
}
Question 18hardmultiple choice
Full question →

Refer to the exhibit. A SageMaker Pipeline fails with 'Invalid output reference' at the TrainingStep. What is the most likely cause?

Exhibit

TrainingStep(
    name="TrainModel",
    step_args=train_args,
    depends_on=[tuning_step]
)
tuning_step = TuningStep(...) # produces multiple artifacts
Question 19hardmultiple choice
Full question →

A financial services company has a SageMaker pipeline that trains a fraud detection model daily. The pipeline consists of three steps: preprocessing (using a Spark script), training (XGBoost), and evaluation. The evaluation step calculates the F1 score and compares it to a threshold of 0.95. If the F1 score is below 0.95, the pipeline should fail and notify the team via email. The team implemented this using a Condition step that checks if the F1 score is greater than or equal to 0.95. If true, the pipeline proceeds to register the model; if false, the pipeline fails. However, the team notices that even when the F1 score is 0.94, the pipeline continues to the registration step. The evaluation script outputs the F1 score as a float with two decimal places in a JSON file. The Condition step uses the expression: $.evaluation.metrics.f1_score >= 0.95. What is the most likely cause of the issue?

Question 20hardmultiple choice
Full question →

An e-commerce company uses Amazon SageMaker to train a model that predicts click-through rates. The training data includes a timestamp column 'click_time' and a categorical feature 'device_type' (8 values). They notice that the model's performance degrades over time because the data distribution shifts. They want to ensure the training data represents the most recent behavior. The data is stored in a daily partitioned S3 bucket (e.g., s3://bucket/data/2024-01-01/). The total dataset size is 500 GB. Which approach should they take to prepare the training data while minimizing bias and cost?

These MLA-C01 practice questions are part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style MLA-C01 questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.