Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 676–750

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 10 of 24

676

Multi-Selecthard

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time transaction data. The data must be processed in near real-time and stored in Amazon S3 for long-term analytics. The engineer wants to ensure data durability and exactly-once processing semantics. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Use the Kinesis Producer Library (KPL) with exactly-once delivery.

B.Use AWS Glue streaming ETL with checkpointing.

C.Enable exactly-once delivery on Kinesis Data Firehose.

D.Use AWS Lambda with the Kinesis trigger and enable event source mapping with RetryAttempts set to 0.

E.Use Amazon SQS as the event source for downstream processing.

AnswersA, C

KPL provides exactly-once semantics when configured to do so.

Why this answer

Correct options: C and D. Using the Kinesis Producer Library (KPL) with exactly-once delivery ensures no duplicates. Enabling Kinesis Data Firehose's exactly-once delivery to S3 ensures data is written exactly once.

Option A (SQS) is not part of Kinesis. Option B (Glue) does not provide exactly-once for streaming. Option E (Lambda) can process records but does not guarantee exactly-once semantics without additional logic.

Full explanation →

677

MCQeasy

A data scientist is using Amazon SageMaker to deploy a model for real-time inference. The model is a TensorFlow neural network. The scientist wants to use automatic scaling based on the number of incoming requests. Which service integration is required?

A.Amazon ECS with service auto scaling

B.Amazon SageMaker endpoint configured with Application Auto Scaling

C.AWS Lambda with provisioned concurrency

D.AWS Auto Scaling plans

AnswerB

SageMaker integrates with Application Auto Scaling to scale endpoints based on demand.

Why this answer

Amazon SageMaker endpoints natively integrate with Application Auto Scaling to adjust the number of instances based on a target metric, such as the number of incoming requests per instance. This allows the TensorFlow model to scale automatically in response to traffic, without needing additional orchestration services.

Exam trap

The trap here is that candidates may confuse SageMaker's built-in auto scaling with external services like ECS or Lambda, not realizing that SageMaker endpoints directly integrate with Application Auto Scaling for request-based scaling.

How to eliminate wrong answers

Option A is wrong because Amazon ECS with service auto scaling is used for container orchestration, not for scaling SageMaker endpoints; SageMaker manages its own infrastructure. Option C is wrong because AWS Lambda with provisioned concurrency is for serverless functions, not for deploying a TensorFlow neural network model for real-time inference via SageMaker. Option D is wrong because AWS Auto Scaling plans are a higher-level service for scaling multiple resources, but SageMaker endpoints require direct integration with Application Auto Scaling via a scaling policy, not a generic plan.

Full explanation →

678

MCQhard

A company is using Amazon SageMaker to train a model using a custom Docker container. The training script writes model artifacts to the `/opt/ml/model` directory. The training job completes successfully, but the model artifacts are not uploaded to the S3 output path specified in the training job. The company has verified that the SageMaker execution role has the necessary S3 permissions. The Docker container is built using a base image that is not one of the official SageMaker Docker images. What is the MOST likely reason for the failure to upload model artifacts?

A.The training script's entry point is not correctly specified in the container.

B.The custom container does not include the SageMaker training toolkit, which handles artifact uploads.

C.The output path in the training job configuration is incorrectly formatted.

D.The SageMaker execution role does not have s3:PutObject permission on the output bucket.

AnswerB

Correct: Without the toolkit, SageMaker does not automatically upload artifacts.

Why this answer

When using a custom Docker container, SageMaker expects the container to have a training entry point that follows the SageMaker toolkit conventions. If the container does not include the SageMaker training toolkit, SageMaker cannot automatically upload artifacts. Option C is correct.

Option A (entry point) is not the issue if the script runs. Option B (S3 permissions) is already verified. Option D (output path) is configured correctly.

Full explanation →

679

MCQeasy

A data scientist wants to explore a large dataset stored in Amazon S3 using SQL queries without moving the data. The dataset is in CSV format and is updated daily with new partitions. Which AWS service should be used to directly query the data in S3?

A.Amazon Athena

B.Amazon Redshift Spectrum

C.Amazon EMR

D.AWS Glue

AnswerA

Athena is purpose-built for querying data in S3 with no infrastructure to manage.

Why this answer

Amazon Athena is a serverless interactive query service that allows querying data directly in S3 using standard SQL. Amazon Redshift Spectrum (option B) can also query S3 but requires a Redshift cluster. Amazon EMR (option C) requires cluster management.

AWS Glue (option D) is for ETL, not interactive querying.

Full explanation →

680

MCQeasy

A data engineer needs to analyze large CSV files stored in Amazon S3 using SQL queries. The data is not frequently accessed, and cost is a primary concern. Which AWS service should be used to query the data directly in S3 without moving it?

A.Amazon Athena

B.Amazon EMR

C.Amazon Redshift Spectrum

D.AWS Glue

AnswerA

Athena is serverless and directly queries S3 using SQL with pay-per-query pricing.

Why this answer

Amazon Athena is a serverless query service that allows SQL queries directly on data in S3, ideal for ad-hoc analysis with a pay-per-query pricing model. Option A (Amazon Redshift Spectrum) also queries S3 but requires an existing Redshift cluster. Option B (AWS Glue) is for ETL.

Option D (Amazon EMR) requires cluster management and is more expensive for occasional queries.

Full explanation →

681

MCQmedium

An IAM policy attached to a SageMaker execution role is shown. A training job executed with this role fails with an error that the role cannot access the S3 bucket. The training job uses input data from s3://my-bucket/train/data.csv and output to s3://my-bucket/output/. What is the most likely cause?

A.The training job does not have s3:GetObject permission for the input data

B.The training data is encrypted with SSE-KMS and the role lacks KMS permissions

C.The training job does not have s3:PutObject permission for the output location

D.The S3 bucket is in a different region than the training job

AnswerC

The output path 'output/' is not covered by the resource 'train/*', so PutObject fails.

Why this answer

Option C is correct because the error message indicates the role cannot access the S3 bucket, which typically occurs when the role lacks write permissions to the output location. The training job needs s3:PutObject permission to write the output artifacts (model, logs, etc.) to s3://my-bucket/output/. Without this permission, SageMaker fails to save the training results, resulting in an access error.

Exam trap

AWS often tests the distinction between read and write permissions in SageMaker S3 access, and the trap here is that candidates assume the error is about reading input data (Option A) when the actual failure is due to missing write permissions for the output location (Option C).

How to eliminate wrong answers

Option A is wrong because the error is about accessing the bucket, not specifically the input data; if s3:GetObject were missing, the error would likely be more specific to reading the input file, and the job would fail at the data loading stage, not with a general bucket access error. Option B is wrong because there is no mention of SSE-KMS encryption in the scenario; if the data were encrypted with KMS, the error would reference KMS permissions, not a generic S3 bucket access error. Option D is wrong because SageMaker training jobs can access S3 buckets in different regions as long as the bucket policy and IAM role allow cross-region access; the error message does not indicate a region mismatch, and SageMaker handles cross-region S3 access transparently.

Full explanation →

682

Multi-Selecteasy

A data engineering team needs to schedule a nightly ETL job that extracts data from an Amazon RDS for PostgreSQL instance, transforms it using Spark, and loads it into Amazon S3. The team wants to use AWS Glue for this task. Which components are required? (Select TWO.)

Select 2 answers

A.An AWS Glue ETL job with a Spark script.

B.An AWS Glue crawler to populate the Data Catalog.

C.An AWS Glue connection to the RDS database.

D.An AWS Glue development endpoint.

E.An AWS Glue notebook for data exploration.

AnswersA, C

The job performs the defined ETL logic.

Why this answer

Option A is correct because a connection to the PostgreSQL database is needed for extraction. Option D is correct because an AWS Glue ETL job with Spark script performs the transformation. Option B is wrong because a crawler is for cataloging, not for ETL.

Option C is wrong because a development endpoint is for interactive development, not production scheduling. Option E is wrong because a notebook is for development, not for scheduled jobs.

Full explanation →

683

Multi-Selectmedium

A data scientist is performing EDA on a dataset with 100 features. They want to reduce dimensionality by removing highly correlated features. Which TWO approaches are appropriate? (Choose TWO.)

Select 2 answers

A.Use feature importance from a random forest to select top features.

B.Remove features with low variance using VarianceThreshold.

C.Compute a correlation matrix and remove one feature from each pair with correlation >0.95.

D.Use Principal Component Analysis (PCA) and select components that explain 95% of variance.

E.Apply L1 regularization (Lasso) during model training to zero out coefficients of correlated features.

AnswersC, D

This directly removes redundant features.

Why this answer

Options A and D are correct. Option A: Removing features with correlation >0.95 directly reduces redundancy. Option D: Using PCA creates uncorrelated components.

Option B is wrong because L1 regularization is a modeling technique, not EDA. Option C is wrong because feature importance from tree-based models is not specifically for removing correlated features.

Full explanation →

684

Multi-Selecteasy

Which TWO of the following are valid Amazon SageMaker built-in algorithms for regression tasks? (Select TWO.)

Select 2 answers

A.BlazingText

B.XGBoost

C.Image Classification

D.Object Detection

E.Linear Learner

AnswersB, E

XGBoost supports regression.

Why this answer

XGBoost is a valid Amazon SageMaker built-in algorithm for regression tasks because it supports regression objectives such as 'reg:squarederror' and 'reg:logistic'. It is a gradient boosting framework that builds an ensemble of decision trees, making it suitable for both regression and classification problems.

Exam trap

The trap here is that candidates often confuse algorithms that can be used for regression (like XGBoost and Linear Learner) with those that are exclusively for classification or computer vision tasks, leading them to select BlazingText or Image Classification incorrectly.

Full explanation →

685

MCQhard

During exploratory data analysis on a dataset with 1 million rows, a data scientist notices that the distribution of the target variable is highly imbalanced (99% class A, 1% class B). Which technique should be applied to address this imbalance before model training?

A.Randomly undersample the majority class to match the minority class size

B.Apply standard scaling to all features

C.Use PCA to reduce dimensionality and oversample in principal component space

D.Use SMOTE to generate synthetic samples for the minority class

AnswerD

SMOTE creates synthetic examples to balance classes.

Why this answer

Option D is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class, balancing the dataset. Option A is wrong because random undersampling can discard important data. Option B is wrong because scaling does not address imbalance.

Option C is wrong because PCA does not fix imbalance.

Full explanation →

686

MCQhard

A company is running a machine learning training job on Amazon SageMaker that reads training data from an S3 bucket. The job fails intermittently with an S3 throttling error. The data is partitioned across thousands of small files (average 100 KB). Which strategy is MOST effective to resolve the throttling issue?

A.Use Amazon Athena to query the data and output results to a new S3 location

B.Enable S3 Transfer Acceleration on the bucket

C.Combine the small files into larger files (e.g., 100 MB) using a preprocessing step

D.Increase the number of SageMaker training instances to distribute the load

AnswerC

Larger files reduce the number of GET requests, mitigating throttling.

Why this answer

Combining small files into larger files reduces the number of S3 GET requests, which reduces the chance of throttling. Increasing the number of instances (option A) would increase parallelism and could worsen throttling. Using S3 Transfer Acceleration (option B) improves transfer speed but does not reduce request rate.

Using Athena (option D) is for querying, not for training data access.

Full explanation →

687

MCQhard

A company is using a custom Docker container in SageMaker for training. The training job fails with 'ResourceLimitExceeded' error. Which action should the data scientist take?

A.Use a smaller instance type

B.Reduce the number of epochs

C.Request a limit increase for the instance type

D.Use a pre-built SageMaker container instead

AnswerC

Directly addresses the error.

Why this answer

The 'ResourceLimitExceeded' error in SageMaker indicates that the AWS account has reached a service quota for the specified instance type (e.g., ml.p3.2xlarge). This is a quota limit, not a performance or resource exhaustion issue within the training job itself. The correct action is to request a limit increase via the AWS Service Quotas console or AWS Support, which raises the maximum number of concurrent instances or total vCPUs allowed for that instance family.

Exam trap

AWS often tests the misconception that 'ResourceLimitExceeded' is a performance or memory error, leading candidates to choose instance downsizing or epoch reduction, when in fact it is a strict AWS account quota that must be raised through a formal request.

How to eliminate wrong answers

Option A is wrong because using a smaller instance type does not resolve a quota limit error; it only changes which quota is checked, and the smaller instance may still be subject to its own quota or may not meet the training job's memory/compute requirements. Option B is wrong because reducing the number of epochs addresses model convergence or training time, not the AWS service quota that limits the number or type of instances you can launch concurrently. Option D is wrong because switching to a pre-built SageMaker container does not affect instance quotas; the error is about resource limits at the AWS account level, not about container compatibility or image configuration.

Full explanation →

688

MCQmedium

A company has a SageMaker endpoint that serves predictions for a mobile app. The endpoint is deployed on a single ml.m5.large instance. Recently, users have reported that the app sometimes returns outdated predictions. The data science team has confirmed that the model is updated daily by retraining with new data and creating a new endpoint configuration. However, the endpoint still returns predictions from the old model for some requests. The team has verified that the new endpoint configuration is associated with the endpoint and that the endpoint is in service. What is the most likely cause of this issue?

A.The old model artifacts are still being cached by the endpoint

B.The endpoint has multiple variants and the old variant still has a weight assigned

C.The mobile app is using a CDN that caches the predictions

D.The new endpoint configuration has not been deployed to the endpoint

AnswerB

If the old variant has a weight, it will continue to serve traffic. The new variant should get a weight of 1 and the old variant weight should be set to 0.

Why this answer

Option D is correct because if the old endpoint variant still has a weight greater than zero, some traffic will continue to be routed to the old model. Option A is wrong because if the new configuration were not active, the endpoint would not be in service. Option B is wrong because the model artifacts are stored in S3 and are not deleted during updates.

Option C is wrong because CloudFront does not cache SageMaker endpoint responses by default.

Full explanation →

689

MCQhard

A company runs a data lake on Amazon S3 with partitions by year/month/day. A machine learning team needs to read daily data from the last 30 days for model retraining. The data format is Parquet. The team uses Amazon Athena to query the data, but the queries are slow and scanning too much data. The team has already optimized the file sizes and compression. What additional step can reduce the amount of data scanned?

A.Remove the partition structure and store data as single large files.

B.Convert the Parquet files to JSON format for better query performance.

C.Use CSV format with Gzip compression.

D.Add more partition columns such as hour to reduce the scanned partitions.

AnswerD

More granular partitions allow queries to scan fewer files.

Why this answer

Option A is correct because partitioning on additional columns like hour can further prune partitions if queries filter by time ranges. Option B is wrong because converting to JSON would increase data size. Option C is wrong because converting to CSV would also increase size.

Option D is wrong because removing partitions would increase scanning.

Full explanation →

690

Matchingmedium

Match each data format to its typical use in AWS ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Tabular data for SageMaker built-in algorithms

Efficient binary format for SageMaker

Columnar storage for analytics

Semi-structured data, e.g., for Lambda

TensorFlow training data format

Why these pairings

Different formats are optimized for different tasks.

Full explanation →

691

MCQeasy

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long and the scientist wants to reduce the training time by distributing the workload across multiple GPUs. Which SageMaker feature should be used to achieve this?

A.Use SageMaker's distributed training libraries

B.Use Amazon EMR to distribute the training

C.Use SageMaker Automatic Model Tuning

D.Use SageMaker Hyperparameter Tuning

AnswerA

SageMaker provides built-in distributed training libraries that can split the workload across multiple GPUs.

Why this answer

Option D is correct because SageMaker's distributed training libraries enable efficient distribution across multiple GPUs. Option A is wrong because Hyperparameter Tuning is for optimizing hyperparameters, not for distributed training. Option B is wrong because Automatic Model Tuning is for hyperparameter optimization.

Option C is wrong because Amazon EMR is for big data processing, not for deep learning training.

Full explanation →

692

MCQhard

A company uses SageMaker Pipelines to automate model retraining. The pipeline fails intermittently at the Preprocess step with a 'ResourceLimitExceeded' error. The team uses a ml.m5.xlarge instance. What is the most likely cause?

A.The account has reached the limit for concurrent ml.m5.xlarge instances

B.The preprocessing script has a memory leak

C.The S3 bucket has insufficient permissions

D.The pipeline execution role is missing the PassRole permission

AnswerA

ResourceLimitExceeded typically means hitting a service limit like concurrent instances.

Why this answer

The error indicates reaching a service quota. SageMaker has a default limit of concurrent training jobs per account. Option A is correct.

Option B would cause different errors. Option C is unrelated. Option D would cause a SageMaker service error, not a resource limit.

Full explanation →

693

MCQhard

A data scientist is using Amazon SageMaker to deploy a custom model container. The model is a large transformer that requires 16 GB of memory. The scientist wants to minimize inference latency. Which SageMaker hosting option should they choose?

A.Use a real-time endpoint with an instance that has sufficient memory.

B.Use an asynchronous inference endpoint.

C.Use SageMaker Serverless Inference.

D.Use a batch transform job.

AnswerA

Real-time endpoints provide low latency and can accommodate large models.

Why this answer

Option B is correct because multi-model endpoints allow multiple models to share resources, but for a single large model, a real-time endpoint with a suitable instance is best. Option A is wrong because serverless inference has memory limits (up to 6 GB) and may cold start. Option C is wrong because batch transform is for offline.

Option D is wrong because asynchronous inference introduces latency for processing requests.

Full explanation →

694

Multi-Selecteasy

Which TWO metrics are appropriate for evaluating a binary classification model when the cost of false negatives is high? (Choose 2)

Select 2 answers

A.Recall

B.Specificity

C.Accuracy

D.F1 score

E.Precision

AnswersA, B

Recall = TP/(TP+FN), high recall minimizes false negatives.

Why this answer

Option A (Recall) and Option D (Specificity) are correct. Recall measures true positives, and specificity measures true negatives, both relevant when false negatives are costly. Precision (B) and F1 (C) are not directly focused on false negatives.

Accuracy (E) is misleading if imbalanced.

Full explanation →

695

MCQeasy

A machine learning engineer is evaluating a binary classification model. The model has a high recall but low precision. Which of the following is the most likely consequence?

A.The model has many false positives.

B.The model has few false negatives.

C.The model misses many positive cases.

D.The model has few false positives.

AnswerA

Low precision means a high rate of false positives.

Why this answer

High recall means the model correctly identifies most positive cases (few false negatives), but low precision indicates that among the cases predicted as positive, many are actually negative. This directly implies a high number of false positives, as precision = TP/(TP+FP) and a low precision with high recall forces FP to be large relative to TP.

Exam trap

Cisco often tests the precision-recall trade-off by asking candidates to confuse the definitions of false positives and false negatives, leading them to incorrectly associate high recall with many false positives instead of few false negatives.

How to eliminate wrong answers

Option B is wrong because high recall implies few false negatives (FN is low), so this is a characteristic of the model, not a consequence of low precision. Option C is wrong because high recall means the model does NOT miss many positive cases; it captures most of them. Option D is wrong because low precision is defined by having many false positives, not few; few false positives would yield high precision.

Full explanation →

696

Multi-Selecteasy

A data scientist is building a classification model and wants to evaluate its performance. Which TWO metrics are appropriate for a multi-class classification problem? (Choose 2)

Select 2 answers

A.Mean Absolute Error (MAE)

B.Recall

C.Precision

D.R-squared

E.Root Mean Square Error (RMSE)

AnswersB, C

Recall can be averaged across classes.

Why this answer

Both precision and recall can be extended to multi-class via micro/macro averaging. R-squared is for regression; RMSE is for regression; Mean Absolute Error is for regression.

Full explanation →

697

MCQmedium

A data scientist is exploring a dataset containing customer transactions. They want to create a feature that captures the average purchase amount per customer over the last 30 days. Which approach is most efficient in Amazon SageMaker Processing?

A.Use Amazon Athena SQL query with GROUP BY

B.Use PySpark with window functions in SageMaker Processing

C.Use a Python script with a for loop to calculate per customer

D.Use pandas groupby and rolling functions

AnswerB

Correct: PySpark window functions are optimized for large-scale grouped rolling aggregates.

Why this answer

Option D is correct because using PySpark in SageMaker Processing with window functions is efficient for grouped time-series aggregations. Option A is wrong because iterating over rows is inefficient in Python. Option B is wrong because SQL in Athena may be simpler but requires moving data.

Option C is wrong because pandas may not scale to large datasets.

Full explanation →

698

MCQeasy

A data analyst is exploring a dataset with a target variable that is highly imbalanced. The minority class represents only 1% of the data. Which technique should the analyst use to better understand the relationships between features and the minority class?

A.Apply SMOTE to the dataset before analysis.

B.Use random sampling to reduce the dataset size.

C.Scale the features using Min-Max scaling.

D.Use stratified sampling to create a balanced sample for analysis.

AnswerD

Stratified sampling preserves class proportions.

Why this answer

Option A is correct because stratified sampling ensures the minority class is proportionally represented in the sample, allowing meaningful analysis. Option B is wrong because random sampling may miss the minority class entirely. Option C is wrong because SMOTE is for generating synthetic data, not for exploratory analysis.

Option D is wrong because feature scaling does not address class imbalance.

Full explanation →

699

MCQhard

A data engineer is troubleshooting an AWS Glue job that reads from and writes to the S3 bucket 'data-lake-bucket'. The job fails when trying to write to the 'sensitive/' prefix. The IAM policy attached to the Glue job's IAM role is shown in the exhibit. What is the MOST likely reason for the failure?

A.The IAM role does not have permission to read objects from the bucket

B.The IAM role has an explicit deny for s3:PutObject on the 'sensitive/' prefix

C.The IAM policy does not specify the bucket resource correctly

D.The IAM policy lacks a required condition for encryption

AnswerB

The Deny statement blocks write access to the sensitive prefix.

Why this answer

Option B is correct. Even though the first statement allows s3:PutObject on the entire bucket, the second statement explicitly denies s3:PutObject on the 'sensitive/' prefix. Explicit deny overrides any allow.

Option A is wrong because the policy allows GetObject. Option C is wrong because the policy covers the bucket. Option D is wrong because there is a deny statement.

Full explanation →

700

MCQeasy

A machine learning engineer is training a regression model to predict house prices using Amazon SageMaker. The dataset contains 10,000 samples and 50 numerical features. After training a linear regression model, the engineer notices that the training loss is low, but the validation loss is high. The engineer suspects overfitting. The dataset is already normalized. Which action should the engineer take to reduce overfitting?

A.Increase the learning rate to speed up convergence.

B.Reduce the number of features using PCA.

C.Add L2 regularization (weight decay) to the loss function.

D.Decrease the mini-batch size during training.

AnswerC

Correct: L2 regularization penalizes large weights and reduces overfitting.

Why this answer

Option B (add L2 regularization) is correct because it penalizes large weights and reduces overfitting in linear models. Option A (decrease batch size) can introduce noise but doesn't directly regularize. Option C (increase learning rate) may cause divergence.

Option D (reduce feature count by PCA) could help but may lose information; regularization is more direct.

Full explanation →

701

MCQmedium

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

A.The Docker image is built from a base image that does not have the required libraries.

B.The container runs as a non-root user that lacks write permissions to /opt/ml.

C.The SageMaker training job is configured with insufficient memory.

D.The training script is not copying the model to /opt/ml/model.

AnswerB

SageMaker mounts volumes as root by default; if the container runs as a different user, it may not have write access.

Why this answer

SageMaker expects the training container to write output to /opt/ml/model and /opt/ml/output. If the container does not have write permissions to these directories, the training fails. The most likely cause is that the Docker image was built with a non-root user that does not have write permissions to /opt/ml.

The correct fix is to ensure the container runs as root or grants write permissions.

Full explanation →

702

MCQhard

A data scientist is performing EDA on a time series dataset of daily sales. The data scientist observes a pattern that repeats every 7 days. Which characteristic of the time series is being observed?

A.Stationarity

B.Autocorrelation

C.Seasonality

D.Trend

AnswerC

Seasonality is a periodic pattern with a fixed frequency.

Why this answer

A pattern that repeats at a fixed frequency (every 7 days) is called seasonality. Option A is wrong because trend is a long-term increase or decrease. Option C is wrong because autocorrelation measures correlation with lagged values, not a repeating pattern.

Option D is wrong because stationarity refers to constant mean/variance over time.

Full explanation →

703

MCQeasy

A company is building a sentiment analysis model for customer reviews. The dataset includes 10,000 positive and 10,000 negative reviews. The data scientist splits the data into 70% training, 15% validation, and 15% test sets. After training, the model achieves 99% accuracy on training set but only 82% on validation set. What is the most likely issue?

A.There is data leakage from validation to training

B.The dataset is imbalanced

C.The model is underfitting

D.The model is overfitting

AnswerD

High training accuracy with significantly lower validation accuracy is a classic sign of overfitting.

Why this answer

Option B is correct because a large gap between training and validation accuracy indicates overfitting. Option A is wrong because the dataset is balanced. Option C is wrong because underfitting would show low training accuracy.

Option D is wrong because data leakage is not indicated by accuracy gap alone.

Full explanation →

704

MCQmedium

An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?

A.ecr:GetDownloadUrlForLayer and ecr:BatchGetImage on the ECR repository

B.ecr:PutImage on the ECR repository

C.s3:GetObject on the ECR repository

D.sagemaker:CreateTrainingJob on the ECR resource

AnswerA

These permissions are required to pull a Docker image from ECR.

Why this answer

When SageMaker pulls a custom Docker image from ECR during training job creation, the execution role needs permissions to download the image layers. The required actions are ecr:GetDownloadUrlForLayer (to generate pre-signed URLs for each layer) and ecr:BatchGetImage (to retrieve image metadata and layer manifests). Without these, the 'AccessDenied' error occurs because SageMaker cannot authenticate or fetch the container image from the ECR repository.

Exam trap

The trap here is that candidates often confuse ECR pull permissions with S3 permissions (option C) or assume that the SageMaker CreateTrainingJob permission (option D) implicitly covers the ECR pull, when in fact the IAM role must explicitly grant the specific ECR read actions for image retrieval.

How to eliminate wrong answers

Option B is wrong because ecr:PutImage is used to push images into ECR, not to pull them; the training job only needs read access. Option C is wrong because s3:GetObject is an S3 permission, not an ECR permission; ECR uses its own API actions for image retrieval, not S3. Option D is wrong because sagemaker:CreateTrainingJob is a SageMaker API action that allows creating the training job itself, not the ECR pull operation; the error occurs at the ECR layer, not at the SageMaker API level.

Full explanation →

705

MCQeasy

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

A.Apply a log transformation to the feature.

B.Apply z-score normalization.

C.Apply one-hot encoding.

D.Apply min-max scaling.

AnswerA

Log transformation compresses high values and can make the distribution more symmetric.

Why this answer

A log transformation compresses the range of the data, reducing the impact of extreme values and pulling in the long tail of a right-skewed distribution. This makes the feature more normally distributed, which is often required for linear models and many statistical tests. It is the standard technique for handling positive-valued features with heavy right skew.

Exam trap

AWS often tests the distinction between scaling (which changes range) and transformation (which changes distribution shape), so the trap here is that candidates might pick min-max scaling or z-score normalization thinking they handle outliers, but they only rescale without fixing skewness.

How to eliminate wrong answers

Option B is wrong because z-score normalization (standardization) centers the data around zero with unit variance but does not change the shape of the distribution; it will still be skewed. Option C is wrong because one-hot encoding is used for categorical features, not for transforming numerical features to reduce skewness. Option D is wrong because min-max scaling rescales the feature to a fixed range (e.g., [0,1]) but does not alter the distribution's skewness; outliers remain outliers in the scaled range.

Full explanation →

706

MCQmedium

A company is performing EDA on a dataset with 10,000 rows and 200 columns. They run a correlation matrix and find many high correlations (|r| > 0.9). What is the best approach to address multicollinearity before modeling?

A.Standardize all features

B.Calculate Variance Inflation Factor (VIF) and remove features with VIF > 10

C.Use Lasso regression with cross-validation

D.Apply Principal Component Analysis (PCA) to all features

AnswerB

VIF identifies highly correlated features for removal.

Why this answer

Option B is correct because VIF measures how much a variable is correlated with others; removing high VIF variables reduces multicollinearity. Option A is wrong because scaling does not remove correlation. Option C is wrong because PCA creates orthogonal components but reduces interpretability.

Option D is wrong because Lasso can handle multicollinearity but may not be the best EDA step.

Full explanation →

707

MCQhard

A company is using Amazon Forecast for demand forecasting. The data includes time series data for multiple items. The company wants to ensure that the forecast is updated daily as new data arrives. Which approach should be used to automate this process?

A.Use AWS Lambda to invoke the Forecast CreateDatasetImportJob and CreatePredictorBacktestExportJob APIs on a daily schedule triggered by Amazon CloudWatch Events.

B.Use Amazon Kinesis Data Streams to stream new data directly into Forecast.

C.Use Amazon SageMaker to retrain the model daily and replace the forecast endpoint.

D.Enable the 'AutoPredictor' feature in Forecast to automatically update the predictor when new data arrives.

AnswerA

This automates data import and retraining.

Why this answer

Option A is correct because Amazon Forecast Predictor can be updated with new data by using the CreateDatasetImportJob API, and then retraining the predictor. Option B is wrong because Forecast does not support real-time streaming. Option C is wrong because Forecast does not update automatically; manual retraining is required.

Option D is wrong because there is no built-in scheduler; custom automation is needed.

Full explanation →

708

MCQeasy

A data scientist is analyzing a dataset with a timestamp column. The goal is to identify seasonality and trends. Which visualization technique is most suitable?

A.Time series line plot of the target variable over time.

B.Box plot of the target variable grouped by day of week.

C.Scatter plot of the target variable vs. the timestamp.

D.Heatmap of correlation between all features.

AnswerA

Line plots are standard for time series data.

Why this answer

Option B is correct because a time series line plot is standard for visualizing trends and seasonality over time. Option A (scatter plot) is for two numerical variables; Option C (heatmap) shows correlation; Option D (box plot) shows distribution.

Full explanation →

709

MCQhard

A data engineer needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3 for ML training. The data is currently stored in HDFS and is compressible. The network bandwidth between the on-premises data center and AWS is 1 Gbps. The team needs to minimize the time to transfer and also wants to avoid any downtime for the on-premises system. Which solution meets these requirements?

A.Set up an AWS Direct Connect connection and use rsync to copy data to S3.

B.Enable S3 Transfer Acceleration on the bucket and use the AWS CLI to copy data.

C.Install the AWS DataSync agent on-premises, configure a task to transfer data to S3 with compression enabled.

D.Use AWS Snowball Edge devices to export the data and ship them to AWS.

AnswerC

DataSync is optimized for large data transfers with compression and parallelization.

Why this answer

Option D is correct because AWS DataSync can transfer data over the network efficiently using parallel streams and compression, and it can be installed as an agent on the on-premises cluster. Option A is wrong because Snowball Edge would require shipping, which takes longer and may not be faster than network transfer with optimization. Option B is wrong because Direct Connect provides a dedicated network connection but does not include the data transfer software; DataSync works over Direct Connect.

Option C is wrong because S3 Transfer Acceleration improves speed over public internet but may not be as fast as DataSync with compression.

Full explanation →

710

MCQeasy

During training, a binary classification model has an AUC of 0.99 on the training set but only 0.72 on the validation set. Which of the following is the most likely cause?

A.Class imbalance in the training set.

B.Underfitting.

C.Overfitting.

D.Data leakage from validation to training.

AnswerC

Overfitting results in high training but lower validation AUC.

Why this answer

Option B is correct because a large gap between training and validation performance indicates overfitting. Option A is wrong because underfitting would show poor performance on both. Option C is wrong because data leakage would inflate both metrics.

Option D is wrong because class imbalance would affect both sets similarly.

Full explanation →

711

MCQhard

A company is using Amazon SageMaker to deploy a model for real-time inference. The endpoint receives variable traffic and the company wants to optimize cost while maintaining responsiveness. Which scaling policy should be used?

A.Target tracking scaling based on invocation count

B.Simple scaling with a cooldown period

C.Scheduled scaling

D.Manual scaling

AnswerA

Automatically adjusts to traffic.

Why this answer

Option C is correct because target tracking scaling based on invocation count is the best approach for variable traffic, as it adjusts to demand. Option A is wrong because manual scaling requires constant monitoring. Option B is wrong because simple scaling with fixed steps may not adapt well.

Option D is wrong because scheduled scaling is for predictable traffic.

Full explanation →

712

MCQeasy

A data scientist is building a regression model to predict house prices. The dataset contains features like 'number_of_rooms' (integer), 'sqft' (float), 'location' (categorical with 1000 unique values). Which feature engineering approach is BEST for the 'location' feature?

A.Remove the feature

B.Target encoding

C.One-hot encoding

D.Label encoding

AnswerB

Target encoding uses mean target per category, good for high cardinality.

Why this answer

Target encoding is the best approach for the 'location' feature because it has 1,000 unique categories, making one-hot encoding infeasible (would create 1,000 dummy columns) and label encoding inappropriate (imposes arbitrary ordinal relationships). Target encoding replaces each category with the mean of the target variable (house price) for that category, capturing the predictive signal of location while keeping the feature as a single numeric column. This balances model performance with dimensionality and avoids overfitting when regularized (e.g., with smoothing or cross-validation).

Exam trap

AWS often tests the trade-off between cardinality and encoding methods, and the trap here is that candidates default to one-hot encoding as the 'standard' categorical encoding without considering the practical infeasibility of high cardinality, or they choose label encoding thinking it is a simple numeric mapping, ignoring the ordinal assumption violation.

How to eliminate wrong answers

Option A is wrong because removing the 'location' feature discards a highly predictive signal — house prices are strongly influenced by location, and a model without it would likely underfit. Option C is wrong because one-hot encoding with 1,000 unique categories would create 999 dummy variables, drastically increasing dimensionality, memory usage, and risk of the curse of dimensionality, especially in regression models. Option D is wrong because label encoding assigns arbitrary integer labels (e.g., 1, 2, 3) to categories, implying an ordinal relationship that does not exist for location, which can mislead linear regression models into treating distant locations as numerically similar.

Full explanation →

713

MCQhard

A data engineering team is building a real-time data pipeline using Amazon Kinesis Data Streams with AWS Lambda for processing. The pipeline ingests clickstream data from a mobile app. The team notices that occasionally, a Lambda function fails due to a transient error, and the failed record is not retried, leading to data loss. The Lambda function is configured with a batch size of 100 and a maximum retry count of 0. The team wants to ensure that all records are processed successfully, even if transient failures occur. They also want to minimize the impact of poison pill records that could block processing. Which combination of actions should the team take to address this issue?

A.Set the maximum retry count to 5 and configure a dead-letter queue on the Lambda function to capture failed records after retries.

B.Switch to using Amazon Kinesis Data Firehose to buffer data and use AWS Lambda for transformation with built-in retry logic.

C.Set the maximum retry count to 5, configure an on-failure destination Amazon SQS queue, and set up a dead-letter queue on that SQS queue for poison pills.

D.Reduce the batch size to 1 and increase the Lambda function timeout to handle transient errors.

AnswerC

This provides retries and isolates poison pills without blocking the main stream.

Why this answer

Option B is correct because increasing the maximum retry count allows Lambda to retry failed batches, and splitting the failure destination into a separate SQS queue with a dead-letter queue (DLQ) for poison pills ensures that problematic records are isolated and can be analyzed separately, while the main processing continues. Option A is incorrect because using a DLQ alone without retries will still lose records if retries are not enabled. Option C is incorrect because reducing batch size may help but does not solve the retry or poison pill problem.

Option D is incorrect because Kinesis Data Firehose is not suitable for real-time per-record processing.

Full explanation →

714

MCQeasy

A company wants to use Amazon SageMaker to train a model using data that is updated daily. The training data is stored in an S3 bucket, and the team wants to automate the training process whenever new data arrives. Which AWS service should be used to trigger the SageMaker training job?

A.AWS Lambda triggered by S3 event notifications

B.Amazon CloudWatch Events

C.Amazon Simple Queue Service (SQS)

D.AWS Step Functions with a scheduled trigger

AnswerA

Lambda can be triggered by S3 events to start the training job.

Why this answer

Option B is correct because S3 events can trigger a Lambda function, which can then start the SageMaker training job. Option A is wrong because CloudWatch Events can schedule events but not directly triggered by S3 object creation. Option C is wrong because Step Functions can orchestrate the workflow but requires an S3 event to start.

Option D is wrong because SQS is a queue service; it does not directly trigger Lambda from S3 events without additional setup.

Full explanation →

715

MCQeasy

A data scientist is training a random forest model on a dataset with 50 features. After training, the model achieves 98% accuracy on the training set but only 85% on the test set. Which technique is most appropriate to reduce the generalization error?

A.Apply Principal Component Analysis (PCA) to reduce dimensionality

B.Add more training data

C.Increase the number of trees in the forest

D.Reduce the maximum depth of each tree

AnswerD

Shallow trees are simpler and less likely to overfit, thus improving test accuracy.

Why this answer

The gap indicates overfitting. Random forest can overfit if trees are too deep or if the number of trees is too high. Reducing the maximum depth of trees limits model complexity and helps generalization.

Increasing the number of trees typically reduces overfitting but can also increase computational cost; however, reducing depth is more direct. Feature selection or PCA might help but are less direct than controlling tree complexity.

Full explanation →

716

MCQhard

A financial services company is building a fraud detection model that requires joining real-time transaction data with a reference dataset of known fraudulent accounts stored in Amazon DynamoDB. The solution must minimize latency and be highly available. The reference dataset is updated frequently (every few minutes). Which architecture should the team use?

A.Use Amazon Athena to query the DynamoDB table and join with streaming data.

B.Use Amazon Kinesis Data Analytics to process the stream and join with a DynamoDB table.

C.Use AWS Glue streaming ETL to read from Kinesis and join with DynamoDB.

D.Use Amazon SageMaker to host a model that queries DynamoDB for each inference.

AnswerB

Kinesis Data Analytics supports real-time joins with DynamoDB using reference data.

Why this answer

Option C is correct because Kinesis Data Analytics can perform real-time SQL joins with a DynamoDB table using the reference data feature, providing low latency. Option A is wrong because Glue is for batch ETL, not real-time. Option B is wrong because SageMaker is for ML training, not real-time data processing.

Option D is wrong because Athena is for querying S3, not real-time streaming.

Full explanation →

717

MCQeasy

Refer to the exhibit. The log shows the end of a successful SageMaker training job. However, the ML engineer cannot find the model artifacts in the specified S3 bucket. What is the most likely cause?

A.The IAM role used by the training job does not have permission to write to the S3 bucket.

B.The S3 bucket does not exist.

C.The model artifacts were uploaded to a different S3 path.

D.The training job did not have network access to S3.

AnswerA

Without s3:PutObject, the upload fails.

Why this answer

The training job completed successfully, meaning the SageMaker training container executed without errors. However, if the model artifacts are not found in the specified S3 bucket, the most likely cause is that the IAM role associated with the training job lacks the necessary s3:PutObject permission for that bucket. SageMaker uses the role's credentials to write the output; without write access, the artifacts are silently dropped or fail to upload, even though the training code itself may have run to completion.

Exam trap

AWS often tests the misconception that a successful training job log implies the model artifacts were successfully uploaded, when in fact the IAM role permissions are the gatekeeper for S3 write operations, and a missing permission can cause silent failures.

How to eliminate wrong answers

Option B is wrong because if the S3 bucket did not exist, SageMaker would raise a bucket-not-found error during the training job initialization, and the job would fail, not complete successfully. Option C is wrong because the model artifacts are written to the exact S3 path specified in the OutputDataConfig parameter of the training job; SageMaker does not randomly choose a different path. Option D is wrong because if the training job lacked network access to S3, it would fail to download training data or upload output, resulting in a job failure, not a successful completion with missing artifacts.

Full explanation →

718

MCQhard

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

A.The policy does not include the s3:ListBucket permission, which is required to access the object.

B.The object is encrypted with SSE-KMS and the role does not have kms:Decrypt permission.

C.The resource ARN in the first statement should be 'arn:aws:s3:::my-bucket/training' without the wildcard.

D.The policy explicitly denies s3:GetObject because of the second statement with the trailing slash.

AnswerA

To read an S3 object, the principal needs both s3:GetObject on the object and s3:ListBucket on the bucket (or at least the bucket-level permission to allow access). The policy only grants object-level permissions, not bucket-level ListBucket.

Why this answer

The s3:GetObject permission alone is insufficient to read an object from S3 when the request is made via the AWS Console or certain SDK operations that first list the bucket's contents. The s3:ListBucket permission is required for the ListObjects API call, which is often implicitly invoked to resolve the object key path. Without it, the read operation fails with an Access Denied error even if the GetObject permission is granted.

Exam trap

Cisco often tests the subtle distinction between object-level permissions (GetObject) and bucket-level permissions (ListBucket), where candidates mistakenly assume that granting GetObject alone is sufficient for all read operations, ignoring that many S3 interactions implicitly require ListBucket to resolve the object path.

How to eliminate wrong answers

Option B is wrong because the question does not mention any encryption settings on the object, and the error is Access Denied, not a KMS-related permission error (which would typically return a 400 Bad Request with a KMS-specific message). Option C is wrong because the resource ARN 'arn:aws:s3:::my-bucket/training/*' correctly grants access to all objects under the 'training/' prefix; removing the wildcard would restrict access to a single object named 'training' (without a trailing slash), which is not the intended scope. Option D is wrong because the second statement with a trailing slash ('arn:aws:s3:::my-bucket/training/') does not explicitly deny s3:GetObject; it only grants s3:GetObject on objects with keys starting with 'training/' (the trailing slash is part of the prefix pattern, not a denial).

Full explanation →

719

Multi-Selectmedium

Which THREE of the following are valid techniques for detecting outliers in a dataset during exploratory data analysis? (Select THREE.)

Select 3 answers

A.Z-score method: flag points with absolute Z-score > 3.

B.Linear regression residuals.

C.Isolation Forest algorithm.

D.K-means clustering.

E.Interquartile Range (IQR) method: flag points outside 1.5*IQR from quartiles.

AnswersA, C, E

Z-score is a standard outlier detection technique.

Why this answer

Z-score, IQR, and Isolation Forest are all common outlier detection methods. Option D (Linear regression) is not for outlier detection. Option E (K-means) can be used for clustering but not primarily for outlier detection.

Full explanation →

720

MCQhard

A company is deploying a machine learning model for real-time fraud detection. The model must have low latency (under 100 ms) and high throughput. The model is an ensemble of 5 gradient boosted trees (XGBoost), each 200 MB. Which deployment strategy is MOST suitable?

A.Use AWS Lambda to invoke each model sequentially.

B.Deploy each model as a separate SageMaker endpoint and use a load balancer.

C.Deploy the ensemble on a single GPU instance with large batch processing.

D.Use SageMaker multi-model endpoint on a compute-optimized instance.

AnswerD

Multi-model endpoints reduce overhead and scale well.

Why this answer

Option D is correct because using a single multi-model endpoint with multiple models behind a load balancer provides scalability and low latency. Option A is wrong because a single model may not handle throughput. Option B is wrong because Lambda has execution time limits and cold starts.

Option C is wrong because CPU is slower than GPU for this use case.

Full explanation →

721

MCQeasy

A data scientist is reviewing the training logs from a SageMaker training job. The logs show training and validation loss per epoch. Based on the exhibited logs, which statement is correct?

A.The model is not learning because the loss is not decreasing

B.The model is underfitting because both losses are high

C.The model is performing well because validation loss is stable

D.The model is overfitting because training loss decreases while validation loss does not

AnswerD

Classic overfitting: training loss improves, validation loss stagnates.

Why this answer

Option D is correct because the training logs show a classic overfitting pattern: training loss consistently decreases across epochs, indicating the model is memorizing the training data, while validation loss does not decrease (or may even increase), indicating poor generalization to unseen data. In SageMaker, monitoring both losses during training is critical to detect overfitting early, often prompting regularization or early stopping.

Exam trap

The trap here is that candidates see decreasing training loss and assume the model is learning well, ignoring the validation loss plateau or increase, which is the hallmark of overfitting.

How to eliminate wrong answers

Option A is wrong because the loss is decreasing (training loss goes down), so the model is learning; the issue is not lack of learning but divergence between training and validation performance. Option B is wrong because underfitting would show both training and validation losses remaining high and not decreasing, whereas here training loss decreases significantly. Option C is wrong because a stable validation loss alone does not indicate good performance if training loss is decreasing while validation loss is not improving—this divergence signals overfitting, not good generalization.

Full explanation →

722

MCQhard

A data engineer runs the CLI command to download an object from S3. The bucket owner is 123456789012, and the engineer's IAM user has s3:GetObject permission on the bucket. The object was uploaded by a different AWS account. What is the MOST likely reason for the AccessDenied error?

A.The --expected-bucket-owner parameter is incorrect

B.The object is owned by a different AWS account, and the bucket owner has not been granted access

C.The bucket policy denies access to the engineer's IAM user

D.The IAM policy does not allow s3:GetObject for that specific key

AnswerB

Object ACLs or bucket policy must grant access to bucket owner.

Why this answer

By default, objects uploaded by another account are owned by the uploading account, and the bucket owner does not have access unless explicitly granted via bucket policy or ACL. The --expected-bucket-owner parameter only checks the bucket owner, not the object. The engineer's permissions are on the bucket, but the object is owned by another account, so the bucket owner needs additional permissions.

Full explanation →

723

Multi-Selecteasy

A data engineer is building a data pipeline for a machine learning project using Amazon SageMaker. The raw data is stored in Amazon S3. Which TWO steps are essential to ensure data privacy and security before training? (Choose TWO.)

Select 2 answers

A.Create a bucket policy that restricts access to the data scientist's IAM role only

B.Enable versioning on the S3 bucket

C.Encrypt the data at rest using S3 server-side encryption

D.Use S3 Transfer Acceleration for faster uploads

E.Use Amazon SageMaker in a VPC and configure VPC endpoints to access S3 securely

AnswersC, E

Encryption protects data at rest.

Why this answer

Options B and E are correct because encryption at rest and in transit are essential. Option A is optional. Option C is for access control, but encryption is more fundamental.

Option D is not a security measure.

Full explanation →

724

MCQeasy

A data scientist is training a linear regression model on a dataset with 100 features. The model shows high variance on the test set. Which action is MOST likely to reduce overfitting?

A.Use a more complex model like XGBoost

B.Increase the number of training iterations

C.Apply L2 regularization (Ridge regression)

D.Add more feature engineering to increase model complexity

AnswerC

L2 regularization penalizes large coefficients, reducing overfitting.

Why this answer

Option B is correct because adding L2 regularization (ridge regression) penalizes large coefficients and reduces model complexity, which helps with overfitting. Option A (adding more features) would increase variance. Option C (increasing training iterations) doesn't address overfitting.

Option D (using a more complex model) would worsen overfitting.

Full explanation →

725

MCQhard

A data scientist is granted the IAM policy shown in the exhibit. The data scientist can query the 'data-lake-bucket' using Athena and get results. However, when the data scientist tries to run a CTAS (CREATE TABLE AS SELECT) query in Athena to write results to a new S3 location, the query fails. What is the most likely reason?

A.The policy does not grant athena:CreateTable permission.

B.The policy does not grant s3:PutObject permission on the bucket.

C.The policy does not grant permissions to the Glue Data Catalog.

D.The policy uses a wildcard for Athena actions, which is not allowed.

AnswerB

CTAS queries write output to S3, requiring s3:PutObject.

Why this answer

Option B is correct because the policy allows s3:GetObject and s3:ListBucket, but not s3:PutObject, which is required for CTAS queries. Option A is wrong because the policy uses resource-level permissions for S3. Option C is wrong because Athena does not require Glue Data Catalog permissions for CTAS if the table metadata is already stored.

Option D is wrong because the policy does not restrict Athena resource ARNs.

Full explanation →

726

MCQmedium

During EDA, a data scientist finds that a numeric feature has many outliers. The feature will be used in a linear regression model. Which approach should the scientist take to handle the outliers?

A.Remove all rows with outlier values.

B.Apply a logarithmic transformation to the feature.

C.Standardize the feature using Z-score normalization.

D.Cap the feature values at the 1st and 99th percentiles.

AnswerD

Capping limits extreme values, reducing their influence.

Why this answer

Option C is correct because capping winsorizes outliers, reducing their impact while retaining data. Option A is wrong because removing all outliers may lose information. Option B is wrong because log transform reduces skew but does not remove outliers.

Option D is wrong because scaling does not mitigate outlier influence.

Full explanation →

727

MCQhard

A data engineer is designing a data lake on Amazon S3 that must support both batch and streaming analytics. The data comes in Parquet format and needs to be queryable by Amazon Athena. Which partitioning strategy will optimize query performance and reduce costs?

A.Partition by date and hour for time-based queries

B.Store data as CSV without partitioning for simplicity

C.Partition by device_id for granular access

D.Use a single partition for all data to simplify management

AnswerA

Common query patterns are time-filtered; this reduces data scanned.

Why this answer

Partitioning by date and hour allows Athena to prune partitions effectively for time-based queries, reducing data scanned. Option A is wrong because a single partition is not efficient. Option C is wrong because partitioning by a high-cardinality column like device_id creates many small partitions.

Option D is wrong because using CSV negates the benefits of columnar storage.

Full explanation →

728

Multi-Selecthard

A data scientist is analyzing a dataset with missing values. Which THREE methods are appropriate for handling missing data during EDA and preprocessing?

Select 3 answers

A.Remove rows with any missing values

B.Impute missing values with the mean of the column

C.Replace missing values with 0

D.Ignore missing values and proceed with modeling

E.Impute missing values with the median of the column

AnswersA, B, E

Listwise deletion is acceptable if missing is MCAR and few rows.

Why this answer

Option A (remove rows with missing values) is valid if missing is random and small. Option B (impute with mean) is common for numeric data. Option C (impute with median) is robust to outliers.

Option D is wrong because using a constant 0 can introduce bias. Option E is wrong because ignoring missing values in models causes errors.

Full explanation →

729

MCQeasy

A machine learning team is using Amazon SageMaker to build a regression model. The target variable is heavily right-skewed with a long tail. Which data transformation should the team apply to the target variable before training?

A.One-hot encoding

B.Min-max scaling

C.Log transformation

D.Standardization (z-score)

AnswerC

Log transform reduces right skew and makes distribution more normal.

Why this answer

A log transformation compresses the range of the target and makes the distribution more symmetric, improving model performance.

Full explanation →

730

MCQeasy

A data scientist is analyzing a dataset with numerical features and a binary target variable. The data scientist creates a pairplot and notices that one feature has a bimodal distribution when colored by the target class. What does this observation suggest?

A.The feature is irrelevant and should be removed.

B.The feature is likely predictive of the target.

C.The feature contains outliers that need to be removed.

D.The feature has missing values that need to be imputed.

AnswerB

Different distributions for each class indicate the feature can separate the classes.

Why this answer

Option A is correct because bimodal distribution separated by class indicates the feature can help distinguish between classes. Option B is wrong because bimodality does not necessarily imply missing values. Option C is wrong because it suggests the feature is useful, not irrelevant.

Option D is wrong because bimodality is not an indication of outliers.

Full explanation →

731

Multi-Selecteasy

Which TWO AWS services can be used to deploy a machine learning model for serverless inference? (Choose 2.)

Select 2 answers

A.Amazon SageMaker Serverless Inference

B.AWS Lambda

C.Amazon EMR

D.Amazon ECS with Fargate

E.AWS Batch

AnswersA, B

Serverless inference option.

Why this answer

SageMaker Serverless Inference (Option A) and AWS Lambda (Option C) both support serverless inference. Option B (Amazon ECS) requires managing clusters. Option D (Amazon EMR) is for big data.

Option E (AWS Batch) is for batch computing.

Full explanation →

732

MCQeasy

A company uses Amazon Kinesis Data Streams to collect clickstream data. The data is consumed by a Lambda function that writes to DynamoDB. Occasionally, the Lambda function fails due to throttling from DynamoDB. How can the company resolve this issue without losing data?

A.Ignore the throttling errors and let Lambda retry.

B.Increase the number of shards in the Kinesis stream.

C.Use an Amazon SQS queue as a buffer between Kinesis and Lambda.

D.Decrease the batch size in the Lambda event source mapping.

AnswerD

Smaller batches reduce the write rate, avoiding throttling.

Why this answer

Decreasing the batch size reduces the number of records per invocation, lowering the write load on DynamoDB. Increasing shards would increase parallelism, potentially worsening throttling. Using SQS would add latency.

Removing error handling would lose data.

Full explanation →

733

MCQmedium

A team is deploying a machine learning model to production using Amazon SageMaker. They want to automatically scale the endpoint based on the incoming request volume, and they also need to ensure that the endpoint can handle sudden bursts of traffic without dropping requests. Which scaling policy should they use?

A.Scheduled scaling policy for peak hours

B.Target tracking scaling policy based on the number of invocations

C.Simple scaling policy based on average latency

D.Manual scaling by monitoring CloudWatch alarms

AnswerB

Target tracking automatically adjusts capacity to maintain a target metric and can handle bursts.

Why this answer

Option B is correct because a target tracking scaling policy with a specified target value for the metric allows the endpoint to automatically adjust capacity to maintain the target metric, and it can handle bursts by adding more instances proactively. Option A is wrong because a simple scaling policy based on average latency may not handle bursts quickly. Option C is wrong because a scheduled scaling policy is for predictable traffic patterns.

Option D is wrong because manual scaling is not automatic.

Full explanation →

734

MCQhard

A data scientist is working on a binary classification problem with a highly imbalanced dataset (1% positive class). They have applied oversampling using SMOTE and trained a logistic regression model. The model achieves 99% accuracy on the test set, but the recall for the positive class is only 5%. What is the most likely cause?

A.SMOTE was applied before splitting the data into training and test sets

B.The model is overfitting due to lack of regularization

C.Accuracy is not a suitable metric for imbalanced data

D.Logistic regression is inappropriate for imbalanced datasets

AnswerA

Why D is correct

Why this answer

Option D is correct because if SMOTE was applied before splitting, synthetic samples leak information from the test set into the training set, leading to overoptimistic accuracy but poor generalization. Option A is wrong because logistic regression can handle balanced data, though it may not capture complex patterns. Option B is wrong because accuracy is a poor metric for imbalanced data, but the low recall indicates a problem beyond metric choice.

Option C is wrong because while L2 regularization might help, it would not cause such a discrepancy between accuracy and recall.

Full explanation →

735

MCQmedium

A healthcare company is building a model to predict patient readmission within 30 days of discharge. The dataset includes 10,000 patient records with 200 features, including lab results, demographics, and historical admissions. The target variable is highly imbalanced: only 8% of patients are readmitted. The data scientist splits the data into 80% training and 20% test sets, ensuring the same proportion of readmissions in each. The scientist trains a logistic regression model and a random forest model. The logistic regression achieves 92% accuracy but recall of 10% for the readmitted class. The random forest achieves 90% accuracy but recall of 25%. The business requirement is to achieve at least 60% recall for readmissions while maintaining reasonable precision. The scientist also has access to a large collection of unlabeled patient records from other hospitals. Which strategy should the data scientist use to meet the business requirement?

A.Collect more labeled data from other hospitals.

B.Use SMOTE to oversample the minority class in the training set.

C.Use random undersampling of the majority class in the training set.

D.Switch to a deep neural network with more layers.

AnswerB

SMOTE creates synthetic samples to balance classes.

Why this answer

Option B is correct because using SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples for the minority class, which can improve recall. Option A is wrong because collecting more data may not be feasible and may not help if imbalance persists. Option C is wrong because undersampling reduces data and may lose information.

Option D is wrong because changing to a deep learning model may not help with limited data.

Full explanation →

736

Multi-Selecteasy

A company is building a machine learning pipeline on AWS. The pipeline includes data ingestion, preprocessing, training, and deployment. Which THREE AWS services can be used to orchestrate the pipeline? (Choose THREE.)

Select 3 answers

A.AWS Glue Workflows

B.Amazon CloudWatch

C.Amazon SageMaker Pipelines

D.AWS Lambda

E.AWS Step Functions

AnswersA, C, E

Glue Workflows can orchestrate ETL jobs.

Why this answer

Option A, B, and C are correct because Step Functions, SageMaker Pipelines, and AWS Glue Workflows provide orchestration capabilities. Option D is wrong because Lambda is a compute service, not a workflow orchestrator. Option E is wrong because CloudWatch is for monitoring.

Full explanation →

737

MCQmedium

A data scientist is training a deep learning model on a large dataset using Amazon SageMaker. The training job is taking too long. Which action would MOST likely reduce training time without sacrificing model accuracy?

A.Use a smaller instance type for training

B.Implement early stopping with a low patience value

C.Enable data parallelism across multiple GPUs

D.Reduce the number of epochs by half

AnswerC

Data parallelism distributes data across GPUs, reducing training time while preserving accuracy.

Why this answer

Option C is correct because enabling data parallelism across multiple GPUs distributes the training workload across several devices, allowing larger batch sizes and faster gradient computation per epoch. Amazon SageMaker's distributed training libraries (e.g., SageMaker Data Parallelism) use all-reduce algorithms to synchronize gradients efficiently, which reduces wall-clock training time without altering the model architecture or loss function, thus preserving accuracy.

Exam trap

The trap here is that candidates may confuse reducing training time with reducing computational load (e.g., smaller instance or fewer epochs), but the question specifically requires maintaining accuracy, which distributed parallelism achieves by leveraging more hardware rather than cutting corners in the training process.

How to eliminate wrong answers

Option A is wrong because using a smaller instance type reduces computational resources (CPU/GPU memory and throughput), which increases per-iteration time and may force smaller batch sizes, potentially slowing convergence or even degrading accuracy due to underfitting. Option B is wrong because implementing early stopping with a low patience value risks halting training prematurely before the model has converged, which can sacrifice accuracy by underfitting the data. Option D is wrong because reducing the number of epochs by half arbitrarily cuts training short without regard to convergence criteria, likely resulting in an underfit model with lower accuracy.

Full explanation →

738

MCQhard

A data scientist is using Amazon SageMaker Debugger to monitor training jobs. The training loss is decreasing but then suddenly spikes. What is the most likely cause and how should it be addressed?

A.Gradient explosion; apply gradient clipping.

B.Overfitting; apply regularization.

C.Learning rate too low; increase learning rate.

D.Vanishing gradients; use ReLU activation.

AnswerA

Gradient clipping limits the gradient magnitude.

Why this answer

Option A is correct because a sudden spike in loss after decreasing often indicates a gradient explosion. Gradient clipping prevents this. Option B is wrong because learning rate that is too low causes slow convergence, not spikes.

Option C is wrong because overfitting shows decreasing training loss but increasing validation loss. Option D is wrong because vanishing gradients cause loss to plateau, not spike.

Full explanation →

739

MCQeasy

A company is training a deep learning model on Amazon SageMaker. The training job is failing with an out-of-memory error. Which SageMaker feature should the company use to resolve this issue without changing the instance type?

A.Use SageMaker distributed training with model parallelism

B.Use SageMaker Savings Plans

C.Enable SageMaker Managed Spot Training

D.Use SageMaker Debugger to monitor memory usage

E.Enable SageMaker Profiler to profile memory

AnswerA

Model parallelism splits the model across multiple instances, reducing memory per instance.

Why this answer

Option D is correct because SageMaker Managed Spot Training can reduce costs but does not fix memory issues. Option A and B are about debugging, not memory. Option C (distributed training) can split the model across instances, reducing per-instance memory usage.

Option E is about cost, not memory.

Full explanation →

740

MCQmedium

Refer to the exhibit. A data scientist runs the AWS CLI command to create a SageMaker training job. The job fails immediately with 'ValidationException: Invalid instance type'. What is the most likely issue?

A.The IAM role ARN is invalid

B.The S3 bucket 'my-bucket' does not exist or the role lacks permissions

C.The instance type ml.m5.large does not support the XGBoost image

D.The training image URI is for a different AWS region

AnswerD

The account ID corresponds to us-east-1, but the CLI command is running in us-west-2.

Why this answer

The error 'ValidationException: Invalid instance type' occurs because the training image URI specified in the command points to an Amazon ECR repository in a different AWS region than where the SageMaker training job is being created. SageMaker validates that the image URI is accessible from the current region; if the URI references a region that does not contain the XGBoost image or the instance type is not supported in that region's ECR, the validation fails. The instance type itself (ml.m5.large) is valid for XGBoost, but the mismatch between the image's region and the job's region triggers the exception.

Exam trap

The trap here is that candidates assume 'Invalid instance type' always means the instance is unsupported by the algorithm, when in reality SageMaker uses this generic error for any validation failure related to the training job's resource configuration, including regional mismatches in the image URI.

How to eliminate wrong answers

Option A is wrong because an invalid IAM role ARN would cause an 'AccessDeniedException' or 'InvalidParameterValue' error, not a 'ValidationException' specifically about the instance type. Option B is wrong because a missing S3 bucket or insufficient permissions would result in a 'ClientError: 404 Not Found' or 'AccessDenied' during data download, not a validation error at job creation. Option C is wrong because ml.m5.large is a supported instance type for the XGBoost algorithm in SageMaker; the error is not about instance type compatibility with the image but about the image URI's regional mismatch.

Full explanation →

741

MCQmedium

A data scientist is using Amazon SageMaker to train a custom image classification model using a PyTorch script. The training job runs successfully but the model accuracy is lower than expected. The scientist wants to debug the training process by inspecting gradients and layer outputs. Which SageMaker feature should be used to capture this internal state during training?

A.Use SageMaker Experiments to track hyperparameters and metrics.

B.Use SageMaker Debugger to capture tensors and gradients.

C.Use SageMaker Profiler to profile system bottlenecks.

D.Use SageMaker Model Monitor to detect data drift.

AnswerB

SageMaker Debugger provides real-time monitoring of training metrics and internal state like gradients.

Why this answer

SageMaker Debugger captures internal model state such as gradients and tensors during training, enabling analysis and debugging. SageMaker Profiler (B) focuses on system performance, not model internals. SageMaker Experiments (C) tracks trials and metrics but does not capture internal state.

SageMaker Model Monitor (D) detects data drift after deployment.

Full explanation →

742

MCQmedium

A data scientist uses Amazon SageMaker Data Wrangler to perform EDA on a large dataset stored in S3. The data scientist notices that the target variable is highly imbalanced. Which SageMaker Data Wrangler transform can be used to address this during data preparation?

A.Standard scaling

B.Principal component analysis (PCA)

C.SMOTE (Synthetic Minority Over-sampling)

D.One-hot encoding

AnswerC

SMOTE generates synthetic samples to balance the target distribution.

Why this answer

SMOTE is available in SageMaker Data Wrangler to generate synthetic samples for the minority class. Option A is wrong because one-hot encoding is for categorical features. Option B is wrong because standard scaling normalizes numeric features.

Option D is wrong because principal component analysis reduces dimensionality.

Full explanation →

743

MCQeasy

A data scientist is training a binary classifier using logistic regression. The dataset has 100 features and 1 million samples. After training, the model achieves AUC of 0.85 on the test set. The business wants to understand which features contribute most to predictions. Which technique should the data scientist use?

A.Use t-SNE to visualize feature importance

B.Use the coefficients of the logistic regression model as feature importance

C.Use a random forest model and its feature importance attribute

D.Use Principal Component Analysis (PCA) to find important components

AnswerB

Logistic regression coefficients indicate direction and magnitude of feature impact.

Why this answer

Coefficients of logistic regression are natural measures of feature importance.

Full explanation →

744

MCQmedium

A company is building a recommendation system for an e-commerce platform. They have user-item interaction data (clicks, purchases) and want to use matrix factorization. They plan to use Amazon SageMaker to train the model. Which dataset format is MOST appropriate for the built-in Factorization Machines algorithm?

A.Libsvm format with user_id and item_id as features

B.CSV file with user_id, item_id, and label columns

C.RecordIO-protobuf with user_id, item_id, and label fields

D.JSON lines file with user_id, item_id, and label fields

AnswerC

RecordIO-protobuf is the required format for SageMaker's built-in Factorization Machines.

Why this answer

The built-in Factorization Machines algorithm in Amazon SageMaker requires the RecordIO-protobuf format for optimal performance, as it allows efficient binary serialization and direct integration with SageMaker's distributed training infrastructure. This format supports sparse data representation, which is critical for high-dimensional user-item interaction data, and enables faster I/O and reduced memory overhead compared to text-based formats.

Exam trap

The trap here is that candidates often assume libsvm or CSV are universally optimal for sparse data, but SageMaker's built-in Factorization Machines specifically requires RecordIO-protobuf for native sparse tensor support and maximum performance, not just any text-based sparse format.

How to eliminate wrong answers

Option A is wrong because libsvm format, while common for linear models and SVM, is not natively supported by SageMaker's built-in Factorization Machines algorithm; the algorithm expects RecordIO-protobuf or CSV, but libsvm lacks the protobuf efficiency and sparse tensor handling required for optimal training. Option B is wrong because CSV format, though supported, is less efficient for large-scale sparse data due to text parsing overhead and lack of native sparse encoding, making it suboptimal for matrix factorization tasks with millions of user-item pairs. Option D is wrong because JSON lines format is not supported by the built-in Factorization Machines algorithm; SageMaker's built-in algorithms require either RecordIO-protobuf or CSV for training, and JSON lines would require custom preprocessing or a custom container.

Full explanation →

745

MCQhard

A data scientist is performing exploratory data analysis on a dataset with mixed data types: numerical, categorical, and text. They want to use Amazon SageMaker Data Wrangler to create a quick visualization dashboard. Which set of transformations should they apply in Data Wrangler to handle all data types appropriately?

A.Use the built-in analysis: summary statistics for numerical, word cloud for text, and frequency for categorical.

B.Convert all features to numerical using one-hot encoding and then create a scatter matrix.

C.Apply TF-IDF vectorization to text and then run k-means clustering.

D.Use PCA to reduce dimensionality and then visualize the first two components.

AnswerA

These are appropriate EDA visualizations for different data types.

Why this answer

Option D is correct because Data Wrangler's built-in analysis includes summary statistics for numerical features, word clouds for text, and frequency counts for categorical features. These are appropriate for initial EDA. Option A is incorrect because PCA is for dimensionality reduction, not EDA.

Option B is incorrect because TF-IDF is a feature engineering step, not EDA. Option C is incorrect because clustering is a modeling step, not EDA.

Full explanation →

746

MCQmedium

A research institution is building a data lake to store genomics data. Each experiment generates multiple files totaling about 500 GB. The data is stored in Amazon S3 and needs to be processed by multiple machine learning (ML) training jobs running on Amazon SageMaker. The data has a high churn rate; after 30 days, most data becomes irrelevant and should be moved to Amazon S3 Glacier Deep Archive. The institution wants to minimize storage costs while maintaining data durability. Which S3 storage class should they use for the first 30 days?

A.Use S3 Intelligent-Tiering for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

B.Use S3 One Zone-IA for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

C.Use S3 Standard for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

D.Use S3 Glacier Instant Retrieval for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

AnswerA

Intelligent-Tiering automatically optimizes costs by moving data to lower-cost tiers when not accessed, and it provides high durability.

Why this answer

Option C is correct because S3 Intelligent-Tiering automatically moves data between access tiers based on usage, which is ideal for data with unknown or changing access patterns. Option A is wrong because S3 Standard is more expensive for data that may not be accessed frequently after the initial processing. Option B is wrong because S3 One Zone-IA is not durable across AZs.

Option D is wrong because S3 Glacier Instant Retrieval is for long-lived, rarely accessed data requiring millisecond access; not cost-effective for the first 30 days.

Full explanation →

747

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data comes from various sources, including IoT devices, web logs, and transactional databases. The engineer needs to organize the data in a way that supports efficient querying using Amazon Athena and allows for easy management of access permissions. Which S3 bucket structure is the most appropriate?

A.Store all data in a single prefix without any partitioning.

B.Use a prefix structure like s3://bucket/source/year/month/day/.

C.Store all data in separate S3 buckets for each source and date.

D.Use a prefix structure like s3://bucket/date/source/.

AnswerB

This structure enables partition pruning by source and time, optimizing Athena queries and allowing granular access control at the source level.

Why this answer

Option B is correct because partitioning by source, year, month, day allows Athena to prune partitions, reducing scan costs and improving performance. Option A is wrong because storing all data in a flat structure forces full scans. Option C is wrong because prefix-based access controls can be applied at the source level within the partitioned structure.

Option D is wrong because using date as the first partition level is less intuitive for managing permissions by source.

Full explanation →

748

MCQhard

A team is training a large deep learning model on Amazon SageMaker. The training job is taking too long and they want to reduce training time without changing the model architecture. Which action is MOST effective?

A.Switch to a compute-optimized instance like c5.4xlarge

B.Use a GPU instance (e.g., p3.2xlarge) for training

C.Increase the batch size and learning rate proportionally

D.Use SageMaker Automatic Model Tuning with hyperparameter optimization

AnswerB

GPUs dramatically speed up matrix operations common in deep learning.

Why this answer

Using a SageMaker managed training instance with GPU (e.g., p3.2xlarge) provides significant acceleration for deep learning models due to parallel processing.

Full explanation →

749

Multi-Selecteasy

Which TWO of the following are common techniques for handling missing values in a dataset during exploratory data analysis? (Select TWO.)

Select 2 answers

A.Apply feature scaling to normalize the data.

B.Remove rows or columns with missing values if they are few.

C.Use Principal Component Analysis (PCA) to reduce dimensionality.

D.Apply one-hot encoding to the missing values.

E.Impute missing values with the mean or median of the column.

AnswersB, E

Deletion is a valid approach when missing data is minimal.

Why this answer

Imputation with mean/median and removing rows/columns are standard techniques. Options C (one-hot encoding), D (PCA), and E (feature scaling) are not for handling missing values.

Full explanation →

750

MCQmedium

A machine learning engineer is analyzing a dataset and observes that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution approximately normal?

A.Square root transformation

B.Exponential transformation

C.Log transformation

D.Box-Cox transformation with lambda = 0

AnswerC

Log transformation is standard for right-skewed data.

Why this answer

Option B is correct because a log transformation compresses the right tail and is effective for right-skewed data. Square root (A) is less effective for heavy skew. Exponential (C) would worsen skew.

Box-Cox (D) is a family that includes log, but log is the most common and straightforward.

Full explanation →

Page 10 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →