Knowledge + Practice

AWS Certified Machine Learning Specialty MLS-C01 (MLS-C01) — Questions 1726–1755

1755 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 24 of 24

1726

MCQhard

A Glue job fails with an AccessDenied error when trying to write to the S3 bucket my-data-lake. The IAM policy attached to the job role is shown in the exhibit. What is the MOST likely reason for the failure?

A.The s3:ListBucket action is missing on the bucket level

B.The job role does not have permissions to decrypt the KMS key used for server-side encryption

C.The s3:PutObject action is not sufficient; the job needs s3:PutObjectAcl

D.The resource ARN for s3:PutObject should include a specific prefix

AnswerB

SSE-KMS requires kms:Decrypt and kms:GenerateDataKey permissions, which are missing.

Why this answer

The policy allows s3:PutObject on the bucket, so write access seems granted. However, if the bucket is encrypted with SSE-KMS, the job also needs kms:Decrypt and kms:GenerateDataKey permissions. The policy does not include KMS actions.

The bucket policy might also deny, but the most common issue is KMS encryption.

Full explanation →

1727

MCQhard

A data scientist is analyzing a dataset with missing values. The missing data mechanism is missing at random (MAR). Which imputation method is most appropriate to preserve relationships between variables?

A.Remove all rows with any missing values.

B.Use k-nearest neighbors imputation.

C.Use multiple imputation by chained equations (MICE).

D.Replace missing values with the mean of the column.

AnswerC

MICE models each variable with missing values conditional on others, suitable for MAR.

Why this answer

Option D is correct because multiple imputation by chained equations (MICE) handles MAR well by modeling each variable with missing values conditional on others. Option A is wrong because mean imputation underestimates variance. Option B is wrong because dropping rows with missing data reduces sample size and can introduce bias.

Option C is wrong because KNN imputation assumes data are MCAR and may not be optimal for MAR.

Full explanation →

1728

MCQhard

Refer to the exhibit. A data scientist runs the above AWS CLI command to create a SageMaker training job using the built-in Linear Learner algorithm. The training job fails with an error. What is the most likely cause?

A.The S3 data type is AugmentedManifestFile, but Linear Learner requires RecordIO or CSV

B.The IAM role does not have sufficient permissions

C.The instance type ml.m5.large does not support the Linear Learner algorithm

D.The MaxRuntimeInSeconds is too short

AnswerA

Linear Learner does not support augmented manifest.

Why this answer

The command uses `AugmentedManifestFile` as the S3 data type, but Linear Learner expects `RecordIO` or `CSV` format, not augmented manifest. Augmented manifest is for algorithms that support it, like object detection. Linear Learner requires `S3DataType` to be `RecordIO` or `CSV`.

Also the content type is `application/x-recordio` which is correct for RecordIO, but the data type is wrong. So the error is due to the S3 data type. Option C is correct.

Option A: The IAM role is present. Option B: Instance type is fine. Option D: Max runtime is fine.

Full explanation →

1729

MCQeasy

A team stores raw data in S3 and uses a Glue Data Catalog for metadata. They want to allow data scientists to query the data with Amazon Athena using their existing IAM roles. What is the MINIMUM set of permissions required?

A.Grant the IAM role permissions for Athena, Glue, and S3 (read and write).

B.Grant the IAM role permissions for Athena actions, Glue Data Catalog actions, and S3 read access.

C.Grant the IAM role permissions for Athena and Amazon Redshift Spectrum.

D.Grant the IAM role permissions for Athena and Amazon Kinesis.

AnswerB

Athena requires GetTable, GetDatabase, etc. from Glue, and GetObject from S3.

Why this answer

Option C is correct because Athena needs permissions to query the Glue Data Catalog and to read data from S3. Option A is wrong because write permissions to S3 are not needed for querying. Option B is wrong because Glue job execution is separate.

Option D is wrong because Kinesis permissions are irrelevant.

Full explanation →

1730

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and a tight deadline of two weeks. Which AWS service should the engineer use to transfer the data most efficiently?

A.AWS Storage Gateway (Volume Gateway)

B.AWS Snowball Edge

C.Amazon S3 Transfer Acceleration

D.AWS DataSync over the internet

AnswerB

Snowball Edge provides physical shipping, bypassing bandwidth limitations.

Why this answer

Option B is correct. AWS Snowball Edge is a physical device that can transfer large amounts of data faster than over the internet. Option A is wrong because AWS DataSync over internet would be too slow (100 Mbps).

Option C is wrong because AWS Storage Gateway is for ongoing hybrid storage, not bulk transfer. Option D is wrong because Amazon S3 Transfer Acceleration improves speed but still over internet, not enough for 50 TB in two weeks over 100 Mbps.

Full explanation →

1731

MCQmedium

A company is streaming data from IoT devices to Amazon Kinesis Data Firehose, which writes to an Amazon S3 bucket. The data is then processed by an AWS Glue ETL job and loaded into Amazon Redshift. The team notices that some records are missing in Redshift. They suspect data loss during the Firehose delivery. Which configuration parameter should be checked first?

A.The AWS KMS key used for encryption.

B.The CloudWatch error logging configuration.

C.The buffer interval (e.g., 60 seconds) and buffer size.

D.The compression format (GZIP, Snappy, etc.).

AnswerC

Correct: If the buffer interval is too long and the stream is stopped, buffered data may be lost if not flushed properly.

Why this answer

Firehose can buffer data before writing to S3. If the buffer interval is too long and the stream ends, data may be lost if the buffer is not flushed. Option C (buffer interval) is the most likely cause.

Option A (compression) does not cause loss. Option B (KMS key) is for encryption. Option D (error logging) only logs errors, does not prevent loss.

Full explanation →

1732

Multi-Selecteasy

A machine learning engineer is deploying a model using Amazon SageMaker. The model requires preprocessing steps (e.g., scaling, encoding) that were applied during training. Which TWO options can ensure the same preprocessing is applied at inference?

Select 2 answers

A.Implement preprocessing as an AWS Lambda function invoked before inference.

B.Deploy a separate preprocessing endpoint and call it before the model endpoint.

C.Retrain the model in each inference request with the preprocessing applied.

D.Create a Scikit-learn pipeline that includes preprocessing and the model, then deploy it.

E.Use SageMaker Inference Pipeline to chain a preprocessing container with the model container.

AnswersD, E

The pipeline ensures consistent transformation during training and inference.

Why this answer

Options A and D are correct. Scikit-learn pipelines bundle preprocessing and model into a single object. SageMaker Inference Pipelines chain preprocessing and prediction containers.

Option B is wrong because Lambda function may introduce inconsistencies. Option C is wrong because separate endpoint adds complexity. Option E is wrong because re-training the model in each inference request is impractical.

Full explanation →

1733

Multi-Selecteasy

Which TWO of the following are true about the bias-variance tradeoff?

Select 2 answers

A.Ensemble methods like bagging increase variance

B.Simple models tend to have high variance

C.High variance can cause overfitting

D.High bias can cause underfitting

E.High variance models are typically too simple

AnswersC, D

High variance means the model is very sensitive to training data, leading to overfitting.

Why this answer

Option A is correct because high bias leads to underfitting. Option C is correct because high variance leads to overfitting. Option B is wrong because high bias models are not complex.

Option D is wrong because simple models have high bias. Option E is wrong because ensemble methods reduce variance.

Full explanation →

1734

MCQhard

A team is building a model to predict customer churn. They have 50 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). Which feature engineering technique is most appropriate?

A.Binning zip codes into regions

B.Target encoding

C.Label encoding

D.One-hot encoding

AnswerB

Target encoding condenses high cardinality into one numeric feature.

Why this answer

Target encoding replaces each category with the mean of the target variable, which handles high cardinality well. Option A is wrong because one-hot encoding would create 10,000 binary columns, causing high dimensionality. Option B is wrong because label encoding implies ordinality.

Option D is wrong because binning reduces cardinality but loses information.

Full explanation →

1735

MCQhard

A data scientist is analyzing a dataset with 1 million rows and 50 features. The scientist wants to detect outliers in a numerical feature 'transaction_amount' which has a long right tail. The scientist suspects that outliers are due to data entry errors and should be removed. Which outlier detection method is MOST robust for this scenario?

A.Interquartile range (IQR) with multiplier 1.5

B.Mahalanobis distance

C.Z-score with threshold 3

D.DBSCAN clustering

AnswerA

IQR method is non-parametric and robust to skewness.

Why this answer

Option C is correct because the IQR method is robust to skewed distributions and does not assume normality. Option A is wrong because Z-score assumes normality. Option B is wrong because Mahalanobis distance assumes multivariate normality.

Option D is wrong because DBSCAN is computationally expensive on 1 million rows and may not be practical for univariate outlier detection.

Full explanation →

1736

MCQmedium

A machine learning engineer is building a pipeline using Amazon SageMaker Pipelines. The pipeline has multiple steps including data preprocessing, training, and evaluation. Which statement about SageMaker Pipelines is correct?

A.Steps in a pipeline must run sequentially.

B.Pipelines support caching of step outputs.

C.Pipelines can only use built-in algorithms.

D.Pipelines cannot have conditional branches.

AnswerB

Caching speeds up re-runs.

Why this answer

Option D is correct because SageMaker Pipelines supports caching of step outputs to avoid re-execution. Option A is wrong because steps can be conditional. Option B is wrong because pipelines can include custom scripts.

Option C is wrong because pipelines support parallel execution.

Full explanation →

1737

MCQmedium

A data scientist is training a deep learning model on a GPU instance. The training loss is decreasing, but the validation loss starts increasing after a few epochs. Which action should the data scientist take to address this?

A.Reduce the batch size

B.Implement early stopping

C.Increase the learning rate

D.Add more layers to the model

AnswerB

Early stopping halts training when validation loss increases.

Why this answer

Option B is correct because early stopping stops training when validation loss starts increasing, preventing overfitting. Option A is wrong because increasing learning rate may cause divergence. Option C is wrong because adding more layers increases complexity and overfitting.

Option D is wrong because reducing batch size may increase noise.

Full explanation →

1738

MCQeasy

A company is using Amazon SageMaker to train a linear learner model for predicting customer lifetime value. The target variable is right-skewed with a long tail. The data scientist applies a log transformation to the target variable and trains the model. The model achieves a low root mean squared error (RMSE) on the log scale. However, when the predictions are exponentiated back to the original scale, the RMSE is much higher. Which step should the data scientist take to improve the model's performance on the original scale?

A.Increase the regularization strength

B.Remove outliers from the training data

C.Use a loss function that models the original distribution, such as Poisson or Tweedie

D.Use a deep learning model instead of linear learner

AnswerC

These loss functions handle skewed distributions better.

Why this answer

Option B (use a loss function like Poisson or Tweedie) is appropriate for non-negative skewed targets. Option A (remove outliers) may lose data. Option C (use a different algorithm) may not address the issue.

Option D (increase regularization) may not help.

Full explanation →

1739

Multi-Selecthard

Which THREE of the following are valid strategies to reduce overfitting in a deep neural network? (Choose 3)

Select 3 answers

A.Increase the number of layers.

B.Use early stopping.

C.Increase the learning rate.

D.Add L2 regularization to the loss function.

E.Use dropout layers.

AnswersB, D, E

Early stopping prevents overfitting.

Why this answer

Option A is correct because L2 regularization penalizes large weights. Option C is correct because dropout randomly drops units to prevent co-adaptation. Option E is correct because early stopping prevents overfitting.

Option B is wrong because increasing model capacity increases overfitting. Option D is wrong because increasing learning rate may cause divergence.

Full explanation →

1740

MCQeasy

A machine learning engineer is building a pipeline to preprocess data and train a model using Amazon SageMaker. The data is stored in Amazon S3 and the preprocessing step is computationally intensive. The engineer wants to minimize costs while ensuring that the preprocessing step does not fail due to instance termination. Which instance type should be used for the preprocessing step?

A.Reserved instances

B.On-demand instances

C.A larger instance type to speed up processing

D.Spot instances

AnswerB

On-demand instances are reliable and not terminated, ensuring the step completes.

Why this answer

Option C is correct because using on-demand instances guarantees that the instance will not be terminated during the preprocessing step. Option A is wrong because spot instances can be terminated, causing failures. Option B is wrong because reserved instances require a long-term commitment.

Option D is wrong because a larger instance type increases costs unnecessarily.

Full explanation →

1741

MCQhard

Refer to the exhibit. A data scientist runs the AWS CLI command shown to explore the contents of an S3 bucket. The command returns an empty array. However, the data scientist knows there are objects larger than 1000 bytes in the bucket. What is the most likely reason for the empty result?

A.The query syntax is incorrect; backticks should not be used

B.The command should use list-objects instead of list-objects-v2

C.The --query parameter is not supported by list-objects-v2

D.The AWS CLI is not configured with the correct region for the bucket

AnswerD

If the bucket is in a different region, the command returns no results.

Why this answer

The command uses backticks incorrectly; in CLI, the correct syntax is --query "Contents[?Size > `1000`]" but the backticks are not valid for numeric comparison in JMESPath. The proper syntax is Size > `1000` with backticks? Actually, JMESPath uses backticks for literal values. The command appears correct.

However, the issue might be that the objects are under a different prefix or the bucket is in a different region. But the most likely reason is that the command is missing the --region parameter if the bucket is not in the default region. Option C is correct.

Option A is wrong because the syntax is correct. Option B is wrong because the query syntax is valid. Option D is wrong because the command lists objects.

Full explanation →

1742

Multi-Selectmedium

Which TWO steps are required to set up cross-account access to an Amazon S3 data lake for AWS Glue jobs running in a different AWS account? (Choose two.)

Select 2 answers

A.Add a bucket policy to the S3 bucket that grants access to the Glue service role from the other account.

B.Create an IAM role in the second account that the Glue job can assume, with permissions to read from the S3 bucket.

C.Create a cross-account Glue crawler in the source account.

D.Set up VPC peering between the two accounts' VPCs.

E.Ensure both accounts are in the same AWS organization.

AnswersA, B

Correct: Bucket policy allows cross-account access.

Why this answer

To allow cross-account access, the S3 bucket policy must grant access to the Glue service role from the other account, and the Glue job must assume a role that has permissions to access the bucket. Option A (bucket policy) and Option D (IAM role in the second account) are correct. Option B (VPC peering) is not required for S3 access.

Option C (cross-account Glue crawler) is not needed. Option E (same account) defeats cross-account.

Full explanation →

1743

Drag & Dropmedium

Drag and drop the steps to train a model using Amazon SageMaker built-in algorithm in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Training involves data preparation, job creation, algorithm selection, input/output paths, and execution.

Full explanation →

1744

Multi-Selecteasy

Which TWO of the following are appropriate use cases for using Amazon SageMaker BlazingText? (Choose 2)

Select 2 answers

A.Text classification using supervised learning.

B.Time series forecasting.

C.Learning word embeddings from a large text corpus.

D.Classifying images.

E.Sequence-to-sequence translation.

AnswersA, C

BlazingText has supervised mode.

Why this answer

Option A is correct because BlazingText supports Word2Vec embeddings. Option C is correct because BlazingText supports text classification with supervised mode. Option B is wrong because image classification is not supported.

Option D is wrong because sequence-to-sequence is not supported. Option E is wrong because time series forecasting is not supported.

Full explanation →

1745

MCQmedium

A company is building a recommendation system for an e-commerce platform. The data includes user-item interactions and features such as user demographics and item categories. Which algorithm would be most appropriate for generating personalized recommendations?

A.XGBoost

B.Factorization Machines

C.k-means clustering

D.Principal Component Analysis (PCA)

AnswerB

Factorization Machines model pairwise feature interactions and work well with sparse data, making them suitable for recommendation systems.

Why this answer

Factorization Machines (FM) are specifically designed for recommendation tasks with sparse, high-dimensional data like user-item interactions. They model pairwise feature interactions (e.g., user demographics × item categories) using factorized parameters, enabling personalized recommendations even when many user-item pairs are unobserved. This makes FM far more effective than tree-based or clustering methods for collaborative filtering and feature-rich recommendation scenarios.

Exam trap

Cisco often tests whether candidates confuse general-purpose ML algorithms (like XGBoost or clustering) with specialized recommendation algorithms, expecting you to recognize that factorization machines are the only option designed for sparse interaction data and feature crosses.

How to eliminate wrong answers

Option A (XGBoost) is wrong because it is a tree-based ensemble method that struggles with sparse, high-cardinality categorical features common in recommendation data; it cannot efficiently learn latent interaction patterns between users and items without extensive feature engineering. Option C (k-means clustering) is wrong because it is an unsupervised clustering algorithm that groups users or items into clusters, but it cannot generate personalized recommendations that account for individual user-item interactions or feature crosses. Option D (PCA) is wrong because it is a dimensionality reduction technique that transforms features into uncorrelated principal components, losing interpretability and failing to model the pairwise feature interactions needed for personalized recommendations.

Full explanation →

1746

Multi-Selectmedium

A data engineer needs to design a data ingestion pipeline that ingests data from a MySQL database hosted on-premises into Amazon S3 for analytics. The pipeline must capture change data (CDC) and run continuously with low latency. Which two services should the data engineer use?

Select 2 answers

A.AWS Database Migration Service (DMS) with ongoing replication.

B.Amazon S3 as the target endpoint for DMS.

C.Amazon AppFlow.

D.AWS Glue ETL jobs scheduled at regular intervals.

E.Amazon Kinesis Data Streams.

AnswersA, B

DMS supports CDC and can write changes to S3 continuously.

Why this answer

Option A and D are correct. AWS DMS can capture ongoing changes from MySQL using CDC and replicate them to S3. Option B (Kinesis Data Streams) can receive CDC data from DMS but is not directly needed if DMS writes to S3.

Option C (Glue ETL) is batch-oriented. Option E (AppFlow) is for SaaS applications, not for on-premises databases.

Full explanation →

1747

Multi-Selectmedium

A data engineering team is designing a data lake on AWS. They need to store raw data in S3 and allow multiple analytics services to query the data. Which TWO services can be used to catalog and provide schema information for the data?

Select 2 answers

A.AWS Glue Data Catalog

B.Amazon Kinesis Data Streams

C.Amazon RDS

D.Amazon DynamoDB

E.Amazon Athena

AnswersA, E

Glue Data Catalog stores metadata and schemas.

Why this answer

Option B (AWS Glue Data Catalog) is a metadata catalog. Option D (Amazon Athena) uses Glue Data Catalog as its schema store. Option A (DynamoDB) is not a catalog; Option C (RDS) is a database; Option E (Kinesis) is streaming.

Full explanation →

1748

MCQhard

A data scientist is performing EDA on a large dataset (10 TB) stored in S3. They need to compute summary statistics for each column. Which approach is most cost-effective and efficient?

A.Use an AWS Glue ETL job with PySpark to compute statistics

B.Use Amazon Athena with SQL queries

C.Download the dataset to an Amazon SageMaker Studio notebook and use pandas

D.Launch an Amazon EMR cluster and use Spark SQL

AnswerB

Athena is serverless, cost-effective, and efficient for ad-hoc queries.

Why this answer

Option D is correct because Amazon Athena uses a serverless query engine that scales automatically and charges per query based on data scanned, making it cost-effective for large datasets. Option A is wrong because downloading to SageMaker Studio may incur high data transfer costs and require significant local storage. Option B is wrong because AWS Glue Spark jobs have overhead and are more suited for complex ETL.

Option C is wrong because Amazon EMR requires provisioning clusters and is more expensive for simple statistics.

Full explanation →

1749

MCQhard

Refer to the exhibit. A data scientist is trying to create a SageMaker training job but receives an access denied error. The IAM policy shown is attached to their role. What is the most likely reason for the error?

A.The policy only allows CreateTrainingJob when the training job status is 'Failed', which is never true initially

B.The Action is not allowed because 'CreateTrainingJob' is misspelled

C.There is an explicit deny in another policy

D.The Resource is set to '*' which does not include the specific training job ARN

AnswerA

Condition prevents creation.

Why this answer

Option A is correct because the IAM policy uses a `Condition` block with `sagemaker:TrainingJobStatus` set to `Failed`. When a `CreateTrainingJob` API call is made, the training job status is not yet set (it is `Creating` or `InProgress`), so the condition evaluates to false, and the request is denied. The policy only grants permission when the status equals `Failed`, which never occurs at creation time.

Exam trap

Cisco often tests the nuance that IAM condition keys like `sagemaker:TrainingJobStatus` are evaluated against the current state of the resource at the time of the API call, and candidates mistakenly assume a wildcard resource or a missing action is the issue rather than a condition that never matches.

How to eliminate wrong answers

Option B is wrong because 'CreateTrainingJob' is the correct AWS API action name; there is no misspelling in the policy. Option C is wrong because while an explicit deny in another policy could cause an access denied error, the question asks for the 'most likely' reason, and the given policy's condition is a direct and obvious cause. Option D is wrong because the `Resource` element set to `'*'` in a SageMaker training job policy actually covers all training job ARNs, so it is not the source of the denial.

Full explanation →

1750

MCQhard

A machine learning team is analyzing feature importance in a dataset with many categorical features. They plan to use a tree-based model. Which encoding method should they use to handle high-cardinality categorical features without creating too many dummy variables?

A.One-hot encoding

B.Label encoding

C.Target encoding

D.Frequency encoding

AnswerC

Target encoding replaces categories with the target mean, preserving information without increasing dimensionality.

Why this answer

Option C is correct because target encoding replaces categories with the mean of the target, which is efficient and works well with tree models. Option A is wrong because one-hot encoding creates many columns for high cardinality. Option B is wrong because label encoding imposes ordinality.

Option D is wrong because frequency encoding may not capture predictive information.

Full explanation →

1751

MCQmedium

A company is using AWS Glue to catalog metadata from various data sources. The crawler is configured to run daily. However, the catalog is not reflecting new partitions added to an S3 bucket during the day. What is the MOST likely cause?

A.The S3 bucket has insufficient permissions for the Glue crawler

B.The table schema has changed and the crawler does not update it

C.The crawler is not scheduled frequently enough to capture changes

D.The data format is not supported by AWS Glue

AnswerC

The crawler runs once a day, so it misses partitions added between runs.

Why this answer

Option C is correct because the crawler is configured to run daily, but new partitions are being added to the S3 bucket throughout the day. Since the crawler only runs once per day, it will not detect and catalog those new partitions until its next scheduled run. To capture changes more frequently, the crawler schedule should be increased or an event-driven trigger (e.g., using Amazon S3 Events and AWS Lambda) should be implemented.

Exam trap

The trap here is that candidates may assume the crawler automatically detects all changes in real time, but AWS Glue crawlers are batch-oriented and only discover new partitions during a crawl run, so scheduling frequency is critical.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket had insufficient permissions for the Glue crawler, the crawler would fail entirely or produce errors, not selectively miss new partitions while still cataloging existing data. Option B is wrong because the question states that new partitions are not being reflected, not that the table schema has changed; Glue crawlers can update schemas by default unless configured otherwise, and schema changes would cause different symptoms (e.g., type mismatches). Option D is wrong because AWS Glue supports a wide range of data formats (CSV, JSON, Parquet, Avro, ORC, etc.), and if the format were unsupported, the crawler would fail to read the data entirely, not just miss new partitions.

Full explanation →

1752

MCQhard

A data scientist is training a neural network using a custom loss function. The training process converges, but the model's performance on the validation set is poor. The data scientist suspects that the model is overfitting. Which action should the data scientist take to diagnose overfitting?

A.Plot the training and validation loss over epochs

B.Add more layers to the network

C.Increase the learning rate

D.Compute the confusion matrix on the training set

AnswerA

If training loss decreases while validation loss increases, it indicates overfitting.

Why this answer

Plotting the training and validation loss over epochs is the standard diagnostic technique for detecting overfitting. If the training loss continues to decrease while the validation loss plateaus or increases, it indicates that the model is memorizing the training data rather than generalizing. This visual comparison directly confirms overfitting, allowing the data scientist to take corrective action such as regularization or early stopping.

Exam trap

Cisco often tests the misconception that improving training performance (e.g., by adding layers or increasing learning rate) is a valid diagnostic step, when in fact the correct approach is to compare training and validation metrics to detect overfitting.

How to eliminate wrong answers

Option B is wrong because adding more layers increases model capacity, which typically exacerbates overfitting rather than diagnosing it. Option C is wrong because increasing the learning rate can cause training instability or divergence, but it does not help identify whether overfitting is occurring. Option D is wrong because computing the confusion matrix on the training set only shows performance on training data, which is already expected to be high when overfitting; it provides no comparison to validation performance and thus cannot diagnose overfitting.

Full explanation →

1753

MCQhard

A data scientist is using Amazon SageMaker to train a TensorFlow model on a dataset that includes sensitive personal information (PII). The data is stored in Amazon S3 with server-side encryption using AWS KMS (SSE-KMS). The training job fails with an Access Denied error when trying to read from S3. The data scientist has already verified that the SageMaker execution role has s3:GetObject permissions on the S3 bucket. What additional configuration is needed?

A.Add kms:Decrypt permission to the SageMaker execution role.

B.Add kms:Encrypt permission to the SageMaker execution role.

C.Add a bucket policy that grants s3:GetObject to the SageMaker role.

D.Configure a VPC endpoint for S3 and attach a policy.

AnswerA

SSE-KMS requires decrypt permission to read objects.

Why this answer

Option A is correct because SageMaker needs kms:Decrypt permission to read SSE-KMS encrypted objects. Option B is wrong because SageMaker does not need kms:Encrypt for reading. Option C is wrong because S3 bucket policy is not needed if role has permissions.

Option D is wrong because VPC endpoint policy is not the issue.

Full explanation →

1754

MCQeasy

A data engineer runs the AWS CLI command above to inspect a file in S3. They need to determine if the file was modified after a Glue ETL job processed it. What additional information could they obtain from this command?

A.The object's content type.

B.The object's storage class.

C.The object's last modified timestamp.

D.The object's ETag.

AnswerC

The LastModified field indicates when the object was last modified.

Why this answer

Option D is correct because the LastModified timestamp is provided, which can be used to compare with job completion time. Option A is wrong because head-object does not show object size. Option B is wrong because ContentLength is shown.

Option C is wrong because ETag is shown.

Full explanation →

1755

MCQmedium

A data analyst is performing exploratory data analysis on a dataset with 100 features. The analyst wants to identify which features contribute most to the variance in the data. Which technique should the analyst use?

A.K-means clustering

B.Principal Component Analysis (PCA)

C.t-Distributed Stochastic Neighbor Embedding (t-SNE)

D.Linear Discriminant Analysis (LDA)

AnswerB

PCA decomposes the data into components that capture the maximum variance.

Why this answer

Option A is correct because PCA is a dimensionality reduction technique that identifies the directions (principal components) that maximize variance. Option B is wrong because t-SNE is for visualization and does not provide variance contributions. Option C is wrong because LDA is supervised and requires labels.

Option D is wrong because K-means is clustering, not variance analysis.

Full explanation →

Page 24 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice MLS-C01 by domain

Target a specific domain to shore up weak areas.

Data Engineering Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

See all domains with question counts →