MLS-C01 Practice Questions

Question 1

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

Accepted Answer

The Kinesis stream has too few shards for the data volume.. The 'ProvisionedThroughputExceededException' error in Amazon Kinesis Data Streams indicates that the data ingestion rate exceeds the write capacity of the stream's shards. Each shard supports up to 1 MB/s or 1,000 records/s for writes. If the clickstream data volume surpasses this limit, the Lambda function, which reads from the stream, will encounter this exception. Increasing the number of shards scales the write capacity to match the data volume.

Answer

The data retention period of the stream is too short.

Answer

The S3 bucket has insufficient write capacity.

Answer

The Lambda function's reserved concurrency is set too high.

Question 2

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

Accepted Answer

Use Amazon EMR with Spark to convert data to Parquet and store in S3, using spot instances for task nodes.. Option B is correct because converting gzip-compressed CSV to Parquet reduces storage size and improves query performance due to columnar storage and predicate pushdown. Using spot instances for task nodes significantly lowers compute cost, while the 30-minute SLA is achievable with Spark on EMR processing 5-minute windows of data.

Answer

Use Amazon EMR with Spark to convert data to Parquet and use on-demand instances.

Answer

Use AWS Glue to convert data to gzip-compressed CSV and query with Athena.

Answer

Use Amazon EMR with Hive to transform data to compressed CSV and store in S3.

Question 3

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

Accepted Answer

Use Amazon Kinesis Data Analytics for Apache Flink to consume the stream, perform feature engineering, and invoke the SageMaker endpoint with exactly-once processing.. Option A is correct because Amazon Kinesis Data Analytics for Apache Flink provides a stateful, low-latency stream processing engine that can consume from Kinesis Data Streams, perform feature engineering in real-time, and invoke SageMaker endpoints with exactly-once processing semantics. This eliminates Lambda timeouts and data loss by using a long-running, scalable application instead of a short-lived function.

Answer

Use the Kinesis Client Library (KCL) to process the stream in an Amazon EC2 instance, and store the predictions in Amazon DynamoDB.

Answer

Increase the Lambda function timeout to 15 minutes and allocate more memory to reduce processing time.

Answer

Configure Amazon Kinesis Firehose to deliver the stream to an Amazon S3 bucket, then trigger a Lambda function to process the data in batches.

Question 4

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

Accepted Answer

Configure the SageMaker training job to use Pipe mode, which streams data directly from S3 without downloading to the instance's local storage.. Pipe mode in SageMaker streams training data directly from Amazon S3 to the training algorithm without first downloading it to the instance's local storage. This eliminates the data download step, significantly reducing startup time and overall training time for large datasets like 10 TB. It is the most cost-effective because it avoids the need for larger instances or additional data transfer acceleration services.

Answer

Use S3 Transfer Acceleration to speed up the data transfer from S3 to the training instance.

Answer

Use larger EC2 instances with more vCPUs and memory to speed up the training process.

Answer

Enable Elastic Fabric Adapter (EFA) on the training instances to improve network throughput.

Question 5

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

Accepted Answer

Use S3 Event Notifications to trigger an AWS Lambda function that converts the JSON to Parquet and writes to a partitioned S3 location, then query with Athena.. Option C is correct because AWS Lambda triggered by S3 Event Notifications provides a fully serverless, event-driven architecture with minimal operational overhead for converting JSON to Parquet and partitioning by date and event type. Lambda can process each new JSON file as it arrives, perform the transformation in memory (using libraries like PyArrow or Pandas), and write the Parquet output to a partitioned S3 path, which Athena can then query directly. This approach avoids managing any clusters or job scheduling, aligning with the requirement for a fully managed, serverless solution.

Answer

Use Amazon EMR with Spark to read JSON, convert to Parquet, and partition, then query with Athena.

Answer

Use AWS Glue ETL jobs to read JSON from S3, transform to Parquet, and write to a partitioned S3 location, then use Athena.

Answer

Use Amazon Kinesis Firehose to ingest data and convert to Parquet, then write to S3, and query with Athena.

Question 6

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

Accepted Answer

AWS Lake Formation. AWS Lake Formation is correct because it provides a centralized service to build, secure, and manage data lakes on AWS. It enables fine-grained access control at the column and row level for sensitive data, which directly meets the requirement for enforcing such controls. Additionally, Lake Formation integrates with Amazon Athena and Amazon EMR for querying and processing the cataloged data.

Answer

AWS Identity and Access Management (IAM)

Answer

Amazon RDS for PostgreSQL

Answer

Amazon DynamoDB

Question 7

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

Accepted Answer

Create a Glue ETL job triggered by an S3 event notification via Lambda.. Option D is correct because it uses an S3 event notification to invoke a Lambda function, which then triggers an AWS Glue ETL job only when new data arrives. This event-driven architecture ensures cost-effectiveness by avoiding continuous or scheduled runs, and it directly transforms raw JSON into Parquet format as required.

Answer

Create a Glue crawler that runs continuously.

Answer

Schedule a Glue ETL job to run every hour.

Answer

Use Glue DataBrew to transform data and schedule it daily.

Question 8

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

Accepted Answer

Enable partition indexing in the Glue table properties.. Option D is correct because enabling partition indexing in the Glue table properties allows the Glue Data Catalog to automatically discover and register new partitions as they are added to S3, without relying solely on crawler runs. This feature uses the Hive-style partition structure (e.g., year=2024/month=01/day=15) to index partitions, ensuring that even if the crawler misses a run, new partitions are still cataloged via the partition index.

Answer

Use a custom classifier to detect partition patterns.

Answer

Increase the crawler schedule to run every hour.

Answer

Configure the crawler to update all partitions on each run.

Question 9

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

Accepted Answer

AWS Glue Data Catalog for schema-on-read.. AWS Glue Data Catalog is correct because it provides a centralized metadata repository that enables schema-on-read for data stored in Amazon S3. It allows you to define table schemas and partitions without transforming the underlying data, so analytics tools like Amazon Athena and Amazon EMR can query the data with the schema applied at read time.

Answer

Amazon Redshift for data warehousing.

Answer

Amazon EMR for data processing.

Question 10

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

Accepted Answer

The policy does not grant s3:PutObject on the bucket itself, which is needed for some write operations.. The error 'Access Denied' when writing to S3 with SSE-S3 encryption typically occurs because the IAM policy lacks the `s3:PutObject` permission on the bucket resource itself. While the policy may grant `s3:PutObject` on the object ARN (`arn:aws:s3:::bucket/*`), some S3 write operations—especially those involving encryption headers or bucket-level checks—also require the permission on the bucket ARN (`arn:aws:s3:::bucket`). Without this, the request is denied even if the object-level permission exists.

Answer

The policy grants s3:PutObject on all buckets, not just the specific one.

Answer

The condition requires objects to be encrypted with SSE-KMS, but the job uses SSE-S3.

Answer

The condition requires objects to use SSE-S3, but the job uses SSE-KMS.

Question 11

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

Accepted Answer

Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput.. The primary bottleneck is DynamoDB write capacity being fully utilized during peak hours. Enhanced fan-out (option B) provides each consumer with a dedicated 2 MB/second read throughput per shard, eliminating the need for consumers to contend for the shared 2 MB/second per shard. This reduces the load on the DynamoDB lease table because workers no longer need to poll for records, which in turn lowers the write operations to the lease table and alleviates the DynamoDB write capacity issue.

Answer

Increase the write capacity of the DynamoDB lease table to handle more workers.

Answer

Switch to using AWS Lambda as the consumer to handle scaling automatically.

Answer

Increase the number of EC2 instances to match the number of shards.

Question 12

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

Accepted Answer

For each highly correlated pair, remove one feature based on domain knowledge or higher correlation with target.. Option C is correct because when features are highly correlated (e.g., > 0.95), they introduce multicollinearity, which can destabilize coefficient estimates in linear models and reduce interpretability. Removing one feature from each correlated pair based on domain knowledge or its correlation with the target variable preserves predictive power while reducing redundancy. This approach is more targeted than PCA, which transforms features into uncorrelated components but sacrifices interpretability and may not align with the binary target.

Answer

Apply PCA to all features to decorrelate them.

Answer

Standardize all features using StandardScaler.

Answer

Randomly drop half of the correlated features.

Question 13

Match each hyperparameter tuning strategy to its description.

Question 14

Match each AWS AI service to its capability.

Question 15

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

Accepted Answer

The imputed values may reduce the variance of the 'age' distribution.. Imputing missing values with the median of the observed data artificially concentrates imputed values around the center of the distribution. This reduces the overall variance of the 'age' column because the imputed values do not reflect the natural spread of the data, potentially distorting downstream analyses like regression or clustering that rely on variance structure.

Answer

The imputation will introduce bias if the missing values are not random.

Answer

Imputation using median is computationally expensive for large datasets.

Answer

The imputed values will increase the variance of the feature, leading to overfitting.

AWS Certified Machine Learning Specialty MLS-C01 practice test

Three ways to study

All 1,755 MLS-C01 questions with answers

Study MLS-C01 by domain

Study MLS-C01 by topic

Data Engineering practice questions

Machine Learning Implementation and Operations practice questions

Modeling practice questions

Exploratory Data Analysis practice questions

MLS-C01 fundamentals practice questions

MLS-C01 scenario practice questions

MLS-C01 troubleshooting practice questions

Top MLS-C01 questions

AWS Certified Machine Learning Specialty MLS-C01 practice questions

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

Exhibit

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

Match each hyperparameter tuning strategy to its description.

Match each AWS AI service to its capability.

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Question Discussion

How to use these MLS-C01 questions

Quick answer