How many Data Engineering questions are on the MLS-C01 exam?

The Data Engineering domain is one of the weighted domains on the MLS-C01 exam. The Courseiva question bank has 300 practice questions for this domain.

How can I practice Data Engineering questions for MLS-C01?

Click any of the 300 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Data Engineering domain.

Free MLS-C01 Data Engineering Practice Questions (2026)

Q: What does the Data Engineering domain cover on the MLS-C01 exam?

The Data Engineering domain covers the key concepts and skills tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services.

Practice Data Engineering questions

10Q 20Q 30Q 50Q

MLS-C01 Data Engineering questions (showing 300 of 374)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?

An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?

A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Analytics for Apache Flink. The pipeline reads from a Kinesis data stream and writes to a S3 bucket. The job must recover quickly from failures without reprocessing large amounts of data. Which TWO configurations should be used? (Choose TWO)

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

A retail company runs an e-commerce platform on AWS. They have a Data Engineering team that processes clickstream data using Amazon Kinesis Data Streams (KDS) with a shard count of 5. The data is consumed by an AWS Lambda function that transforms and loads the data into an Amazon S3 bucket partitioned by year/month/day/hour. Recently, the team has noticed that the Lambda function is experiencing throttling errors, and the KDS shard iterator age is increasing, indicating that the consumer cannot keep up with the incoming data rate. The team has already increased the Lambda reserved concurrency to 1000 and enabled batch window of 60 seconds. The metrics show that the Lambda function duration is well under the 5-minute timeout, and there are no errors in the transformation logic. The S3 write operations are not failing. Which course of action would MOST effectively resolve the issue without unnecessary cost or complexity?

Drag and drop the steps to create an Amazon SageMaker notebook instance in the correct order.

Drag and drop the steps to perform hyperparameter tuning using SageMaker Automatic Model Tuning in the correct order.

Drag and drop the steps to set up Amazon SageMaker Ground Truth for a labeling job in the correct order.

Drag and drop the steps to set up cross-validation in a SageMaker training job using the built-in XGBoost algorithm in the correct order.

Match each AWS service to its primary purpose in a machine learning pipeline.

Match each AWS security service to its function in ML.

Match each data format to its typical use in AWS ML.

Match each SageMaker optimization technique to its description.

A data engineering team needs to process streaming data from thousands of IoT devices. They want to aggregate data in 1-minute windows and store results in an S3 data lake for downstream analytics. Which architecture should they use?

A company uses AWS Glue ETL jobs to transform CSV data from an S3 bucket into Parquet. The jobs often fail with memory errors when processing large datasets. They want to minimize cost and improve reliability. What should they do?

A machine learning team needs to create a training dataset by joining two large datasets (10 TB and 5 TB) stored in S3. The join key is 'user_id'. They want to minimize data movement and cost. Which approach should they use?

A company uses an Amazon SageMaker notebook to train a model using data from an S3 bucket. The IAM role attached to the notebook has the following policy. What is the MOST specific change needed to allow the notebook to read from the bucket 'ml-data-123'?

A data engineer needs to transform raw clickstream data (JSON files) stored in S3 into a partitioned Parquet dataset for querying with Athena. The transformation includes cleaning, deduplication, and enrichment. The pipeline should run daily. Which solution is MOST cost-effective and requires the least operational overhead?

A company uses Kinesis Data Streams to ingest real-time sensor data. The data is consumed by a Lambda function that writes to DynamoDB. During peak hours, the Lambda function throws ProvisionedThroughputExceededException. The team wants to decouple the write operation and improve resilience. What should they do?

A data engineer needs to design a data pipeline that ingests CSV files from an SFTP server daily, transforms them, and loads them into Amazon Redshift. The files are typically 2-3 GB. Which combination of AWS services is MOST appropriate?

A team stores raw data in S3 and uses a Glue Data Catalog for metadata. They want to allow data scientists to query the data with Amazon Athena using their existing IAM roles. What is the MINIMUM set of permissions required?

A company is building a near-real-time dashboard using data from multiple sources. They need to aggregate millions of events per second with sub-second latency. The architecture must be fully managed and minimize operational overhead. Which service should they use for the aggregation layer?

A data engineer needs to set up a data lake on S3 that supports both batch and streaming ingestion. The data must be queryable by Athena, Redshift Spectrum, and EMR. Which TWO configurations are essential? (Choose two.)

A team wants to move data from an on-premises Oracle database to Amazon S3 for analytics. The pipeline must run daily and handle incremental updates. Which THREE services should they use together? (Choose three.)

A company uses Amazon Kinesis Data Streams to ingest clickstream data. They need to archive raw data to S3 every hour and also enable real-time processing with sub-second latency. Which TWO actions should they take? (Choose two.)

An IAM policy attached to a SageMaker notebook role is shown. The data engineer tries to run an Athena query on a table in the 'my_database' Glue database. The query fails with an access denied error. What is the MOST likely cause?

A data engineer runs the AWS CLI command above to inspect a file in S3. They need to determine if the file was modified after a Glue ETL job processed it. What additional information could they obtain from this command?

A data engineer created a CloudFormation template for a Glue ETL job as shown. The job processes 500 GB of data and takes 90 minutes to complete. However, the job fails after 60 minutes. What is the MOST likely cause?

A data science team needs to process streaming data from thousands of IoT devices and perform real-time anomaly detection. The data must be persisted in Amazon S3 for batch processing later. Which combination of AWS services should be used to meet these requirements?

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries are slow and wants to improve performance without changing the schema. Which action is most likely to improve query performance?

A data pipeline uses AWS Glue to transform data from Amazon RDS to Amazon S3. The team wants to ensure that only new or updated records are processed in each run, minimizing cost and time. Which AWS Glue feature should be used?

A company is using Amazon SageMaker to train machine learning models. The training data is stored in Amazon S3, but the data includes personally identifiable information (PII) that must be anonymized before training. What is the most efficient way to anonymize the data?

A data engineer needs to move 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. Which AWS service should be used to transfer the data most efficiently?

A team is building a data lake on Amazon S3 and using AWS Glue to catalog data. They notice that Glue crawlers are taking too long to update the catalog for a large dataset with millions of small files. Which approach will MOST improve crawler performance?

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format, and the company wants to convert it to Parquet for efficient querying. Which configuration should be used?

A data engineer needs to run a one-time ETL job to transform 500 GB of data from Amazon RDS to Amazon S3. The job should be cost-effective and require minimal infrastructure management. Which AWS service should be used?

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data science team wants to analyze the data using Amazon Athena. What is the most efficient way to make the DynamoDB data available for Athena queries?

A company is designing a data pipeline that ingests streaming data from social media feeds. The data must be processed in real-time to detect trending topics, and results must be stored in Amazon DynamoDB for low-latency access. Which services should the company use? (Choose TWO.)

A data engineer needs to transform and move 2 TB of data from an Amazon RDS for PostgreSQL instance to Amazon S3 daily. The transformation includes filtering, joining with data in S3, and aggregating. Which AWS services can be used together to accomplish this with minimal operational overhead? (Choose THREE.)

A company wants to build a data lake on Amazon S3. The data lake should support both batch and real-time data ingestion. Which AWS services should be used for data ingestion? (Choose TWO.)

A data engineer wants to stream clickstream data from a web application to Amazon S3 for near-real-time analytics. Which AWS service should be used to ingest and buffer the data before landing in S3?

A machine learning team needs to process a large dataset stored in Amazon S3 using Apache Spark. They want to minimize cost and avoid managing infrastructure. Which AWS service should they use?

A company uses AWS Glue to run ETL jobs on a daily schedule. The jobs are failing intermittently with 'OutOfMemory' errors. The data volume has grown 5x over the past month. Which is the MOST cost-effective fix?

A data scientist needs to query a dataset stored as Parquet files in Amazon S3 using standard SQL without managing any infrastructure. Which service should they use?

A team wants to build a data pipeline that processes incoming JSON files from an S3 bucket and loads them into a Redshift table. The pipeline must handle schema evolution and data validation. Which combination of services would be MOST appropriate?

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on a stream of IoT sensor data. The application is experiencing high latency. The data volume has doubled. Which action would MOST effectively reduce latency?

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The company has a 1 Gbps internet connection. Which service would complete the transfer in the shortest time?

A company is building a data lake on Amazon S3. They need to enforce encryption at rest for all objects. Which combination of actions will achieve this? (Assume the bucket is versioned.)

A company uses Amazon EMR with Spark to process data daily. The job reads from S3 and writes to S3. Recently, the job started failing with 'S3AccessDenied' errors. The IAM role used by EMR has not changed. What is the MOST likely cause?

A company is designing a data pipeline to ingest data from multiple sources into an Amazon S3 data lake. The data must be encrypted at rest and in transit. Which TWO actions should be taken to meet these requirements?

A data engineering team uses AWS Glue to run ETL jobs. They notice that jobs are taking longer to complete as data volume grows. They want to optimize performance without increasing cost significantly. Which THREE strategies should they consider?

A company wants to analyze streaming data from IoT devices in near-real-time. They need to store raw data in Amazon S3 and also run SQL queries on the streaming data. Which TWO services should they use?

An IAM policy is attached to a group. A user in the group tries to read the object s3://data-lake-bucket/sensitive/file.txt from an IP address 192.168.1.1. What will happen?

A data engineer runs the CLI command to download an object from S3. The bucket owner is 123456789012, and the engineer's IAM user has s3:GetObject permission on the bucket. The object was uploaded by a different AWS account. What is the MOST likely reason for the AccessDenied error?

An S3 event notification is configured to trigger a Lambda function when new objects are created. The Lambda function processes the event JSON shown. Which field should the function use to read the new object from S3?

A data engineer needs to ingest streaming data from an on-premises Kafka cluster into Amazon S3 with minimal operational overhead. Which AWS service should be used to stream the data into S3 without managing servers?

A company is using AWS Glue ETL jobs to process data stored in Amazon S3. The jobs currently run sequentially and take too long. The data engineer wants to reduce job duration without rewriting the code. Which action is most effective?

A data science team uses Amazon SageMaker to train models on a dataset stored in Amazon S3. The dataset is 2 TB and is accessed by multiple training jobs. The team notices that training jobs are slow due to high S3 GET request latency. Which solution would provide the fastest and most cost-effective data access?

A company runs a daily ETL job that reads data from Amazon RDS, transforms it using AWS Glue, and writes the results to Amazon S3. The job started failing yesterday with the error: 'Rate exceeded'. What is the most likely cause and solution?

A company wants to analyze historical data stored in Amazon S3 using Amazon Athena. The data is in CSV format and is partitioned by date. Which action will provide the best query performance and cost optimization?

A company uses AWS Lake Formation to manage permissions on a data lake stored in Amazon S3. A data analyst tries to query a table using Amazon Athena but receives an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What is the most likely cause?

A data pipeline uses Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The Lambda function sometimes times out because of spikes in traffic. The team wants to buffer the data before processing to handle spikes. Which approach is most effective?

A company runs a nightly AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes to Amazon S3. The job fails intermittently with 'ERROR: cannot execute INSERT in a read-only transaction'. What is the most likely cause?

A company uses Amazon EMR to run Spark jobs on a large dataset stored in Amazon S3. The jobs are failing with 'OutOfMemoryError' in the executors. The data is not skewed. Which configuration change will most likely resolve the issue?

A data engineer is designing a data ingestion pipeline that will receive up to 5 GB of data per hour from thousands of IoT devices. The data must be stored in Amazon S3 and analyzed in near real-time. Which TWO services should be used together to meet these requirements? (Choose TWO.)

A company needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 5 days. Which TWO options are viable? (Choose TWO.)

A company uses Amazon Redshift for data warehousing. The data engineering team notices that query performance has degraded over time. Which THREE actions should the team take to improve performance? (Choose THREE.)

A data engineer is building a data pipeline that ingests streaming data from IoT devices. The data must be processed in near real-time and stored in Amazon S3 for further analysis. Which AWS service should be used to capture and process the streaming data before storing it in S3?

A machine learning team needs to preprocess large volumes of clickstream data stored in Amazon S3 before training a model. The preprocessing includes data cleaning, feature engineering, and normalization. The team wants to use a serverless solution that minimizes operational overhead. Which combination of services should the team use?

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time data. The data source is a Kinesis data stream, and the output is written to an S3 bucket. Recently, the processing latency has increased significantly. The team suspects that the Flink application is encountering backpressure. Which metric should the team monitor to confirm backpressure?

A data scientist wants to query a dataset stored in Amazon S3 using standard SQL without provisioning any servers. The dataset is in CSV format and is updated daily. Which AWS service should be used?

A company is building a data pipeline to process sensitive customer data. The pipeline uses AWS Glue for ETL and stores results in Amazon S3. The security team requires that all data be encrypted at rest in S3 using customer-managed AWS KMS keys. Additionally, the Glue job must be able to write encrypted data to S3. What should the data engineer do to meet these requirements?

A large e-commerce company is using Amazon DynamoDB as the source for real-time analytics. The data is streamed to Amazon Kinesis Data Streams using DynamoDB Streams and then processed by an AWS Lambda function. The Lambda function writes the data to an Amazon Elasticsearch Service cluster for search and visualization. Recently, the Lambda function has been failing with throttling errors from the Elasticsearch cluster. What is the MOST effective way to handle this?

A company is using Amazon S3 as a data lake. The data engineering team needs to catalog the schema of the data and make it available for querying with Amazon Athena. Which AWS Glue component should be used?

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 100 Mbps connection to AWS. The transfer must be completed within one week. Which approach should the engineer use?

A company is using Amazon Redshift for data warehousing. The data engineering team notices that queries are slow and the system is frequently writing to disk due to insufficient memory. Which type of workload management (WLM) configuration change would help reduce disk writes?

Which TWO AWS services can be used to move data from an on-premises database to Amazon S3 on a recurring schedule without writing custom code? (Choose 2.)

100

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose 3.)

101

Which TWO AWS services can be used to schedule and orchestrate a data pipeline that includes multiple steps such as data extraction, transformation, and loading? (Choose 2.)

102

A data engineer needs to analyze large CSV files stored in Amazon S3 using SQL queries. The data is not frequently accessed, and cost is a primary concern. Which AWS service should be used to query the data directly in S3 without moving it?

103

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be transformed before being stored in Amazon S3. The transformations include enrichment with reference data from Amazon DynamoDB. Which AWS service should be used to perform the transformation with minimal operational overhead?

104

A data engineering team is designing a data lake on Amazon S3. Raw data is ingested in JSON format and must be partitioned by year, month, and day. The team expects high query performance for recent data but infrequent queries for older data. The data is immutable. Which storage tier configuration minimizes costs while meeting performance requirements?

105

A data engineer is tasked with building a system to process a continuous stream of IoT sensor data. The data must be processed in near real-time, and the results must be stored in Amazon S3 partitioned by hour. Which AWS service is the most cost-effective and simplest to implement?

106

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are currently failing due to insufficient memory. The data volume varies, with occasional spikes. Which solution should be used to handle the variable memory requirements efficiently?

107

A data pipeline uses Amazon Kinesis Data Streams to ingest event data. The data is consumed by an AWS Lambda function, which writes to Amazon DynamoDB. The Lambda function is experiencing throttling errors, and the DynamoDB write capacity is underutilized. The events must be processed in order per shard. Which solution most effectively addresses the throttling?

108

A data scientist needs to run a one-time SQL query on a large dataset in Amazon S3. The dataset is stored in Parquet format and is about 500 GB. The query requires complex aggregations and joins. Which AWS service should be used to minimize cost and setup time?

109

A company is building a data lake on Amazon S3. Raw data is ingested from multiple sources in different formats (CSV, JSON, Parquet). The data must be cataloged and made queryable using Amazon Athena. The data schema may evolve over time. Which approach minimizes manual effort and supports schema evolution?

110

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to Amazon S3. The data arrives at unpredictable rates, with occasional bursts. The company needs to ensure data is delivered within 60 seconds of ingestion, and the data must be partitioned by year/month/day/hour. Which configuration meets these requirements?

111

Which TWO AWS services can be used to transform data in transit before storing it in Amazon S3? (Choose TWO.)

112

A company is designing a data pipeline to analyze customer behavior. The pipeline must handle real-time streaming data and batch data. The data must be stored in a data lake on Amazon S3 and also made available for interactive queries. Which THREE services should be combined to build this pipeline? (Choose THREE.)

113

A data engineering team is migrating on-premises Hadoop workloads to AWS. The workloads include batch processing using Apache Spark and interactive SQL queries. The data is stored in HDFS. Which TWO AWS services should be used to replace HDFS and provide a scalable, durable storage layer? (Choose TWO.)

114

A data engineering team needs to ingest streaming data from thousands of IoT devices into a data lake on Amazon S3 for near-real-time analytics. The data must be partitioned by device ID and timestamp, and the team must minimize data loss during ingestion failures. Which solution is MOST appropriate?

115

A data scientist needs to query a 2 TB dataset stored in Amazon S3 using Amazon Athena. The data is in CSV format and is used for exploratory analysis. Queries are currently slow and expensive. Which action will improve query performance and reduce cost?

116

A company uses AWS Glue ETL jobs to process data from an Amazon RDS for MySQL database into Amazon S3. The job runs daily and takes 6 hours to complete. The team wants to reduce runtime and cost. The source table has 50 million rows and is updated continuously. Which combination of changes would be MOST effective?

117

A data pipeline uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that transforms and writes to Amazon DynamoDB. The Lambda function is throttled during traffic spikes, causing data to be reprocessed. Which solution should the team implement to handle the throttling without losing data?

118

A company wants to use Amazon SageMaker to train a model on a dataset stored in Amazon S3. The dataset is 100 GB and consists of millions of small JSON files. What should the data engineering team do to optimize training performance?

119

A financial services company needs to build a data lake on Amazon S3 that meets regulatory requirements for data retention and encryption. Data must be encrypted at rest and in transit, and access must be audited. The data lake will be queried by Amazon Athena and Amazon Redshift Spectrum. Which combination of actions should be taken?

120

A data engineering team is building a pipeline to process terabytes of log data daily using Amazon EMR with Spark. The data arrives in hourly batches and must be processed within 4 hours. The team needs to minimize cost. Which cluster configuration is MOST cost-effective?

121

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 1 Gbps connection to AWS. The transfer must be completed within 10 days. What is the MOST efficient approach?

122

A company runs a real-time fraud detection pipeline using Amazon Kinesis Data Analytics. The pipeline reads from a Kinesis data stream, performs sliding window aggregations, and writes results to a DynamoDB table. The application is experiencing high latency during peak hours. Which action would MOST effectively reduce latency?

123

A data engineering team is designing a data pipeline to process streaming data from social media feeds. The data must be deduplicated, enriched with customer information from a relational database, and stored in Amazon S3 in Parquet format. Which AWS services should the team use to build this pipeline? (Select TWO.)

124

A company uses AWS Glue Data Catalog to manage metadata for its data lake on Amazon S3. The data lake contains terabytes of data in CSV format. The data engineering team wants to improve query performance in Amazon Athena and reduce costs. Which actions should the team take? (Select THREE.)

125

A data engineering team needs to schedule a nightly ETL job that extracts data from an Amazon RDS for PostgreSQL instance, transforms it using Spark, and loads it into Amazon S3. The team wants to use AWS Glue for this task. Which components are required? (Select TWO.)

126

Refer to the exhibit. An IAM policy is attached to a data engineering role. The role is used by an AWS Glue ETL job that reads from 'raw/' and writes to 'processed/'. The job fails with an access denied error when trying to write to 'processed/'. What is the likely cause?

127

Refer to the exhibit. A data engineer examines the output of 'aws glue get-job-run' for a failed job. The job run state is FAILED, but ErrorMessage is empty. The job ran for 3600 seconds (1 hour) before failing. What is the MOST likely cause of the failure?

128

Refer to the exhibit. A CloudFormation template creates an S3 bucket. The data engineering team stores daily log files in this bucket and queries them using Amazon Athena. After 30 days, queries on logs older than 30 days start failing with 'Access Denied' errors. What is the MOST likely reason?

129

A company captures streaming data from IoT devices using Amazon Kinesis Data Streams. The data is consumed by a custom application that processes records in near real-time. Recently, the application has been falling behind, and the stream is showing increased 'iterator age' metrics in CloudWatch. Which action is MOST likely to reduce the iterator age?

130

A data engineer needs to build a pipeline that ingests CSV files from an S3 bucket, validates the schema, and loads the data into an Amazon Redshift cluster. The pipeline must handle schema evolution gracefully by adding new columns as they appear in the source files. Which combination of AWS services and configurations would meet these requirements with minimal operational overhead?

131

A machine learning team is preparing a dataset for model training. The data is stored in an Amazon S3 bucket with objects that are each approximately 100 MB in size. The team wants to use Amazon SageMaker for training. To optimize training performance, which data format and storage configuration should be used?

132

A company uses Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application uses event time and watermarks for windowed aggregations. The team notices that the output from tumbling windows is delayed, and many late records are being dropped. What is the MOST likely cause?

133

A research lab stores large genomic datasets in Amazon S3 Glacier Deep Archive. They need to run a one-time analysis on a subset of 10 PB of data. The analysis will use an Amazon EMR cluster with Amazon S3 as the data source. What is the MOST cost-effective and performant way to make the data available for the EMR cluster?

134

An ML engineer is using Amazon SageMaker to train a model on a dataset that contains personal identifiable information (PII). The data must be encrypted at rest and in transit. The company uses AWS KMS for key management. How should the engineer configure the SageMaker training job to meet these encryption requirements?

135

A company uses AWS Glue ETL jobs to transform data from Amazon RDS for MySQL to Amazon S3. The transformation includes aggregations and joins. The job runs daily and processes approximately 100 GB of data. Recently, the job started failing with memory errors on the worker nodes. Which approach would MOST effectively resolve the issue without changing the logic?

136

A data scientist needs to run a one-time training job on a 5 TB dataset stored in Amazon S3. The training algorithm requires random access to individual records. Which SageMaker input mode and data format combination would be MOST appropriate?

137

A team is building a data pipeline using Amazon Kinesis Data Firehose to deliver real-time clickstream data to an Amazon S3 bucket. The data must be partitioned by year, month, day, and hour. Which configuration should the team use to achieve this?

138

A company is building a data lake on Amazon S3 and wants to ensure that data is encrypted at rest using AWS KMS. Which TWO actions are required to achieve this? (Choose TWO.)

139

A company is using Amazon DynamoDB as a source for a machine learning pipeline. The data is exported nightly to Amazon S3 using DynamoDB Streams and an AWS Glue job. The Glue job reads the stream records, transforms them, and writes to S3 in Parquet format. The team notices that the Glue job is taking too long and consuming high DynamoDB read capacity. Which THREE actions would reduce the load on DynamoDB and improve performance? (Choose THREE.)

140

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest sensor data. The data must be processed in real-time, and the results must be stored in Amazon DynamoDB. Which TWO AWS services can be used together to achieve this? (Choose TWO.)

141

An IAM policy is attached to a data engineering role that writes to an S3 bucket. The policy is shown in the exhibit. What is the effect of this policy?

142

An ML engineer runs the AWS CLI command shown in the exhibit on a file in S3. The engineer wants to use this file in a SageMaker training job. What does the output reveal about the data?

143

An AWS Glue job is failing with an error that it cannot access an S3 bucket. The IAM role attached to the Glue job is shown in the exhibit. What is the MOST likely cause of the failure?

144

A data engineer needs to process streaming data from an IoT fleet and store the results in Amazon S3 for analysis. The solution must be serverless and handle data that arrives at irregular intervals. Which AWS service should be used to ingest the data?

145

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The input data is located in an S3 bucket, and the team needs to transform the data before inference using a custom Python script. The transformation should run on a serverless infrastructure and must be triggered automatically when new data arrives in S3. Which combination of services should the team use?

146

A data engineer needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3 for ML training. The data is currently stored in HDFS and is compressible. The network bandwidth between the on-premises data center and AWS is 1 Gbps. The team needs to minimize the time to transfer and also wants to avoid any downtime for the on-premises system. Which solution meets these requirements?

147

A company uses Amazon Kinesis Data Streams for real-time clickstream analysis. The data is consumed by a Lambda function that enriches the records and stores them in Amazon S3. Recently, the Lambda function has been failing with throttling errors, and the consumer is falling behind. The team needs to increase the throughput of the consumer without changing the data format or the Lambda function code. What should the team do?

148

A financial services company is building a fraud detection model that requires joining real-time transaction data with a reference dataset of known fraudulent accounts stored in Amazon DynamoDB. The solution must minimize latency and be highly available. The reference dataset is updated frequently (every few minutes). Which architecture should the team use?

149

A data scientist needs to perform exploratory data analysis on a 100 GB CSV file stored in Amazon S3. The data is not sensitive. The scientist wants to use SQL queries to filter and aggregate the data without setting up a server or moving the data. Which service should be used?

150

A company runs a data lake on Amazon S3 with partitions by year/month/day. A machine learning team needs to read daily data from the last 30 days for model retraining. The data format is Parquet. The team uses Amazon Athena to query the data, but the queries are slow and scanning too much data. The team has already optimized the file sizes and compression. What additional step can reduce the amount of data scanned?

151

A company is using Amazon SageMaker to train a model on a dataset that is updated daily. The data is stored in an S3 bucket. The training pipeline uses AWS Step Functions to orchestrate data preprocessing and model training. The preprocessing step uses a SageMaker Processing job that reads data from S3, cleans it, and writes the output back to S3. The team notices that the training step often fails due to insufficient disk space on the processing instance. Which change should the team make to resolve this issue without increasing cost?

152

A team is building a data pipeline that ingests data from an Amazon S3 bucket, transforms it using AWS Glue, and loads it into Amazon Redshift for analysis. The Glue job runs on a schedule every hour. The team has noticed that the job takes longer than expected and sometimes fails due to memory issues. The data volume is variable, with occasional spikes. Which solution should the team implement to optimize the pipeline?

153

Which TWO AWS services can be used to transform data in a streaming fashion without using a persistent cluster? (Choose 2.)

154

Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon Redshift for storing large datasets used for machine learning? (Choose 3.)

155

A company is using Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function writes results to an S3 bucket. The team wants to ensure that each record is processed exactly once and in order. Which TWO configurations should the team implement? (Choose 2.)

156

Refer to the exhibit. An ML engineer applies this bucket policy to an S3 bucket. The SageMaker execution role MySageMakerRole is used to train a model. The training data is located in s3://my-bucket/data/. The SageMaker training job fails with an access error. What is the most likely cause?

157

Refer to the exhibit. An ML engineer runs the above CLI command to inspect files in an S3 bucket. The training data consists of 200 CSV files, each 1 GB. The engineer plans to use Amazon SageMaker to train a model using this data. What should the engineer do to optimize training performance?

158

A company runs a real-time recommendation system that uses Amazon SageMaker endpoints for inference. The system ingests user activity data from a mobile app via Amazon API Gateway and AWS Lambda, which writes events to an Amazon Kinesis Data Stream. A second Lambda function consumes the stream, calls a SageMaker endpoint to generate recommendations, and stores the results in Amazon DynamoDB. The system has been working well, but recently the team noticed an increase in latency from the time a user action occurs to when the recommendation is stored. The SageMaker endpoint shows increased invocation latency but no throttling. CloudWatch metrics show that the Kinesis stream's IteratorAgeMilliseconds is increasing, indicating the consumer is falling behind. The Lambda consumer's duration is within limits, but the number of invocations is lower than expected. The team suspects the issue is with the event source mapping. Which course of action should the team take to reduce the latency?

159

A data engineering team needs to process streaming data from thousands of IoT devices. The data must be ingested with low latency and processed in near real-time to detect anomalies. Which AWS service should they use for ingestion?

160

A company is using AWS Glue to run ETL jobs that process data from an Amazon RDS for PostgreSQL database. The jobs are failing with connection timeouts. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause?

161

A data scientist needs to run ad-hoc SQL queries on a large dataset stored in Amazon S3 (Parquet format, 2 TB). The queries are interactive and require sub-second response times. Which service should they use?

162

A company is using Amazon Kinesis Data Firehose to load streaming data into Amazon S3. The data is in JSON format, and they want to convert it to Parquet before storage. What should they configure?

163

A data engineering team needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The data is currently stored in HDFS. Which service should they use for an efficient transfer?

164

An e-commerce company uses Amazon DynamoDB as the primary data store for user sessions. They want to run analytics on historical session data using Amazon Athena. What is the recommended approach to export DynamoDB data to S3 in a format optimized for Athena?

165

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should they use for scheduling?

166

A company is using AWS Glue Data Catalog as the metadata store for their data lake. They have multiple AWS accounts and want to share the catalog across accounts. Which feature should they use?

167

A data engineer is designing a data pipeline that ingests 500 GB of data daily from an on-premises Oracle database to Amazon S3. The pipeline must minimize data loss and support change data capture (CDC). Which combination of services should they use?

168

Which TWO of the following are valid ways to reduce query costs in Amazon Athena? (Choose 2)

169

Which THREE of the following are best practices for optimizing performance of Amazon EMR clusters? (Choose 3)

170

Which TWO services can be used to transform data in transit within a Kinesis Data Firehose delivery stream? (Choose 2)

171

A financial services company uses Amazon Kinesis Data Streams with 50 shards to ingest real-time stock trade data. The data is consumed by a custom Java application running on Amazon EC2 instances. Recently, the application has been experiencing high latency, and CloudWatch metrics show that the average iterator age is increasing. The application uses the Kinesis Client Library (KCL) with DynamoDB for lease tracking. The EC2 instances are in an Auto Scaling group with a minimum of 2 and maximum of 10 instances, and the current CPU utilization is below 50%. The team wants to reduce latency without increasing costs significantly. What should they do?

172

A media company ingests video metadata from multiple sources into an Amazon S3 bucket. Each metadata record is a JSON file about 2 KB. They use AWS Glue ETL jobs to process these files and load them into Amazon Redshift for analytics. The jobs currently run hourly and take about 10 minutes to process all new files. However, the company is growing and expects the number of files to increase 100x. The data engineering team wants to minimize processing time and cost. The Glue job currently reads all files from the S3 bucket using a full scan. What should they do to optimize the pipeline?

173

A retail company uses Amazon Redshift for its data warehouse. The data engineering team runs ETL jobs that load data from multiple sources into Redshift daily. They notice that the load performance is slow and the cluster CPU utilization is high during the ETL window. The team wants to improve load performance without changing the cluster configuration. They currently load data using INSERT statements from a staging table. What should they do?

174

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The data arrives in bursts and must be processed with minimal latency. Which AWS service is most appropriate for the ingestion layer?

175

A company is building a data pipeline using AWS Glue to transform data from Amazon RDS to Amazon S3. The pipeline runs daily and processes about 500 GB of data. The team notices that the job is taking longer than expected. Which change would MOST improve the job performance?

176

A data engineer is designing a data lake on Amazon S3 that must support both batch and streaming analytics. The data comes in Parquet format and needs to be queryable by Amazon Athena. Which partitioning strategy will optimize query performance and reduce costs?

177

A company uses AWS Lambda to process events from Amazon S3. The Lambda function transforms the data and writes results to another S3 bucket. Recently, the function has been failing due to timeout errors when processing large files. Which solution should the data engineer implement?

178

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 1 Gbps internet connection and wants to complete the transfer within 5 days. What is the MOST cost-effective and reliable solution?

179

A company uses Amazon EMR to run Spark jobs on a transient cluster that processes data from S3. The jobs are failing with 'OutOfMemory' errors. The data engineer has already increased the executor memory. Which additional configuration change would MOST likely resolve the issue?

180

A data engineering team needs to orchestrate a complex workflow that involves multiple AWS Glue jobs, Lambda functions, and S3 operations. The workflow must run on a schedule and allow monitoring of each step. Which AWS service should they use?

181

A company is using Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a fleet of EC2 instances running a custom consumer application. The consumer is falling behind and the shard iterator age is increasing. Which TWO actions should the data engineer take to improve consumer performance? (Choose TWO.)

182

A data engineer is designing a data pipeline to process streaming data from Amazon Kinesis Data Streams and store the results in Amazon S3 in Parquet format. The data must be available for querying in Amazon Athena within minutes of arrival. Which THREE services should be used together? (Choose THREE.)

183

A company wants to centralize logging from multiple AWS accounts and on-premises servers. The logs must be stored cost-effectively and be searchable. Which TWO services should be used? (Choose TWO.)

184

A data engineer is troubleshooting an AWS Glue job that reads from an S3 bucket and writes to another S3 bucket. The job fails with an 'Access Denied' error when trying to write to the output bucket. The IAM policy attached to the Glue service role is shown. What is the MOST likely cause of the failure?

185

A data engineer runs the above CLI command and sees that the bucket contains many small Parquet files (1 MB each) under the prefix. When querying this data with Athena, the query performance is poor and costs are high. Which approach would MOST improve performance and reduce cost?

186

A company runs a data pipeline using AWS Glue ETL jobs that process about 10 TB of data daily from Amazon S3. The jobs are triggered by a schedule and write results to a separate S3 bucket. Recently, the jobs have been taking longer to complete, and the data engineering team has observed that the number of files in the source bucket has increased significantly, from thousands to millions of small files (each about 100 KB). The Glue jobs are configured to use the 'Group Files' option, but performance is still poor. The team needs to improve the job performance without changing the source data generation process. Which course of action should the team take?

187

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to an Amazon S3 bucket. The data is then queried using Amazon Athena. The marketing team wants to run daily reports that aggregate click events by product ID. However, the reports are slow because Athena scans the entire dataset each time. The data is partitioned by date (e.g., s3://bucket/clickstream/2023/01/01/). The product ID is a column within the data. The data engineering team wants to improve query performance without moving the data to another service. Which approach should the team take?

188

A startup is building a data pipeline that ingests data from multiple sources into an Amazon S3 data lake. The data includes CSV files from legacy systems, JSON from web APIs, and Avro from mobile apps. The data must be transformed into Parquet format and cataloged for querying with Amazon Athena. The pipeline must be serverless and minimize operational overhead. The team has decided to use AWS Glue for ETL and cataloging. However, they are concerned about the cost of running Glue jobs continuously. The data arrives in small batches every 10 minutes. Which approach should the team use to minimize cost while meeting the requirements?

189

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes results to Amazon S3. The engineer notices that the Lambda function is experiencing throttling and some records are being dropped. Which TWO actions should the engineer take to improve the reliability of the pipeline?

190

A machine learning team is using Amazon SageMaker to train a model on a dataset stored in S3. The training job reads data from S3 using Pipe input mode, but the training is slow. The team wants to improve data throughput. Which THREE actions should they take?

191

A data engineer is designing a data lake on Amazon S3. The data is collected from IoT devices and is highly variable in volume. The engineer needs to ensure that the data is ingested reliably and can be processed in near real-time. Which AWS service should be used to ingest the data into the data lake?

192

A data engineer has attached the above IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from and writes to 'my-data-bucket'. The job is failing with an Access Denied error. What is the most likely cause?

193

A machine learning engineer is using Amazon SageMaker to train a model. The training dataset is 2 TB and is stored in Amazon S3. The engineer wants to reduce the training time by improving data loading performance. Which data ingestion mode should be used?

194

A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize costs and avoid managing infrastructure. Which AWS service should be used?

195

A data engineer is investigating a slow Athena query on a partitioned table. The table is partitioned by year, month, and day, and the data is stored in S3 with the prefix pattern 'raw/YYYY/MM/DD/'. The engineer runs the above CLI command and sees that there are many small files. Which action would most improve query performance?

196

A data engineering team is building a real-time fraud detection pipeline. The pipeline ingests transaction data from an Amazon Kinesis Data Stream with 10 shards. Each shard produces about 500 records per second, each record is 2 KB. The data is processed by a Lambda function that runs for about 200 ms and then writes results to an Amazon DynamoDB table. The team notices that the Lambda function is experiencing a high number of throttles, and there are increasing numbers of records being retried. The Lambda function's reserved concurrency is set to 100. The DynamoDB table has 100 read capacity units and 100 write capacity units. Which change would most effectively reduce throttling and improve processing throughput?

197

A machine learning team is preparing a large dataset for training. The dataset consists of 10,000 CSV files, each about 100 MB, stored in Amazon S3. The team wants to transform the data using AWS Glue ETL jobs. The transformation involves filtering rows, adding new columns, and joining with a small reference table (100 KB). The team is concerned about job performance and cost. They currently have a Glue job with 10 DPU (Data Processing Units) and it takes about 2 hours to complete. The team wants to reduce the runtime and cost. Which approach should they take?

198

A data engineer is tasked with building a data pipeline that moves data from an on-premises database to Amazon S3 for analytics. The database is a MySQL instance that is 2 TB in size. The company has a 1 Gbps dedicated network connection to AWS (AWS Direct Connect). The data must be transferred once daily. The engineer needs to choose the most efficient and reliable service for this task. Which service should they use?

199

A data engineering team is using Apache Spark on Amazon EMR to process streaming data from Amazon Kinesis Data Streams. The Spark application uses structured streaming to read from Kinesis, perform transformations, and write to Amazon S3 in Parquet format. The team notices that the application is falling behind and the processing latency is increasing. The Kinesis stream has 5 shards, and the EMR cluster has 5 core nodes of type r5.xlarge. The Spark application is configured with 5 executors, each with 2 cores and 8 GB memory. The team wants to reduce processing latency. Which change would be most effective?

200

A data engineer needs to continuously ingest streaming data from thousands of IoT devices and store the raw data in Amazon S3 for archival processing. The data volume varies significantly throughout the day, and the solution must be serverless, scalable, and cost-effective. Which AWS service should be used to capture and buffer the streaming data before writing to S3?

201

A company is running a machine learning training job on Amazon SageMaker that reads training data from an S3 bucket. The job fails intermittently with an S3 throttling error. The data is partitioned across thousands of small files (average 100 KB). Which strategy is MOST effective to resolve the throttling issue?

202

A data scientist wants to explore a large dataset stored in Amazon S3 using SQL queries without moving the data. The dataset is in CSV format and is updated daily with new partitions. Which AWS service should be used to directly query the data in S3?

203

A company is building a data pipeline that ingests data from multiple sources into a centralized data lake on Amazon S3. The data must be transformed before it is available for analysis. The pipeline should be event-driven, automatically triggering transformation jobs when new data arrives. Which combination of AWS services should be used?

204

A data engineering team is designing a data lake on Amazon S3. They need to enforce encryption at rest for all data stored in the bucket. The security policy requires that the encryption keys be managed by the organization using AWS Key Management Service (KMS), and that the bucket must deny uploads of unencrypted objects. Which bucket policy should be applied?

205

A company uses Amazon RDS for its transactional database and needs to export a daily snapshot of a table to Amazon S3 in Parquet format for analytics. Which AWS service can perform this export without writing custom code?

206

A company is streaming data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a AWS Lambda function that processes each record. The Lambda function is experiencing high error rates and throttling due to the volume of data. Which action would MOST effectively improve the processing throughput and reduce errors?

207

A data scientist needs to run complex ETL transformations on a large dataset stored in Amazon S3. The transformations are written in PySpark and require occasional access to Hive metastore. The solution should minimize operational overhead and allow the data scientist to focus on code development. Which AWS service should be used?

208

A company wants to perform real-time analytics on streaming data from clickstreams. The data needs to be ingested, processed, and made available for querying within seconds. Which AWS service should be used for the processing step?

209

A company is using AWS Glue to catalog metadata from various data sources. The crawler is configured to run daily. However, the catalog is not reflecting new partitions added to an S3 bucket during the day. What is the MOST likely cause?

210

A data engineer is designing a data pipeline that ingests data from a relational database into a data lake on Amazon S3. The data must be incrementally loaded daily. Which TWO AWS services can be used together to achieve this?

211

A company wants to use Amazon SageMaker to train a model using data stored in Amazon S3. The data is sensitive and must be encrypted at rest and in transit. Which THREE steps should be taken to ensure data security?

212

A data engineer needs to collect and analyze log data from multiple EC2 instances in real-time. The solution should be serverless and scalable. Which TWO AWS services should be used?

213

A company is using AWS Glue ETL jobs to transform data. The jobs are failing due to insufficient memory. The data processing involves complex joins and aggregations. Which THREE actions can improve job performance and reduce memory usage?

214

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The solution must handle data that arrives in bursts and must be able to reprocess failed records automatically. Which combination of AWS services should the team use?

215

A data engineer is designing a data pipeline that transforms raw JSON files (each 50-200 KB) in Amazon S3 into Parquet format using AWS Glue. The pipeline must minimize data processing costs and handle a high volume of small files (millions per day). The engineer configures a Glue ETL job with Spark, but the job is slow and expensive due to overhead of reading many small files. Which optimization should the engineer implement to reduce cost and improve performance?

216

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data contains personally identifiable information (PII) that must be redacted before storage. Which AWS service can be integrated with Kinesis Data Firehose to transform the data in real time?

217

A data engineering team needs to build a data lake on Amazon S3 that will be queried by Amazon Athena and Amazon Redshift Spectrum. The data will be ingested from multiple sources in various formats (CSV, JSON, Parquet). Which partitioning strategy will provide the best query performance for date-range queries?

218

A company has an AWS Glue ETL job that reads data from an Amazon RDS for MySQL table and writes to Amazon S3 in Parquet format. The job runs daily and processes 500 GB of data. Recently, the job has been failing with memory errors during the write phase. The data schema is wide (200 columns). Which change should a data engineer make to the Glue job to resolve the memory issue?

219

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and a tight deadline of two weeks. Which AWS service should the engineer use to transfer the data most efficiently?

220

A data engineering team is building a real-time fraud detection system. Transactions are ingested via Amazon Kinesis Data Streams, and a machine learning model (deployed on Amazon SageMaker) scores each transaction. The team needs to store the raw transactions and the model's predictions in Amazon S3 for later analysis. Which architecture should the team use?

221

A company uses Amazon Redshift for its data warehouse. The data engineering team needs to load 10 TB of data from Amazon S3 into Redshift every night. The team wants to minimize the load time and use the fewest number of COPY commands. The data is in CSV format and is partitioned by date in S3. Which approach should the team take?

222

A data engineer needs to schedule an AWS Glue ETL job to run every hour. The job reads from an Amazon DynamoDB table and writes to Amazon S3. Which AWS service should the engineer use to trigger the Glue job on schedule?

223

A data engineering team is designing a data pipeline that processes streaming data from Amazon Kinesis Data Streams using AWS Lambda. The team notices that some records are being processed multiple times (duplicates). Which TWO steps should the team take to ensure exactly-once processing?

224

A company uses Amazon Athena to query a data lake in Amazon S3. The data is partitioned by year, month, day, and hour. The team notices that queries are slow and expensive. The team wants to improve performance and reduce costs. Which THREE actions should the team take?

225

A data engineer is building a data pipeline using AWS Glue. The pipeline reads data from Amazon S3, transforms it, and writes it back to S3 in a different format. The engineer needs to handle schema evolution (new columns added over time). Which TWO features of AWS Glue can help manage schema evolution?

226

A data engineer uses the IAM policy above for an AWS Lambda function that processes data in S3 and triggers an AWS Glue job. The Lambda function is unable to start the Glue job. What is the most likely cause?

227

A data engineer runs the AWS CLI command above to inspect an object in S3. The engineer wants to query this metadata (kafka-offset) using Amazon Athena to track processing progress. How can the engineer make this metadata available for Athena queries without modifying the existing data pipeline?

228

A data engineer configures an S3 event notification to trigger an AWS Lambda function when a new object is created in 'my-input-bucket'. The Lambda function processes the CSV file and writes results to 'my-output-bucket'. The engineer notices that the Lambda function is not triggered for some objects. Which step should the engineer take to diagnose the issue?

229

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. The application has been failing intermittently with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

230

A data engineering team is designing a data pipeline to process large CSV files (10-50 GB each) stored in Amazon S3. The pipeline must transform the data using AWS Glue and load it into Amazon Redshift for analytics. The team wants to minimize costs while ensuring the pipeline can handle peak loads. Which approach is the most cost-effective?

231

A company is using Amazon DynamoDB to store sensor data. The data is exported to Amazon S3 using DynamoDB Streams and AWS Lambda for long-term archival. Recently, the Lambda function has been failing due to 'ProvisionedThroughputExceededException' on the DynamoDB stream. What is the most likely cause?

232

A data scientist is building a training dataset from data stored in Amazon S3. The data consists of JSON files each containing a 'timestamp' field. The scientist wants to use AWS Glue to catalog the data and enable querying via Amazon Athena. However, Athena queries are returning zero results for time-range filters. What is the most likely cause?

233

A company is streaming data from IoT devices to Amazon Kinesis Data Firehose, which writes to an Amazon S3 bucket. The data is then processed by an AWS Glue ETL job and loaded into Amazon Redshift. The team notices that some records are missing in Redshift. They suspect data loss during the Firehose delivery. Which configuration parameter should be checked first?

234

A data engineer needs to set up a data pipeline that ingests data from an Amazon RDS MySQL database into Amazon S3. The pipeline should run daily and capture incremental changes (inserts, updates, deletes) from the source database. Which AWS service should be used as the data ingestion tool?

235

A company is building a data lake on Amazon S3 and wants to use AWS Glue to catalog the data. The data includes CSV, Parquet, and JSON files. The team wants to ensure that the Glue crawler can infer the schema correctly and update the Data Catalog when new partitions are added. Which crawler configuration should be used?

236

A company uses Amazon Kinesis Data Streams with a shard count of 5. The data producer sends 1000 records per second, each 1 KB in size. The consumer application reads from the stream using the Kinesis Client Library (KCL) and processes records. The consumer is experiencing high latency and falling behind. What is the most effective way to improve consumer throughput?

237

A company wants to store semi-structured data from IoT sensors in a cost-effective manner for occasional querying. The data is not updated once written. Which Amazon S3 storage class is the most cost-effective for this use case?

238

Which TWO configurations are required to enable AWS Glue to access data stored in a VPC? (Choose two.)

239

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose three.)

240

Which TWO steps are required to set up cross-account access to an Amazon S3 data lake for AWS Glue jobs running in a different AWS account? (Choose two.)

241

Refer to the exhibit. A company is using the Kinesis stream 'my-stream' with one shard. The producer is sending 1000 records per second, each 1 KB. The consumer is reading from the stream using the Kinesis Client Library (KCL). The consumer is able to process 500 records per second per shard. What is the most likely cause of the consumer falling behind?

242

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue job that fails with an 'AccessDenied' error when trying to write to the S3 bucket 'my-data-lake'. The IAM policy attached to the Glue service role is shown. What is the missing permission?

243

Refer to the exhibit. A team deploys this CloudFormation stack. The Kinesis stream is created, but the Firehose delivery stream fails to create with a 'Resource handler returned message: Unable to assume role' error. What is the most likely cause?

244

A company is streaming real-time sensor data from IoT devices to Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that enriches the records with metadata from an Amazon DynamoDB table and writes the results to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors from DynamoDB. The data volume is variable, with occasional bursts. Which solution should a data engineer implement to resolve this issue without losing data?

245

An e-commerce company uses Amazon Redshift for analytics. The data engineering team needs to load daily sales data from an S3 bucket that receives new files every hour. The data must be loaded into Redshift with minimal impact on query performance during the day, and they need to handle late-arriving data (files that appear after the daily load). Which approach should they use?

246

A data scientist needs to train a machine learning model using a large dataset (500 GB) stored in an S3 bucket. The training will be performed on a SageMaker notebook instance. The data scientist wants to minimize data transfer costs and reduce training time. Which data ingestion approach should the data engineer recommend?

247

A company is using AWS Glue to run ETL jobs that transform data from multiple sources into a data lake on S3. The jobs are scheduled to run hourly. Recently, the jobs have been failing intermittently with 'MemoryError' exceptions. The data volume has grown over time. The data engineer needs to resolve this issue cost-effectively. Which action should be taken?

248

A data engineering team needs to set up a data pipeline that ingests streaming data from an Apache Kafka cluster running on Amazon EKS into an S3 data lake. The data must be stored in Parquet format, partitioned by date and event type. The team wants a fully managed solution with minimal operational overhead. Which solution should they choose?

249

A data scientist is training a deep learning model using a large dataset stored in S3. The training job runs on a SageMaker training instance with a GPU. The data engineer notices that the GPU utilization is low, and the training is I/O bound. The data is read directly from S3 using the SageMaker SDK. Which change should the data engineer recommend to improve GPU utilization?

250

A company uses Amazon DynamoDB as the primary data store for a real-time recommendation engine. The data engineering team needs to export a daily snapshot of the DynamoDB table to S3 for offline analytics. The table is large (10 TB) and has a high read/write throughput. Which method will export the data with the least impact on the production workload?

251

A data engineer is building a data pipeline that uses AWS Lambda to process records from an SQS queue and write results to an S3 bucket. The Lambda function processes each record individually and writes a separate file to S3. The team notices high latency and wants to reduce the number of S3 PUT requests to improve performance and reduce cost. Which approach should the data engineer take?

252

A company has a large number of small CSV files (hundreds of thousands) in an S3 bucket. A data engineer needs to run a SQL query on this data using Amazon Athena. The queries are currently slow and expensive. Which two actions will improve query performance and reduce cost?

253

A data engineer needs to design a data ingestion pipeline that ingests data from a MySQL database hosted on-premises into Amazon S3 for analytics. The pipeline must capture change data (CDC) and run continuously with low latency. Which two services should the data engineer use?

254

A company is using Amazon Redshift for data warehousing. The data engineering team observes that query performance degrades over time due to data skew. Which three strategies should the team implement to improve performance?

255

Refer to the exhibit. An IAM policy is attached to a data engineering team's role. The team needs to upload data to the 'confidential' prefix in the 'my-data-lake' bucket. However, they are receiving 'AccessDenied' errors. What is the likely cause?

256

Refer to the exhibit. A data engineer runs an Athena query and gets a failure. What is the most likely cause?

257

Refer to the exhibit. A data engineer has deployed this CloudFormation template. The Glue job 'my-etl-job' reads from the S3 bucket 'my-data-lake-bucket' and writes transformed data to another bucket. After 30 days, the data engineer notices that the Glue job fails with 'Input data not found' errors. What is the most likely cause?

258

A data engineer needs to extract data from an Amazon RDS for MySQL database into Amazon S3 for further processing. The data volume is 2 TB and the job must run daily within a 1-hour window. Which AWS service is most suitable for this task?

259

A company is building a data lake on Amazon S3. Data arrives from multiple sources in different formats (CSV, JSON, Parquet). The engineering team wants to query this data using Amazon Athena with minimal transformation. Which approach minimizes query cost and improves performance?

260

A data pipeline uses AWS Lambda to process small files (10-50 MB) from an S3 bucket and write results to DynamoDB. The Lambda function times out after 15 seconds for larger files. The team wants to handle files up to 100 MB without changing the Lambda code. Which solution is MOST cost-effective?

261

A data scientist needs to run a one-time query on 10 TB of data stored in S3 using Amazon Athena. The query scans 5 TB and returns a small result set. Which approach minimizes cost?

262

A company uses Amazon Kinesis Data Firehose to ingest streaming data and deliver it to an S3 bucket. The data is in JSON format with a timestamp field. The data science team wants to query the data using Athena with partitioning by year/month/day. How should the S3 data be organized?

263

An organization is migrating its on-premises Hadoop cluster to AWS. The cluster runs Spark jobs that process 50 TB of data daily. The data is stored in HDFS with 3x replication. Which storage option on AWS provides the best price-performance for this workload?

264

A company needs to ingest real-time clickstream data from thousands of web servers into AWS for near-real-time analytics. The data volume varies and can spike during promotions. Which service should be used to capture and buffer the data before processing?

265

A data engineer uses AWS Glue to run ETL jobs that transform data from JSON to Parquet. The job runs successfully but takes 30 minutes longer than expected. CloudWatch metrics show high memory utilization and disk spills. What is the most likely cause?

266

A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest with a key that is automatically rotated every year. Which solution meets these requirements with the least operational overhead?

267

Which TWO options are valid ways to reduce the amount of data scanned by Amazon Athena queries, thereby reducing cost?

268

Which THREE AWS services can be used together to build a serverless data pipeline that ingests streaming data, transforms it, and loads it into Amazon Redshift for analysis?

269

Which TWO options are best practices for managing access to data stored in Amazon S3 for a data lake?

270

A data engineer is investigating why an Athena query against the my-data-lake bucket is slow. The query filters on year, month, and day. The exhibit shows the metadata of one Parquet file. What is the MOST likely cause of the slow query?

271

The Glue job my-glue-job fails after a few successful runs. The error log shows 'Job run exceeds max concurrent runs limit'. The CloudFormation template is shown in the exhibit. What change should be made to allow multiple runs to execute concurrently?

272

A Glue job fails with an AccessDenied error when trying to write to the S3 bucket my-data-lake. The IAM policy attached to the job role is shown in the exhibit. What is the MOST likely reason for the failure?

273

A data scientist needs to process a large volume of streaming data from IoT devices and store the results in Amazon S3 for further analysis. Which AWS service is most suitable for ingesting and processing this data in near real-time?

274

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with timeouts. What is the most likely cause?

275

A company uses Amazon Kinesis Data Streams to collect clickstream data. The data is consumed by a Lambda function that writes to DynamoDB. Occasionally, the Lambda function fails due to throttling from DynamoDB. How can the company resolve this issue without losing data?

276

A company needs to perform complex transformations on large datasets stored in Amazon S3 using Apache Spark. They want to minimize operational overhead. Which AWS service should they use?

277

A company is migrating its on-premises Hadoop cluster to AWS. They have a large amount of historical data stored in HDFS. Which approach is the most efficient for transferring this data to Amazon S3?

278

A data engineer needs to automate the transformation of CSV files to Parquet format as soon as they are uploaded to an S3 bucket. The transformed files should be stored in another S3 bucket. Which solution is the most cost-effective and requires the least maintenance?

279

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. They notice that the data is delivered in 5-minute intervals even though they set the buffer interval to 60 seconds. What could be the cause?

280

A company needs to process sensitive data from multiple sources. They want to use AWS Glue to catalog and transform the data. Which feature should they use to ensure that sensitive columns are masked before the data is available for querying?

281

A company runs a critical ETL job using AWS Glue that writes to an Amazon Redshift cluster. The job occasionally fails due to insufficient disk space on the Redshift cluster. How can the company automate the process to prevent this failure?

282

Which TWO AWS services are suitable for real-time stream processing?

283

Which TWO data formats are columnar and optimized for analytics queries in Amazon S3?

284

Which THREE considerations are important when designing a data lake on Amazon S3?

285

An IAM policy attached to an AWS Glue job allows reading and writing to an S3 bucket and accessing Glue Data Catalog. The job fails with an access denied error when trying to create a table in the Data Catalog. What is the likely issue?

286

A data engineer runs the AWS CLI command shown and notices a zero-byte file in the results. What is the most likely cause of this zero-byte file?

287

A company uses AWS Glue jobs with job bookmarks enabled to process incremental data. They notice that the job processes all data each time instead of only new data. What is the most likely reason?

288

A data engineer needs to store streaming data from thousands of IoT devices for real-time analytics. Which AWS service is most suitable for ingesting and storing this data for subsequent processing by Amazon Kinesis Data Analytics?

289

A company is using AWS Glue to run ETL jobs that process data in an S3 data lake. The jobs are failing with out-of-memory errors when processing large files. Which configuration change should be made to resolve this issue?

290

A data scientist is training a deep learning model on a GPU instance. The training data is stored in S3 and is 50 GB. To reduce I/O bottlenecks, which storage option should be used to cache the data locally on the instance?

291

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be transformed into Parquet format before delivery. Which approach should the data engineer use?

292

A company is running a data pipeline that uses Amazon EMR with Spark to process 100 TB of data daily. The pipeline must complete within 6 hours. Currently, it takes 8 hours. Which optimization will most likely reduce the runtime?

293

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should be used to trigger the job?

294

A company is using Amazon Athena to query a data lake in S3. Queries are slow and expensive. The data is stored as JSON. Which action will improve query performance and reduce cost?

295

A data engineer is building a data pipeline that uses Amazon S3 to store raw data, AWS Lambda for transformation, and Amazon DynamoDB for serving. The Lambda function experiences high latency when writing to DynamoDB. Which action will most effectively reduce the latency?

296

A company needs to move 10 TB of data from an on-premises NAS to Amazon S3 over a 100 Mbps internet connection. The transfer must complete within 3 days. Which solution is the most appropriate?

297

A data engineering team is designing a data lake on AWS. They need to store raw data in S3 and allow multiple analytics services to query the data. Which TWO services can be used to catalog and provide schema information for the data?

298

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be processed and stored in S3 in near real-time. Which THREE services can be used together to achieve this?

299

A data engineer is designing a data pipeline that uses Amazon S3 events to trigger an AWS Lambda function for processing. The pipeline must handle high throughput with low latency. Which TWO configurations should be applied?

300

A data engineer is configuring an IAM policy to allow users to upload objects to an S3 bucket only if the objects are encrypted using SSE-S3. However, users are getting AccessDenied errors when uploading objects without specifying encryption. What is the most likely cause?

Practice all 300 Data Engineering questions

Other MLS-C01 exam domains

Machine Learning Implementation and Operations Modeling Exploratory Data Analysis

Frequently asked questions

What does the Data Engineering domain cover on the MLS-C01 exam?

The Data Engineering domain covers the key concepts tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all MLS-C01 domains — no account required.

How many Data Engineering questions are in the MLS-C01 question bank?

The Courseiva MLS-C01 question bank contains 300 questions in the Data Engineering domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Data Engineering for MLS-C01?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Data Engineering questions for MLS-C01?

Yes — the session launcher on this page draws questions exclusively from the Data Engineering domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your MLS-C01 domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included