Practice MLS-C01 Data Engineering questions with full explanations on every answer.
Start practicing
Data Engineering — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?
2A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?
3A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?
4A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?
5An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?
6A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?
7A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?
8A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?
9A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?
10A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?
11An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?
12A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?
13A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?
14A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?
15A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?
16A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?
17A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)
18A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)
19A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?
20A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?
21A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?
22A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?
23A data engineer is designing a streaming pipeline using Amazon Kinesis Data Analytics for Apache Flink. The pipeline reads from a Kinesis data stream and writes to a S3 bucket. The job must recover quickly from failures without reprocessing large amounts of data. Which TWO configurations should be used? (Choose TWO)
24A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)
25A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?
26A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?
27A retail company runs an e-commerce platform on AWS. They have a Data Engineering team that processes clickstream data using Amazon Kinesis Data Streams (KDS) with a shard count of 5. The data is consumed by an AWS Lambda function that transforms and loads the data into an Amazon S3 bucket partitioned by year/month/day/hour. Recently, the team has noticed that the Lambda function is experiencing throttling errors, and the KDS shard iterator age is increasing, indicating that the consumer cannot keep up with the incoming data rate. The team has already increased the Lambda reserved concurrency to 1000 and enabled batch window of 60 seconds. The metrics show that the Lambda function duration is well under the 5-minute timeout, and there are no errors in the transformation logic. The S3 write operations are not failing. Which course of action would MOST effectively resolve the issue without unnecessary cost or complexity?
28Drag and drop the steps to create an Amazon SageMaker notebook instance in the correct order.
29Drag and drop the steps to perform hyperparameter tuning using SageMaker Automatic Model Tuning in the correct order.
30Drag and drop the steps to set up Amazon SageMaker Ground Truth for a labeling job in the correct order.
31Drag and drop the steps to set up cross-validation in a SageMaker training job using the built-in XGBoost algorithm in the correct order.
32Match each AWS service to its primary purpose in a machine learning pipeline.
33Match each AWS security service to its function in ML.
34Match each data format to its typical use in AWS ML.
35Match each SageMaker optimization technique to its description.
36A data engineering team needs to process streaming data from thousands of IoT devices. They want to aggregate data in 1-minute windows and store results in an S3 data lake for downstream analytics. Which architecture should they use?
37A company uses AWS Glue ETL jobs to transform CSV data from an S3 bucket into Parquet. The jobs often fail with memory errors when processing large datasets. They want to minimize cost and improve reliability. What should they do?
38A machine learning team needs to create a training dataset by joining two large datasets (10 TB and 5 TB) stored in S3. The join key is 'user_id'. They want to minimize data movement and cost. Which approach should they use?
39A company uses an Amazon SageMaker notebook to train a model using data from an S3 bucket. The IAM role attached to the notebook has the following policy. What is the MOST specific change needed to allow the notebook to read from the bucket 'ml-data-123'?
40A data engineer needs to transform raw clickstream data (JSON files) stored in S3 into a partitioned Parquet dataset for querying with Athena. The transformation includes cleaning, deduplication, and enrichment. The pipeline should run daily. Which solution is MOST cost-effective and requires the least operational overhead?
41A company uses Kinesis Data Streams to ingest real-time sensor data. The data is consumed by a Lambda function that writes to DynamoDB. During peak hours, the Lambda function throws ProvisionedThroughputExceededException. The team wants to decouple the write operation and improve resilience. What should they do?
42A data engineer needs to design a data pipeline that ingests CSV files from an SFTP server daily, transforms them, and loads them into Amazon Redshift. The files are typically 2-3 GB. Which combination of AWS services is MOST appropriate?
43A team stores raw data in S3 and uses a Glue Data Catalog for metadata. They want to allow data scientists to query the data with Amazon Athena using their existing IAM roles. What is the MINIMUM set of permissions required?
44A company is building a near-real-time dashboard using data from multiple sources. They need to aggregate millions of events per second with sub-second latency. The architecture must be fully managed and minimize operational overhead. Which service should they use for the aggregation layer?
45A data engineer needs to set up a data lake on S3 that supports both batch and streaming ingestion. The data must be queryable by Athena, Redshift Spectrum, and EMR. Which TWO configurations are essential? (Choose two.)
46A team wants to move data from an on-premises Oracle database to Amazon S3 for analytics. The pipeline must run daily and handle incremental updates. Which THREE services should they use together? (Choose three.)
47A company uses Amazon Kinesis Data Streams to ingest clickstream data. They need to archive raw data to S3 every hour and also enable real-time processing with sub-second latency. Which TWO actions should they take? (Choose two.)
48An IAM policy attached to a SageMaker notebook role is shown. The data engineer tries to run an Athena query on a table in the 'my_database' Glue database. The query fails with an access denied error. What is the MOST likely cause?
49A data engineer runs the AWS CLI command above to inspect a file in S3. They need to determine if the file was modified after a Glue ETL job processed it. What additional information could they obtain from this command?
50A data engineer created a CloudFormation template for a Glue ETL job as shown. The job processes 500 GB of data and takes 90 minutes to complete. However, the job fails after 60 minutes. What is the MOST likely cause?
51A data science team needs to process streaming data from thousands of IoT devices and perform real-time anomaly detection. The data must be persisted in Amazon S3 for batch processing later. Which combination of AWS services should be used to meet these requirements?
52A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries are slow and wants to improve performance without changing the schema. Which action is most likely to improve query performance?
53A data pipeline uses AWS Glue to transform data from Amazon RDS to Amazon S3. The team wants to ensure that only new or updated records are processed in each run, minimizing cost and time. Which AWS Glue feature should be used?
54A company is using Amazon SageMaker to train machine learning models. The training data is stored in Amazon S3, but the data includes personally identifiable information (PII) that must be anonymized before training. What is the most efficient way to anonymize the data?
55A data engineer needs to move 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. Which AWS service should be used to transfer the data most efficiently?
56A team is building a data lake on Amazon S3 and using AWS Glue to catalog data. They notice that Glue crawlers are taking too long to update the catalog for a large dataset with millions of small files. Which approach will MOST improve crawler performance?
57A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format, and the company wants to convert it to Parquet for efficient querying. Which configuration should be used?
58A data engineer needs to run a one-time ETL job to transform 500 GB of data from Amazon RDS to Amazon S3. The job should be cost-effective and require minimal infrastructure management. Which AWS service should be used?
59A company uses Amazon DynamoDB as the primary data store for a real-time application. The data science team wants to analyze the data using Amazon Athena. What is the most efficient way to make the DynamoDB data available for Athena queries?
60A company is designing a data pipeline that ingests streaming data from social media feeds. The data must be processed in real-time to detect trending topics, and results must be stored in Amazon DynamoDB for low-latency access. Which services should the company use? (Choose TWO.)
61A data engineer needs to transform and move 2 TB of data from an Amazon RDS for PostgreSQL instance to Amazon S3 daily. The transformation includes filtering, joining with data in S3, and aggregating. Which AWS services can be used together to accomplish this with minimal operational overhead? (Choose THREE.)
62A company wants to build a data lake on Amazon S3. The data lake should support both batch and real-time data ingestion. Which AWS services should be used for data ingestion? (Choose TWO.)
63A data engineer wants to stream clickstream data from a web application to Amazon S3 for near-real-time analytics. Which AWS service should be used to ingest and buffer the data before landing in S3?
64A machine learning team needs to process a large dataset stored in Amazon S3 using Apache Spark. They want to minimize cost and avoid managing infrastructure. Which AWS service should they use?
65A company uses AWS Glue to run ETL jobs on a daily schedule. The jobs are failing intermittently with 'OutOfMemory' errors. The data volume has grown 5x over the past month. Which is the MOST cost-effective fix?
66A data scientist needs to query a dataset stored as Parquet files in Amazon S3 using standard SQL without managing any infrastructure. Which service should they use?
67A team wants to build a data pipeline that processes incoming JSON files from an S3 bucket and loads them into a Redshift table. The pipeline must handle schema evolution and data validation. Which combination of services would be MOST appropriate?
68A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on a stream of IoT sensor data. The application is experiencing high latency. The data volume has doubled. Which action would MOST effectively reduce latency?
69A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The company has a 1 Gbps internet connection. Which service would complete the transfer in the shortest time?
70A company is building a data lake on Amazon S3. They need to enforce encryption at rest for all objects. Which combination of actions will achieve this? (Assume the bucket is versioned.)
71A company uses Amazon EMR with Spark to process data daily. The job reads from S3 and writes to S3. Recently, the job started failing with 'S3AccessDenied' errors. The IAM role used by EMR has not changed. What is the MOST likely cause?
72A company is designing a data pipeline to ingest data from multiple sources into an Amazon S3 data lake. The data must be encrypted at rest and in transit. Which TWO actions should be taken to meet these requirements?
73A data engineering team uses AWS Glue to run ETL jobs. They notice that jobs are taking longer to complete as data volume grows. They want to optimize performance without increasing cost significantly. Which THREE strategies should they consider?
74A company wants to analyze streaming data from IoT devices in near-real-time. They need to store raw data in Amazon S3 and also run SQL queries on the streaming data. Which TWO services should they use?
75An IAM policy is attached to a group. A user in the group tries to read the object s3://data-lake-bucket/sensitive/file.txt from an IP address 192.168.1.1. What will happen?
76A data engineer runs the CLI command to download an object from S3. The bucket owner is 123456789012, and the engineer's IAM user has s3:GetObject permission on the bucket. The object was uploaded by a different AWS account. What is the MOST likely reason for the AccessDenied error?
77An S3 event notification is configured to trigger a Lambda function when new objects are created. The Lambda function processes the event JSON shown. Which field should the function use to read the new object from S3?
78A data engineer needs to ingest streaming data from an on-premises Kafka cluster into Amazon S3 with minimal operational overhead. Which AWS service should be used to stream the data into S3 without managing servers?
79A company is using AWS Glue ETL jobs to process data stored in Amazon S3. The jobs currently run sequentially and take too long. The data engineer wants to reduce job duration without rewriting the code. Which action is most effective?
80A data science team uses Amazon SageMaker to train models on a dataset stored in Amazon S3. The dataset is 2 TB and is accessed by multiple training jobs. The team notices that training jobs are slow due to high S3 GET request latency. Which solution would provide the fastest and most cost-effective data access?
81A company runs a daily ETL job that reads data from Amazon RDS, transforms it using AWS Glue, and writes the results to Amazon S3. The job started failing yesterday with the error: 'Rate exceeded'. What is the most likely cause and solution?
82A company wants to analyze historical data stored in Amazon S3 using Amazon Athena. The data is in CSV format and is partitioned by date. Which action will provide the best query performance and cost optimization?
83A company uses AWS Lake Formation to manage permissions on a data lake stored in Amazon S3. A data analyst tries to query a table using Amazon Athena but receives an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What is the most likely cause?
84A data pipeline uses Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The Lambda function sometimes times out because of spikes in traffic. The team wants to buffer the data before processing to handle spikes. Which approach is most effective?
85A company runs a nightly AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes to Amazon S3. The job fails intermittently with 'ERROR: cannot execute INSERT in a read-only transaction'. What is the most likely cause?
86A company uses Amazon EMR to run Spark jobs on a large dataset stored in Amazon S3. The jobs are failing with 'OutOfMemoryError' in the executors. The data is not skewed. Which configuration change will most likely resolve the issue?
87A data engineer is designing a data ingestion pipeline that will receive up to 5 GB of data per hour from thousands of IoT devices. The data must be stored in Amazon S3 and analyzed in near real-time. Which TWO services should be used together to meet these requirements? (Choose TWO.)
88A company needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 5 days. Which TWO options are viable? (Choose TWO.)
89A company uses Amazon Redshift for data warehousing. The data engineering team notices that query performance has degraded over time. Which THREE actions should the team take to improve performance? (Choose THREE.)
90A data engineer is building a data pipeline that ingests streaming data from IoT devices. The data must be processed in near real-time and stored in Amazon S3 for further analysis. Which AWS service should be used to capture and process the streaming data before storing it in S3?
91A machine learning team needs to preprocess large volumes of clickstream data stored in Amazon S3 before training a model. The preprocessing includes data cleaning, feature engineering, and normalization. The team wants to use a serverless solution that minimizes operational overhead. Which combination of services should the team use?
92A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time data. The data source is a Kinesis data stream, and the output is written to an S3 bucket. Recently, the processing latency has increased significantly. The team suspects that the Flink application is encountering backpressure. Which metric should the team monitor to confirm backpressure?
93A data scientist wants to query a dataset stored in Amazon S3 using standard SQL without provisioning any servers. The dataset is in CSV format and is updated daily. Which AWS service should be used?
94A company is building a data pipeline to process sensitive customer data. The pipeline uses AWS Glue for ETL and stores results in Amazon S3. The security team requires that all data be encrypted at rest in S3 using customer-managed AWS KMS keys. Additionally, the Glue job must be able to write encrypted data to S3. What should the data engineer do to meet these requirements?
95A large e-commerce company is using Amazon DynamoDB as the source for real-time analytics. The data is streamed to Amazon Kinesis Data Streams using DynamoDB Streams and then processed by an AWS Lambda function. The Lambda function writes the data to an Amazon Elasticsearch Service cluster for search and visualization. Recently, the Lambda function has been failing with throttling errors from the Elasticsearch cluster. What is the MOST effective way to handle this?
96A company is using Amazon S3 as a data lake. The data engineering team needs to catalog the schema of the data and make it available for querying with Amazon Athena. Which AWS Glue component should be used?
97A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 100 Mbps connection to AWS. The transfer must be completed within one week. Which approach should the engineer use?
98A company is using Amazon Redshift for data warehousing. The data engineering team notices that queries are slow and the system is frequently writing to disk due to insufficient memory. Which type of workload management (WLM) configuration change would help reduce disk writes?
99Which TWO AWS services can be used to move data from an on-premises database to Amazon S3 on a recurring schedule without writing custom code? (Choose 2.)
100Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose 3.)
101Which TWO AWS services can be used to schedule and orchestrate a data pipeline that includes multiple steps such as data extraction, transformation, and loading? (Choose 2.)
102A data engineer needs to analyze large CSV files stored in Amazon S3 using SQL queries. The data is not frequently accessed, and cost is a primary concern. Which AWS service should be used to query the data directly in S3 without moving it?
103A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be transformed before being stored in Amazon S3. The transformations include enrichment with reference data from Amazon DynamoDB. Which AWS service should be used to perform the transformation with minimal operational overhead?
104A data engineering team is designing a data lake on Amazon S3. Raw data is ingested in JSON format and must be partitioned by year, month, and day. The team expects high query performance for recent data but infrequent queries for older data. The data is immutable. Which storage tier configuration minimizes costs while meeting performance requirements?
105A data engineer is tasked with building a system to process a continuous stream of IoT sensor data. The data must be processed in near real-time, and the results must be stored in Amazon S3 partitioned by hour. Which AWS service is the most cost-effective and simplest to implement?
106A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are currently failing due to insufficient memory. The data volume varies, with occasional spikes. Which solution should be used to handle the variable memory requirements efficiently?
107A data pipeline uses Amazon Kinesis Data Streams to ingest event data. The data is consumed by an AWS Lambda function, which writes to Amazon DynamoDB. The Lambda function is experiencing throttling errors, and the DynamoDB write capacity is underutilized. The events must be processed in order per shard. Which solution most effectively addresses the throttling?
108A data scientist needs to run a one-time SQL query on a large dataset in Amazon S3. The dataset is stored in Parquet format and is about 500 GB. The query requires complex aggregations and joins. Which AWS service should be used to minimize cost and setup time?
109A company is building a data lake on Amazon S3. Raw data is ingested from multiple sources in different formats (CSV, JSON, Parquet). The data must be cataloged and made queryable using Amazon Athena. The data schema may evolve over time. Which approach minimizes manual effort and supports schema evolution?
110An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to Amazon S3. The data arrives at unpredictable rates, with occasional bursts. The company needs to ensure data is delivered within 60 seconds of ingestion, and the data must be partitioned by year/month/day/hour. Which configuration meets these requirements?
111Which TWO AWS services can be used to transform data in transit before storing it in Amazon S3? (Choose TWO.)
112A company is designing a data pipeline to analyze customer behavior. The pipeline must handle real-time streaming data and batch data. The data must be stored in a data lake on Amazon S3 and also made available for interactive queries. Which THREE services should be combined to build this pipeline? (Choose THREE.)
113A data engineering team is migrating on-premises Hadoop workloads to AWS. The workloads include batch processing using Apache Spark and interactive SQL queries. The data is stored in HDFS. Which TWO AWS services should be used to replace HDFS and provide a scalable, durable storage layer? (Choose TWO.)
114A data engineering team needs to ingest streaming data from thousands of IoT devices into a data lake on Amazon S3 for near-real-time analytics. The data must be partitioned by device ID and timestamp, and the team must minimize data loss during ingestion failures. Which solution is MOST appropriate?
115A data scientist needs to query a 2 TB dataset stored in Amazon S3 using Amazon Athena. The data is in CSV format and is used for exploratory analysis. Queries are currently slow and expensive. Which action will improve query performance and reduce cost?
116A company uses AWS Glue ETL jobs to process data from an Amazon RDS for MySQL database into Amazon S3. The job runs daily and takes 6 hours to complete. The team wants to reduce runtime and cost. The source table has 50 million rows and is updated continuously. Which combination of changes would be MOST effective?
117A data pipeline uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that transforms and writes to Amazon DynamoDB. The Lambda function is throttled during traffic spikes, causing data to be reprocessed. Which solution should the team implement to handle the throttling without losing data?
118A company wants to use Amazon SageMaker to train a model on a dataset stored in Amazon S3. The dataset is 100 GB and consists of millions of small JSON files. What should the data engineering team do to optimize training performance?
119A financial services company needs to build a data lake on Amazon S3 that meets regulatory requirements for data retention and encryption. Data must be encrypted at rest and in transit, and access must be audited. The data lake will be queried by Amazon Athena and Amazon Redshift Spectrum. Which combination of actions should be taken?
120A data engineering team is building a pipeline to process terabytes of log data daily using Amazon EMR with Spark. The data arrives in hourly batches and must be processed within 4 hours. The team needs to minimize cost. Which cluster configuration is MOST cost-effective?
121A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 1 Gbps connection to AWS. The transfer must be completed within 10 days. What is the MOST efficient approach?
122A company runs a real-time fraud detection pipeline using Amazon Kinesis Data Analytics. The pipeline reads from a Kinesis data stream, performs sliding window aggregations, and writes results to a DynamoDB table. The application is experiencing high latency during peak hours. Which action would MOST effectively reduce latency?
123A data engineering team is designing a data pipeline to process streaming data from social media feeds. The data must be deduplicated, enriched with customer information from a relational database, and stored in Amazon S3 in Parquet format. Which AWS services should the team use to build this pipeline? (Select TWO.)
124A company uses AWS Glue Data Catalog to manage metadata for its data lake on Amazon S3. The data lake contains terabytes of data in CSV format. The data engineering team wants to improve query performance in Amazon Athena and reduce costs. Which actions should the team take? (Select THREE.)
125A data engineering team needs to schedule a nightly ETL job that extracts data from an Amazon RDS for PostgreSQL instance, transforms it using Spark, and loads it into Amazon S3. The team wants to use AWS Glue for this task. Which components are required? (Select TWO.)
126Refer to the exhibit. An IAM policy is attached to a data engineering role. The role is used by an AWS Glue ETL job that reads from 'raw/' and writes to 'processed/'. The job fails with an access denied error when trying to write to 'processed/'. What is the likely cause?
127Refer to the exhibit. A data engineer examines the output of 'aws glue get-job-run' for a failed job. The job run state is FAILED, but ErrorMessage is empty. The job ran for 3600 seconds (1 hour) before failing. What is the MOST likely cause of the failure?
128Refer to the exhibit. A CloudFormation template creates an S3 bucket. The data engineering team stores daily log files in this bucket and queries them using Amazon Athena. After 30 days, queries on logs older than 30 days start failing with 'Access Denied' errors. What is the MOST likely reason?
129A company captures streaming data from IoT devices using Amazon Kinesis Data Streams. The data is consumed by a custom application that processes records in near real-time. Recently, the application has been falling behind, and the stream is showing increased 'iterator age' metrics in CloudWatch. Which action is MOST likely to reduce the iterator age?
130A data engineer needs to build a pipeline that ingests CSV files from an S3 bucket, validates the schema, and loads the data into an Amazon Redshift cluster. The pipeline must handle schema evolution gracefully by adding new columns as they appear in the source files. Which combination of AWS services and configurations would meet these requirements with minimal operational overhead?
131A machine learning team is preparing a dataset for model training. The data is stored in an Amazon S3 bucket with objects that are each approximately 100 MB in size. The team wants to use Amazon SageMaker for training. To optimize training performance, which data format and storage configuration should be used?
132A company uses Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application uses event time and watermarks for windowed aggregations. The team notices that the output from tumbling windows is delayed, and many late records are being dropped. What is the MOST likely cause?
133A research lab stores large genomic datasets in Amazon S3 Glacier Deep Archive. They need to run a one-time analysis on a subset of 10 PB of data. The analysis will use an Amazon EMR cluster with Amazon S3 as the data source. What is the MOST cost-effective and performant way to make the data available for the EMR cluster?
134An ML engineer is using Amazon SageMaker to train a model on a dataset that contains personal identifiable information (PII). The data must be encrypted at rest and in transit. The company uses AWS KMS for key management. How should the engineer configure the SageMaker training job to meet these encryption requirements?
135A company uses AWS Glue ETL jobs to transform data from Amazon RDS for MySQL to Amazon S3. The transformation includes aggregations and joins. The job runs daily and processes approximately 100 GB of data. Recently, the job started failing with memory errors on the worker nodes. Which approach would MOST effectively resolve the issue without changing the logic?
136A data scientist needs to run a one-time training job on a 5 TB dataset stored in Amazon S3. The training algorithm requires random access to individual records. Which SageMaker input mode and data format combination would be MOST appropriate?
137A team is building a data pipeline using Amazon Kinesis Data Firehose to deliver real-time clickstream data to an Amazon S3 bucket. The data must be partitioned by year, month, day, and hour. Which configuration should the team use to achieve this?
138A company is building a data lake on Amazon S3 and wants to ensure that data is encrypted at rest using AWS KMS. Which TWO actions are required to achieve this? (Choose TWO.)
139A company is using Amazon DynamoDB as a source for a machine learning pipeline. The data is exported nightly to Amazon S3 using DynamoDB Streams and an AWS Glue job. The Glue job reads the stream records, transforms them, and writes to S3 in Parquet format. The team notices that the Glue job is taking too long and consuming high DynamoDB read capacity. Which THREE actions would reduce the load on DynamoDB and improve performance? (Choose THREE.)
140A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest sensor data. The data must be processed in real-time, and the results must be stored in Amazon DynamoDB. Which TWO AWS services can be used together to achieve this? (Choose TWO.)
141An IAM policy is attached to a data engineering role that writes to an S3 bucket. The policy is shown in the exhibit. What is the effect of this policy?
142An ML engineer runs the AWS CLI command shown in the exhibit on a file in S3. The engineer wants to use this file in a SageMaker training job. What does the output reveal about the data?
143An AWS Glue job is failing with an error that it cannot access an S3 bucket. The IAM role attached to the Glue job is shown in the exhibit. What is the MOST likely cause of the failure?
144A data engineer needs to process streaming data from an IoT fleet and store the results in Amazon S3 for analysis. The solution must be serverless and handle data that arrives at irregular intervals. Which AWS service should be used to ingest the data?
145A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The input data is located in an S3 bucket, and the team needs to transform the data before inference using a custom Python script. The transformation should run on a serverless infrastructure and must be triggered automatically when new data arrives in S3. Which combination of services should the team use?
146A data engineer needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3 for ML training. The data is currently stored in HDFS and is compressible. The network bandwidth between the on-premises data center and AWS is 1 Gbps. The team needs to minimize the time to transfer and also wants to avoid any downtime for the on-premises system. Which solution meets these requirements?
147A company uses Amazon Kinesis Data Streams for real-time clickstream analysis. The data is consumed by a Lambda function that enriches the records and stores them in Amazon S3. Recently, the Lambda function has been failing with throttling errors, and the consumer is falling behind. The team needs to increase the throughput of the consumer without changing the data format or the Lambda function code. What should the team do?
148A financial services company is building a fraud detection model that requires joining real-time transaction data with a reference dataset of known fraudulent accounts stored in Amazon DynamoDB. The solution must minimize latency and be highly available. The reference dataset is updated frequently (every few minutes). Which architecture should the team use?
149A data scientist needs to perform exploratory data analysis on a 100 GB CSV file stored in Amazon S3. The data is not sensitive. The scientist wants to use SQL queries to filter and aggregate the data without setting up a server or moving the data. Which service should be used?
150A company runs a data lake on Amazon S3 with partitions by year/month/day. A machine learning team needs to read daily data from the last 30 days for model retraining. The data format is Parquet. The team uses Amazon Athena to query the data, but the queries are slow and scanning too much data. The team has already optimized the file sizes and compression. What additional step can reduce the amount of data scanned?
151A company is using Amazon SageMaker to train a model on a dataset that is updated daily. The data is stored in an S3 bucket. The training pipeline uses AWS Step Functions to orchestrate data preprocessing and model training. The preprocessing step uses a SageMaker Processing job that reads data from S3, cleans it, and writes the output back to S3. The team notices that the training step often fails due to insufficient disk space on the processing instance. Which change should the team make to resolve this issue without increasing cost?
152A team is building a data pipeline that ingests data from an Amazon S3 bucket, transforms it using AWS Glue, and loads it into Amazon Redshift for analysis. The Glue job runs on a schedule every hour. The team has noticed that the job takes longer than expected and sometimes fails due to memory issues. The data volume is variable, with occasional spikes. Which solution should the team implement to optimize the pipeline?
153Which TWO AWS services can be used to transform data in a streaming fashion without using a persistent cluster? (Choose 2.)
154Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon Redshift for storing large datasets used for machine learning? (Choose 3.)
155A company is using Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function writes results to an S3 bucket. The team wants to ensure that each record is processed exactly once and in order. Which TWO configurations should the team implement? (Choose 2.)
156Refer to the exhibit. An ML engineer applies this bucket policy to an S3 bucket. The SageMaker execution role MySageMakerRole is used to train a model. The training data is located in s3://my-bucket/data/. The SageMaker training job fails with an access error. What is the most likely cause?
157Refer to the exhibit. An ML engineer runs the above CLI command to inspect files in an S3 bucket. The training data consists of 200 CSV files, each 1 GB. The engineer plans to use Amazon SageMaker to train a model using this data. What should the engineer do to optimize training performance?
158A company runs a real-time recommendation system that uses Amazon SageMaker endpoints for inference. The system ingests user activity data from a mobile app via Amazon API Gateway and AWS Lambda, which writes events to an Amazon Kinesis Data Stream. A second Lambda function consumes the stream, calls a SageMaker endpoint to generate recommendations, and stores the results in Amazon DynamoDB. The system has been working well, but recently the team noticed an increase in latency from the time a user action occurs to when the recommendation is stored. The SageMaker endpoint shows increased invocation latency but no throttling. CloudWatch metrics show that the Kinesis stream's IteratorAgeMilliseconds is increasing, indicating the consumer is falling behind. The Lambda consumer's duration is within limits, but the number of invocations is lower than expected. The team suspects the issue is with the event source mapping. Which course of action should the team take to reduce the latency?
159A data engineering team needs to process streaming data from thousands of IoT devices. The data must be ingested with low latency and processed in near real-time to detect anomalies. Which AWS service should they use for ingestion?
160A company is using AWS Glue to run ETL jobs that process data from an Amazon RDS for PostgreSQL database. The jobs are failing with connection timeouts. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause?
161A data scientist needs to run ad-hoc SQL queries on a large dataset stored in Amazon S3 (Parquet format, 2 TB). The queries are interactive and require sub-second response times. Which service should they use?
162A company is using Amazon Kinesis Data Firehose to load streaming data into Amazon S3. The data is in JSON format, and they want to convert it to Parquet before storage. What should they configure?
163A data engineering team needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The data is currently stored in HDFS. Which service should they use for an efficient transfer?
164An e-commerce company uses Amazon DynamoDB as the primary data store for user sessions. They want to run analytics on historical session data using Amazon Athena. What is the recommended approach to export DynamoDB data to S3 in a format optimized for Athena?
165A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should they use for scheduling?
166A company is using AWS Glue Data Catalog as the metadata store for their data lake. They have multiple AWS accounts and want to share the catalog across accounts. Which feature should they use?
167A data engineer is designing a data pipeline that ingests 500 GB of data daily from an on-premises Oracle database to Amazon S3. The pipeline must minimize data loss and support change data capture (CDC). Which combination of services should they use?
168Which TWO of the following are valid ways to reduce query costs in Amazon Athena? (Choose 2)
169Which THREE of the following are best practices for optimizing performance of Amazon EMR clusters? (Choose 3)
170Which TWO services can be used to transform data in transit within a Kinesis Data Firehose delivery stream? (Choose 2)
171A financial services company uses Amazon Kinesis Data Streams with 50 shards to ingest real-time stock trade data. The data is consumed by a custom Java application running on Amazon EC2 instances. Recently, the application has been experiencing high latency, and CloudWatch metrics show that the average iterator age is increasing. The application uses the Kinesis Client Library (KCL) with DynamoDB for lease tracking. The EC2 instances are in an Auto Scaling group with a minimum of 2 and maximum of 10 instances, and the current CPU utilization is below 50%. The team wants to reduce latency without increasing costs significantly. What should they do?
172A media company ingests video metadata from multiple sources into an Amazon S3 bucket. Each metadata record is a JSON file about 2 KB. They use AWS Glue ETL jobs to process these files and load them into Amazon Redshift for analytics. The jobs currently run hourly and take about 10 minutes to process all new files. However, the company is growing and expects the number of files to increase 100x. The data engineering team wants to minimize processing time and cost. The Glue job currently reads all files from the S3 bucket using a full scan. What should they do to optimize the pipeline?
173A retail company uses Amazon Redshift for its data warehouse. The data engineering team runs ETL jobs that load data from multiple sources into Redshift daily. They notice that the load performance is slow and the cluster CPU utilization is high during the ETL window. The team wants to improve load performance without changing the cluster configuration. They currently load data using INSERT statements from a staging table. What should they do?
174A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The data arrives in bursts and must be processed with minimal latency. Which AWS service is most appropriate for the ingestion layer?
175A company is building a data pipeline using AWS Glue to transform data from Amazon RDS to Amazon S3. The pipeline runs daily and processes about 500 GB of data. The team notices that the job is taking longer than expected. Which change would MOST improve the job performance?
176A data engineer is designing a data lake on Amazon S3 that must support both batch and streaming analytics. The data comes in Parquet format and needs to be queryable by Amazon Athena. Which partitioning strategy will optimize query performance and reduce costs?
177A company uses AWS Lambda to process events from Amazon S3. The Lambda function transforms the data and writes results to another S3 bucket. Recently, the function has been failing due to timeout errors when processing large files. Which solution should the data engineer implement?
178A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 1 Gbps internet connection and wants to complete the transfer within 5 days. What is the MOST cost-effective and reliable solution?
179A company uses Amazon EMR to run Spark jobs on a transient cluster that processes data from S3. The jobs are failing with 'OutOfMemory' errors. The data engineer has already increased the executor memory. Which additional configuration change would MOST likely resolve the issue?
180A data engineering team needs to orchestrate a complex workflow that involves multiple AWS Glue jobs, Lambda functions, and S3 operations. The workflow must run on a schedule and allow monitoring of each step. Which AWS service should they use?
181A company is using Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a fleet of EC2 instances running a custom consumer application. The consumer is falling behind and the shard iterator age is increasing. Which TWO actions should the data engineer take to improve consumer performance? (Choose TWO.)
182A data engineer is designing a data pipeline to process streaming data from Amazon Kinesis Data Streams and store the results in Amazon S3 in Parquet format. The data must be available for querying in Amazon Athena within minutes of arrival. Which THREE services should be used together? (Choose THREE.)
183A company wants to centralize logging from multiple AWS accounts and on-premises servers. The logs must be stored cost-effectively and be searchable. Which TWO services should be used? (Choose TWO.)
184A data engineer is troubleshooting an AWS Glue job that reads from an S3 bucket and writes to another S3 bucket. The job fails with an 'Access Denied' error when trying to write to the output bucket. The IAM policy attached to the Glue service role is shown. What is the MOST likely cause of the failure?
185A data engineer runs the above CLI command and sees that the bucket contains many small Parquet files (1 MB each) under the prefix. When querying this data with Athena, the query performance is poor and costs are high. Which approach would MOST improve performance and reduce cost?
186A company runs a data pipeline using AWS Glue ETL jobs that process about 10 TB of data daily from Amazon S3. The jobs are triggered by a schedule and write results to a separate S3 bucket. Recently, the jobs have been taking longer to complete, and the data engineering team has observed that the number of files in the source bucket has increased significantly, from thousands to millions of small files (each about 100 KB). The Glue jobs are configured to use the 'Group Files' option, but performance is still poor. The team needs to improve the job performance without changing the source data generation process. Which course of action should the team take?
187An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to an Amazon S3 bucket. The data is then queried using Amazon Athena. The marketing team wants to run daily reports that aggregate click events by product ID. However, the reports are slow because Athena scans the entire dataset each time. The data is partitioned by date (e.g., s3://bucket/clickstream/2023/01/01/). The product ID is a column within the data. The data engineering team wants to improve query performance without moving the data to another service. Which approach should the team take?
188A startup is building a data pipeline that ingests data from multiple sources into an Amazon S3 data lake. The data includes CSV files from legacy systems, JSON from web APIs, and Avro from mobile apps. The data must be transformed into Parquet format and cataloged for querying with Amazon Athena. The pipeline must be serverless and minimize operational overhead. The team has decided to use AWS Glue for ETL and cataloging. However, they are concerned about the cost of running Glue jobs continuously. The data arrives in small batches every 10 minutes. Which approach should the team use to minimize cost while meeting the requirements?
189A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes results to Amazon S3. The engineer notices that the Lambda function is experiencing throttling and some records are being dropped. Which TWO actions should the engineer take to improve the reliability of the pipeline?
190A machine learning team is using Amazon SageMaker to train a model on a dataset stored in S3. The training job reads data from S3 using Pipe input mode, but the training is slow. The team wants to improve data throughput. Which THREE actions should they take?
191A data engineer is designing a data lake on Amazon S3. The data is collected from IoT devices and is highly variable in volume. The engineer needs to ensure that the data is ingested reliably and can be processed in near real-time. Which AWS service should be used to ingest the data into the data lake?
192A data engineer has attached the above IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from and writes to 'my-data-bucket'. The job is failing with an Access Denied error. What is the most likely cause?
193A machine learning engineer is using Amazon SageMaker to train a model. The training dataset is 2 TB and is stored in Amazon S3. The engineer wants to reduce the training time by improving data loading performance. Which data ingestion mode should be used?
194A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize costs and avoid managing infrastructure. Which AWS service should be used?
195A data engineer is investigating a slow Athena query on a partitioned table. The table is partitioned by year, month, and day, and the data is stored in S3 with the prefix pattern 'raw/YYYY/MM/DD/'. The engineer runs the above CLI command and sees that there are many small files. Which action would most improve query performance?
196A data engineering team is building a real-time fraud detection pipeline. The pipeline ingests transaction data from an Amazon Kinesis Data Stream with 10 shards. Each shard produces about 500 records per second, each record is 2 KB. The data is processed by a Lambda function that runs for about 200 ms and then writes results to an Amazon DynamoDB table. The team notices that the Lambda function is experiencing a high number of throttles, and there are increasing numbers of records being retried. The Lambda function's reserved concurrency is set to 100. The DynamoDB table has 100 read capacity units and 100 write capacity units. Which change would most effectively reduce throttling and improve processing throughput?
197A machine learning team is preparing a large dataset for training. The dataset consists of 10,000 CSV files, each about 100 MB, stored in Amazon S3. The team wants to transform the data using AWS Glue ETL jobs. The transformation involves filtering rows, adding new columns, and joining with a small reference table (100 KB). The team is concerned about job performance and cost. They currently have a Glue job with 10 DPU (Data Processing Units) and it takes about 2 hours to complete. The team wants to reduce the runtime and cost. Which approach should they take?
198A data engineer is tasked with building a data pipeline that moves data from an on-premises database to Amazon S3 for analytics. The database is a MySQL instance that is 2 TB in size. The company has a 1 Gbps dedicated network connection to AWS (AWS Direct Connect). The data must be transferred once daily. The engineer needs to choose the most efficient and reliable service for this task. Which service should they use?
199A data engineering team is using Apache Spark on Amazon EMR to process streaming data from Amazon Kinesis Data Streams. The Spark application uses structured streaming to read from Kinesis, perform transformations, and write to Amazon S3 in Parquet format. The team notices that the application is falling behind and the processing latency is increasing. The Kinesis stream has 5 shards, and the EMR cluster has 5 core nodes of type r5.xlarge. The Spark application is configured with 5 executors, each with 2 cores and 8 GB memory. The team wants to reduce processing latency. Which change would be most effective?
200A data engineer needs to continuously ingest streaming data from thousands of IoT devices and store the raw data in Amazon S3 for archival processing. The data volume varies significantly throughout the day, and the solution must be serverless, scalable, and cost-effective. Which AWS service should be used to capture and buffer the streaming data before writing to S3?
201A company is running a machine learning training job on Amazon SageMaker that reads training data from an S3 bucket. The job fails intermittently with an S3 throttling error. The data is partitioned across thousands of small files (average 100 KB). Which strategy is MOST effective to resolve the throttling issue?
202A data scientist wants to explore a large dataset stored in Amazon S3 using SQL queries without moving the data. The dataset is in CSV format and is updated daily with new partitions. Which AWS service should be used to directly query the data in S3?
203A company is building a data pipeline that ingests data from multiple sources into a centralized data lake on Amazon S3. The data must be transformed before it is available for analysis. The pipeline should be event-driven, automatically triggering transformation jobs when new data arrives. Which combination of AWS services should be used?
204A data engineering team is designing a data lake on Amazon S3. They need to enforce encryption at rest for all data stored in the bucket. The security policy requires that the encryption keys be managed by the organization using AWS Key Management Service (KMS), and that the bucket must deny uploads of unencrypted objects. Which bucket policy should be applied?
205A company uses Amazon RDS for its transactional database and needs to export a daily snapshot of a table to Amazon S3 in Parquet format for analytics. Which AWS service can perform this export without writing custom code?
206A company is streaming data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a AWS Lambda function that processes each record. The Lambda function is experiencing high error rates and throttling due to the volume of data. Which action would MOST effectively improve the processing throughput and reduce errors?
207A data scientist needs to run complex ETL transformations on a large dataset stored in Amazon S3. The transformations are written in PySpark and require occasional access to Hive metastore. The solution should minimize operational overhead and allow the data scientist to focus on code development. Which AWS service should be used?
208A company wants to perform real-time analytics on streaming data from clickstreams. The data needs to be ingested, processed, and made available for querying within seconds. Which AWS service should be used for the processing step?
209A company is using AWS Glue to catalog metadata from various data sources. The crawler is configured to run daily. However, the catalog is not reflecting new partitions added to an S3 bucket during the day. What is the MOST likely cause?
210A data engineer is designing a data pipeline that ingests data from a relational database into a data lake on Amazon S3. The data must be incrementally loaded daily. Which TWO AWS services can be used together to achieve this?
211A company wants to use Amazon SageMaker to train a model using data stored in Amazon S3. The data is sensitive and must be encrypted at rest and in transit. Which THREE steps should be taken to ensure data security?
212A data engineer needs to collect and analyze log data from multiple EC2 instances in real-time. The solution should be serverless and scalable. Which TWO AWS services should be used?
213A company is using AWS Glue ETL jobs to transform data. The jobs are failing due to insufficient memory. The data processing involves complex joins and aggregations. Which THREE actions can improve job performance and reduce memory usage?
214A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The solution must handle data that arrives in bursts and must be able to reprocess failed records automatically. Which combination of AWS services should the team use?
215A data engineer is designing a data pipeline that transforms raw JSON files (each 50-200 KB) in Amazon S3 into Parquet format using AWS Glue. The pipeline must minimize data processing costs and handle a high volume of small files (millions per day). The engineer configures a Glue ETL job with Spark, but the job is slow and expensive due to overhead of reading many small files. Which optimization should the engineer implement to reduce cost and improve performance?
216A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data contains personally identifiable information (PII) that must be redacted before storage. Which AWS service can be integrated with Kinesis Data Firehose to transform the data in real time?
217A data engineering team needs to build a data lake on Amazon S3 that will be queried by Amazon Athena and Amazon Redshift Spectrum. The data will be ingested from multiple sources in various formats (CSV, JSON, Parquet). Which partitioning strategy will provide the best query performance for date-range queries?
218A company has an AWS Glue ETL job that reads data from an Amazon RDS for MySQL table and writes to Amazon S3 in Parquet format. The job runs daily and processes 500 GB of data. Recently, the job has been failing with memory errors during the write phase. The data schema is wide (200 columns). Which change should a data engineer make to the Glue job to resolve the memory issue?
219A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and a tight deadline of two weeks. Which AWS service should the engineer use to transfer the data most efficiently?
220A data engineering team is building a real-time fraud detection system. Transactions are ingested via Amazon Kinesis Data Streams, and a machine learning model (deployed on Amazon SageMaker) scores each transaction. The team needs to store the raw transactions and the model's predictions in Amazon S3 for later analysis. Which architecture should the team use?
221A company uses Amazon Redshift for its data warehouse. The data engineering team needs to load 10 TB of data from Amazon S3 into Redshift every night. The team wants to minimize the load time and use the fewest number of COPY commands. The data is in CSV format and is partitioned by date in S3. Which approach should the team take?
222A data engineer needs to schedule an AWS Glue ETL job to run every hour. The job reads from an Amazon DynamoDB table and writes to Amazon S3. Which AWS service should the engineer use to trigger the Glue job on schedule?
223A data engineering team is designing a data pipeline that processes streaming data from Amazon Kinesis Data Streams using AWS Lambda. The team notices that some records are being processed multiple times (duplicates). Which TWO steps should the team take to ensure exactly-once processing?
224A company uses Amazon Athena to query a data lake in Amazon S3. The data is partitioned by year, month, day, and hour. The team notices that queries are slow and expensive. The team wants to improve performance and reduce costs. Which THREE actions should the team take?
225A data engineer is building a data pipeline using AWS Glue. The pipeline reads data from Amazon S3, transforms it, and writes it back to S3 in a different format. The engineer needs to handle schema evolution (new columns added over time). Which TWO features of AWS Glue can help manage schema evolution?
226A data engineer uses the IAM policy above for an AWS Lambda function that processes data in S3 and triggers an AWS Glue job. The Lambda function is unable to start the Glue job. What is the most likely cause?
227A data engineer runs the AWS CLI command above to inspect an object in S3. The engineer wants to query this metadata (kafka-offset) using Amazon Athena to track processing progress. How can the engineer make this metadata available for Athena queries without modifying the existing data pipeline?
228A data engineer configures an S3 event notification to trigger an AWS Lambda function when a new object is created in 'my-input-bucket'. The Lambda function processes the CSV file and writes results to 'my-output-bucket'. The engineer notices that the Lambda function is not triggered for some objects. Which step should the engineer take to diagnose the issue?
229A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. The application has been failing intermittently with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?
230A data engineering team is designing a data pipeline to process large CSV files (10-50 GB each) stored in Amazon S3. The pipeline must transform the data using AWS Glue and load it into Amazon Redshift for analytics. The team wants to minimize costs while ensuring the pipeline can handle peak loads. Which approach is the most cost-effective?
231A company is using Amazon DynamoDB to store sensor data. The data is exported to Amazon S3 using DynamoDB Streams and AWS Lambda for long-term archival. Recently, the Lambda function has been failing due to 'ProvisionedThroughputExceededException' on the DynamoDB stream. What is the most likely cause?
232A data scientist is building a training dataset from data stored in Amazon S3. The data consists of JSON files each containing a 'timestamp' field. The scientist wants to use AWS Glue to catalog the data and enable querying via Amazon Athena. However, Athena queries are returning zero results for time-range filters. What is the most likely cause?
233A company is streaming data from IoT devices to Amazon Kinesis Data Firehose, which writes to an Amazon S3 bucket. The data is then processed by an AWS Glue ETL job and loaded into Amazon Redshift. The team notices that some records are missing in Redshift. They suspect data loss during the Firehose delivery. Which configuration parameter should be checked first?
234A data engineer needs to set up a data pipeline that ingests data from an Amazon RDS MySQL database into Amazon S3. The pipeline should run daily and capture incremental changes (inserts, updates, deletes) from the source database. Which AWS service should be used as the data ingestion tool?
235A company is building a data lake on Amazon S3 and wants to use AWS Glue to catalog the data. The data includes CSV, Parquet, and JSON files. The team wants to ensure that the Glue crawler can infer the schema correctly and update the Data Catalog when new partitions are added. Which crawler configuration should be used?
236A company uses Amazon Kinesis Data Streams with a shard count of 5. The data producer sends 1000 records per second, each 1 KB in size. The consumer application reads from the stream using the Kinesis Client Library (KCL) and processes records. The consumer is experiencing high latency and falling behind. What is the most effective way to improve consumer throughput?
237A company wants to store semi-structured data from IoT sensors in a cost-effective manner for occasional querying. The data is not updated once written. Which Amazon S3 storage class is the most cost-effective for this use case?
238Which TWO configurations are required to enable AWS Glue to access data stored in a VPC? (Choose two.)
239Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose three.)
240Which TWO steps are required to set up cross-account access to an Amazon S3 data lake for AWS Glue jobs running in a different AWS account? (Choose two.)
241Refer to the exhibit. A company is using the Kinesis stream 'my-stream' with one shard. The producer is sending 1000 records per second, each 1 KB. The consumer is reading from the stream using the Kinesis Client Library (KCL). The consumer is able to process 500 records per second per shard. What is the most likely cause of the consumer falling behind?
242Refer to the exhibit. A data engineer is troubleshooting an AWS Glue job that fails with an 'AccessDenied' error when trying to write to the S3 bucket 'my-data-lake'. The IAM policy attached to the Glue service role is shown. What is the missing permission?
243Refer to the exhibit. A team deploys this CloudFormation stack. The Kinesis stream is created, but the Firehose delivery stream fails to create with a 'Resource handler returned message: Unable to assume role' error. What is the most likely cause?
244A company is streaming real-time sensor data from IoT devices to Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that enriches the records with metadata from an Amazon DynamoDB table and writes the results to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors from DynamoDB. The data volume is variable, with occasional bursts. Which solution should a data engineer implement to resolve this issue without losing data?
245An e-commerce company uses Amazon Redshift for analytics. The data engineering team needs to load daily sales data from an S3 bucket that receives new files every hour. The data must be loaded into Redshift with minimal impact on query performance during the day, and they need to handle late-arriving data (files that appear after the daily load). Which approach should they use?
246A data scientist needs to train a machine learning model using a large dataset (500 GB) stored in an S3 bucket. The training will be performed on a SageMaker notebook instance. The data scientist wants to minimize data transfer costs and reduce training time. Which data ingestion approach should the data engineer recommend?
247A company is using AWS Glue to run ETL jobs that transform data from multiple sources into a data lake on S3. The jobs are scheduled to run hourly. Recently, the jobs have been failing intermittently with 'MemoryError' exceptions. The data volume has grown over time. The data engineer needs to resolve this issue cost-effectively. Which action should be taken?
248A data engineering team needs to set up a data pipeline that ingests streaming data from an Apache Kafka cluster running on Amazon EKS into an S3 data lake. The data must be stored in Parquet format, partitioned by date and event type. The team wants a fully managed solution with minimal operational overhead. Which solution should they choose?
249A data scientist is training a deep learning model using a large dataset stored in S3. The training job runs on a SageMaker training instance with a GPU. The data engineer notices that the GPU utilization is low, and the training is I/O bound. The data is read directly from S3 using the SageMaker SDK. Which change should the data engineer recommend to improve GPU utilization?
250A company uses Amazon DynamoDB as the primary data store for a real-time recommendation engine. The data engineering team needs to export a daily snapshot of the DynamoDB table to S3 for offline analytics. The table is large (10 TB) and has a high read/write throughput. Which method will export the data with the least impact on the production workload?
251A data engineer is building a data pipeline that uses AWS Lambda to process records from an SQS queue and write results to an S3 bucket. The Lambda function processes each record individually and writes a separate file to S3. The team notices high latency and wants to reduce the number of S3 PUT requests to improve performance and reduce cost. Which approach should the data engineer take?
252A company has a large number of small CSV files (hundreds of thousands) in an S3 bucket. A data engineer needs to run a SQL query on this data using Amazon Athena. The queries are currently slow and expensive. Which two actions will improve query performance and reduce cost?
253A data engineer needs to design a data ingestion pipeline that ingests data from a MySQL database hosted on-premises into Amazon S3 for analytics. The pipeline must capture change data (CDC) and run continuously with low latency. Which two services should the data engineer use?
254A company is using Amazon Redshift for data warehousing. The data engineering team observes that query performance degrades over time due to data skew. Which three strategies should the team implement to improve performance?
255Refer to the exhibit. An IAM policy is attached to a data engineering team's role. The team needs to upload data to the 'confidential' prefix in the 'my-data-lake' bucket. However, they are receiving 'AccessDenied' errors. What is the likely cause?
256Refer to the exhibit. A data engineer runs an Athena query and gets a failure. What is the most likely cause?
257Refer to the exhibit. A data engineer has deployed this CloudFormation template. The Glue job 'my-etl-job' reads from the S3 bucket 'my-data-lake-bucket' and writes transformed data to another bucket. After 30 days, the data engineer notices that the Glue job fails with 'Input data not found' errors. What is the most likely cause?
258A data engineer needs to extract data from an Amazon RDS for MySQL database into Amazon S3 for further processing. The data volume is 2 TB and the job must run daily within a 1-hour window. Which AWS service is most suitable for this task?
259A company is building a data lake on Amazon S3. Data arrives from multiple sources in different formats (CSV, JSON, Parquet). The engineering team wants to query this data using Amazon Athena with minimal transformation. Which approach minimizes query cost and improves performance?
260A data pipeline uses AWS Lambda to process small files (10-50 MB) from an S3 bucket and write results to DynamoDB. The Lambda function times out after 15 seconds for larger files. The team wants to handle files up to 100 MB without changing the Lambda code. Which solution is MOST cost-effective?
261A data scientist needs to run a one-time query on 10 TB of data stored in S3 using Amazon Athena. The query scans 5 TB and returns a small result set. Which approach minimizes cost?
262A company uses Amazon Kinesis Data Firehose to ingest streaming data and deliver it to an S3 bucket. The data is in JSON format with a timestamp field. The data science team wants to query the data using Athena with partitioning by year/month/day. How should the S3 data be organized?
263An organization is migrating its on-premises Hadoop cluster to AWS. The cluster runs Spark jobs that process 50 TB of data daily. The data is stored in HDFS with 3x replication. Which storage option on AWS provides the best price-performance for this workload?
264A company needs to ingest real-time clickstream data from thousands of web servers into AWS for near-real-time analytics. The data volume varies and can spike during promotions. Which service should be used to capture and buffer the data before processing?
265A data engineer uses AWS Glue to run ETL jobs that transform data from JSON to Parquet. The job runs successfully but takes 30 minutes longer than expected. CloudWatch metrics show high memory utilization and disk spills. What is the most likely cause?
266A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest with a key that is automatically rotated every year. Which solution meets these requirements with the least operational overhead?
267Which TWO options are valid ways to reduce the amount of data scanned by Amazon Athena queries, thereby reducing cost?
268Which THREE AWS services can be used together to build a serverless data pipeline that ingests streaming data, transforms it, and loads it into Amazon Redshift for analysis?
269Which TWO options are best practices for managing access to data stored in Amazon S3 for a data lake?
270A data engineer is investigating why an Athena query against the my-data-lake bucket is slow. The query filters on year, month, and day. The exhibit shows the metadata of one Parquet file. What is the MOST likely cause of the slow query?
271The Glue job my-glue-job fails after a few successful runs. The error log shows 'Job run exceeds max concurrent runs limit'. The CloudFormation template is shown in the exhibit. What change should be made to allow multiple runs to execute concurrently?
272A Glue job fails with an AccessDenied error when trying to write to the S3 bucket my-data-lake. The IAM policy attached to the job role is shown in the exhibit. What is the MOST likely reason for the failure?
273A data scientist needs to process a large volume of streaming data from IoT devices and store the results in Amazon S3 for further analysis. Which AWS service is most suitable for ingesting and processing this data in near real-time?
274A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with timeouts. What is the most likely cause?
275A company uses Amazon Kinesis Data Streams to collect clickstream data. The data is consumed by a Lambda function that writes to DynamoDB. Occasionally, the Lambda function fails due to throttling from DynamoDB. How can the company resolve this issue without losing data?
276A company needs to perform complex transformations on large datasets stored in Amazon S3 using Apache Spark. They want to minimize operational overhead. Which AWS service should they use?
277A company is migrating its on-premises Hadoop cluster to AWS. They have a large amount of historical data stored in HDFS. Which approach is the most efficient for transferring this data to Amazon S3?
278A data engineer needs to automate the transformation of CSV files to Parquet format as soon as they are uploaded to an S3 bucket. The transformed files should be stored in another S3 bucket. Which solution is the most cost-effective and requires the least maintenance?
279A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. They notice that the data is delivered in 5-minute intervals even though they set the buffer interval to 60 seconds. What could be the cause?
280A company needs to process sensitive data from multiple sources. They want to use AWS Glue to catalog and transform the data. Which feature should they use to ensure that sensitive columns are masked before the data is available for querying?
281A company runs a critical ETL job using AWS Glue that writes to an Amazon Redshift cluster. The job occasionally fails due to insufficient disk space on the Redshift cluster. How can the company automate the process to prevent this failure?
282Which TWO AWS services are suitable for real-time stream processing?
283Which TWO data formats are columnar and optimized for analytics queries in Amazon S3?
284Which THREE considerations are important when designing a data lake on Amazon S3?
285An IAM policy attached to an AWS Glue job allows reading and writing to an S3 bucket and accessing Glue Data Catalog. The job fails with an access denied error when trying to create a table in the Data Catalog. What is the likely issue?
286A data engineer runs the AWS CLI command shown and notices a zero-byte file in the results. What is the most likely cause of this zero-byte file?
287A company uses AWS Glue jobs with job bookmarks enabled to process incremental data. They notice that the job processes all data each time instead of only new data. What is the most likely reason?
288A data engineer needs to store streaming data from thousands of IoT devices for real-time analytics. Which AWS service is most suitable for ingesting and storing this data for subsequent processing by Amazon Kinesis Data Analytics?
289A company is using AWS Glue to run ETL jobs that process data in an S3 data lake. The jobs are failing with out-of-memory errors when processing large files. Which configuration change should be made to resolve this issue?
290A data scientist is training a deep learning model on a GPU instance. The training data is stored in S3 and is 50 GB. To reduce I/O bottlenecks, which storage option should be used to cache the data locally on the instance?
291A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be transformed into Parquet format before delivery. Which approach should the data engineer use?
292A company is running a data pipeline that uses Amazon EMR with Spark to process 100 TB of data daily. The pipeline must complete within 6 hours. Currently, it takes 8 hours. Which optimization will most likely reduce the runtime?
293A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should be used to trigger the job?
294A company is using Amazon Athena to query a data lake in S3. Queries are slow and expensive. The data is stored as JSON. Which action will improve query performance and reduce cost?
295A data engineer is building a data pipeline that uses Amazon S3 to store raw data, AWS Lambda for transformation, and Amazon DynamoDB for serving. The Lambda function experiences high latency when writing to DynamoDB. Which action will most effectively reduce the latency?
296A company needs to move 10 TB of data from an on-premises NAS to Amazon S3 over a 100 Mbps internet connection. The transfer must complete within 3 days. Which solution is the most appropriate?
297A data engineering team is designing a data lake on AWS. They need to store raw data in S3 and allow multiple analytics services to query the data. Which TWO services can be used to catalog and provide schema information for the data?
298A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be processed and stored in S3 in near real-time. Which THREE services can be used together to achieve this?
299A data engineer is designing a data pipeline that uses Amazon S3 events to trigger an AWS Lambda function for processing. The pipeline must handle high throughput with low latency. Which TWO configurations should be applied?
300A data engineer is configuring an IAM policy to allow users to upload objects to an S3 bucket only if the objects are encrypted using SSE-S3. However, users are getting AccessDenied errors when uploading objects without specifying encryption. What is the most likely cause?
The Data Engineering domain covers the key concepts tested in this area of the MLS-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all MLS-C01 domains — no account required.
The Courseiva MLS-C01 question bank contains 300 questions in the Data Engineering domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Data Engineering domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included