MLS-C01 Data Engineering — All Questions With Answers

Question 1mediummultiple choice

Read the full Data Engineering explanation →

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

Question 2hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

Question 3easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

Question 4mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

Question 5hardmultiple choice

Read the full Data Engineering explanation →

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

Question 6easymultiple choice

Read the full Data Engineering explanation →

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

Question 7mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

Question 8hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?

Question 9easymultiple choice

Read the full Data Engineering explanation →

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

Question 10mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?

Question 11hardmultiple choice

Read the full Data Engineering explanation →

An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?

Question 12mediummultiple choice

Read the full Data Engineering explanation →

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

Question 13hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

Question 14easymultiple choice

Read the full NAT/PAT explanation →

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

Question 15mediummultiple choice

Read the full Data Engineering explanation →

A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?

Question 16hardmultiple choice

Read the full Data Engineering explanation →

A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?

Question 17mediummulti select

Read the full NAT/PAT explanation →

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

Question 18hardmulti select

Read the full Data Engineering explanation →

A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)

Question 19easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

Question 20mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

Question 21hardmultiple choice

Read the full Data Engineering explanation →

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

Question 22mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

Question 23mediummulti select

Read the full Data Engineering explanation →

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Analytics for Apache Flink. The pipeline reads from a Kinesis data stream and writes to a S3 bucket. The job must recover quickly from failures without reprocessing large amounts of data. Which TWO configurations should be used? (Choose TWO)

Question 24hardmulti select

Read the full Data Engineering explanation →

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

Question 25hardmultiple choice

Read the full Data Engineering explanation →

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}

Question 26hardmultiple choice

Read the full Data Engineering explanation →

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

Question 27hardmultiple choice

Read the full Data Engineering explanation →

A retail company runs an e-commerce platform on AWS. They have a Data Engineering team that processes clickstream data using Amazon Kinesis Data Streams (KDS) with a shard count of 5. The data is consumed by an AWS Lambda function that transforms and loads the data into an Amazon S3 bucket partitioned by year/month/day/hour. Recently, the team has noticed that the Lambda function is experiencing throttling errors, and the KDS shard iterator age is increasing, indicating that the consumer cannot keep up with the incoming data rate. The team has already increased the Lambda reserved concurrency to 1000 and enabled batch window of 60 seconds. The metrics show that the Lambda function duration is well under the 5-minute timeout, and there are no errors in the transformation logic. The S3 write operations are not failing. Which course of action would MOST effectively resolve the issue without unnecessary cost or complexity?

Question 28mediumdrag order

Read the full Data Engineering explanation →

Drag and drop the steps to create an Amazon SageMaker notebook instance in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 29mediumdrag order

Read the full Data Engineering explanation →

Drag and drop the steps to perform hyperparameter tuning using SageMaker Automatic Model Tuning in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 30mediumdrag order

Read the full Data Engineering explanation →

Drag and drop the steps to set up Amazon SageMaker Ground Truth for a labeling job in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 31mediumdrag order

Read the full Data Engineering explanation →

Drag and drop the steps to set up cross-validation in a SageMaker training job using the built-in XGBoost algorithm in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 32mediummatching

Read the full Data Engineering explanation →

Match each AWS service to its primary purpose in a machine learning pipeline.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Build, train, and deploy ML models

ETL and data cataloging

Object storage for datasets and models

Serverless compute for preprocessing

Image and video analysis

Question 33mediummatching

Read the full Data Engineering explanation →

Match each AWS security service to its function in ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Manage access to AWS resources

Encryption key management

Audit API calls

Isolate network resources

Discover and protect sensitive data

Question 34mediummatching

Read the full Data Engineering explanation →

Match each data format to its typical use in AWS ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Tabular data for SageMaker built-in algorithms

Efficient binary format for SageMaker

Columnar storage for analytics

Semi-structured data, e.g., for Lambda

TensorFlow training data format

Question 35mediummatching

Read the full Data Engineering explanation →

Match each SageMaker optimization technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Train across multiple GPUs or instances

Hyperparameter optimization with Bayesian search

Use spot instances for cost savings

Stream data directly from S3 for faster training

Monitor training and detect issues

Question 36mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to process streaming data from thousands of IoT devices. They want to aggregate data in 1-minute windows and store results in an S3 data lake for downstream analytics. Which architecture should they use?

Question 37hardmultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue ETL jobs to transform CSV data from an S3 bucket into Parquet. The jobs often fail with memory errors when processing large datasets. They want to minimize cost and improve reliability. What should they do?

Question 38easymultiple choice

Read the full Data Engineering explanation →

A machine learning team needs to create a training dataset by joining two large datasets (10 TB and 5 TB) stored in S3. The join key is 'user_id'. They want to minimize data movement and cost. Which approach should they use?

Question 39hardmultiple choice

Read the full Data Engineering explanation →

A company uses an Amazon SageMaker notebook to train a model using data from an S3 bucket. The IAM role attached to the notebook has the following policy. What is the MOST specific change needed to allow the notebook to read from the bucket 'ml-data-123'?

Question 40easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transform raw clickstream data (JSON files) stored in S3 into a partitioned Parquet dataset for querying with Athena. The transformation includes cleaning, deduplication, and enrichment. The pipeline should run daily. Which solution is MOST cost-effective and requires the least operational overhead?

Question 41mediummultiple choice

Read the full Data Engineering explanation →

A company uses Kinesis Data Streams to ingest real-time sensor data. The data is consumed by a Lambda function that writes to DynamoDB. During peak hours, the Lambda function throws ProvisionedThroughputExceededException. The team wants to decouple the write operation and improve resilience. What should they do?

Question 42mediummultiple choice

Read the full NAT/PAT explanation →

A data engineer needs to design a data pipeline that ingests CSV files from an SFTP server daily, transforms them, and loads them into Amazon Redshift. The files are typically 2-3 GB. Which combination of AWS services is MOST appropriate?

Question 43easymultiple choice

Read the full Data Engineering explanation →

A team stores raw data in S3 and uses a Glue Data Catalog for metadata. They want to allow data scientists to query the data with Amazon Athena using their existing IAM roles. What is the MINIMUM set of permissions required?

Question 44hardmultiple choice

Read the full Data Engineering explanation →

A company is building a near-real-time dashboard using data from multiple sources. They need to aggregate millions of events per second with sub-second latency. The architecture must be fully managed and minimize operational overhead. Which service should they use for the aggregation layer?

Question 45hardmulti select

Read the full Data Engineering explanation →

A data engineer needs to set up a data lake on S3 that supports both batch and streaming ingestion. The data must be queryable by Athena, Redshift Spectrum, and EMR. Which TWO configurations are essential? (Choose two.)

Question 46easymulti select

Read the full Data Engineering explanation →

A team wants to move data from an on-premises Oracle database to Amazon S3 for analytics. The pipeline must run daily and handle incremental updates. Which THREE services should they use together? (Choose three.)

Question 47mediummulti select

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data. They need to archive raw data to S3 every hour and also enable real-time processing with sub-second latency. Which TWO actions should they take? (Choose two.)

Question 48mediummultiple choice

Read the full Data Engineering explanation →

An IAM policy attached to a SageMaker notebook role is shown. The data engineer tries to run an Athena query on a table in the 'my_database' Glue database. The query fails with an access denied error. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": ["glue:GetTable", "glue:GetDatabase"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["athena:StartQueryExecution", "athena:GetQueryResults"],
      "Resource": "*"
    }
  ]
}
```

Question 49easymultiple choice

Read the full Data Engineering explanation →

A data engineer runs the AWS CLI command above to inspect a file in S3. They need to determine if the file was modified after a Glue ETL job processed it. What additional information could they obtain from this command?

Network Topology

Question 50hardmultiple choice

Read the full Data Engineering explanation →

A data engineer created a CloudFormation template for a Glue ETL job as shown. The job processes 500 GB of data and takes 90 minutes to complete. However, the job fails after 60 minutes. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

```yaml
Resources:
  GlueJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation: s3://my-bucket/scripts/etl.py
        PythonVersion: 3
      Role: arn:aws:iam::123456789012:role/GlueServiceRole
      MaxRetries: 0
      Timeout: 60
      NumberOfWorkers: 10
      WorkerType: G.1X
```

Question 51mediummultiple choice

Read the full NAT/PAT explanation →

A data science team needs to process streaming data from thousands of IoT devices and perform real-time anomaly detection. The data must be persisted in Amazon S3 for batch processing later. Which combination of AWS services should be used to meet these requirements?

Question 52easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries are slow and wants to improve performance without changing the schema. Which action is most likely to improve query performance?

Question 53hardmultiple choice

Read the full Data Engineering explanation →

A data pipeline uses AWS Glue to transform data from Amazon RDS to Amazon S3. The team wants to ensure that only new or updated records are processed in each run, minimizing cost and time. Which AWS Glue feature should be used?

Question 54mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon SageMaker to train machine learning models. The training data is stored in Amazon S3, but the data includes personally identifiable information (PII) that must be anonymized before training. What is the most efficient way to anonymize the data?

Question 55easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to move 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. Which AWS service should be used to transfer the data most efficiently?

Question 56hardmultiple choice

Read the full Data Engineering explanation →

A team is building a data lake on Amazon S3 and using AWS Glue to catalog data. They notice that Glue crawlers are taking too long to update the catalog for a large dataset with millions of small files. Which approach will MOST improve crawler performance?

Question 57mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format, and the company wants to convert it to Parquet for efficient querying. Which configuration should be used?

Question 58easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to run a one-time ETL job to transform 500 GB of data from Amazon RDS to Amazon S3. The job should be cost-effective and require minimal infrastructure management. Which AWS service should be used?

Question 59mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data science team wants to analyze the data using Amazon Athena. What is the most efficient way to make the DynamoDB data available for Athena queries?

Question 60hardmulti select

Read the full Data Engineering explanation →

A company is designing a data pipeline that ingests streaming data from social media feeds. The data must be processed in real-time to detect trending topics, and results must be stored in Amazon DynamoDB for low-latency access. Which services should the company use? (Choose TWO.)

Question 61mediummulti select

Read the full Data Engineering explanation →

A data engineer needs to transform and move 2 TB of data from an Amazon RDS for PostgreSQL instance to Amazon S3 daily. The transformation includes filtering, joining with data in S3, and aggregating. Which AWS services can be used together to accomplish this with minimal operational overhead? (Choose THREE.)

Question 62easymulti select

Read the full Data Engineering explanation →

A company wants to build a data lake on Amazon S3. The data lake should support both batch and real-time data ingestion. Which AWS services should be used for data ingestion? (Choose TWO.)

Question 63easymultiple choice

Read the full Data Engineering explanation →

A data engineer wants to stream clickstream data from a web application to Amazon S3 for near-real-time analytics. Which AWS service should be used to ingest and buffer the data before landing in S3?

Question 64mediummultiple choice

Read the full Data Engineering explanation →

A machine learning team needs to process a large dataset stored in Amazon S3 using Apache Spark. They want to minimize cost and avoid managing infrastructure. Which AWS service should they use?

Question 65hardmultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue to run ETL jobs on a daily schedule. The jobs are failing intermittently with 'OutOfMemory' errors. The data volume has grown 5x over the past month. Which is the MOST cost-effective fix?

Question 66easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to query a dataset stored as Parquet files in Amazon S3 using standard SQL without managing any infrastructure. Which service should they use?

Question 67mediummultiple choice

Read the full NAT/PAT explanation →

A team wants to build a data pipeline that processes incoming JSON files from an S3 bucket and loads them into a Redshift table. The pipeline must handle schema evolution and data validation. Which combination of services would be MOST appropriate?

Question 68hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on a stream of IoT sensor data. The application is experiencing high latency. The data volume has doubled. Which action would MOST effectively reduce latency?

Question 69easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The company has a 1 Gbps internet connection. Which service would complete the transfer in the shortest time?

Question 70mediummultiple choice

Read the full NAT/PAT explanation →

A company is building a data lake on Amazon S3. They need to enforce encryption at rest for all objects. Which combination of actions will achieve this? (Assume the bucket is versioned.)

Question 71hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon EMR with Spark to process data daily. The job reads from S3 and writes to S3. Recently, the job started failing with 'S3AccessDenied' errors. The IAM role used by EMR has not changed. What is the MOST likely cause?

Question 72mediummulti select

Read the full Data Engineering explanation →

A company is designing a data pipeline to ingest data from multiple sources into an Amazon S3 data lake. The data must be encrypted at rest and in transit. Which TWO actions should be taken to meet these requirements?

Question 73hardmulti select

Read the full Data Engineering explanation →

A data engineering team uses AWS Glue to run ETL jobs. They notice that jobs are taking longer to complete as data volume grows. They want to optimize performance without increasing cost significantly. Which THREE strategies should they consider?

Question 74easymulti select

Read the full Data Engineering explanation →

A company wants to analyze streaming data from IoT devices in near-real-time. They need to store raw data in Amazon S3 and also run SQL queries on the streaming data. Which TWO services should they use?

Question 75mediummultiple choice

Read the full Data Engineering explanation →

An IAM policy is attached to a group. A user in the group tries to read the object s3://data-lake-bucket/sensitive/file.txt from an IP address 192.168.1.1. What will happen?

Exhibit

Refer to the exhibit.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:*"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/sensitive/*",
      "Condition": {
        "StringNotEquals": {
          "aws:SourceIp": "10.0.0.0/8"
        }
      }
    }
  ]
}

Question 76hardmultiple choice

Read the full Data Engineering explanation →

A data engineer runs the CLI command to download an object from S3. The bucket owner is 123456789012, and the engineer's IAM user has s3:GetObject permission on the bucket. The object was uploaded by a different AWS account. What is the MOST likely reason for the AccessDenied error?

Network Topology

Question 77easymultiple choice

Read the full Data Engineering explanation →

An S3 event notification is configured to trigger a Lambda function when new objects are created. The Lambda function processes the event JSON shown. Which field should the function use to read the new object from S3?

Exhibit

Refer to the exhibit.
{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "s3SchemaVersion": "1.0",
        "bucket": {
          "name": "my-bucket",
          "arn": "arn:aws:s3:::my-bucket"
        },
        "object": {
          "key": "data/file.csv",
          "size": 1024
        }
      }
    }
  ]
}

Question 78easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to ingest streaming data from an on-premises Kafka cluster into Amazon S3 with minimal operational overhead. Which AWS service should be used to stream the data into S3 without managing servers?

Question 79mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue ETL jobs to process data stored in Amazon S3. The jobs currently run sequentially and take too long. The data engineer wants to reduce job duration without rewriting the code. Which action is most effective?

Question 80hardmultiple choice

Read the full Data Engineering explanation →

A data science team uses Amazon SageMaker to train models on a dataset stored in Amazon S3. The dataset is 2 TB and is accessed by multiple training jobs. The team notices that training jobs are slow due to high S3 GET request latency. Which solution would provide the fastest and most cost-effective data access?

Question 81mediummultiple choice

Read the full Data Engineering explanation →

A company runs a daily ETL job that reads data from Amazon RDS, transforms it using AWS Glue, and writes the results to Amazon S3. The job started failing yesterday with the error: 'Rate exceeded'. What is the most likely cause and solution?

Question 82easymultiple choice

Read the full Data Engineering explanation →

A company wants to analyze historical data stored in Amazon S3 using Amazon Athena. The data is in CSV format and is partitioned by date. Which action will provide the best query performance and cost optimization?

Question 83hardmultiple choice

Read the full Data Engineering explanation →

A company uses AWS Lake Formation to manage permissions on a data lake stored in Amazon S3. A data analyst tries to query a table using Amazon Athena but receives an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What is the most likely cause?

Question 84mediummultiple choice

Read the full Data Engineering explanation →

A data pipeline uses Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The Lambda function sometimes times out because of spikes in traffic. The team wants to buffer the data before processing to handle spikes. Which approach is most effective?

Question 85easymultiple choice

Read the full Data Engineering explanation →

A company runs a nightly AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes to Amazon S3. The job fails intermittently with 'ERROR: cannot execute INSERT in a read-only transaction'. What is the most likely cause?

Question 86hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon EMR to run Spark jobs on a large dataset stored in Amazon S3. The jobs are failing with 'OutOfMemoryError' in the executors. The data is not skewed. Which configuration change will most likely resolve the issue?

Question 87mediummulti select

Read the full Data Engineering explanation →

A data engineer is designing a data ingestion pipeline that will receive up to 5 GB of data per hour from thousands of IoT devices. The data must be stored in Amazon S3 and analyzed in near real-time. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Question 88easymulti select

Read the full Data Engineering explanation →

A company needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 5 days. Which TWO options are viable? (Choose TWO.)

Question 89hardmulti select

Read the full Data Engineering explanation →

A company uses Amazon Redshift for data warehousing. The data engineering team notices that query performance has degraded over time. Which THREE actions should the team take to improve performance? (Choose THREE.)

Question 90easymultiple choice

Read the full Data Engineering explanation →

A data engineer is building a data pipeline that ingests streaming data from IoT devices. The data must be processed in near real-time and stored in Amazon S3 for further analysis. Which AWS service should be used to capture and process the streaming data before storing it in S3?

Question 91mediummultiple choice

Read the full NAT/PAT explanation →

A machine learning team needs to preprocess large volumes of clickstream data stored in Amazon S3 before training a model. The preprocessing includes data cleaning, feature engineering, and normalization. The team wants to use a serverless solution that minimizes operational overhead. Which combination of services should the team use?

Question 92hardmultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time data. The data source is a Kinesis data stream, and the output is written to an S3 bucket. Recently, the processing latency has increased significantly. The team suspects that the Flink application is encountering backpressure. Which metric should the team monitor to confirm backpressure?

Question 93easymultiple choice

Read the full Data Engineering explanation →

A data scientist wants to query a dataset stored in Amazon S3 using standard SQL without provisioning any servers. The dataset is in CSV format and is updated daily. Which AWS service should be used?

Question 94mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data pipeline to process sensitive customer data. The pipeline uses AWS Glue for ETL and stores results in Amazon S3. The security team requires that all data be encrypted at rest in S3 using customer-managed AWS KMS keys. Additionally, the Glue job must be able to write encrypted data to S3. What should the data engineer do to meet these requirements?

Question 95hardmultiple choice

Read the full Data Engineering explanation →

A large e-commerce company is using Amazon DynamoDB as the source for real-time analytics. The data is streamed to Amazon Kinesis Data Streams using DynamoDB Streams and then processed by an AWS Lambda function. The Lambda function writes the data to an Amazon Elasticsearch Service cluster for search and visualization. Recently, the Lambda function has been failing with throttling errors from the Elasticsearch cluster. What is the MOST effective way to handle this?

Question 96easymultiple choice

Read the full Data Engineering explanation →

A company is using Amazon S3 as a data lake. The data engineering team needs to catalog the schema of the data and make it available for querying with Amazon Athena. Which AWS Glue component should be used?

Question 97mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 100 Mbps connection to AWS. The transfer must be completed within one week. Which approach should the engineer use?

Question 98hardmultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Redshift for data warehousing. The data engineering team notices that queries are slow and the system is frequently writing to disk due to insufficient memory. Which type of workload management (WLM) configuration change would help reduce disk writes?

Question 99mediummulti select

Read the full Data Engineering explanation →

Which TWO AWS services can be used to move data from an on-premises database to Amazon S3 on a recurring schedule without writing custom code? (Choose 2.)

Question 100hardmulti select

Read the full Data Engineering explanation →

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose 3.)

Question 101easymulti select

Read the full Data Engineering explanation →

Which TWO AWS services can be used to schedule and orchestrate a data pipeline that includes multiple steps such as data extraction, transformation, and loading? (Choose 2.)

Question 102easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to analyze large CSV files stored in Amazon S3 using SQL queries. The data is not frequently accessed, and cost is a primary concern. Which AWS service should be used to query the data directly in S3 without moving it?

Question 103mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be transformed before being stored in Amazon S3. The transformations include enrichment with reference data from Amazon DynamoDB. Which AWS service should be used to perform the transformation with minimal operational overhead?

Question 104hardmultiple choice

Read the full Data Engineering explanation →

A data engineering team is designing a data lake on Amazon S3. Raw data is ingested in JSON format and must be partitioned by year, month, and day. The team expects high query performance for recent data but infrequent queries for older data. The data is immutable. Which storage tier configuration minimizes costs while meeting performance requirements?

Question 105easymultiple choice

Read the full Data Engineering explanation →

A data engineer is tasked with building a system to process a continuous stream of IoT sensor data. The data must be processed in near real-time, and the results must be stored in Amazon S3 partitioned by hour. Which AWS service is the most cost-effective and simplest to implement?

Question 106mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are currently failing due to insufficient memory. The data volume varies, with occasional spikes. Which solution should be used to handle the variable memory requirements efficiently?

Question 107hardmultiple choice

Read the full Data Engineering explanation →

A data pipeline uses Amazon Kinesis Data Streams to ingest event data. The data is consumed by an AWS Lambda function, which writes to Amazon DynamoDB. The Lambda function is experiencing throttling errors, and the DynamoDB write capacity is underutilized. The events must be processed in order per shard. Which solution most effectively addresses the throttling?

Question 108easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to run a one-time SQL query on a large dataset in Amazon S3. The dataset is stored in Parquet format and is about 500 GB. The query requires complex aggregations and joins. Which AWS service should be used to minimize cost and setup time?

Question 109mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data lake on Amazon S3. Raw data is ingested from multiple sources in different formats (CSV, JSON, Parquet). The data must be cataloged and made queryable using Amazon Athena. The data schema may evolve over time. Which approach minimizes manual effort and supports schema evolution?

Question 110hardmultiple choice

Read the full Data Engineering explanation →

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to Amazon S3. The data arrives at unpredictable rates, with occasional bursts. The company needs to ensure data is delivered within 60 seconds of ingestion, and the data must be partitioned by year/month/day/hour. Which configuration meets these requirements?

Question 111easymulti select

Read the full Data Engineering explanation →

Which TWO AWS services can be used to transform data in transit before storing it in Amazon S3? (Choose TWO.)

Question 112mediummulti select

Read the full Data Engineering explanation →

A company is designing a data pipeline to analyze customer behavior. The pipeline must handle real-time streaming data and batch data. The data must be stored in a data lake on Amazon S3 and also made available for interactive queries. Which THREE services should be combined to build this pipeline? (Choose THREE.)

Question 113hardmulti select

Read the full Data Engineering explanation →

A data engineering team is migrating on-premises Hadoop workloads to AWS. The workloads include batch processing using Apache Spark and interactive SQL queries. The data is stored in HDFS. Which TWO AWS services should be used to replace HDFS and provide a scalable, durable storage layer? (Choose TWO.)

Question 114mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to ingest streaming data from thousands of IoT devices into a data lake on Amazon S3 for near-real-time analytics. The data must be partitioned by device ID and timestamp, and the team must minimize data loss during ingestion failures. Which solution is MOST appropriate?

Question 115easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to query a 2 TB dataset stored in Amazon S3 using Amazon Athena. The data is in CSV format and is used for exploratory analysis. Queries are currently slow and expensive. Which action will improve query performance and reduce cost?

Question 116hardmultiple choice

Read the full NAT/PAT explanation →

A company uses AWS Glue ETL jobs to process data from an Amazon RDS for MySQL database into Amazon S3. The job runs daily and takes 6 hours to complete. The team wants to reduce runtime and cost. The source table has 50 million rows and is updated continuously. Which combination of changes would be MOST effective?

Question 117mediummultiple choice

Read the full Data Engineering explanation →

A data pipeline uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that transforms and writes to Amazon DynamoDB. The Lambda function is throttled during traffic spikes, causing data to be reprocessed. Which solution should the team implement to handle the throttling without losing data?

Question 118easymultiple choice

Read the full Data Engineering explanation →

A company wants to use Amazon SageMaker to train a model on a dataset stored in Amazon S3. The dataset is 100 GB and consists of millions of small JSON files. What should the data engineering team do to optimize training performance?

Question 119hardmultiple choice

Read the full NAT/PAT explanation →

A financial services company needs to build a data lake on Amazon S3 that meets regulatory requirements for data retention and encryption. Data must be encrypted at rest and in transit, and access must be audited. The data lake will be queried by Amazon Athena and Amazon Redshift Spectrum. Which combination of actions should be taken?

Question 120mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team is building a pipeline to process terabytes of log data daily using Amazon EMR with Spark. The data arrives in hourly batches and must be processed within 4 hours. The team needs to minimize cost. Which cluster configuration is MOST cost-effective?

Question 121easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 1 Gbps connection to AWS. The transfer must be completed within 10 days. What is the MOST efficient approach?

Question 122hardmultiple choice

Read the full Data Engineering explanation →

A company runs a real-time fraud detection pipeline using Amazon Kinesis Data Analytics. The pipeline reads from a Kinesis data stream, performs sliding window aggregations, and writes results to a DynamoDB table. The application is experiencing high latency during peak hours. Which action would MOST effectively reduce latency?

Question 123mediummulti select

Read the full Data Engineering explanation →

A data engineering team is designing a data pipeline to process streaming data from social media feeds. The data must be deduplicated, enriched with customer information from a relational database, and stored in Amazon S3 in Parquet format. Which AWS services should the team use to build this pipeline? (Select TWO.)

Question 124hardmulti select

Read the full Data Engineering explanation →

A company uses AWS Glue Data Catalog to manage metadata for its data lake on Amazon S3. The data lake contains terabytes of data in CSV format. The data engineering team wants to improve query performance in Amazon Athena and reduce costs. Which actions should the team take? (Select THREE.)

Question 125easymulti select

Read the full Data Engineering explanation →

A data engineering team needs to schedule a nightly ETL job that extracts data from an Amazon RDS for PostgreSQL instance, transforms it using Spark, and loads it into Amazon S3. The team wants to use AWS Glue for this task. Which components are required? (Select TWO.)

Question 126mediummultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. An IAM policy is attached to a data engineering role. The role is used by an AWS Glue ETL job that reads from 'raw/' and writes to 'processed/'. The job fails with an access denied error when trying to write to 'processed/'. What is the likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/raw/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/processed/*"
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    }
  ]
}

Question 127hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer examines the output of 'aws glue get-job-run' for a failed job. The job run state is FAILED, but ErrorMessage is empty. The job ran for 3600 seconds (1 hour) before failing. What is the MOST likely cause of the failure?

Network Topology

Question 128hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A CloudFormation template creates an S3 bucket. The data engineering team stores daily log files in this bucket and queries them using Amazon Athena. After 30 days, queries on logs older than 30 days start failing with 'Access Denied' errors. What is the MOST likely reason?

Exhibit

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "MyBucket": {
      "Type": "AWS::S3::Bucket",
      "Properties": {
        "LifecycleConfiguration": {
          "Rules": [
            {
              "Id": "ArchiveRule",
              "Status": "Enabled",
              "Transition": {
                "StorageClass": "GLACIER",
                "TransitionInDays": 30
              },
              "ExpirationInDays": 365
            }
          ]
        }
      }
    }
  }
}

Question 129mediummultiple choice

Read the full Data Engineering explanation →

A company captures streaming data from IoT devices using Amazon Kinesis Data Streams. The data is consumed by a custom application that processes records in near real-time. Recently, the application has been falling behind, and the stream is showing increased 'iterator age' metrics in CloudWatch. Which action is MOST likely to reduce the iterator age?

Question 130hardmultiple choice

Read the full NAT/PAT explanation →

A data engineer needs to build a pipeline that ingests CSV files from an S3 bucket, validates the schema, and loads the data into an Amazon Redshift cluster. The pipeline must handle schema evolution gracefully by adding new columns as they appear in the source files. Which combination of AWS services and configurations would meet these requirements with minimal operational overhead?

Question 131easymultiple choice

Read the full Data Engineering explanation →

A machine learning team is preparing a dataset for model training. The data is stored in an Amazon S3 bucket with objects that are each approximately 100 MB in size. The team wants to use Amazon SageMaker for training. To optimize training performance, which data format and storage configuration should be used?

Question 132mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application uses event time and watermarks for windowed aggregations. The team notices that the output from tumbling windows is delayed, and many late records are being dropped. What is the MOST likely cause?

Question 133hardmultiple choice

Read the full Data Engineering explanation →

A research lab stores large genomic datasets in Amazon S3 Glacier Deep Archive. They need to run a one-time analysis on a subset of 10 PB of data. The analysis will use an Amazon EMR cluster with Amazon S3 as the data source. What is the MOST cost-effective and performant way to make the data available for the EMR cluster?

Question 134easymultiple choice

Read the full Data Engineering explanation →

An ML engineer is using Amazon SageMaker to train a model on a dataset that contains personal identifiable information (PII). The data must be encrypted at rest and in transit. The company uses AWS KMS for key management. How should the engineer configure the SageMaker training job to meet these encryption requirements?

Question 135mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue ETL jobs to transform data from Amazon RDS for MySQL to Amazon S3. The transformation includes aggregations and joins. The job runs daily and processes approximately 100 GB of data. Recently, the job started failing with memory errors on the worker nodes. Which approach would MOST effectively resolve the issue without changing the logic?

Question 136hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist needs to run a one-time training job on a 5 TB dataset stored in Amazon S3. The training algorithm requires random access to individual records. Which SageMaker input mode and data format combination would be MOST appropriate?

Question 137easymultiple choice

Read the full Data Engineering explanation →

A team is building a data pipeline using Amazon Kinesis Data Firehose to deliver real-time clickstream data to an Amazon S3 bucket. The data must be partitioned by year, month, day, and hour. Which configuration should the team use to achieve this?

Question 138mediummulti select

Read the full Data Engineering explanation →

A company is building a data lake on Amazon S3 and wants to ensure that data is encrypted at rest using AWS KMS. Which TWO actions are required to achieve this? (Choose TWO.)

Question 139hardmulti select

Read the full Data Engineering explanation →

A company is using Amazon DynamoDB as a source for a machine learning pipeline. The data is exported nightly to Amazon S3 using DynamoDB Streams and an AWS Glue job. The Glue job reads the stream records, transforms them, and writes to S3 in Parquet format. The team notices that the Glue job is taking too long and consuming high DynamoDB read capacity. Which THREE actions would reduce the load on DynamoDB and improve performance? (Choose THREE.)

Question 140easymulti select

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest sensor data. The data must be processed in real-time, and the results must be stored in Amazon DynamoDB. Which TWO AWS services can be used together to achieve this? (Choose TWO.)

Question 141mediummultiple choice

Read the full Data Engineering explanation →

An IAM policy is attached to a data engineering role that writes to an S3 bucket. The policy is shown in the exhibit. What is the effect of this policy?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::data-lake-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    }
  ]
}

Question 142hardmultiple choice

Read the full Data Engineering explanation →

An ML engineer runs the AWS CLI command shown in the exhibit on a file in S3. The engineer wants to use this file in a SageMaker training job. What does the output reveal about the data?

Network Topology

Question 143easymultiple choice

Read the full Data Engineering explanation →

An AWS Glue job is failing with an error that it cannot access an S3 bucket. The IAM role attached to the Glue job is shown in the exhibit. What is the MOST likely cause of the failure?

Exhibit

Refer to the exhibit.

{
  "RoleName": "MLDataProcessingRole",
  "AssumeRolePolicyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  },
  "AttachedManagedPolicies": [
    "arn:aws:iam::aws:policy/AmazonS3FullAccess",
    "arn:aws:iam::aws:policy/AWSGlueServiceRole"
  ]
}

Question 144easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to process streaming data from an IoT fleet and store the results in Amazon S3 for analysis. The solution must be serverless and handle data that arrives at irregular intervals. Which AWS service should be used to ingest the data?

Question 145mediummultiple choice

Study the full Python automation breakdown →

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The input data is located in an S3 bucket, and the team needs to transform the data before inference using a custom Python script. The transformation should run on a serverless infrastructure and must be triggered automatically when new data arrives in S3. Which combination of services should the team use?

Question 146hardmultiple choice

Read the full Data Engineering explanation →

A data engineer needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3 for ML training. The data is currently stored in HDFS and is compressible. The network bandwidth between the on-premises data center and AWS is 1 Gbps. The team needs to minimize the time to transfer and also wants to avoid any downtime for the on-premises system. Which solution meets these requirements?

Question 147mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams for real-time clickstream analysis. The data is consumed by a Lambda function that enriches the records and stores them in Amazon S3. Recently, the Lambda function has been failing with throttling errors, and the consumer is falling behind. The team needs to increase the throughput of the consumer without changing the data format or the Lambda function code. What should the team do?

Question 148hardmultiple choice

Read the full Data Engineering explanation →

A financial services company is building a fraud detection model that requires joining real-time transaction data with a reference dataset of known fraudulent accounts stored in Amazon DynamoDB. The solution must minimize latency and be highly available. The reference dataset is updated frequently (every few minutes). Which architecture should the team use?

Question 149easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to perform exploratory data analysis on a 100 GB CSV file stored in Amazon S3. The data is not sensitive. The scientist wants to use SQL queries to filter and aggregate the data without setting up a server or moving the data. Which service should be used?

Question 150hardmultiple choice

Read the full Data Engineering explanation →

A company runs a data lake on Amazon S3 with partitions by year/month/day. A machine learning team needs to read daily data from the last 30 days for model retraining. The data format is Parquet. The team uses Amazon Athena to query the data, but the queries are slow and scanning too much data. The team has already optimized the file sizes and compression. What additional step can reduce the amount of data scanned?

Question 151mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon SageMaker to train a model on a dataset that is updated daily. The data is stored in an S3 bucket. The training pipeline uses AWS Step Functions to orchestrate data preprocessing and model training. The preprocessing step uses a SageMaker Processing job that reads data from S3, cleans it, and writes the output back to S3. The team notices that the training step often fails due to insufficient disk space on the processing instance. Which change should the team make to resolve this issue without increasing cost?

Question 152mediummultiple choice

Read the full Data Engineering explanation →

A team is building a data pipeline that ingests data from an Amazon S3 bucket, transforms it using AWS Glue, and loads it into Amazon Redshift for analysis. The Glue job runs on a schedule every hour. The team has noticed that the job takes longer than expected and sometimes fails due to memory issues. The data volume is variable, with occasional spikes. Which solution should the team implement to optimize the pipeline?

Question 153easymulti select

Read the full Data Engineering explanation →

Which TWO AWS services can be used to transform data in a streaming fashion without using a persistent cluster? (Choose 2.)

Question 154mediummulti select

Read the full Data Engineering explanation →

Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon Redshift for storing large datasets used for machine learning? (Choose 3.)

Question 155hardmulti select

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function writes results to an S3 bucket. The team wants to ensure that each record is processed exactly once and in order. Which TWO configurations should the team implement? (Choose 2.)

Question 156hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. An ML engineer applies this bucket policy to an S3 bucket. The SageMaker execution role MySageMakerRole is used to train a model. The training data is located in s3://my-bucket/data/. The SageMaker training job fails with an access error. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/data/*",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/MySageMakerRole"
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-bucket",
      "Principal": "*"
    }
  ]
}

Question 157mediummultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. An ML engineer runs the above CLI command to inspect files in an S3 bucket. The training data consists of 200 CSV files, each 1 GB. The engineer plans to use Amazon SageMaker to train a model using this data. What should the engineer do to optimize training performance?

Network Topology

Question 158hardmultiple choice

Read the full Data Engineering explanation →

A company runs a real-time recommendation system that uses Amazon SageMaker endpoints for inference. The system ingests user activity data from a mobile app via Amazon API Gateway and AWS Lambda, which writes events to an Amazon Kinesis Data Stream. A second Lambda function consumes the stream, calls a SageMaker endpoint to generate recommendations, and stores the results in Amazon DynamoDB. The system has been working well, but recently the team noticed an increase in latency from the time a user action occurs to when the recommendation is stored. The SageMaker endpoint shows increased invocation latency but no throttling. CloudWatch metrics show that the Kinesis stream's IteratorAgeMilliseconds is increasing, indicating the consumer is falling behind. The Lambda consumer's duration is within limits, but the number of invocations is lower than expected. The team suspects the issue is with the event source mapping. Which course of action should the team take to reduce the latency?

Question 159easymultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to process streaming data from thousands of IoT devices. The data must be ingested with low latency and processed in near real-time to detect anomalies. Which AWS service should they use for ingestion?

Question 160mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that process data from an Amazon RDS for PostgreSQL database. The jobs are failing with connection timeouts. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause?

Question 161hardmultiple choice

Read the full Data Engineering explanation →

A data scientist needs to run ad-hoc SQL queries on a large dataset stored in Amazon S3 (Parquet format, 2 TB). The queries are interactive and require sub-second response times. Which service should they use?

Question 162easymultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Firehose to load streaming data into Amazon S3. The data is in JSON format, and they want to convert it to Parquet before storage. What should they configure?

Question 163mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The data is currently stored in HDFS. Which service should they use for an efficient transfer?

Question 164hardmultiple choice

Read the full Data Engineering explanation →

An e-commerce company uses Amazon DynamoDB as the primary data store for user sessions. They want to run analytics on historical session data using Amazon Athena. What is the recommended approach to export DynamoDB data to S3 in a format optimized for Athena?

Question 165easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should they use for scheduling?

Question 166mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue Data Catalog as the metadata store for their data lake. They have multiple AWS accounts and want to share the catalog across accounts. Which feature should they use?

Question 167hardmultiple choice

Read the full NAT/PAT explanation →

A data engineer is designing a data pipeline that ingests 500 GB of data daily from an on-premises Oracle database to Amazon S3. The pipeline must minimize data loss and support change data capture (CDC). Which combination of services should they use?

Question 168mediummulti select

Read the full Data Engineering explanation →

Which TWO of the following are valid ways to reduce query costs in Amazon Athena? (Choose 2)

Question 169hardmulti select

Read the full Data Engineering explanation →

Which THREE of the following are best practices for optimizing performance of Amazon EMR clusters? (Choose 3)

Question 170easymulti select

Read the full Data Engineering explanation →

Which TWO services can be used to transform data in transit within a Kinesis Data Firehose delivery stream? (Choose 2)

Question 171hardmultiple choice

Read the full Data Engineering explanation →

A financial services company uses Amazon Kinesis Data Streams with 50 shards to ingest real-time stock trade data. The data is consumed by a custom Java application running on Amazon EC2 instances. Recently, the application has been experiencing high latency, and CloudWatch metrics show that the average iterator age is increasing. The application uses the Kinesis Client Library (KCL) with DynamoDB for lease tracking. The EC2 instances are in an Auto Scaling group with a minimum of 2 and maximum of 10 instances, and the current CPU utilization is below 50%. The team wants to reduce latency without increasing costs significantly. What should they do?

Question 172mediummultiple choice

Read the full Data Engineering explanation →

A media company ingests video metadata from multiple sources into an Amazon S3 bucket. Each metadata record is a JSON file about 2 KB. They use AWS Glue ETL jobs to process these files and load them into Amazon Redshift for analytics. The jobs currently run hourly and take about 10 minutes to process all new files. However, the company is growing and expects the number of files to increase 100x. The data engineering team wants to minimize processing time and cost. The Glue job currently reads all files from the S3 bucket using a full scan. What should they do to optimize the pipeline?

Question 173easymultiple choice

Read the full Data Engineering explanation →

A retail company uses Amazon Redshift for its data warehouse. The data engineering team runs ETL jobs that load data from multiple sources into Redshift daily. They notice that the load performance is slow and the cluster CPU utilization is high during the ETL window. The team wants to improve load performance without changing the cluster configuration. They currently load data using INSERT statements from a staging table. What should they do?

Question 174easymultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The data arrives in bursts and must be processed with minimal latency. Which AWS service is most appropriate for the ingestion layer?

Question 175mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data pipeline using AWS Glue to transform data from Amazon RDS to Amazon S3. The pipeline runs daily and processes about 500 GB of data. The team notices that the job is taking longer than expected. Which change would MOST improve the job performance?

Question 176hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is designing a data lake on Amazon S3 that must support both batch and streaming analytics. The data comes in Parquet format and needs to be queryable by Amazon Athena. Which partitioning strategy will optimize query performance and reduce costs?

Question 177easymultiple choice

Read the full Data Engineering explanation →

A company uses AWS Lambda to process events from Amazon S3. The Lambda function transforms the data and writes results to another S3 bucket. Recently, the function has been failing due to timeout errors when processing large files. Which solution should the data engineer implement?

Question 178mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 1 Gbps internet connection and wants to complete the transfer within 5 days. What is the MOST cost-effective and reliable solution?

Question 179hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon EMR to run Spark jobs on a transient cluster that processes data from S3. The jobs are failing with 'OutOfMemory' errors. The data engineer has already increased the executor memory. Which additional configuration change would MOST likely resolve the issue?

Question 180easymultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to orchestrate a complex workflow that involves multiple AWS Glue jobs, Lambda functions, and S3 operations. The workflow must run on a schedule and allow monitoring of each step. Which AWS service should they use?

Question 181mediummulti select

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a fleet of EC2 instances running a custom consumer application. The consumer is falling behind and the shard iterator age is increasing. Which TWO actions should the data engineer take to improve consumer performance? (Choose TWO.)

Question 182hardmulti select

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline to process streaming data from Amazon Kinesis Data Streams and store the results in Amazon S3 in Parquet format. The data must be available for querying in Amazon Athena within minutes of arrival. Which THREE services should be used together? (Choose THREE.)

Question 183easymulti select

Read the full Data Engineering explanation →

A company wants to centralize logging from multiple AWS accounts and on-premises servers. The logs must be stored cost-effectively and be searchable. Which TWO services should be used? (Choose TWO.)

Question 184mediummultiple choice

Read the full Data Engineering explanation →

A data engineer is troubleshooting an AWS Glue job that reads from an S3 bucket and writes to another S3 bucket. The job fails with an 'Access Denied' error when trying to write to the output bucket. The IAM policy attached to the Glue service role is shown. What is the MOST likely cause of the failure?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    }
  ]
}

Question 185hardmultiple choice

Read the full Data Engineering explanation →

A data engineer runs the above CLI command and sees that the bucket contains many small Parquet files (1 MB each) under the prefix. When querying this data with Athena, the query performance is poor and costs are high. Which approach would MOST improve performance and reduce cost?

Network Topology

Question 186hardmultiple choice

Read the full Data Engineering explanation →

A company runs a data pipeline using AWS Glue ETL jobs that process about 10 TB of data daily from Amazon S3. The jobs are triggered by a schedule and write results to a separate S3 bucket. Recently, the jobs have been taking longer to complete, and the data engineering team has observed that the number of files in the source bucket has increased significantly, from thousands to millions of small files (each about 100 KB). The Glue jobs are configured to use the 'Group Files' option, but performance is still poor. The team needs to improve the job performance without changing the source data generation process. Which course of action should the team take?

Question 187mediummultiple choice

Read the full Data Engineering explanation →

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to an Amazon S3 bucket. The data is then queried using Amazon Athena. The marketing team wants to run daily reports that aggregate click events by product ID. However, the reports are slow because Athena scans the entire dataset each time. The data is partitioned by date (e.g., s3://bucket/clickstream/2023/01/01/). The product ID is a column within the data. The data engineering team wants to improve query performance without moving the data to another service. Which approach should the team take?

Question 188easymultiple choice

Read the full Data Engineering explanation →

A startup is building a data pipeline that ingests data from multiple sources into an Amazon S3 data lake. The data includes CSV files from legacy systems, JSON from web APIs, and Avro from mobile apps. The data must be transformed into Parquet format and cataloged for querying with Amazon Athena. The pipeline must be serverless and minimize operational overhead. The team has decided to use AWS Glue for ETL and cataloging. However, they are concerned about the cost of running Glue jobs continuously. The data arrives in small batches every 10 minutes. Which approach should the team use to minimize cost while meeting the requirements?

Question 189mediummulti select

Read the full Data Engineering explanation →

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes results to Amazon S3. The engineer notices that the Lambda function is experiencing throttling and some records are being dropped. Which TWO actions should the engineer take to improve the reliability of the pipeline?

Question 190hardmulti select

Read the full Data Engineering explanation →

A machine learning team is using Amazon SageMaker to train a model on a dataset stored in S3. The training job reads data from S3 using Pipe input mode, but the training is slow. The team wants to improve data throughput. Which THREE actions should they take?

Question 191mediummultiple choice

Read the full Data Engineering explanation →

A data engineer is designing a data lake on Amazon S3. The data is collected from IoT devices and is highly variable in volume. The engineer needs to ensure that the data is ingested reliably and can be processed in near real-time. Which AWS service should be used to ingest the data into the data lake?

Question 192hardmultiple choice

Read the full Data Engineering explanation →

A data engineer has attached the above IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from and writes to 'my-data-bucket'. The job is failing with an Access Denied error. What is the most likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "10.0.0.0/24"
        }
      }
    }
  ]
}

Question 193easymultiple choice

Read the full Data Engineering explanation →

A machine learning engineer is using Amazon SageMaker to train a model. The training dataset is 2 TB and is stored in Amazon S3. The engineer wants to reduce the training time by improving data loading performance. Which data ingestion mode should be used?

Question 194mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize costs and avoid managing infrastructure. Which AWS service should be used?

Question 195hardmultiple choice

Read the full NAT/PAT explanation →

A data engineer is investigating a slow Athena query on a partitioned table. The table is partitioned by year, month, and day, and the data is stored in S3 with the prefix pattern 'raw/YYYY/MM/DD/'. The engineer runs the above CLI command and sees that there are many small files. Which action would most improve query performance?

Network Topology

Question 196hardmultiple choice

Read the full Data Engineering explanation →

A data engineering team is building a real-time fraud detection pipeline. The pipeline ingests transaction data from an Amazon Kinesis Data Stream with 10 shards. Each shard produces about 500 records per second, each record is 2 KB. The data is processed by a Lambda function that runs for about 200 ms and then writes results to an Amazon DynamoDB table. The team notices that the Lambda function is experiencing a high number of throttles, and there are increasing numbers of records being retried. The Lambda function's reserved concurrency is set to 100. The DynamoDB table has 100 read capacity units and 100 write capacity units. Which change would most effectively reduce throttling and improve processing throughput?

Question 197mediummultiple choice

Read the full Data Engineering explanation →

A machine learning team is preparing a large dataset for training. The dataset consists of 10,000 CSV files, each about 100 MB, stored in Amazon S3. The team wants to transform the data using AWS Glue ETL jobs. The transformation involves filtering rows, adding new columns, and joining with a small reference table (100 KB). The team is concerned about job performance and cost. They currently have a Glue job with 10 DPU (Data Processing Units) and it takes about 2 hours to complete. The team wants to reduce the runtime and cost. Which approach should they take?

Question 198easymultiple choice

Read the full Data Engineering explanation →

A data engineer is tasked with building a data pipeline that moves data from an on-premises database to Amazon S3 for analytics. The database is a MySQL instance that is 2 TB in size. The company has a 1 Gbps dedicated network connection to AWS (AWS Direct Connect). The data must be transferred once daily. The engineer needs to choose the most efficient and reliable service for this task. Which service should they use?

Question 199mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team is using Apache Spark on Amazon EMR to process streaming data from Amazon Kinesis Data Streams. The Spark application uses structured streaming to read from Kinesis, perform transformations, and write to Amazon S3 in Parquet format. The team notices that the application is falling behind and the processing latency is increasing. The Kinesis stream has 5 shards, and the EMR cluster has 5 core nodes of type r5.xlarge. The Spark application is configured with 5 executors, each with 2 cores and 8 GB memory. The team wants to reduce processing latency. Which change would be most effective?

Question 200mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to continuously ingest streaming data from thousands of IoT devices and store the raw data in Amazon S3 for archival processing. The data volume varies significantly throughout the day, and the solution must be serverless, scalable, and cost-effective. Which AWS service should be used to capture and buffer the streaming data before writing to S3?

Question 201hardmultiple choice

Read the full Data Engineering explanation →

A company is running a machine learning training job on Amazon SageMaker that reads training data from an S3 bucket. The job fails intermittently with an S3 throttling error. The data is partitioned across thousands of small files (average 100 KB). Which strategy is MOST effective to resolve the throttling issue?

Question 202easymultiple choice

Read the full Data Engineering explanation →

A data scientist wants to explore a large dataset stored in Amazon S3 using SQL queries without moving the data. The dataset is in CSV format and is updated daily with new partitions. Which AWS service should be used to directly query the data in S3?

Question 203mediummultiple choice

Read the full NAT/PAT explanation →

A company is building a data pipeline that ingests data from multiple sources into a centralized data lake on Amazon S3. The data must be transformed before it is available for analysis. The pipeline should be event-driven, automatically triggering transformation jobs when new data arrives. Which combination of AWS services should be used?

Question 204hardmultiple choice

Read the full Data Engineering explanation →

A data engineering team is designing a data lake on Amazon S3. They need to enforce encryption at rest for all data stored in the bucket. The security policy requires that the encryption keys be managed by the organization using AWS Key Management Service (KMS), and that the bucket must deny uploads of unencrypted objects. Which bucket policy should be applied?

Question 205easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon RDS for its transactional database and needs to export a daily snapshot of a table to Amazon S3 in Parquet format for analytics. Which AWS service can perform this export without writing custom code?

Question 206hardmultiple choice

Read the full Data Engineering explanation →

A company is streaming data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a AWS Lambda function that processes each record. The Lambda function is experiencing high error rates and throttling due to the volume of data. Which action would MOST effectively improve the processing throughput and reduce errors?

Question 207mediummultiple choice

Read the full Data Engineering explanation →

A data scientist needs to run complex ETL transformations on a large dataset stored in Amazon S3. The transformations are written in PySpark and require occasional access to Hive metastore. The solution should minimize operational overhead and allow the data scientist to focus on code development. Which AWS service should be used?

Question 208easymultiple choice

Read the full Data Engineering explanation →

A company wants to perform real-time analytics on streaming data from clickstreams. The data needs to be ingested, processed, and made available for querying within seconds. Which AWS service should be used for the processing step?

Question 209mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to catalog metadata from various data sources. The crawler is configured to run daily. However, the catalog is not reflecting new partitions added to an S3 bucket during the day. What is the MOST likely cause?

Question 210hardmulti select

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline that ingests data from a relational database into a data lake on Amazon S3. The data must be incrementally loaded daily. Which TWO AWS services can be used together to achieve this?

Question 211mediummulti select

Read the full Data Engineering explanation →

A company wants to use Amazon SageMaker to train a model using data stored in Amazon S3. The data is sensitive and must be encrypted at rest and in transit. Which THREE steps should be taken to ensure data security?

Question 212easymulti select

Read the full Data Engineering explanation →

A data engineer needs to collect and analyze log data from multiple EC2 instances in real-time. The solution should be serverless and scalable. Which TWO AWS services should be used?

Question 213hardmulti select

Read the full Data Engineering explanation →

A company is using AWS Glue ETL jobs to transform data. The jobs are failing due to insufficient memory. The data processing involves complex joins and aggregations. Which THREE actions can improve job performance and reduce memory usage?

Question 214mediummultiple choice

Read the full NAT/PAT explanation →

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The solution must handle data that arrives in bursts and must be able to reprocess failed records automatically. Which combination of AWS services should the team use?

Question 215hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline that transforms raw JSON files (each 50-200 KB) in Amazon S3 into Parquet format using AWS Glue. The pipeline must minimize data processing costs and handle a high volume of small files (millions per day). The engineer configures a Glue ETL job with Spark, but the job is slow and expensive due to overhead of reading many small files. Which optimization should the engineer implement to reduce cost and improve performance?

Question 216easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data contains personally identifiable information (PII) that must be redacted before storage. Which AWS service can be integrated with Kinesis Data Firehose to transform the data in real time?

Question 217mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to build a data lake on Amazon S3 that will be queried by Amazon Athena and Amazon Redshift Spectrum. The data will be ingested from multiple sources in various formats (CSV, JSON, Parquet). Which partitioning strategy will provide the best query performance for date-range queries?

Question 218hardmultiple choice

Read the full Data Engineering explanation →

A company has an AWS Glue ETL job that reads data from an Amazon RDS for MySQL table and writes to Amazon S3 in Parquet format. The job runs daily and processes 500 GB of data. Recently, the job has been failing with memory errors during the write phase. The data schema is wide (200 columns). Which change should a data engineer make to the Glue job to resolve the memory issue?

Question 219easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and a tight deadline of two weeks. Which AWS service should the engineer use to transfer the data most efficiently?

Question 220mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team is building a real-time fraud detection system. Transactions are ingested via Amazon Kinesis Data Streams, and a machine learning model (deployed on Amazon SageMaker) scores each transaction. The team needs to store the raw transactions and the model's predictions in Amazon S3 for later analysis. Which architecture should the team use?

Question 221hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Redshift for its data warehouse. The data engineering team needs to load 10 TB of data from Amazon S3 into Redshift every night. The team wants to minimize the load time and use the fewest number of COPY commands. The data is in CSV format and is partitioned by date in S3. Which approach should the team take?

Question 222easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to schedule an AWS Glue ETL job to run every hour. The job reads from an Amazon DynamoDB table and writes to Amazon S3. Which AWS service should the engineer use to trigger the Glue job on schedule?

Question 223mediummulti select

Read the full Data Engineering explanation →

A data engineering team is designing a data pipeline that processes streaming data from Amazon Kinesis Data Streams using AWS Lambda. The team notices that some records are being processed multiple times (duplicates). Which TWO steps should the team take to ensure exactly-once processing?

Question 224hardmulti select

Read the full Data Engineering explanation →

A company uses Amazon Athena to query a data lake in Amazon S3. The data is partitioned by year, month, day, and hour. The team notices that queries are slow and expensive. The team wants to improve performance and reduce costs. Which THREE actions should the team take?

Question 225easymulti select

Read the full Data Engineering explanation →

A data engineer is building a data pipeline using AWS Glue. The pipeline reads data from Amazon S3, transforms it, and writes it back to S3 in a different format. The engineer needs to handle schema evolution (new columns added over time). Which TWO features of AWS Glue can help manage schema evolution?

Question 226hardmultiple choice

Read the full Data Engineering explanation →

A data engineer uses the IAM policy above for an AWS Lambda function that processes data in S3 and triggers an AWS Glue job. The Lambda function is unable to start the Glue job. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "arn:aws:glue:us-east-1:123456789012:job/my-etl-job"
    }
  ]
}
```

Question 227mediummultiple choice

Read the full Data Engineering explanation →

A data engineer runs the AWS CLI command above to inspect an object in S3. The engineer wants to query this metadata (kafka-offset) using Amazon Athena to track processing progress. How can the engineer make this metadata available for Athena queries without modifying the existing data pipeline?

Network Topology

Question 228hardmultiple choice

Read the full Data Engineering explanation →

A data engineer configures an S3 event notification to trigger an AWS Lambda function when a new object is created in 'my-input-bucket'. The Lambda function processes the CSV file and writes results to 'my-output-bucket'. The engineer notices that the Lambda function is not triggered for some objects. Which step should the engineer take to diagnose the issue?

Exhibit

Refer to the exhibit.

```
{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "s3SchemaVersion": "1.0",
        "bucket": {
          "name": "my-input-bucket",
          "arn": "arn:aws:s3:::my-input-bucket"
        },
        "object": {
          "key": "data/file.csv",
          "size": 1024,
          "eTag": "abc123"
        }
      }
    }
  ]
}

Question 229mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. The application has been failing intermittently with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

Question 230hardmultiple choice

Read the full Data Engineering explanation →

A data engineering team is designing a data pipeline to process large CSV files (10-50 GB each) stored in Amazon S3. The pipeline must transform the data using AWS Glue and load it into Amazon Redshift for analytics. The team wants to minimize costs while ensuring the pipeline can handle peak loads. Which approach is the most cost-effective?

Question 231easymultiple choice

Read the full Data Engineering explanation →

A company is using Amazon DynamoDB to store sensor data. The data is exported to Amazon S3 using DynamoDB Streams and AWS Lambda for long-term archival. Recently, the Lambda function has been failing due to 'ProvisionedThroughputExceededException' on the DynamoDB stream. What is the most likely cause?

Question 232hardmultiple choice

Read the full Data Engineering explanation →

A data scientist is building a training dataset from data stored in Amazon S3. The data consists of JSON files each containing a 'timestamp' field. The scientist wants to use AWS Glue to catalog the data and enable querying via Amazon Athena. However, Athena queries are returning zero results for time-range filters. What is the most likely cause?

Question 233mediummultiple choice

Read the full Data Engineering explanation →

A company is streaming data from IoT devices to Amazon Kinesis Data Firehose, which writes to an Amazon S3 bucket. The data is then processed by an AWS Glue ETL job and loaded into Amazon Redshift. The team notices that some records are missing in Redshift. They suspect data loss during the Firehose delivery. Which configuration parameter should be checked first?

Question 234easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to set up a data pipeline that ingests data from an Amazon RDS MySQL database into Amazon S3. The pipeline should run daily and capture incremental changes (inserts, updates, deletes) from the source database. Which AWS service should be used as the data ingestion tool?

Question 235mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data lake on Amazon S3 and wants to use AWS Glue to catalog the data. The data includes CSV, Parquet, and JSON files. The team wants to ensure that the Glue crawler can infer the schema correctly and update the Data Catalog when new partitions are added. Which crawler configuration should be used?

Question 236hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams with a shard count of 5. The data producer sends 1000 records per second, each 1 KB in size. The consumer application reads from the stream using the Kinesis Client Library (KCL) and processes records. The consumer is experiencing high latency and falling behind. What is the most effective way to improve consumer throughput?

Question 237easymultiple choice

Read the full Data Engineering explanation →

A company wants to store semi-structured data from IoT sensors in a cost-effective manner for occasional querying. The data is not updated once written. Which Amazon S3 storage class is the most cost-effective for this use case?

Question 238mediummulti select

Read the full Data Engineering explanation →

Which TWO configurations are required to enable AWS Glue to access data stored in a VPC? (Choose two.)

Question 239hardmulti select

Read the full Data Engineering explanation →

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose three.)

Question 240mediummulti select

Read the full Data Engineering explanation →

Which TWO steps are required to set up cross-account access to an Amazon S3 data lake for AWS Glue jobs running in a different AWS account? (Choose two.)

Question 241hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A company is using the Kinesis stream 'my-stream' with one shard. The producer is sending 1000 records per second, each 1 KB. The consumer is reading from the stream using the Kinesis Client Library (KCL). The consumer is able to process 500 records per second per shard. What is the most likely cause of the consumer falling behind?

Network Topology

Question 242mediummultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue job that fails with an 'AccessDenied' error when trying to write to the S3 bucket 'my-data-lake'. The IAM policy attached to the Glue service role is shown. What is the missing permission?

Exhibit

Refer to the exhibit.

IAM policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/glue/*"
    }
  ]
}

Question 243hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A team deploys this CloudFormation stack. The Kinesis stream is created, but the Firehose delivery stream fails to create with a 'Resource handler returned message: Unable to assume role' error. What is the most likely cause?

Exhibit

Refer to the exhibit.

CloudFormation snippet:
"MyKinesisStream": {
  "Type": "AWS::Kinesis::Stream",
  "Properties": {
    "Name": "data-stream",
    "ShardCount": 2,
    "RetentionPeriodHours": 168,
    "StreamEncryption": {
      "EncryptionType": "KMS",
      "KeyId": "alias/aws/kinesis"
    }
  }
}

"MyFirehose": {
  "Type": "AWS::KinesisFirehose::DeliveryStream",
  "Properties": {
    "DeliveryStreamType": "KinesisStreamAsSource",
    "KinesisStreamSourceConfiguration": {
      "KinesisStreamARN": { "Fn::GetAtt": ["MyKinesisStream", "Arn"] },
      "RoleARN": "arn:aws:iam::123456789012:role/firehose-role"
    },
    "S3DestinationConfiguration": {
      "BucketARN": "arn:aws:s3:::my-bucket",
      "RoleARN": "arn:aws:iam::123456789012:role/firehose-role"
    }
  }
}

Question 244mediummultiple choice

Read the full Data Engineering explanation →

A company is streaming real-time sensor data from IoT devices to Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that enriches the records with metadata from an Amazon DynamoDB table and writes the results to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors from DynamoDB. The data volume is variable, with occasional bursts. Which solution should a data engineer implement to resolve this issue without losing data?

Question 245hardmultiple choice

Read the full Data Engineering explanation →

An e-commerce company uses Amazon Redshift for analytics. The data engineering team needs to load daily sales data from an S3 bucket that receives new files every hour. The data must be loaded into Redshift with minimal impact on query performance during the day, and they need to handle late-arriving data (files that appear after the daily load). Which approach should they use?

Question 246easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to train a machine learning model using a large dataset (500 GB) stored in an S3 bucket. The training will be performed on a SageMaker notebook instance. The data scientist wants to minimize data transfer costs and reduce training time. Which data ingestion approach should the data engineer recommend?

Question 247hardmultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that transform data from multiple sources into a data lake on S3. The jobs are scheduled to run hourly. Recently, the jobs have been failing intermittently with 'MemoryError' exceptions. The data volume has grown over time. The data engineer needs to resolve this issue cost-effectively. Which action should be taken?

Question 248easymultiple choice

Read the full Data Engineering explanation →

A data engineering team needs to set up a data pipeline that ingests streaming data from an Apache Kafka cluster running on Amazon EKS into an S3 data lake. The data must be stored in Parquet format, partitioned by date and event type. The team wants a fully managed solution with minimal operational overhead. Which solution should they choose?

Question 249mediummultiple choice

Read the full Data Engineering explanation →

A data scientist is training a deep learning model using a large dataset stored in S3. The training job runs on a SageMaker training instance with a GPU. The data engineer notices that the GPU utilization is low, and the training is I/O bound. The data is read directly from S3 using the SageMaker SDK. Which change should the data engineer recommend to improve GPU utilization?

Question 250mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon DynamoDB as the primary data store for a real-time recommendation engine. The data engineering team needs to export a daily snapshot of the DynamoDB table to S3 for offline analytics. The table is large (10 TB) and has a high read/write throughput. Which method will export the data with the least impact on the production workload?

Question 251hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is building a data pipeline that uses AWS Lambda to process records from an SQS queue and write results to an S3 bucket. The Lambda function processes each record individually and writes a separate file to S3. The team notices high latency and wants to reduce the number of S3 PUT requests to improve performance and reduce cost. Which approach should the data engineer take?

Question 252easymulti select

Read the full Data Engineering explanation →

A company has a large number of small CSV files (hundreds of thousands) in an S3 bucket. A data engineer needs to run a SQL query on this data using Amazon Athena. The queries are currently slow and expensive. Which two actions will improve query performance and reduce cost?

Question 253mediummulti select

Read the full Data Engineering explanation →

A data engineer needs to design a data ingestion pipeline that ingests data from a MySQL database hosted on-premises into Amazon S3 for analytics. The pipeline must capture change data (CDC) and run continuously with low latency. Which two services should the data engineer use?

Question 254hardmulti select

Read the full Data Engineering explanation →

A company is using Amazon Redshift for data warehousing. The data engineering team observes that query performance degrades over time due to data skew. Which three strategies should the team implement to improve performance?

Question 255mediummultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. An IAM policy is attached to a data engineering team's role. The team needs to upload data to the 'confidential' prefix in the 'my-data-lake' bucket. However, they are receiving 'AccessDenied' errors. What is the likely cause?

Exhibit

Refer to the exhibit.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::my-data-lake/confidential/*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalAccount": "123456789012"
        }
      }
    }
  ]
}
```

Question 256hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer runs an Athena query and gets a failure. What is the most likely cause?

Network Topology

Question 257easymultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer has deployed this CloudFormation template. The Glue job 'my-etl-job' reads from the S3 bucket 'my-data-lake-bucket' and writes transformed data to another bucket. After 30 days, the data engineer notices that the Glue job fails with 'Input data not found' errors. What is the most likely cause?

Network Topology

Question 258easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to extract data from an Amazon RDS for MySQL database into Amazon S3 for further processing. The data volume is 2 TB and the job must run daily within a 1-hour window. Which AWS service is most suitable for this task?

Question 259mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data lake on Amazon S3. Data arrives from multiple sources in different formats (CSV, JSON, Parquet). The engineering team wants to query this data using Amazon Athena with minimal transformation. Which approach minimizes query cost and improves performance?

Question 260hardmultiple choice

Read the full Data Engineering explanation →

A data pipeline uses AWS Lambda to process small files (10-50 MB) from an S3 bucket and write results to DynamoDB. The Lambda function times out after 15 seconds for larger files. The team wants to handle files up to 100 MB without changing the Lambda code. Which solution is MOST cost-effective?

Question 261easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to run a one-time query on 10 TB of data stored in S3 using Amazon Athena. The query scans 5 TB and returns a small result set. Which approach minimizes cost?

Question 262mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to ingest streaming data and deliver it to an S3 bucket. The data is in JSON format with a timestamp field. The data science team wants to query the data using Athena with partitioning by year/month/day. How should the S3 data be organized?

Question 263hardmultiple choice

Read the full Data Engineering explanation →

An organization is migrating its on-premises Hadoop cluster to AWS. The cluster runs Spark jobs that process 50 TB of data daily. The data is stored in HDFS with 3x replication. Which storage option on AWS provides the best price-performance for this workload?

Question 264easymultiple choice

Read the full Data Engineering explanation →

A company needs to ingest real-time clickstream data from thousands of web servers into AWS for near-real-time analytics. The data volume varies and can spike during promotions. Which service should be used to capture and buffer the data before processing?

Question 265mediummultiple choice

Read the full Data Engineering explanation →

A data engineer uses AWS Glue to run ETL jobs that transform data from JSON to Parquet. The job runs successfully but takes 30 minutes longer than expected. CloudWatch metrics show high memory utilization and disk spills. What is the most likely cause?

Question 266hardmultiple choice

Read the full Data Engineering explanation →

A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest with a key that is automatically rotated every year. Which solution meets these requirements with the least operational overhead?

Question 267mediummulti select

Read the full Data Engineering explanation →

Which TWO options are valid ways to reduce the amount of data scanned by Amazon Athena queries, thereby reducing cost?

Question 268hardmulti select

Read the full Data Engineering explanation →

Which THREE AWS services can be used together to build a serverless data pipeline that ingests streaming data, transforms it, and loads it into Amazon Redshift for analysis?

Question 269easymulti select

Read the full Data Engineering explanation →

Which TWO options are best practices for managing access to data stored in Amazon S3 for a data lake?

Question 270hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is investigating why an Athena query against the my-data-lake bucket is slow. The query filters on year, month, and day. The exhibit shows the metadata of one Parquet file. What is the MOST likely cause of the slow query?

Network Topology

Question 271mediummultiple choice

Read the full Data Engineering explanation →

The Glue job my-glue-job fails after a few successful runs. The error log shows 'Job run exceeds max concurrent runs limit'. The CloudFormation template is shown in the exhibit. What change should be made to allow multiple runs to execute concurrently?

Exhibit

Refer to the exhibit.

CloudFormation template snippet:

Resources:
  MyGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation: s3://my-bucket/scripts/etl.py
        PythonVersion: "3"
      DefaultArguments:
        --TempDir: s3://my-bucket/temp/
        --job-bookmark-option: job-bookmark-enable
      ExecutionProperty:
        MaxConcurrentRuns: 1
      MaxRetries: 0
      Name: my-glue-job
      Role: arn:aws:iam::123456789012:role/GlueServiceRole

Question 272hardmultiple choice

Read the full Data Engineering explanation →

A Glue job fails with an AccessDenied error when trying to write to the S3 bucket my-data-lake. The IAM policy attached to the job role is shown in the exhibit. What is the MOST likely reason for the failure?

Exhibit

Refer to the exhibit.

IAM policy attached to an AWS Glue job role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::my-data-lake/*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::my-data-lake"
        }
    ]
}

Question 273easymultiple choice

Read the full Data Engineering explanation →

A data scientist needs to process a large volume of streaming data from IoT devices and store the results in Amazon S3 for further analysis. Which AWS service is most suitable for ingesting and processing this data in near real-time?

Question 274easymultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with timeouts. What is the most likely cause?

Question 275easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to collect clickstream data. The data is consumed by a Lambda function that writes to DynamoDB. Occasionally, the Lambda function fails due to throttling from DynamoDB. How can the company resolve this issue without losing data?

Question 276mediummultiple choice

Read the full Data Engineering explanation →

A company needs to perform complex transformations on large datasets stored in Amazon S3 using Apache Spark. They want to minimize operational overhead. Which AWS service should they use?

Question 277mediummultiple choice

Read the full Data Engineering explanation →

A company is migrating its on-premises Hadoop cluster to AWS. They have a large amount of historical data stored in HDFS. Which approach is the most efficient for transferring this data to Amazon S3?

Question 278mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to automate the transformation of CSV files to Parquet format as soon as they are uploaded to an S3 bucket. The transformed files should be stored in another S3 bucket. Which solution is the most cost-effective and requires the least maintenance?

Question 279mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. They notice that the data is delivered in 5-minute intervals even though they set the buffer interval to 60 seconds. What could be the cause?

Question 280hardmultiple choice

Read the full Data Engineering explanation →

A company needs to process sensitive data from multiple sources. They want to use AWS Glue to catalog and transform the data. Which feature should they use to ensure that sensitive columns are masked before the data is available for querying?

Question 281hardmultiple choice

Read the full Data Engineering explanation →

A company runs a critical ETL job using AWS Glue that writes to an Amazon Redshift cluster. The job occasionally fails due to insufficient disk space on the Redshift cluster. How can the company automate the process to prevent this failure?

Question 282easymulti select

Read the full Data Engineering explanation →

Which TWO AWS services are suitable for real-time stream processing?

Question 283mediummulti select

Read the full Data Engineering explanation →

Which TWO data formats are columnar and optimized for analytics queries in Amazon S3?

Question 284hardmulti select

Read the full Data Engineering explanation →

Which THREE considerations are important when designing a data lake on Amazon S3?

Question 285hardmultiple choice

Read the full Data Engineering explanation →

An IAM policy attached to an AWS Glue job allows reading and writing to an S3 bucket and accessing Glue Data Catalog. The job fails with an access denied error when trying to create a table in the Data Catalog. What is the likely issue?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetTable",
        "glue:GetDatabase"
      ],
      "Resource": "*"
    }
  ]
}

Question 286hardmultiple choice

Read the full Data Engineering explanation →

A data engineer runs the AWS CLI command shown and notices a zero-byte file in the results. What is the most likely cause of this zero-byte file?

Network Topology

Question 287mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue jobs with job bookmarks enabled to process incremental data. They notice that the job processes all data each time instead of only new data. What is the most likely reason?

Exhibit

Refer to the exhibit.

Resources:
  MyGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation: s3://my-bucket/scripts/etl.py
      DefaultArguments:
        --TempDir: s3://my-bucket/temp/
        --job-bookmark-option: job-bookmark-enable
      MaxRetries: 0
      MaxConcurrentRuns: 3
      Role: arn:aws:iam::123456789012:role/GlueServiceRole

Question 288easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to store streaming data from thousands of IoT devices for real-time analytics. Which AWS service is most suitable for ingesting and storing this data for subsequent processing by Amazon Kinesis Data Analytics?

Question 289mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that process data in an S3 data lake. The jobs are failing with out-of-memory errors when processing large files. Which configuration change should be made to resolve this issue?

Question 290hardmultiple choice

Read the full Data Engineering explanation →

A data scientist is training a deep learning model on a GPU instance. The training data is stored in S3 and is 50 GB. To reduce I/O bottlenecks, which storage option should be used to cache the data locally on the instance?

Question 291mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be transformed into Parquet format before delivery. Which approach should the data engineer use?

Question 292hardmultiple choice

Read the full Data Engineering explanation →

A company is running a data pipeline that uses Amazon EMR with Spark to process 100 TB of data daily. The pipeline must complete within 6 hours. Currently, it takes 8 hours. Which optimization will most likely reduce the runtime?

Question 293easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should be used to trigger the job?

Question 294mediummultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Athena to query a data lake in S3. Queries are slow and expensive. The data is stored as JSON. Which action will improve query performance and reduce cost?

Question 295hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is building a data pipeline that uses Amazon S3 to store raw data, AWS Lambda for transformation, and Amazon DynamoDB for serving. The Lambda function experiences high latency when writing to DynamoDB. Which action will most effectively reduce the latency?

Question 296easymultiple choice

Read the full Data Engineering explanation →

A company needs to move 10 TB of data from an on-premises NAS to Amazon S3 over a 100 Mbps internet connection. The transfer must complete within 3 days. Which solution is the most appropriate?

Question 297mediummulti select

Read the full Data Engineering explanation →

A data engineering team is designing a data lake on AWS. They need to store raw data in S3 and allow multiple analytics services to query the data. Which TWO services can be used to catalog and provide schema information for the data?

Question 298hardmulti select

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be processed and stored in S3 in near real-time. Which THREE services can be used together to achieve this?

Question 299mediummulti select

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline that uses Amazon S3 events to trigger an AWS Lambda function for processing. The pipeline must handle high throughput with low latency. Which TWO configurations should be applied?

Question 300hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is configuring an IAM policy to allow users to upload objects to an S3 bucket only if the objects are encrypted using SSE-S3. However, users are getting AccessDenied errors when uploading objects without specifying encryption. What is the most likely cause?

Exhibit

Refer to the exhibit.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}
```

Question 301mediummultiple choice

Read the full Data Engineering explanation →

An AWS Glue ETL job failed with the error 'Insufficient memory allocated for the job'. The job run details show AllocatedCapacity: 5, WorkerType: Standard, NumberOfWorkers: 5. Which change should be made to resolve the issue?

Network Topology

Question 302hardmultiple choice

Read the full Data Engineering explanation →

An S3 event notification triggers an AWS Lambda function when a new object is created. The Lambda function parses the event and processes the object. The function is failing with a timeout error for large objects. Which approach should be used to handle large objects efficiently?

Exhibit

Refer to the exhibit.

```
{
    "Records": [
        {
            "eventVersion": "2.1",
            "eventSource": "aws:s3",
            "awsRegion": "us-east-1",
            "eventName": "ObjectCreated:Put",
            "s3": {
                "s3SchemaVersion": "1.0",
                "bucket": {
                    "name": "my-data-lake",
                    "arn": "arn:aws:s3:::my-data-lake"
                },
                "object": {
                    "key": "data/2023/01/15/sample.json",
                    "size": 1024,
                    "eTag": "abc123"
                }
            }
        }
    ]
}
```

Question 303mediummultiple choice

Study the full Python automation breakdown →

A data engineer needs to transform large CSV files stored in Amazon S3 into Parquet format before loading into Amazon Redshift. The transformation logic is complex and requires custom Python code. Which AWS service should be used to perform this transformation with minimal operational overhead?

Question 304easymultiple choice

Read the full Data Engineering explanation →

A company is streaming clickstream data from a website to Amazon Kinesis Data Streams. The data is consumed by a Lambda function that enriches each record with geolocation information before writing to an S3 bucket. Recently, the Lambda function has been failing with throttling errors. What is the MOST likely cause?

Question 305hardmultiple choice

Read the full Data Engineering explanation →

A data scientist wants to run a one-time SQL query on a large dataset stored in Amazon S3 (CSV format, 2 TB) using Amazon Athena. The query involves joining this dataset with a smaller table stored in Amazon RDS. What is the MOST cost-effective and performant approach?

Question 306easymultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Firehose to load streaming data into an S3 bucket. The data schema evolves over time, with new columns added. The data must be queryable using Amazon Athena. What is the BEST way to handle schema changes?

Question 307mediummultiple choice

Read the full Data Engineering explanation →

A company runs a daily batch ETL job using AWS Glue that reads from Amazon RDS (MySQL), transforms the data, and writes to Amazon Redshift. The job takes 6 hours and processes 500 GB of data. Management wants to reduce the runtime. Which action would be MOST effective?

Question 308hardmultiple choice

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time sensor data. The application reads from a Kinesis data stream, performs windowed aggregations, and writes results to an S3 bucket. Recently, the application has been experiencing high latency and checkpoint failures. What is the MOST likely cause?

Question 309mediummultiple choice

Read the full Data Engineering explanation →

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3 with minimal latency (under 5 minutes) for real-time analytics. The data volume is approximately 10 MB per second. Which solution is MOST cost-effective and meets the latency requirement?

Question 310easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Redshift for data warehousing. The data engineering team needs to load data from multiple S3 buckets into Redshift daily. Each bucket contains files in different formats (CSV, JSON, Parquet). Which AWS service is BEST suited to automate this ingestion process?

Question 311hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is designing a data lake on Amazon S3. The data comes from various sources and must be stored in a way that supports both batch and real-time analytics. The engineer needs to partition the data to optimize query performance in Amazon Athena. Which partitioning strategy is MOST appropriate?

Question 312mediummulti select

Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams with 10 shards to ingest clickstream data. Each record is approximately 50 KB. The data is consumed by a Lambda function that writes to DynamoDB. The Lambda function is experiencing throttling errors. Which TWO actions should the data engineer take to resolve the issue? (Choose TWO.)

Question 313hardmulti select

Read the full Data Engineering explanation →

A data engineer is designing an ETL pipeline using AWS Glue to process data from Amazon S3 and load it into Amazon Redshift. The pipeline must handle incremental data loads and ensure data consistency. Which THREE features should the engineer use to achieve this? (Choose THREE.)

Question 314easymulti select

Read the full Data Engineering explanation →

A company stores IoT sensor data in Amazon S3 and uses Amazon Athena for ad-hoc queries. The data is partitioned by date, but queries are still slow and expensive. Which TWO actions can improve query performance and reduce cost? (Choose TWO.)

Question 315hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is troubleshooting an AWS Glue job that reads from and writes to the S3 bucket 'data-lake-bucket'. The job fails when trying to write to the 'sensitive/' prefix. The IAM policy attached to the Glue job's IAM role is shown in the exhibit. What is the MOST likely reason for the failure?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::data-lake-bucket/sensitive/*"
    }
  ]
}

Question 316mediummultiple choice

Read the full Data Engineering explanation →

A data engineer runs the AWS CLI command shown in the exhibit to find large log files in S3. The command returns an empty list, but the engineer knows there are files larger than 1 MB in that prefix. What is the MOST likely issue?

Network Topology

Question 317easymultiple choice

Read the full Data Engineering explanation →

A Lambda function is triggered by S3 events. The event payload shown in the exhibit is received by the Lambda function. The function is supposed to process the CSV file and load it into DynamoDB. However, the function fails because it cannot read the file. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "bucket": {
          "name": "my-bucket"
        },
        "object": {
          "key": "data/sample.csv"
        }
      }
    }
  ]
}

Question 318easymultiple choice

Read the full NAT/PAT explanation →

A company is building a data pipeline to process streaming data from IoT devices. The data must be ingested with low latency, transformed in real-time using custom logic, and stored in Amazon S3 partitioned by device ID and timestamp. Which combination of AWS services should the company use to meet these requirements?

Question 319mediummultiple choice

Read the full Data Engineering explanation →

A data scientist is using Amazon SageMaker to train a model. The training data is stored in Amazon S3 and is approximately 500 GB. The data scientist notices that the training job is taking a long time to start because the data is being copied to the training instance's storage. The data scientist wants to reduce the startup time for subsequent training jobs. Which action should the data scientist take?

Question 320hardmultiple choice

Read the full Data Engineering explanation →

A company is designing a data pipeline to process log files from multiple sources. The logs are written to Amazon S3 every hour. The data is then transformed using AWS Glue ETL jobs and loaded into Amazon Redshift for analysis. The company needs to ensure that the data is available for analysis within 30 minutes of being written to S3. Currently, the Glue job is triggered hourly, but the company wants to reduce the latency. Which solution should the company implement?

Question 321easymultiple choice

Read the full Data Engineering explanation →

A company stores sensitive customer data in Amazon S3. The company must ensure that data is encrypted at rest. The company also needs to manage the encryption keys using an AWS service that allows automatic rotation of keys. Which solution meets these requirements?

Question 322mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data pipeline that ingests data from on-premises databases into Amazon S3 using AWS Database Migration Service (AWS DMS). The company wants to capture continuous changes from the source database and replicate them to S3 in near-real time. Which AWS DMS configuration should the company use?

Question 323hardmultiple choice

Read the full Data Engineering explanation →

A data engineering team is designing a data lake on Amazon S3. The data is ingested from multiple sources in JSON, CSV, and Parquet formats. The team needs to make the data available for analysis using Amazon Athena and Amazon Redshift Spectrum. The team wants to minimize data transformation costs and storage overhead. Which data storage approach should the team use?

Question 324easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by a custom application that runs on Amazon EC2 instances. The company notices that the consumer application is falling behind the producer, causing data to be throttled. Which action should the company take to improve the consumer's throughput?

Question 325mediummultiple choice

Read the full Data Engineering explanation →

A company is using AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The ETL jobs are failing intermittently with write timeout errors when writing to S3. The company wants to implement a retry mechanism for transient errors. What should the company do?

Question 326hardmultiple choice

Read the full NAT/PAT explanation →

A company is migrating its on-premises Apache Hadoop cluster to AWS. The cluster processes large datasets using Spark jobs. The company wants to minimize operational overhead and use native AWS services. Which combination of services should the company use?

Question 327mediummulti select

Read the full Data Engineering explanation →

A company is building a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time events. The pipeline then uses AWS Lambda to process the events and store results in Amazon DynamoDB. The company wants to ensure that the Lambda function can process all events without data loss and without duplicating processing. Which TWO configuration steps should the company take?

Question 328hardmulti select

Read the full Data Engineering explanation →

A company is using AWS Glue to catalog data stored in Amazon S3. The data is partitioned by year, month, day, and hour. The company runs hourly ETL jobs that add new partitions. The Glue crawler is scheduled to run every hour to update the Data Catalog. However, the crawler is taking longer than expected and is not completing before the next crawler run starts. Which THREE actions could the company take to resolve this issue?

Question 329easymulti select

Read the full Data Engineering explanation →

A company needs to move 50 TB of data from an on-premises data center to Amazon S3. The company has a limited internet bandwidth of 100 Mbps. The data transfer must be completed within 10 days. Which TWO services should the company use together to meet these requirements?

Question 330mediummultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer is creating an IAM policy for an AWS Glue ETL job that reads encrypted objects from an S3 bucket, transforms them, and writes the results back to the same bucket. The bucket uses SSE-KMS encryption with the KMS key specified. The ETL job is failing with an "Access Denied" error when trying to write data. What is the likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc123"
    }
  ]
}

Question 331hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer runs the AWS CLI command to check an object in an S3 bucket. The bucket is part of a data lake and is configured with versioning enabled. However, the output shows "VersionId": null. What is the most likely reason for this?

Network Topology

Question 332hardmultiple choice

Read the full Data Engineering explanation →

A company runs a streaming data pipeline using Amazon Kinesis Data Streams with 10 shards. The pipeline ingests sensor data from thousands of devices. Each device sends a JSON payload every 5 seconds. The payload size is approximately 2 KB. The data is consumed by a fleet of EC2 instances running a custom Java application that uses the Kinesis Client Library (KCL). Over the past week, the company has observed that the consumer application is experiencing increased latency, and the Kinesis stream's 'GetRecords.IteratorAgeMilliseconds' CloudWatch metric is consistently above 10 seconds. The company has verified that the EC2 instances have sufficient CPU and memory resources. The KCL application is configured with 10 workers, one per shard. The application processes each record by performing a simple transformation and writing to Amazon DynamoDB. The DynamoDB table has sufficient write capacity and is not throttling. The company wants to reduce the iterator age to under 2 seconds. Which action should the company take?

Question 333mediummultiple choice

Read the full Data Engineering explanation →

A company is streaming e-commerce events to Amazon Kinesis Data Streams. The data science team needs to join events from multiple shards in near real-time and then store the joined results in Amazon S3. Which solution would meet these requirements with the LEAST operational overhead?

Question 334easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to load data from a MySQL database to Amazon S3 daily. The database is 500 GB and the load window is 2 hours. The data must be extracted without impacting the source database performance. Which AWS service should be used to perform the extraction?

Question 335hardmultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The stream has 8 shards. A Lambda function processes each record and writes to Amazon DynamoDB. The Lambda function sometimes fails due to DynamoDB write throttling, causing duplicate processing of records after retries. The data engineering team needs to ensure exactly-once processing semantics for the DynamoDB writes. What should the team do?

Question 336easymultiple choice

Read the full Data Engineering explanation →

A data pipeline uses AWS Glue to crawl an S3 bucket and create a table in the AWS Glue Data Catalog. The data is in Parquet format with partitions by date. After a new partition is added to S3, the crawler runs but the new partition is not reflected in the table. What is the most likely cause?

Question 337mediummultiple choice

Read the full Data Engineering explanation →

A data engineering team uses Amazon EMR with Spark to transform large datasets in S3. The team notices that the Spark jobs on the EMR cluster are failing with out-of-memory errors. The cluster uses instance types with moderate memory. Which configuration change would MOST effectively reduce memory pressure without increasing cost?

Question 338easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is organized by year/month/day/hour. The team needs to ensure that all data is encrypted at rest in S3 using an AWS KMS customer managed key (CMK). Which configuration should the team implement?

Question 339mediummultiple choice

Read the full Data Engineering explanation →

A company runs a nightly batch job that reads data from Amazon RDS for PostgreSQL, transforms it using AWS Glue, and writes the output to Amazon S3 in Parquet format. The job takes 2 hours to complete, but the data volume has grown, and the job now takes 4 hours, exceeding the allowed window. The team needs to reduce the job duration without increasing cost. Which action is MOST effective?

Question 340hardmulti select

Read the full NAT/PAT explanation →

A company uses Amazon S3 to store historical transaction data in CSV format. The data is partitioned by transaction_date. A data analyst runs Amazon Athena queries that frequently filter on customer_id and transaction_date. The queries are slow and expensive. The team needs to improve query performance and reduce cost. Which combination of actions should the team take? (Choose TWO.)

Question 341mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Streams to collect IoT sensor data. The stream has 4 shards. A consumer application reads from the stream using the Kinesis Client Library (KCL). The application processes records and stores them in Amazon DynamoDB. Recently, the data volume has increased, and the consumer is falling behind. Which action should the team take to increase the processing throughput?

Question 342hardmulti select

Read the full NAT/PAT explanation →

A company uses Amazon Redshift to run analytics on sales data. The data is loaded daily from S3 using COPY commands. The team notices that the COPY command performance degrades over time due to table bloat. The team needs to maintain query performance and reduce storage costs. Which combination of maintenance operations should the team perform regularly? (Choose THREE.)

Question 343easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of data from an on-premises HDFS cluster to Amazon S3. The data must be encrypted in transit and at rest. The on-premises network has a 1 Gbps connection to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

Question 344mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Athena to analyze data stored in S3. The data is in CSV format and is partitioned by year/month/day. Queries that filter on a specific day are slow. The team wants to improve query performance without changing the data format. Which action should the team take?

Question 345hardmulti select

Read the full NAT/PAT explanation →

A company uses AWS Glue to run ETL jobs on a daily basis. The jobs read from Amazon RDS and write to Amazon S3. The data volume has grown, and the jobs are taking longer to complete. The team wants to optimize the jobs for cost and performance. Which combination of techniques should the team implement? (Choose THREE.)

Question 346mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon EMR to run Spark jobs on a cluster with 10 core nodes of type r5.xlarge. The jobs are I/O intensive and read large amounts of data from S3. The team notices high network throughput but low CPU utilization. Which configuration change would improve job performance at the same cost?

Question 347easymultiple choice

Read the full Data Engineering explanation →

A company wants to build a data lake on Amazon S3. The data lake will store raw data in its original format and also transformed data in Parquet. The data is generated by various sources and must be cataloged for discovery. Which service should the company use to automatically discover, catalog, and make the data searchable?

Question 348hardmultiple choice

Read the full Data Engineering explanation →

Refer to the exhibit. A data engineer has attached this IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from an S3 bucket (data-bucket) that is encrypted with SSE-KMS using the key arn:aws:kms:us-east-1:123456789012:key/abc123, transforms the data, and writes the result to a different S3 bucket (output-bucket) encrypted with a different KMS key (arn:aws:kms:us-east-1:123456789012:key/xyz789). When the job runs, it fails with an access denied error. What is the cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/abc123"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetTable",
        "glue:UpdateTable"
      ],
      "Resource": "arn:aws:glue:us-east-1:123456789012:catalog"
    }
  ]
}

Question 349mediummultiple choice

Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed before delivery using AWS Lambda. The Lambda function adds a timestamp field. The Firehose stream receives up to 10,000 records per second. The transformation currently takes 500 ms per record. What should the team do to ensure the transformation can keep up with the incoming data without data loss?

Question 350hardmultiple choice

Read the full Data Engineering explanation →

A company runs a critical data pipeline using Apache Spark on Amazon EMR. The pipeline reads data from Amazon S3, performs complex transformations, and writes results back to S3. The job runs every hour and must complete within 30 minutes. Recently, the job has been taking longer and occasionally failing due to executor losses. The team suspects memory pressure. Which action should the team take to improve stability and performance without increasing cost?

Question 351hardmultiple choice

Read the full NAT/PAT explanation →

A company runs a real-time analytics platform that ingests IoT sensor data from millions of devices. The data is sent to Amazon Kinesis Data Streams with 16 shards. A custom Java application using the Kinesis Client Library (KCL) processes the data and writes aggregated results to Amazon DynamoDB. The application runs on a fleet of EC2 instances in an Auto Scaling group. Recently, the team noticed that some records are being processed multiple times, resulting in duplicate entries in DynamoDB. The application uses the DynamoDB PutItem API to write records. The team needs to eliminate duplicates without significantly increasing latency. Which solution should the team implement?

Question 352mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for PostgreSQL and load it into Amazon Redshift. The Glue job runs nightly and takes 6 hours to complete. The Redshift cluster is a single dc2.large node. The team needs to reduce the load time to under 3 hours. The data volume is 200 GB per night. The team is considering using Amazon Redshift Spectrum to query data directly from S3 instead of loading it. However, the data transformation logic is complex and requires multiple joins and aggregations that are currently performed in Glue. Which approach should the team recommend to meet the time requirement?

Question 353easymultiple choice

Read the full Data Engineering explanation →

A company uses Amazon S3 to store log files from various applications. The logs are in JSON format and are appended to existing files every few minutes. A data analyst wants to run SQL queries on the logs using Amazon Athena. However, queries return incomplete results because Athena does not support modifying data. The team needs to enable querying of the latest log data with minimal changes to the existing ingestion process. Which solution should the team implement?

Question 354mediummultiple choice

Read the full NAT/PAT explanation →

A data engineering team is building a real-time clickstream analytics pipeline on AWS. They need to ingest millions of events per second from mobile apps and websites, process them with low latency, and store the results in Amazon S3 for downstream analysis. Which combination of AWS services should the team use to minimize operational overhead while meeting these requirements?

Question 355easymultiple choice

Read the full Data Engineering explanation →

A data engineer is designing a data lake on Amazon S3. The data comes from various sources, including IoT devices, web logs, and transactional databases. The engineer needs to organize the data in a way that supports efficient querying using Amazon Athena and allows for easy management of access permissions. Which S3 bucket structure is the most appropriate?

Question 356hardmultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue to run ETL jobs that transform data from Amazon RDS for MySQL to Amazon S3. The current job runs daily and takes 3 hours to process 100 GB of data. The company expects data volume to grow 10x in the next year. They need to reduce job runtime and cost. Which approach should they take?

Question 357easymultiple choice

Read the full NAT/PAT explanation →

A data engineer is tasked with building a pipeline to process streaming data from IoT devices. The devices send data in JSON format every second. The pipeline must aggregate data in 5-minute windows and store the results in Amazon S3. The engineer needs to handle late-arriving data (up to 1 hour) and ensure exactly-once semantics. Which combination of AWS services should they use?

Question 358mediummultiple choice

Read the full Data Engineering explanation →

A research institution is building a data lake to store genomics data. Each experiment generates multiple files totaling about 500 GB. The data is stored in Amazon S3 and needs to be processed by multiple machine learning (ML) training jobs running on Amazon SageMaker. The data has a high churn rate; after 30 days, most data becomes irrelevant and should be moved to Amazon S3 Glacier Deep Archive. The institution wants to minimize storage costs while maintaining data durability. Which S3 storage class should they use for the first 30 days?

Question 359hardmultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue crawlers to populate the AWS Glue Data Catalog from Amazon S3. The data is partitioned by year/month/day/hour. The crawler runs every hour and adds new partitions. However, the data engineer notices that the crawler is taking longer to run as the number of partitions grows, and sometimes it misses new partitions. What is the most cost-effective and reliable way to address this?

Question 360easymultiple choice

Read the full Data Engineering explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and the data must be transferred within 5 days. Which AWS service is best suited for this task?

Question 361mediummulti select

Read the full Data Engineering explanation →

A company is migrating on-premises data to AWS. The data includes both structured and unstructured files, totaling 200 TB. The company has a 1 Gbps dedicated network connection to AWS. They want to minimize migration time and cost. Which TWO AWS services or features should they use together? (Choose two.)

Question 362hardmulti select

Read the full Data Engineering explanation →

A data engineering team is designing a streaming data pipeline that ingests 10,000 events per second. Each event is 2 KB. The pipeline must process events with a latency of less than 1 second. The team is considering using Amazon Kinesis Data Streams with 10 shards. Which TWO additional configurations should the team implement to meet the latency requirement? (Choose two.)

Question 363mediummulti select

Read the full Data Engineering explanation →

A company uses AWS Glue to run ETL jobs. The data engineer wants to monitor job performance and troubleshoot failures. Which THREE AWS services or features should they use together? (Choose three.)

Question 364hardmulti select

Read the full Data Engineering explanation →

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time transaction data. The data must be processed in near real-time and stored in Amazon S3 for long-term analytics. The engineer wants to ensure data durability and exactly-once processing semantics. Which TWO actions should the engineer take? (Choose two.)

Question 365hardmultiple choice

Read the full Data Engineering explanation →

You are a data engineer at a fintech company. The company processes real-time stock market data from multiple exchanges. The data is ingested via Amazon Kinesis Data Streams with 50 shards. Each record is about 1 KB, and the ingestion rate is 5,000 records per second. The data is consumed by a Java application running on Amazon ECS that performs real-time analytics and stores results in Amazon DynamoDB. Recently, the application has been experiencing high latency, and some records are stuck in the shards for minutes before being consumed. The CloudWatch metrics show that the application's CPU utilization is low, but the iterator age is increasing. The application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely cause and how should it be fixed?

Question 366hardmultiple choice

Read the full Data Engineering explanation →

A company runs an e-commerce platform that generates clickstream data in real-time. The data is ingested into Amazon Kinesis Data Streams (100 shards) and processed by AWS Lambda functions, which aggregate data in 1-minute windows and write the results to Amazon S3. The Lambda functions are triggered by the Kinesis stream using the event source mapping. Recently, the company noticed that some records are being processed multiple times, leading to duplicate data in S3. The Lambda function is idempotent, but the duplicates are causing downstream issues. The Lambda function's concurrency limit is 1000, and the batch size is 100. The average processing time per record is 200 ms. What is the most likely cause of the duplicates, and how should it be fixed?

Question 367mediummultiple choice

Read the full Data Engineering explanation →

A data engineer is responsible for managing a data lake on Amazon S3. The data lake contains CSV files from various sources, totaling 10 TB. The engineer needs to make this data queryable using Amazon Athena. However, Athena queries are currently taking a long time and scanning large amounts of data. The engineer has noticed that the CSV files are not partitioned, and there are no indexes. The engineer wants to improve query performance and reduce costs. The data is accessed frequently for the last 30 days, but older data is rarely queried. The engineer also wants to minimize the amount of data scanned by Athena. What should the engineer do?

Question 368mediummultiple choice

Read the full Data Engineering explanation →

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The Glue job writes data to Redshift using the JDBC connection. Recently, the job has been failing with connection timeout errors when writing to Redshift. The Redshift cluster is a 2-node dc2.large cluster. The Glue job processes about 50 GB of data per run. The errors occur sporadically, and the job succeeds after a few retries. The data engineer needs to resolve the issue to prevent job failures. What should the engineer do?

Question 369easymulti select

Read the full Data Engineering explanation →

A data engineer is building a data pipeline that ingests streaming data from Amazon Kinesis Data Streams, transforms the data using AWS Lambda, and stores the results in Amazon S3. The engineer needs to ensure that each record is processed exactly once and in order. Which TWO approaches should the engineer consider? (Choose TWO.)

Question 370mediummulti select

Read the full Data Engineering explanation →

A company runs a data lake on Amazon S3 with AWS Glue for ETL. The data science team needs to train machine learning models on historical data, but they are concerned about data quality issues such as missing values, duplicates, and outliers. The team wants to build a data quality monitoring solution that automatically detects anomalies and alerts the data engineering team. Which THREE steps should the team take to implement this solution? (Choose THREE.)

Question 371hardmultiple choice

Read the full NAT/PAT explanation →

A data engineering team is building a real-time data pipeline using Amazon Kinesis Data Streams with AWS Lambda for processing. The pipeline ingests clickstream data from a mobile app. The team notices that occasionally, a Lambda function fails due to a transient error, and the failed record is not retried, leading to data loss. The Lambda function is configured with a batch size of 100 and a maximum retry count of 0. The team wants to ensure that all records are processed successfully, even if transient failures occur. They also want to minimize the impact of poison pill records that could block processing. Which combination of actions should the team take to address this issue?

Question 372easymultiple choice

Read the full Data Engineering explanation →

A machine learning team is using Amazon SageMaker to train models on a large dataset stored in Amazon S3. The dataset is 5 TB in size and is partitioned by date. The team wants to minimize data transfer costs and reduce training time by caching frequently accessed data locally on the training instances. The training instances are EC2 instances with attached Amazon EBS volumes. The team is considering using SageMaker Pipe mode to stream data directly from S3, but they are concerned about network bandwidth. Which approach should the team use to optimize data loading for training?

Question 373mediummultiple choice

Read the full Data Engineering explanation →

A company is building a data pipeline to process streaming data from IoT devices. The data is ingested via Amazon Kinesis Data Streams. Each record is about 1 KB. The company wants to use AWS Lambda for real-time transformations and then store the results in Amazon DynamoDB. The expected throughput is 10,000 records per second. The Lambda function currently runs in about 200 ms. The company is concerned about Lambda concurrency limits and wants to ensure there are no throttling errors. The default concurrency limit for Lambda is 1,000. Which approach should the team take to handle the expected throughput without throttling?

Question 374hardmultiple choice

Read the full Data Engineering explanation →

A data engineer is setting up a data lake on Amazon S3 for a large retail company. The data includes customer transactions, inventory, and web logs. The company wants to use AWS Glue for ETL and Amazon Athena for ad-hoc queries. The data is partitioned by year, month, day, and hour. The engineer notices that Athena queries are slow and often scan large amounts of data even when only a specific hour is needed. The engineer has already enabled partitioning and used columnar formats like Parquet. What additional step should the engineer take to optimize query performance and reduce data scanned?