AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 76150

1786 questions total · 24pages · All types, answers revealed

Page 1

Page 2 of 24

Page 3
76
MCQhard

A company is designing a data ingestion pipeline for real-time analytics. The source is a relational database, and the target is Amazon Redshift. The pipeline must handle schema changes in the source database automatically. Which combination of services should be used?

A.Amazon S3 and Amazon Athena
B.AWS DMS and AWS Glue
C.AWS Glue and Amazon Redshift COPY
D.Amazon Kinesis Data Streams and AWS Lambda
AnswerB

DMS captures changes, Glue can detect and apply schema changes.

Why this answer

AWS DMS can continuously replicate data from the source database to S3 in CDC mode. AWS Glue can then detect schema changes and transform the data before loading into Redshift. Kinesis Data Streams is for streaming data, not database CDC.

Athena is for querying, not for handling schema evolution. Lambda alone would require custom code for schema detection.

77
MCQhard

A data engineer is troubleshooting an Amazon Redshift cluster that is not allowing connections from a specific IP range. The engineer verified that the cluster's security group allows inbound traffic from the IP range. What is the next step to resolve the issue?

A.Modify the Redshift cluster parameter group to enable public accessibility.
B.Verify that the cluster's security group is attached to the Redshift cluster.
C.Check the IAM role associated with the Redshift cluster.
D.Check the network ACL (NACL) associated with the Redshift cluster's subnet.
AnswerD

NACLs can block traffic at the subnet level even if security groups allow it.

Why this answer

Option C is correct because Redshift also uses a VPC network ACL (NACL) that can block traffic at the subnet level, independent of security groups. Option A is wrong because the security group already allows traffic. Option B is wrong because IAM roles do not control network-level access.

Option D is wrong because the cluster parameter group does not control network access.

78
Multi-Selectmedium

A company uses Amazon Redshift to store customer data. The security team requires that all queries are logged for auditing purposes. Which combination of steps should be taken to meet this requirement? (Select TWO.)

Select 2 answers
A.Enable AWS CloudTrail database audit logging.
B.Use AWS CloudTrail to log Redshift API calls.
C.Enable logging on the Redshift security group.
D.Enable VPC Flow Logs for the Redshift cluster.
E.Enable Amazon Redshift audit logging to an S3 bucket.
AnswersB, E

CloudTrail logs API calls, including query execution.

Why this answer

Option B is correct because AWS CloudTrail can be configured to log Redshift API calls, such as CreateCluster, DeleteCluster, and ModifyCluster, which provides an audit trail of administrative actions. Option E is correct because Amazon Redshift supports native audit logging, including connection logs, user activity logs, and query logs, which can be exported to an Amazon S3 bucket for long-term retention and analysis.

Exam trap

The trap here is that candidates confuse AWS CloudTrail's ability to log API calls with database-level query logging, leading them to incorrectly select CloudTrail as the sole solution, while overlooking the need for Redshift's native audit logging to capture actual SQL queries.

79
MCQmedium

A data engineer is responsible for a real-time data pipeline that ingests clickstream data from a website into Amazon Kinesis Data Streams, then processed by an AWS Lambda function that writes to an Amazon DynamoDB table for user session tracking. The Lambda function is idempotent and uses the DynamoDB PutItem API with a condition expression to avoid overwriting existing records. Over the past week, the engineer has observed an increase in DynamoDB write throttling (ProvisionedThroughputExceededException) during peak traffic hours. The DynamoDB table has on-demand capacity. The engineer checks the Lambda function's reserved concurrency and finds it set to 1000. The Kinesis stream has 10 shards. The Lambda function's batch size is set to 100. The engineer suspects that the retry behavior is causing duplicate writes and throttling. Which change should the engineer make to reduce throttling?

A.Increase the number of Kinesis shards to 20 to distribute the load.
B.Decrease the Lambda batch size to 10 to reduce the number of records processed per invocation.
C.Decrease the Lambda reserved concurrency to 500 to limit the number of concurrent invocations.
D.Use a DynamoDB Stream to trigger a second Lambda function that writes to the table.
AnswerB

Smaller batches reduce the number of concurrent writes to DynamoDB, lowering throttling.

Why this answer

Option B is correct. On-demand DynamoDB can scale, but it has a per-partition throughput limit. Reducing the Lambda batch size reduces the number of concurrent writes per shard, decreasing the chance of hitting partition limits.

Option A is wrong because increasing shards would increase concurrency, worsening throttling. Option C is wrong because decreasing reserved concurrency could cause Lambda throttling but not DynamoDB throttling. Option D is wrong because using a DynamoDB stream adds complexity and does not directly reduce write throttling.

80
MCQhard

A company uses Amazon DynamoDB as the primary data store for a gaming application. The application experiences sudden spikes in traffic. The data engineer notices that write requests are throttled during peak times. The partition keys are well-distributed. What should the data engineer do to reduce throttling?

A.Use DynamoDB global tables to distribute writes across regions.
B.Configure DynamoDB auto scaling to adjust write capacity automatically.
C.Increase the number of partition keys to improve write distribution.
D.Enable DynamoDB Accelerator (DAX) to cache write operations.
AnswerB

Auto scaling increases write capacity during spikes, reducing throttling.

Why this answer

Option C is correct because DynamoDB auto scaling adjusts capacity based on load, preventing throttling during spikes. Option A is wrong because DAX is a cache for reads, not writes. Option B is wrong because global tables improve latency and disaster recovery, not write capacity.

Option D is wrong because increasing partition count is automatic if throughput increases, but without auto scaling, throttling still happens.

81
Multi-Selectmedium

A company uses Amazon DynamoDB as the primary data store for a web application. The application experiences high read latency. Which TWO actions can improve read performance?

Select 2 answers
A.Add a Global Secondary Index (GSI)
B.Enable DynamoDB Global Tables
C.Enable DynamoDB Accelerator (DAX)
D.Enable DynamoDB Streams
E.Increase the write capacity units
AnswersB, C

Global Tables allow reads from local regions, reducing latency.

Why this answer

Option B is correct because DynamoDB Accelerator (DAX) provides in-memory caching, reducing read latency. Option D is correct because Global Tables provide local read replicas, reducing latency for global users. Option A is wrong because GSI adds secondary indexes but does not reduce read latency for primary key lookups.

Option C is wrong because increasing write capacity does not affect read performance. Option E is wrong because DynamoDB Streams is for change capture, not reducing latency.

82
MCQmedium

A company uses Amazon S3 to store raw data and AWS Glue to run ETL jobs that transform the data into analytics-ready tables. The Glue job reads from a source with a schema that changes frequently (new columns added). The engineer wants the Glue job to automatically adapt to schema changes without manual intervention. Which configuration should the engineer use?

A.Schedule a Glue crawler to run after each ETL job to update the Data Catalog.
B.Set the job to use schema-on-read by storing data in Parquet format.
C.Enable the 'Update schema' option in the Glue job's output target configuration.
D.Use Glue's partition indexes to automatically detect new columns.
AnswerC

This option automatically adds new columns to the target table.

Why this answer

Glue's 'Update schema' option in the job's output target allows the job to automatically incorporate new columns. Option A is wrong because a crawler after the job would add lag. Option B is wrong because schema-on-read is not automatic.

Option D is wrong because catalog partitions are for partitioning, not schema evolution.

83
MCQmedium

A company uses AWS Lake Formation to manage data lake permissions. A data analyst cannot query a table in Athena, although the table appears in the catalog. The analyst has IAM permissions to run Athena. What is the MOST likely cause?

A.The Glue Data Catalog does not have the table registered.
B.The S3 bucket policy denies access to the analyst's IAM role.
C.The analyst lacks Lake Formation permissions on the table.
D.The Athena workgroup is not configured with the correct output location.
AnswerC

Lake Formation grants fine-grained permissions; the analyst needs SELECT.

Why this answer

Option B is correct because Lake Formation permissions are separate from IAM; the analyst needs SELECT permissions granted via Lake Formation. Option A is wrong because S3 bucket policy may allow access but Lake Formation can override. Option C is wrong because the table appears, so catalog permissions are likely fine.

Option D is wrong because Athena workgroup permissions are not related.

84
MCQmedium

A company runs a data pipeline that uses AWS Glue to process data from an Amazon DynamoDB table and write results to Amazon S3. The Glue job runs on a schedule every hour. Recently, the job started failing intermittently with 'ProvisionedThroughputExceededException' errors from DynamoDB. What is the BEST solution?

A.Use DynamoDB Accelerator (DAX) to reduce read latency.
B.Change the Glue job schedule to run every 2 hours.
C.Implement exponential backoff and retries in the Glue job for DynamoDB operations.
D.Increase the read capacity units of the DynamoDB table.
AnswerC

Exponential backoff handles throttling gracefully.

Why this answer

Option D is correct because implementing exponential backoff in the Glue job will handle throttling gracefully. Option A is wrong because increasing provisioned capacity increases cost and may not be necessary. Option B is wrong because using DynamoDB Accelerator does not affect throughput limits.

Option C is wrong because changing the schedule may not solve the issue.

85
Multi-Selectmedium

A company is designing a data lake on Amazon S3. The security policy requires that all data be encrypted at rest using AWS KMS with automatic key rotation. Which encryption option meets these requirements? (Select THREE.)

Select 3 answers
A.Enable automatic key rotation on the KMS key.
B.Use SSE-KMS with an AWS managed key.
C.Set the default encryption on the S3 bucket to SSE-KMS with the CMK.
D.Use SSE-KMS with a customer-managed key (CMK).
E.Use SSE-C with a customer-provided key.
AnswersA, C, D

This is a requirement.

Why this answer

Option A is correct because AWS KMS customer-managed keys (CMKs) support automatic key rotation, which can be enabled to rotate the key material annually. This satisfies the security policy requirement for automatic key rotation. SSE-KMS with a CMK (Option D) is also required because AWS managed keys (Option B) do not support automatic key rotation, and SSE-C (Option E) does not use KMS at all.

Setting default encryption on the S3 bucket to SSE-KMS with the CMK (Option C) ensures all objects are encrypted with that key, meeting the encryption-at-rest requirement.

Exam trap

The trap here is that candidates often assume AWS managed keys (aws/s3) support automatic key rotation, but they do not; only customer-managed CMKs allow you to enable automatic rotation, and the question requires selecting three correct options that together meet both the KMS and automatic rotation requirements.

86
MCQmedium

A company uses Amazon DynamoDB as a source for an AWS Glue job. The job reads a large table using a DynamoDB export to S3 feature. The job is failing with 'ThrottlingException' from DynamoDB. What should the data engineer do to resolve this issue WITHOUT changing the job's logic?

A.Use DynamoDB Streams to capture changes and process them incrementally
B.Reduce the number of DynamoDB read segments in the Glue job
C.Use the DynamoDB export to S3 feature and read the exported data from S3
D.Increase the read capacity units (RCU) of the DynamoDB table
AnswerC

Export to S3 reads from the table without consuming RCU, avoiding throttling entirely.

Why this answer

Option C is correct because the DynamoDB export to S3 feature creates a point-in-time snapshot of the table data in S3 without consuming any read capacity units (RCUs) from the DynamoDB table. By reading the exported data from S3 instead of directly scanning the DynamoDB table, the Glue job avoids triggering ThrottlingException entirely, as the export operation uses the table's backup and restore mechanism, not the read path. This resolves the issue without altering the job's logic, as the job can be reconfigured to read from the S3 export location.

Exam trap

The trap here is that candidates often assume the only way to resolve DynamoDB throttling is to increase RCUs (Option D) or reduce parallelism (Option B), missing the fact that the export-to-S3 feature completely eliminates the need to read from DynamoDB during the Glue job, which is the most efficient and cost-effective solution without altering job logic.

How to eliminate wrong answers

Option A is wrong because using DynamoDB Streams to capture changes and process them incrementally changes the job's logic from a full scan to a streaming/incremental approach, which violates the requirement to not change the job's logic; additionally, streams consume read capacity and could still cause throttling if not properly managed. Option B is wrong because reducing the number of DynamoDB read segments in the Glue job would decrease parallelism and potentially reduce the throttling, but it does not eliminate the root cause—the job still reads directly from DynamoDB, consuming RCUs and risking ThrottlingException; it also changes the job's configuration, which may alter performance. Option D is wrong because increasing the read capacity units (RCU) of the DynamoDB table addresses throttling by raising the throughput limit, but it incurs additional cost and does not leverage the export-to-S3 feature; it also changes the table's provisioned capacity, which is a modification outside the job's logic but still a change to the infrastructure, and the question asks to resolve the issue without changing the job's logic, which increasing RCU does not technically violate, but it is not the best practice and does not avoid the underlying scan overhead.

87
Multi-Selectmedium

A data engineer is designing a data lake on Amazon S3 that will be used for both batch processing with Amazon EMR and interactive queries with Amazon Athena. The data includes sensitive personally identifiable information (PII) that must be encrypted at rest. The company requires that the encryption keys be managed by the company and rotated every 90 days. Which TWO options should the engineer implement to meet these requirements? (Choose TWO.)

Select 2 answers
A.Use customer-provided keys (SSE-C) and store the keys in AWS Secrets Manager.
B.Configure a bucket policy to deny uploads that are not encrypted.
C.Enable S3 default encryption using SSE-KMS with the customer managed key.
D.Use AWS Key Management Service (KMS) to create a customer managed key with automatic yearly rotation.
E.Use S3 managed keys (SSE-S3) for server-side encryption.
AnswersC, D

Default encryption ensures all new objects are encrypted with the KMS key.

Why this answer

Option A is correct because AWS KMS allows you to create and manage your own customer managed keys, and you can set automatic key rotation every 90 days. Option D is correct because enabling S3 default encryption with SSE-KMS ensures all objects are encrypted using the KMS key. Option B is wrong because SSE-S3 uses Amazon-managed keys, not customer-managed keys.

Option C is wrong because SSE-C requires you to manage the keys yourself, but you cannot use KMS for rotation; also, SSE-C is not recommended for this scenario. Option E is wrong because bucket policies do not enable encryption; they can enforce encryption but do not manage keys.

88
Multi-Selectmedium

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3 for long-term retention. Which THREE services can be used together to achieve this?

Select 3 answers
A.Amazon Kinesis Data Analytics
B.Amazon Kinesis Data Firehose
C.AWS Glue
D.Amazon Athena
E.Amazon Kinesis Data Streams
AnswersA, B, E

Performs real-time transformations.

Why this answer

Options B, C, and D are correct: Kinesis Data Streams ingests, Kinesis Data Analytics transforms, and Kinesis Data Firehose writes to S3. Option A (Glue) is batch-oriented, not real-time. Option E (Athena) is for querying, not streaming.

89
MCQhard

A company is using Amazon MSK (Managed Streaming for Apache Kafka) to ingest real-time data. They need to transform the data using custom Java code before writing to Amazon S3. The transformation must be fault-tolerant and exactly-once semantics are required. Which AWS service should be used?

A.Amazon EMR with Spark Streaming
B.AWS Lambda consumer for MSK
C.Kafka Connect with S3 Sink Connector
D.Kinesis Data Analytics for Apache Flink
AnswerC

Supports exactly-once and custom transformations.

Why this answer

Option C is correct because Kafka Connect with S3 Sink Connector supports exactly-once semantics and custom transformations via Single Message Transforms (SMTs) or custom connectors. Option A (Kinesis Data Analytics for Apache Flink) is for Flink, not Kafka. Option B (AWS Lambda) does not provide exactly-once semantics from MSK.

Option D (Amazon EMR) is for batch processing.

90
MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Lambda function that processes data from Amazon S3. The function is triggered by S3 events, but no logs appear in CloudWatch Logs. The engineer runs the AWS CLI command shown. What is the MOST likely reason for the missing logs?

A.The Lambda execution role does not have permissions to create log groups and write logs.
B.The Lambda function is configured to log to a different log group.
C.The Lambda function is not being invoked by S3 events.
D.The log retention policy is set to 7 days, causing logs to expire immediately.
AnswerA

Missing logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents.

Why this answer

Option B is correct because the log group shows 'storedBytes': 0, indicating no logs have been written. The most common cause is that the Lambda execution role lacks permission to create log streams and put logs. Option A is wrong because the log group exists.

Option C is wrong because retention does not prevent logs from being written. Option D is wrong because the CLI command shows the log group exists.

91
MCQmedium

A media company ingests large video files (up to 100 GB each) from content creators via Amazon S3 multipart uploads. After upload, the company needs to transcode the videos into multiple formats using AWS Elemental MediaConvert. The current pipeline uses S3 event notifications to trigger an AWS Lambda function that starts a MediaConvert job. However, for very large files, the Lambda function times out (15-minute limit) before the upload completes because the event is sent when the multipart upload is initiated, not when it completes. How should the engineer fix this issue?

A.Increase the Lambda timeout to 30 minutes.
B.Use Amazon SQS to queue the event and have a Lambda function process it later.
C.Use AWS Step Functions to poll the S3 bucket for object existence and then start MediaConvert.
D.Change the S3 event notification to listen for s3:ObjectCreated:Put events instead of s3:ObjectCreated:*, which fires only after the object is fully written.
AnswerD

s3:ObjectCreated:Put triggers only when the object is complete, avoiding early invocation.

Why this answer

Option D is correct: Use S3 event notifications for s3:ObjectCreated:Put, which triggers only after the object is completely written. Then Lambda can start MediaConvert. Option A (increase timeout) does not solve the root cause.

Option B (Step Functions) adds complexity. Option C (SQS) does not address the event timing.

92
MCQmedium

A company uses AWS KMS to encrypt sensitive data in S3. The security team requires that the KMS key must be rotated automatically every year. Which key type should be used?

A.Asymmetric customer managed key
B.AWS managed key (aws/s3)
C.Custom key store backed by CloudHSM
D.Customer managed key with automatic rotation enabled
AnswerB

AWS managed keys automatically rotate every year, meeting the requirement.

Why this answer

Option A is correct because AWS managed keys (with automatic rotation) rotate annually by default. Option B is wrong because customer managed keys (with rotation enabled) rotate annually, but the key material is not rotated automatically unless imported key material is used. Option C is wrong because AWS managed keys rotate automatically, but custom key stores are for CloudHSM.

Option D is wrong because asymmetric keys are not automatically rotated.

93
MCQmedium

A data pipeline uses AWS Glue to process data from Amazon S3. The job fails with an 'OutOfMemoryError' during the transformation phase. Which action should the data engineer take to resolve this issue?

A.Enable S3 server-side encryption.
B.Increase the number of partitions in the input data.
C.Change the data format from CSV to Parquet.
D.Increase the number of DPUs (Data Processing Units) for the Glue job.
AnswerD

More DPUs provide additional memory and compute resources to handle large transformations.

Why this answer

Increasing the number of DPUs allocated to the Glue job provides more memory and processing capacity, which directly addresses the OutOfMemoryError. Option A is wrong because it increases parallelism without addressing memory. Option B is wrong because the error is not related to schema.

Option D is wrong because S3 performance is not the issue.

94
MCQhard

A data engineer is designing a data pipeline that uses AWS Glue to process data from an RDS MySQL database. The pipeline must capture only incremental changes (inserts and updates) and run every hour. Which approach is most cost-effective and reliable?

A.Use Glue job bookmarks to track and process only new and updated records
B.Use AWS DMS with change data capture (CDC) to replicate changes to S3
C.Add a timestamp column and query rows where timestamp > last run
D.Perform a full table scan each hour and compare with previous snapshot
AnswerA

Bookmarks efficiently handle incremental loads.

Why this answer

Option B is correct because Glue job bookmarks track processed data and enable incremental processing without reprocessing full tables. Option A is wrong because CDC from DMS adds complexity and cost. Option C is wrong because full scans are inefficient and costly.

Option D is wrong because querying by timestamp may miss updates if timestamps are not indexed or updated.

95
MCQeasy

A data engineer ran the above CLI command to describe an Amazon DynamoDB table named 'Orders'. The table has a key schema with 'OrderID' as the partition key and 'CustomerID' as the sort key. The table currently has no items. The engineer wants to add a new attribute 'OrderDate' and then query all orders for a specific customer within a date range. Which of the following actions is the MOST efficient approach to support this query pattern?

A.Modify the table's primary key to include 'OrderDate' as an additional sort key.
B.Use a Scan operation with a filter expression on 'CustomerID' and 'OrderDate' to retrieve the data.
C.Create a Local Secondary Index (LSI) with 'CustomerID' as partition key and 'OrderDate' as sort key.
D.Create a Global Secondary Index (GSI) with 'CustomerID' as partition key and 'OrderDate' as sort key.
AnswerD

GSI can be added at any time and supports efficient queries on CustomerID and OrderDate.

Why this answer

Option D is correct because a Global Secondary Index (GSI) allows querying on a different partition key ('CustomerID') and sort key ('OrderDate') without altering the base table's key schema. This supports efficient range queries on 'OrderDate' for a specific customer, as GSIs provide a separate index with its own provisioned throughput and can be created on existing tables with items. The base table's primary key remains unchanged, and the GSI enables the desired query pattern with low latency.

Exam trap

AWS often tests the distinction between LSIs and GSIs, specifically that LSIs require the same partition key as the base table, while GSIs allow a different partition key, which is a common point of confusion for candidates.

How to eliminate wrong answers

Option A is wrong because DynamoDB does not support modifying an existing table's primary key schema; you cannot add a sort key after table creation without recreating the table. Option B is wrong because a Scan operation reads every item in the table and then applies a filter, which is inefficient and costly for large tables, and does not leverage DynamoDB's indexing capabilities for range queries. Option C is wrong because a Local Secondary Index (LSI) must have the same partition key as the base table (here 'OrderID'), so it cannot use 'CustomerID' as the partition key; LSIs only allow querying with the base table's partition key and an alternate sort key.

96
Multi-Selecteasy

A data engineer is setting up an AWS Glue job to process data from an Amazon S3 bucket. The job fails with an 'Access Denied' error. Which TWO IAM permissions are MOST likely missing from the Glue job's IAM role?

Select 2 answers
A.s3:PutObject
B.kms:Decrypt
C.dynamodb:GetItem
D.glue:StartJobRun
E.s3:GetObject
AnswersA, E

Required to write output to S3.

Why this answer

Options A and C are correct. Glue needs s3:GetObject to read from S3 and s3:PutObject to write output. Option B is incorrect because glue:StartJobRun is not needed for the job itself.

Option D is incorrect because kms:Decrypt is needed only if S3 uses KMS encryption. Option E is incorrect because dynamodb:GetItem is not relevant unless accessing DynamoDB.

97
MCQhard

A company is using AWS Glue to run ETL jobs that write data to an Amazon S3 data lake. The jobs are failing with '503 Slow Down' errors. The data engineering team has already implemented retries. What is the BEST long-term solution?

A.Enable S3 Transfer Acceleration.
B.Use S3 multipart upload for all objects.
C.Increase the number of retries in the Glue job.
D.Implement a backoff strategy to reduce the request rate.
AnswerD

Reducing request rate helps avoid S3 503 errors.

Why this answer

The '503 Slow Down' error from Amazon S3 indicates that the request rate is too high and S3 is throttling the requests. The best long-term solution is to implement a backoff strategy (exponential backoff) to reduce the request rate, which allows the Glue job to automatically slow down and retry with increasing delays, aligning with S3's request rate limits and avoiding sustained throttling.

Exam trap

The trap here is that candidates often confuse '503 Slow Down' with a network or throughput issue and choose S3 Transfer Acceleration or multipart upload, when in fact the error is a throttling response from S3 that requires reducing the request rate via backoff, not increasing speed or parallelism.

How to eliminate wrong answers

Option A is wrong because S3 Transfer Acceleration is designed to speed up uploads over long distances using edge locations, but it does not reduce the request rate or resolve throttling caused by high request volumes. Option B is wrong because S3 multipart upload is a mechanism for uploading large objects in parts, which can improve throughput but does not address the root cause of excessive request rate leading to '503 Slow Down' errors. Option C is wrong because increasing the number of retries without reducing the request rate will likely continue to trigger throttling, as the same high request rate will persist after each retry, leading to repeated failures.

98
Multi-Selecteasy

A company is ingesting large volumes of sensor data into Amazon S3. The data must be encrypted at rest using an AWS KMS customer managed key. Which TWO actions are required to enable server-side encryption with AWS KMS (SSE-KMS) on the S3 bucket?

Select 2 answers
A.Enable S3 Versioning on the bucket
B.Set the default encryption on the S3 bucket to AWS-KMS and specify the KMS key
C.Enable Amazon CloudWatch Logs for the bucket
D.Add a bucket policy that denies uploads without encryption
E.Ensure the IAM role/user has kms:Encrypt permission on the KMS key
AnswersB, E

This configures SSE-KMS for all objects.

Why this answer

Option B is correct because setting the default encryption on the S3 bucket to AWS-KMS and specifying the KMS key ensures that all objects uploaded to the bucket are automatically encrypted with SSE-KMS using that customer managed key. This is the primary configuration step to enforce server-side encryption at rest with a customer managed key.

Exam trap

The trap here is that candidates often think a bucket policy denying unencrypted uploads alone is sufficient to enable SSE-KMS, but it only enforces that uploads must include encryption headers—it does not automatically apply encryption, so the default encryption setting is also required.

99
MCQmedium

A company is migrating an on-premises MongoDB database to Amazon DocumentDB. The migration must have minimal downtime. Which service should be used to perform the migration?

A.AWS Glue
B.AWS DataSync
C.AWS Database Migration Service (DMS)
D.Amazon S3 Transfer Acceleration
AnswerC

DMS supports MongoDB to DocumentDB migration with minimal downtime using change data capture.

Why this answer

AWS Database Migration Service (DMS) is the correct choice because it supports continuous replication from MongoDB to Amazon DocumentDB using change data capture (CDC), enabling near-zero downtime migrations. DMS can perform a full load of existing data and then apply ongoing changes from the source MongoDB oplog, keeping the target DocumentDB synchronized until the cutover.

Exam trap

The trap here is that candidates may confuse AWS DMS with AWS DataSync or AWS Glue, assuming any 'migration' or 'data transfer' service can handle live database replication, but only DMS provides the necessary CDC engine for heterogeneous database migrations with minimal downtime.

How to eliminate wrong answers

Option A is wrong because AWS Glue is a serverless data integration service for ETL (extract, transform, load) jobs, not designed for live database migration with minimal downtime; it lacks native CDC support for MongoDB to DocumentDB replication. Option B is wrong because AWS DataSync is optimized for moving large volumes of file data (e.g., NFS, SMB) to AWS storage services like S3 or EFS, not for heterogeneous database migrations or ongoing replication. Option D is wrong because Amazon S3 Transfer Acceleration is a feature that speeds up uploads to S3 buckets over long distances using edge locations; it has no capability to migrate or replicate a MongoDB database to DocumentDB.

100
Multi-Selecthard

A data engineer is designing a data lake on Amazon S3. The data must be immutable and support high-throughput streaming ingestion. Which THREE features should the engineer consider? (Select THREE.)

Select 3 answers
A.S3 Transfer Acceleration
B.S3 Lifecycle policies to transition data to Amazon S3 Glacier
C.S3 Multipart Upload API
D.S3 Object Lock in governance mode
E.S3 Cross-Region Replication (CRR)
AnswersB, C, D

Lifecycle policies automate data movement, cost-effectively managing the data lifecycle.

Why this answer

S3 Object Lock in governance mode (Option D) is correct because it enforces immutability by preventing objects from being deleted or overwritten for a specified retention period, which is essential for a data lake requiring immutable data. S3 Multipart Upload API (Option C) is correct because it enables high-throughput streaming ingestion by allowing large objects to be uploaded in parallel parts, improving throughput and resilience. S3 Lifecycle policies to transition data to Amazon S3 Glacier (Option B) is correct because it supports cost-effective storage management for immutable data that is rarely accessed, aligning with the data lake's lifecycle needs.

Exam trap

The trap here is that candidates often confuse S3 Transfer Acceleration (a speed optimization) with a feature that provides immutability or streaming support, leading them to select it incorrectly, while overlooking that S3 Object Lock and Multipart Upload directly address the core requirements of immutability and high-throughput ingestion.

101
MCQhard

A data engineer uses AWS Database Migration Service (DMS) to migrate an on-premises Oracle database to Amazon Aurora MySQL. The migration is successful, but the engineer notices that the target Aurora cluster has a higher CPU utilization than expected during the full load phase. What is the MOST likely cause?

A.The DMS task has LOB mode set to 'Full LOB mode', causing additional processing.
B.DMS is performing data validation during the full load phase.
C.DMS is reading from an Amazon Aurora read replica instead of the primary instance.
D.The DMS task is configured to use multiple parallel threads to load data, overwhelming the target instance.
AnswerD

Parallel threads increase throughput but also increase CPU usage.

Why this answer

Option A is correct because DMS uses multiple tasks in parallel to maximize throughput, which can cause high CPU on the target. Option B is wrong because DMS does not use read replicas during full load. Option C is wrong because LOB settings affect column size handling, not CPU load.

Option D is wrong because validation occurs after migration, not during full load.

102
MCQmedium

A company needs to transform JSON data from an Amazon S3 bucket into Parquet format and load it into an Amazon Redshift cluster. The transformation includes joining with a reference table stored in Amazon RDS. Which AWS service is BEST suited for this task?

A.AWS Data Pipeline
B.AWS Glue ETL job
C.Amazon Athena
D.Amazon EMR with Spark
AnswerB

Glue ETL jobs can read from S3, connect to RDS via JDBC, transform, and write to Redshift efficiently.

Why this answer

Option D is correct because AWS Glue ETL jobs can read from S3, connect to RDS via JDBC, perform joins and transformations, and write to Redshift. Option A (Athena) can query S3 but cannot join with RDS natively. Option B (EMR) is possible but more complex to set up.

Option C (Data Pipeline) is older and less integrated.

103
MCQmedium

A company uses Amazon Kinesis Data Firehose to ingest application logs into an Amazon S3 bucket. The logs are in JSON format. The data engineering team wants to convert the logs from JSON to Parquet format before landing in S3. What is the most cost-effective way to achieve this?

A.Use Amazon Athena to query the JSON data and write results in Parquet format.
B.Configure the Firehose delivery stream to convert the data to Parquet using a schema from AWS Glue.
C.Use an AWS Lambda function to transform each record to Parquet and send to Firehose.
D.Use an AWS Glue ETL job to run on a schedule and convert JSON to Parquet in S3.
AnswerB

Firehose supports built-in conversion to Parquet/ORC using Glue schema.

Why this answer

Option B is correct because Kinesis Data Firehose can convert the input data format to Parquet using a schema from AWS Glue. Option A is incorrect because Lambda can do this but incurring additional compute cost. Option C is incorrect because Athena queries raw data and would not help with ingestion.

Option D is incorrect because Glue ETL would add cost and latency.

104
MCQhard

A data engineer is designing a data pipeline that ingests millions of small JSON files (1-10 KB each) from an S3 bucket into Amazon Redshift. The current approach uses a Lambda function triggered by S3 events to call the Redshift COPY command for each file. This is causing high latency and throttling. Which alternative is MOST cost-effective and efficient?

A.Use Amazon Kinesis Data Streams and a consumer to batch files before COPY
B.Use Amazon Kinesis Data Firehose to buffer and write larger files to S3, then use a scheduled COPY command
C.Increase the Lambda concurrency limit and memory
D.Use AWS Glue to merge files into larger Parquet files before loading
AnswerB

Firehose buffers small files into larger ones, reducing COPY frequency and cost.

Why this answer

Option D is correct because Kinesis Firehose can buffer the small files and write larger batches to S3, then use a scheduled COPY or Redshift Spectrum. Option A still processes each file individually. Option B is for streaming, not batch.

Option C adds complexity and cost.

105
MCQeasy

A company is storing large amounts of log data in Amazon S3. The data is accessed frequently for the first 30 days, then rarely after that. The company wants to automatically transition the data to a lower-cost storage class after 30 days. Which S3 feature should the data engineer use?

A.S3 Intelligent-Tiering
B.S3 Lifecycle policies
C.S3 Cross-Region Replication
D.S3 Batch Operations
AnswerB

Lifecycle policies can transition objects after a specified number of days.

Why this answer

Option B is correct because S3 Lifecycle policies automatically transition objects between storage classes. Option A is wrong because S3 Intelligent-Tiering monitors access patterns but may have monitoring costs. Option C is wrong because S3 replication is for copying data, not transitioning storage classes.

Option D is wrong because S3 Batch Operations is for bulk actions, not automatic transitions.

106
MCQmedium

A company is using Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. The team notices that the application is experiencing high latency during peak hours. The stream has 8 shards, and the application is configured with a parallelism of 4. Which action would most likely reduce the latency?

A.Decrease the batch size in the S3 sink.
B.Use a larger Kinesis Data Analytics application instance type.
C.Increase the parallelism of the Flink application to 8.
D.Increase the checkpointing interval to reduce overhead.
AnswerC

Matching parallelism to shard count ensures each shard is processed concurrently, reducing backpressure.

Why this answer

The correct answer is to increase the parallelism of the Flink application to match the number of shards (8). When parallelism is lower than the number of shards, some shards are underutilized, causing backpressure and latency. Option A (increasing checkpointing interval) would reduce overhead but not address the parallelism mismatch.

Option C (using a larger instance type) could help but is less effective than matching parallelism. Option D (decreasing batch size) is not applicable to Flink. Option B directly fixes the bottleneck.

107
Multi-Selecteasy

A data engineer is monitoring an AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job is taking longer than expected. The engineer suspects that the job's parallelism is not optimal. Which THREE actions can improve the job's performance? (Choose THREE.)

Select 3 answers
A.Enable the 'groupFiles' option in the S3 sink to coalesce small files.
B.Decrease the 'dynamodb.splits' parameter to reduce the number of parallel readers.
C.Increase the 'MaxCapacity' (DPU) setting for the Glue job.
D.Disable job bookmark to avoid storing metadata.
E.Increase the 'dynamodb.throughput.read.percentage' parameter to allocate more read capacity.
AnswersA, C, E

Coalescing small files reduces the number of output files and improves write performance.

Why this answer

Option A is correct because increasing the number of DynamoDB reads per segment increases parallelism when reading from DynamoDB. Option C is correct because increasing the number of DPUs (data processing units) allocated to the job increases parallelism. Option E is correct because using the 'groupFiles' option can reduce the number of small files written to S3, reducing write overhead.

Option B is wrong because decreasing the batch size reduces throughput. Option D is wrong because disabling job bookmark may cause reprocessing but does not improve performance; it may actually degrade it.

108
MCQmedium

A data engineer is designing a data lake on S3 and needs to ensure that data is encrypted at rest using customer-managed KMS keys. The engineer also needs to audit all access to the KMS keys. Which combination of services should be used?

A.SSE-KMS with AWS CloudTrail
B.SSE-C with CloudWatch Logs
C.SSE-KMS with S3 Inventory
D.SSE-S3 with S3 server access logs
AnswerA

SSE-KMS uses customer-managed KMS keys; CloudTrail records KMS API calls for auditing.

Why this answer

S3 server-side encryption with KMS (SSE-KMS) uses customer-managed KMS keys. CloudTrail logs all KMS API calls, including Decrypt and GenerateDataKey. Option A is wrong because SSE-S3 uses AWS-managed keys, not customer-managed.

Option B is wrong because S3 Inventory does not audit KMS access. Option D is wrong because S3 access logs do not capture KMS operations. Option C is correct.

109
Multi-Selectmedium

A data engineer is setting up an Amazon Redshift cluster for a new data warehouse. The engineer needs to ensure that the cluster can automatically recover from failures and maintain high availability. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers
A.Configure workload management (WLM) queues to prioritize critical queries.
B.Configure automated snapshots with a retention period of 1 day and enable cluster recreation from snapshots.
C.Enable Multi-AZ deployment.
D.Enable concurrency scaling.
E.Configure manual snapshots with cross-region copy.
AnswersB, C

Automated snapshots allow Redshift to automatically restore a cluster from the latest snapshot if the primary fails.

Why this answer

Option A (enable Multi-AZ) and Option D (automated snapshots with cluster recreation) are correct. Multi-AZ provides synchronous replication across AZs for high availability. Automated snapshots allow point-in-time recovery and cluster replacement in case of failure.

Option B is wrong because concurrency scaling improves query performance, not availability. Option C is wrong because manual snapshots require manual intervention. Option E is wrong because workload management (WLM) does not affect availability.

110
MCQeasy

Refer to the exhibit. A data engineer runs the command on an Amazon S3 bucket used for data lake storage. The engineer is concerned about accidental overwrites of objects. What does the output indicate?

A.Versioning is enabled, so previous versions of objects are preserved.
B.Old versions will be automatically deleted after a retention period.
C.Objects are encrypted at rest by default.
D.MFA Delete is disabled, meaning anyone can delete objects permanently.
AnswerA

Versioning keeps all versions of objects.

Why this answer

Option A is correct because the Status 'Enabled' means versioning is turned on for the bucket. Option B is wrong because MFA Delete is not required for versioning. Option C is wrong because versioning does not automatically expire old versions.

Option D is wrong because versioning does not encrypt objects.

111
MCQhard

Refer to the exhibit. A data engineer is using a Kinesis Data Stream with 2 shards. The producer uses a partition key that is the user ID (a UUID). The consumer is falling behind. Which change would improve throughput?

A.Switch to Kinesis Data Firehose
B.Increase the number of shards
C.Increase the retention period
D.Change the partition key to a constant value
AnswerB

More shards increase the read capacity for consumers.

Why this answer

Option B is correct because increasing the number of shards increases the ingestion capacity. Option A is wrong because Kinesis Data Firehose would add latency. Option C is wrong because the partition key is already random (UUID), which distributes data well.

Option D is wrong because increasing the retention period does not affect throughput.

112
MCQmedium

A company runs an e-commerce platform that generates clickstream data from millions of users. The data is ingested into Amazon Kinesis Data Streams with a shard count of 10. The data is then consumed by a Kinesis Data Analytics application that runs SQL queries to aggregate metrics in real time. Recently, the application has been falling behind, and the stream's iterator age metric is increasing. The data volume has doubled over the past month. The application currently uses a single Kinesis Data Analytics application with parallelism of 1. Which action should the data engineer take to improve the processing rate and reduce the iterator age without losing data or causing duplicates?

A.Change the Kinesis Data Analytics application to use a Kinesis Data Firehose delivery stream as the source.
B.Reduce the retention period of the Kinesis Data Streams to 24 hours.
C.Increase the number of shards in the Kinesis Data Streams to 20.
D.Increase the parallelism of the Kinesis Data Analytics application to match the number of shards.
AnswerD

Higher parallelism allows concurrent processing of multiple shards.

Why this answer

Option D is correct because Kinesis Data Analytics (KDA) processes data from each shard in a stream using one or more parallel operators. With a parallelism of 1, the application uses only a single processing thread, which cannot keep up with the doubled data volume across 10 shards. By increasing parallelism to match the shard count (10), KDA can read from all shards concurrently, distributing the processing load and reducing the iterator age without data loss or duplicates, as KDA manages checkpointing and exactly-once semantics internally.

Exam trap

The trap here is that candidates often assume increasing shard count (Option C) is the only way to handle higher data volume, but they overlook that the processing application's parallelism must also scale to consume the additional shards, otherwise the bottleneck shifts to the consumer.

How to eliminate wrong answers

Option A is wrong because Kinesis Data Firehose is a delivery service that buffers and loads data into destinations like S3 or Redshift; it does not support real-time SQL analytics or reduce iterator age, and using it as a source would break the existing KDA SQL application. Option B is wrong because reducing the retention period from the default (24 hours or more) to 24 hours does not improve processing rate; it only causes data to expire sooner, potentially losing unprocessed records and increasing the risk of data loss without addressing the throughput bottleneck. Option C is wrong because increasing the shard count to 20 would double the stream's ingestion capacity, but the KDA application with parallelism of 1 would still process only one shard at a time, leaving the other 19 shards unprocessed and worsening the iterator age; the bottleneck is the application's parallelism, not the stream's shard count.

113
Multi-Selecthard

A company needs to ingest data from a MySQL database into Amazon S3 using AWS DMS. The data changes frequently and the requirement is to capture changes in near real-time. Which THREE configurations are necessary?

Select 3 answers
A.Create a VPC endpoint for S3.
B.Create an S3 target endpoint in DMS.
C.Enable binary logging (binlog) on the MySQL source database.
D.Create an AWS DMS replication instance.
E.Configure an S3 event notification to trigger DMS.
AnswersB, C, D

Needed to specify the S3 bucket.

Why this answer

Options A, B, and E are correct because the database must have binary logging enabled, DMS requires a replication instance, and the target endpoint should be S3. Option C is incorrect because S3 events are not needed for DMS. Option D is incorrect because a VPC endpoint is not required if using public S3.

114
Multi-Selecthard

A company uses Amazon S3 to store log files that are generated every hour. Each log file is about 1 GB. The logs must be stored for 5 years for compliance. The data engineer wants to minimize storage costs while ensuring that logs can be retrieved within 24 hours for the first year, and within 48 hours thereafter. Which THREE lifecycle actions should the engineer configure? (Choose THREE.)

Select 3 answers
A.Transition objects to S3 Standard after 30 days.
B.Transition objects to S3 Glacier Deep Archive after 1 year.
C.Set a retrieval window of 48 hours for Glacier Deep Archive.
D.Delete objects after 2 years to reduce storage costs.
E.Transition objects to S3 Standard-IA after 30 days.
AnswersB, C, E

Deep Archive is the cheapest storage class for archival data.

Why this answer

Option B is correct because Standard-IA is suitable for infrequently accessed data with immediate retrieval. Option C is correct because Glacier Deep Archive provides low-cost storage with retrieval within 12 hours. Option D is correct because Deep Archive retrieval within 48 hours meets the requirement.

Option A is incorrect because Standard is not cost-effective for long-term. Option E is incorrect because deletion before 5 years violates compliance.

115
MCQmedium

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. The consumer application is falling behind and the iterator age is increasing. Which action would most effectively improve throughput?

A.Switch from Kinesis Data Streams to Kinesis Data Firehose
B.Decrease the batch size in the consumer
C.Enable enhanced fan-out for the consumer
D.Increase the number of shards in the stream
AnswerD

More shards increase read capacity and parallelism.

Why this answer

Option C is correct because increasing shards increases parallelism. Option A is wrong because it reduces batch size. Option B is wrong because it doesn't help throughput.

Option D is wrong because it adds latency.

116
Multi-Selecteasy

A data engineer needs to monitor the performance of an RDS for PostgreSQL database. Which THREE CloudWatch metrics are most useful for this purpose?

Select 3 answers
A.CPUUtilization
B.DatabaseConnections
C.FreeStorageSpace
D.NetworkThroughput
E.ReadLatency / WriteLatency
AnswersA, B, E

Indicates compute load.

Why this answer

CPUUtilization is a critical metric for monitoring RDS for PostgreSQL because high CPU usage can indicate inefficient queries, insufficient instance size, or contention. Sustained high CPU can lead to performance degradation and increased query latency, making it essential for capacity planning and troubleshooting.

Exam trap

The trap here is that candidates often confuse storage metrics (like FreeStorageSpace) with performance metrics, or assume NetworkThroughput is a performance indicator, when in fact latency and CPU metrics directly reflect query execution health.

117
MCQhard

A company uses Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function is failing with 'ProvisionedThroughputExceededException' when writing to a DynamoDB table. Which action should the data engineer take to resolve this without losing data?

A.Reduce the number of Kinesis shards to lower the ingestion rate.
B.Increase the DynamoDB table's read capacity.
C.Configure a dead-letter queue (DLQ) on the Lambda function and increase the DynamoDB write capacity.
D.Disable retries on the Lambda function to avoid throttling.
AnswerC

The DLQ captures failed records, and increasing write capacity reduces throttling. Together, they prevent data loss.

Why this answer

Option B is correct because adding a dead-letter queue (DLQ) on the Lambda function captures failed records for later reprocessing, preventing data loss. Option A (disable retries) would lose data. Option C (increase DynamoDB write capacity) addresses the throughput but does not handle failures that may still occur.

Option D (reduce shards) would reduce the ingestion rate but may not solve the issue and could cause data loss.

118
MCQmedium

A gaming company collects player event data from mobile devices. The data is sent to an Amazon API Gateway endpoint, which triggers an AWS Lambda function that writes the data to an Amazon DynamoDB table. The company wants to also store the data in Amazon S3 for historical analysis. The data volume is about 100 GB per day. The data engineer needs to design a solution to copy data from DynamoDB to S3 with minimal impact on the DynamoDB table. What should the data engineer do?

A.Enable DynamoDB Streams on the table and configure a Lambda function to write changes to S3.
B.Create a global secondary index on the table and export the index to S3.
C.Use AWS Glue to scan the DynamoDB table and write results to S3 every hour.
D.Use the DynamoDB Export to S3 feature to export the entire table daily.
AnswerA

Streams capture changes with low latency and minimal impact on the table.

Why this answer

Option A is correct. Using DynamoDB Streams with a Lambda function that writes to S3 is a common pattern for real-time replication with minimal impact. Option B is wrong because Export to S3 is a one-time or scheduled export, not continuous.

Option C is wrong because using Scan would consume read capacity and impact performance. Option D is wrong because adding a secondary index does not help with exporting data.

119
MCQeasy

A data engineer needs to ensure that an S3 bucket is not publicly accessible. Which S3 block public access setting should be applied to achieve this?

A.BlockPublicAcls (both new and existing)
B.IgnorePublicAcls
C.BlockPublicAcls (new ACLs)
D.BlockPublicPolicy
AnswerA

Blocks all public ACLs.

Why this answer

Option B is correct because BlockPublicAcls prevents public ACLs on the bucket. Option A is wrong because it only applies to new ACLs. Option C is wrong because it blocks public bucket policies.

Option D is wrong because it blocks public cross-account access. The combination of all four is recommended, but the question asks for a single setting.

120
MCQmedium

Refer to the exhibit. A data engineer has attached this KMS key policy to a customer-managed key. The policy is intended to allow the DataEngineer role to decrypt objects in S3 only when the request comes through S3. However, the role is unable to decrypt objects stored in an S3 bucket in the us-west-2 region. What is the most likely cause?

A.The key policy does not allow the role to use GenerateDataKey
B.The role does not have an IAM policy that allows kms:Decrypt
C.The condition restricts the permission to the us-east-1 region only
D.The role does not have permission to decrypt from S3
AnswerC

The kms:ViaService condition specifies s3.us-east-1.amazonaws.com, so it only works for S3 requests in us-east-1.

Why this answer

The condition in the policy restricts the permission to requests coming through S3 in us-east-1 only (s3.us-east-1.amazonaws.com). For buckets in us-west-2, the viaService would be s3.us-west-2.amazonaws.com, so the condition fails. The key policy does not allow decrypt from other services, but the issue is region mismatch.

121
MCQhard

A company uses AWS Glue ETL to transform data from Amazon RDS for PostgreSQL to Amazon S3. The transformation includes joining several tables and aggregating millions of rows. The job runs successfully but takes over 2 hours. The data engineer wants to reduce runtime. Which action is MOST effective?

A.Enable Auto Scaling for the Glue job.
B.Use AWS Glue DynamicFrames instead of DataFrames.
C.Increase the number of DPUs for the Glue job.
D.Convert the source data to Parquet format.
AnswerC

More DPUs increase parallelism and reduce execution time.

Why this answer

Option D is correct because increasing the number of DPUs (Data Processing Units) in AWS Glue ETL jobs can significantly reduce runtime by parallelizing the transformation. Option A is wrong because Auto Scaling is enabled by default and may not help if DPUs are already maxed. Option B is wrong because converting to Parquet can help with reading from S3, but the source is RDS, not S3.

Option C is wrong because writing to S3 using dynamic frames does not directly improve performance of reading from RDS.

122
MCQhard

A company ingests streaming data from multiple sources into a single Kinesis Data Streams stream. Each source produces records with a different schema. The data must be routed to different S3 prefixes based on the source. Which approach minimizes transformation overhead?

A.Use a single Kinesis Data Firehose with a Lambda transformation that reads schema metadata from DynamoDB to determine the S3 prefix.
B.Ingest all data into S3 and use AWS Glue ETL jobs to partition and route data to different prefixes.
C.Use separate Kinesis Data Streams for each source and configure separate Firehose delivery streams.
D.Use Kinesis Data Analytics to run SQL queries that route data to different Firehose streams.
AnswerA

This approach uses DynamoDB for schema metadata and Lambda for lightweight routing.

Why this answer

Option A is correct because using a DynamoDB table to store schema metadata and a Lambda function to route records to the appropriate Firehose delivery stream is efficient and scalable. Option B is wrong because it requires multiple Kinesis streams, increasing cost and complexity. Option C is wrong because ingesting into S3 and then using Glue to route adds latency.

Option D is wrong because using Kinesis Data Analytics for routing is overkill and not designed for this purpose.

123
MCQeasy

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by an AWS Lambda function that processes records and stores results in Amazon DynamoDB. Recently, the Lambda function has been failing with ProvisionedThroughputExceededException errors. Which action should the data engineer take to resolve this issue?

A.Enable auto scaling on the DynamoDB table to handle increased write capacity.
B.Reduce the number of shards in the Kinesis stream to lower the ingestion rate.
C.Increase the batch size in the Lambda event source mapping to process more records per invocation.
D.Configure the Lambda function to discard records that cause throttling errors.
AnswerA

Auto scaling adjusts throughput based on actual usage, preventing throttling.

Why this answer

Option A is correct because enabling DynamoDB auto scaling dynamically adjusts throughput to match demand. Option B is wrong because increasing Lambda batch size would increase write requests, worsening the problem. Option C is wrong because Lambda retries are already built-in but would still fail.

Option D is wrong because disabling retries would lose data.

124
MCQhard

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is configured with a buffer interval of 60 seconds and a buffer size of 5 MB. The data arrives at an average rate of 2 MB per second. What is the expected time interval between S3 writes?

A.Approximately 2.5 seconds
B.Approximately 30 seconds
C.Approximately 60 seconds
D.Approximately 10 seconds
AnswerA

The buffer size of 5 MB fills in 2.5 seconds at 2 MB/s, triggering a write.

Why this answer

Option A is correct because with 2 MB/s, the buffer size of 5 MB will be reached in 2.5 seconds, but the buffer interval is 60 seconds. Firehose writes when either condition is met first. Since the buffer size is reached much earlier, writes will occur approximately every 2.5 seconds.

Option B, C, D are incorrect because they are based on the interval or miscalculations.

125
MCQhard

A healthcare company is building a data pipeline to ingest electronic health records (EHR) from hospitals. The data is sent as JSON files via SFTP to an on-premises server. The company wants to move this data to AWS using AWS Transfer Family (SFTP) and then process it with AWS Glue. Data sovereignty regulations require that all data remain within the EU (Frankfurt) region. The pipeline must detect when a new file arrives and start the Glue job automatically. The engineer has set up an AWS Transfer Family server in Frankfurt, and files are uploaded to an S3 bucket in the same region. However, the Glue job is not triggering automatically. The engineer needs to implement automated triggering. What should the engineer do?

A.Configure AWS Step Functions to poll the S3 bucket every minute and start the Glue job if new files exist.
B.Configure Amazon CloudWatch Events to trigger the Glue job on a schedule that checks for new files.
C.Use Amazon Simple Queue Service (SQS) to queue file metadata and have a Lambda function poll the queue to start the Glue job.
D.Set up an S3 event notification on the bucket to invoke an AWS Lambda function that starts the Glue job.
AnswerD

S3 event notifications can invoke Lambda immediately when a new file is uploaded.

Why this answer

Option B is correct: S3 event notifications can be configured to invoke Lambda when a new object is created, and Lambda can start the Glue job. Option A (CloudWatch Events) cannot directly monitor S3 object creation. Option C (SQS) is not needed.

Option D (Step Functions) adds unnecessary complexity.

126
MCQmedium

A data engineering team uses AWS Lambda functions to process streaming data from Amazon Kinesis Data Streams and write the results to an S3 bucket. The S3 bucket is encrypted with SSE-KMS using a customer-managed key (CMK). The Lambda function's IAM role has permissions for kms:Decrypt and kms:GenerateDataKey on the CMK. However, the Lambda function fails with an 'Access Denied' error when writing to S3. The S3 bucket policy allows s3:PutObject from the Lambda function's IAM role. What is the most likely cause?

A.The Lambda function's execution role does not have permission to invoke the function.
B.The Lambda function's IAM role is missing the kms:Encrypt permission on the CMK.
C.The S3 bucket policy denies s3:PutObject from the Lambda function.
D.The Kinesis data stream is not encrypted, causing the Lambda function to fail.
AnswerB

Writing encrypted data requires kms:Encrypt.

Why this answer

Option A is correct. When writing to S3 with SSE-KMS, the Lambda function needs kms:Encrypt permission to encrypt the data. The role only has kms:Decrypt and kms:GenerateDataKey, missing kms:Encrypt.

Option B is wrong because Kinesis encryption is separate. Option C is wrong because the bucket policy allows PutObject. Option D is wrong because Lambda has permissions, but missing kms:Encrypt.

127
Multi-Selecteasy

A data engineer needs to enforce that all data in an Amazon S3 bucket is encrypted at rest. Which of the following can be used to achieve this? (Choose TWO.)

Select 2 answers
A.Use AWS CloudTrail to monitor for unencrypted objects
B.Use VPC endpoints to restrict access
C.Configure a bucket policy to deny PutObject if encryption headers are missing
D.Enable default encryption on the S3 bucket using SSE-S3
E.Use AWS KMS to generate encryption keys for the bucket
AnswersC, D

This policy enforces encryption on uploads.

Why this answer

Option B and D are correct. S3 bucket policies can deny PutObject requests without encryption headers, and SSE-S3 ensures encryption at rest. Option A is wrong because CloudTrail logs API calls but does not enforce encryption.

Option C is wrong because VPC endpoints provide network isolation, not encryption. Option E is wrong because KMS alone does not enforce encryption on S3.

128
MCQmedium

A company uses AWS Glue DataBrew to clean and normalize data. The data contains sensitive columns that must be masked before being written to the output. Which DataBrew action should be applied?

A.Apply a Hash transform to the sensitive columns.
B.Apply an Encrypt transform to the sensitive columns.
C.Apply a Delete transform to remove the sensitive columns.
D.Apply a Mask transform to the sensitive columns.
AnswerD

Mask transform obfuscates data, e.g., showing only last 4 digits.

Why this answer

Option C is correct because DataBrew has a built-in 'Mask' transform that can obfuscate sensitive data. Option A is wrong because 'Hash' is for hashing, not masking. Option B is wrong because 'Delete' removes the column entirely.

Option D is wrong because 'Encrypt' is not a DataBrew transform; encryption is handled at the storage layer.

129
MCQmedium

A data engineer is setting up an Amazon S3 lifecycle policy to transition objects to S3 Glacier after 90 days and delete after 365 days. The objects are stored in the S3 Standard storage class. Which lifecycle rule configuration meets the requirements?

A.Transition to Glacier after 90 days and expire after 90 days
B.Transition to Glacier after 90 days and expire after 90 days
C.Transition to Glacier after 365 days and expire after 365 days
D.Transition to Glacier after 90 days and expire after 365 days
AnswerD

Correct timing for transition and deletion.

Why this answer

Option B is correct because the transition to Glacier should happen after 90 days, and expiration after 365 days. Option A is wrong because expiration after 90 days deletes too early. Option C is wrong because transition to Glacier after 365 days is too late.

Option D is wrong because transition to Glacier after 90 days is correct.

130
MCQhard

A company runs a batch ETL job on Amazon EMR every night. Recently, the job started failing with 'Out of Memory' errors in the Spark executors. The data volume has grown 20% in the past month. The cluster uses uniform instance groups with 5 core nodes of r5.xlarge (4 vCPU, 32 GB RAM). Which change should the data engineer implement to resolve the issue with minimal cost increase?

A.Increase the number of core nodes to 7.
B.Change instance type to r5.2xlarge (8 vCPU, 64 GB RAM) for all nodes.
C.Configure instance fleets to include r5.xlarge and r5.2xlarge instances.
D.Tune Spark memory configurations to reduce executor memory overhead.
AnswerC

Instance fleets allow cost-effective scaling by mixing types.

Why this answer

Option C is correct because using instance fleets allows the cluster to include both r5.xlarge and r5.2xlarge instances, enabling the Spark executors to use the larger instances for memory-intensive tasks while still leveraging the existing r5.xlarge nodes. This provides a cost-effective way to handle the 20% data growth by adding memory capacity without replacing the entire cluster or over-provisioning all nodes. Instance fleets also support Spot Instances, which can further reduce costs while addressing the Out of Memory errors.

Exam trap

The trap here is that candidates often assume increasing the number of nodes (Option A) or tuning Spark memory settings (Option D) can solve memory issues, but they fail to recognize that the root cause is insufficient memory per executor, which is best addressed by adding larger instances via instance fleets to minimize cost increase.

How to eliminate wrong answers

Option A is wrong because simply increasing the number of core nodes to 7 does not increase the memory per executor; it only adds more nodes with the same 32 GB RAM each, which may not resolve the Out of Memory errors if individual executors are hitting their limits due to data skew or large partitions. Option B is wrong because changing all nodes to r5.2xlarge (64 GB RAM) would double the memory per node but also double the cost for the entire cluster, which is not the minimal cost increase solution. Option D is wrong because tuning Spark memory configurations (e.g., reducing executor memory overhead) cannot create additional physical memory; it only reallocates existing memory, which will not resolve the Out of Memory errors if the total available memory is insufficient for the increased data volume.

131
MCQeasy

A company needs to ingest data from multiple on-premises databases into Amazon S3 for analytics. The databases include Oracle, MySQL, and PostgreSQL. The data must be continuously replicated with minimal latency. Which AWS service should be used?

A.AWS Database Migration Service (AWS DMS)
B.Amazon Kinesis Data Streams
C.AWS Snowball
D.AWS Glue
AnswerA

DMS can continuously replicate from multiple source databases to S3.

Why this answer

Option B is correct because AWS DMS supports heterogeneous database migrations and continuous replication to S3. Option A is wrong because Glue is batch-oriented. Option C is wrong because Kinesis Data Streams is for streaming data, not database replication.

Option D is wrong because Snowball is for large offline transfers.

132
MCQhard

A healthcare company uses Amazon RDS for PostgreSQL to store patient records. The database has a size of 1 TB and is running on a db.r5.large instance. The company requires that the database be highly available and have automated backups with point-in-time recovery (PITR) for the last 35 days. The operations team has configured Multi-AZ deployment and automated backups with a 35-day retention period. During a recent disaster simulation, the team attempted to restore the database to a point in time from 30 days ago. The restore operation failed because the backup was not available. On investigation, the team found that the automated backups were being deleted before the retention period ended. The team also noticed that the database has a large number of transaction logs generating a high volume of write activity. What is the most likely cause of the backups being deleted prematurely?

A.The RDS instance was deleted, which automatically deletes all automated backups.
B.The automated backup window was set to a time that conflicted with the database maintenance window.
C.The Multi-AZ deployment was not enabled during the backup process, causing backups to fail.
D.The database had manual snapshots that were deleted manually by the operations team.
AnswerA

When an RDS instance is deleted, automated backups are also deleted unless a final snapshot is taken.

Why this answer

Option D is correct because automated backups are retained based on the backup retention period. However, if the database instance is deleted, all automated backups are also deleted. Option A is wrong because manual snapshots are separate from automated backups.

Option B is wrong because the backup window does not affect retention. Option C is wrong because Multi-AZ automatically performs backups from the standby, but the retention is still enforced.

133
MCQeasy

A company needs to ingest data from an external FTP server into AWS S3. The FTP server is not accessible from the internet. Which AWS service should be used to securely transfer the data?

A.Kinesis Data Firehose
B.AWS Transfer Family with SFTP endpoint in a VPC
C.AWS DataSync
D.AWS Snowball Edge
AnswerB

Transfer Family supports SFTP and can be deployed in a VPC to access private FTP servers.

Why this answer

Option B is correct because AWS Transfer Family supports SFTP and can be used with a VPC endpoint to transfer data from an FTP server in a private network to S3. Option A is wrong because Snowball Edge is for large offline data transfers, not regular FTP transfers. Option C is wrong because DataSync is for moving data between on-premises storage and AWS, but it requires network access.

Option D is wrong because Kinesis Data Firehose is for streaming data, not FTP transfers.

134
MCQeasy

A data engineer is configuring an Amazon S3 lifecycle policy to transition objects to S3 Glacier Deep Archive after 90 days. The bucket receives new objects daily. The engineer wants to ensure that objects are not deleted before 90 days. Which lifecycle action should be used?

A.Expiration
B.Transition
C.NoncurrentVersionTransition
D.AbortIncompleteMultipartUpload
AnswerB

Transition moves objects to a different storage class.

Why this answer

Option B (Transition) is correct because the S3 Lifecycle Transition action moves objects between storage classes over time. To ensure objects are moved to S3 Glacier Deep Archive after 90 days without deletion, a Transition rule is configured to specify the target storage class and the number of days from object creation.

Exam trap

The trap here is confusing Expiration (which deletes objects) with Transition (which moves objects to another storage class), leading candidates to select Expiration when the goal is to retain objects for a minimum period before moving them to archival storage.

How to eliminate wrong answers

Option A (Expiration) is wrong because it permanently deletes objects after a specified number of days, which would remove them before they could be transitioned to Glacier Deep Archive. Option C (NoncurrentVersionTransition) is wrong because it applies only to noncurrent versions of versioned objects, not to current objects in a non-versioned or versioned bucket. Option D (AbortIncompleteMultipartUpload) is wrong because it only aborts incomplete multipart uploads after a specified number of days, not transitioning or deleting complete objects.

135
MCQeasy

A company wants to securely store database credentials used by a Lambda function. Which AWS service should be used to store and rotate the credentials automatically?

A.AWS CloudHSM
B.AWS Secrets Manager
C.AWS Key Management Service (KMS)
D.AWS Systems Manager Parameter Store
AnswerB

Secrets Manager is designed for storing secrets and supports automatic rotation of database credentials.

Why this answer

AWS Secrets Manager is designed for storing secrets and provides automatic rotation. Systems Manager Parameter Store can store secrets but does not natively support automatic rotation for database credentials. KMS is for encryption keys, not storing secrets.

CloudHSM is for hardware security modules.

136
MCQhard

A data engineering team needs to ingest streaming data from thousands of IoT devices. The data must be processed in near real-time and stored in Amazon S3 in Apache Parquet format partitioned by device_id and timestamp. Which combination of services should the team use to minimize operational overhead and cost?

A.Amazon Kinesis Data Streams, Amazon EC2 for processing, and Amazon S3 with lifecycle policies.
B.Amazon MSK (Kafka), AWS Glue Streaming, and Amazon S3.
C.Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and optionally AWS Lambda.
D.Amazon S3 Transfer Acceleration and AWS Lambda for event-driven transformation.
AnswerC

Kinesis provides serverless ingestion and Firehose handles delivery, Parquet conversion, and partitioning.

Why this answer

Option D is correct. Kinesis Data Streams ingests streaming data, Kinesis Data Firehose delivers data to S3 with built-in Parquet conversion and partitioning, and optionally uses Lambda for lightweight transformations. Option A is wrong because EC2-based ingestion is high overhead.

Option B is wrong because S3 Transfer Acceleration is for large file transfers, not streaming. Option C is wrong because Kafka and Flink require more operational overhead.

137
MCQmedium

A company uses AWS Glue to run ETL jobs on data stored in S3. The data is encrypted with SSE-KMS. The Glue job fails with an 'AccessDenied' error when trying to read the data. What is the MOST likely cause?

A.The S3 bucket policy denies access to the Glue service role.
B.The AWS Glue Data Catalog does not have permission to the table.
C.The IAM role used by Glue does not have kms:Decrypt permission on the KMS key.
D.The Glue job's connection does not have the necessary permissions.
AnswerC

Glue needs kms:Decrypt to read SSE-KMS encrypted data.

Why this answer

Option B is correct because Glue jobs need kms:Decrypt permission on the KMS key to read encrypted data. Option A is wrong because the S3 bucket policy may allow access but KMS permission is separate. Option C is wrong because Glue Data Catalog permissions are for the catalog, not data access.

Option D is wrong because Glue connection permissions are for JDBC connections, not S3.

138
MCQhard

Refer to the exhibit. A data engineer runs a Glue job manually and receives a ThrottlingException. The engineer checks the job run history and sees a previous failure with the same error. What is the MOST likely cause of the throttling, and which solution is MOST appropriate?

A.Increase the number of DPUs for the job to reduce runtime.
B.Implement retry logic with exponential backoff in the script that calls start-job-run.
C.Use AWS Glue reserved capacity to guarantee API throughput.
D.Delete old job runs to reduce the number of entries in the job run history.
AnswerB

Exponential backoff handles API rate limits by retrying after a delay, reducing the chance of throttling.

Why this answer

Option B is correct because the error 'Rate exceeded' indicates the Glue API rate limit has been reached. The most likely cause is multiple concurrent job runs or too many API calls. Implementing retry logic with exponential backoff in the job or script that triggers the job will handle transient throttling.

Option A (increasing DPUs) does not affect API rate. Option C (deleting old job runs) might free up quota but is not the best practice. Option D (using reserved capacity) is not applicable for Glue API calls.

139
MCQhard

A data engineer is using AWS DMS to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration is running in full load mode with ongoing replication. After the full load completes, the ongoing replication task shows a 'TargetMetadata' error: 'ERROR: duplicate key value violates unique constraint'. The engineer verifies that the target table already contains the data. What should the engineer do to resolve this issue?

A.Enable 'BatchApplyEnabled' and set 'TaskRecoveryTableEnabled' to false in the task settings.
B.Disable the unique constraint on the target table.
C.Truncate the target table and restart the full load.
D.Drop the indexes on the target table and recreate them after the migration.
AnswerA

Batch apply minimizes duplicate key errors, and disabling recovery table prevents re-application of already-applied changes.

Why this answer

Option C is correct because the error indicates that the task is trying to insert rows that already exist. Enabling 'BatchApplyEnabled' and setting 'TaskRecoveryTableEnabled' to false prevents duplicate handling errors during replication. Option A is wrong because truncating would lose data.

Option B is wrong because disabling constraints may cause data integrity issues. Option D is wrong because dropping indexes does not prevent duplicate key violations.

140
Multi-Selectmedium

A data engineer is troubleshooting a slow-running Amazon Athena query on a large dataset stored in S3. The query scans many small files. Which TWO actions can improve query performance?

Select 2 answers
A.Increase the number of files to increase parallelism
B.Disable S3 server-side encryption
C.Concatenate small files into larger files
D.Partition the data by a frequently filtered column
E.Convert files from CSV to JSON
AnswersC, D

Reduces file open overhead.

Why this answer

Option A is correct because compacting small files into larger ones reduces overhead. Option C is correct because partitioning limits data scanned. Option B is wrong because converting to Parquet helps but is not one of the two selected here (though it is also good).

Option D is wrong because more files increase overhead. Option E is wrong because disabling encryption does not affect performance.

141
MCQmedium

A media company ingests video files from content partners into an Amazon S3 bucket. Each video file is 10-50 GB. Upon upload, an AWS Lambda function is triggered to extract metadata (e.g., resolution, duration) and store it in DynamoDB. The company now wants to also generate a thumbnail image for each video. The thumbnail generation is CPU-intensive and can take up to 5 minutes per video. The Lambda function has a maximum execution time of 15 minutes. The company has noticed that some thumbnail generation tasks are timing out. What should the data engineer do to reliably generate thumbnails for all videos?

A.Provision an EC2 instance to run a script that polls S3 for new videos and generates thumbnails
B.Use AWS Glue with a Python shell job to generate thumbnails
C.Increase the Lambda timeout to 15 minutes and allocate more memory
D.Use AWS Batch to run a containerized thumbnail generation job triggered by S3 events
AnswerD

Batch is optimized for batch computing and can handle long-running jobs.

Why this answer

Option C is correct because AWS Batch is designed for long-running, compute-intensive jobs like video processing. Option A (increase Lambda timeout) is not a good practice for such heavy tasks. Option B (EC2 instance) requires manual management.

Option D (Glue) is for ETL, not video processing.

142
MCQeasy

A company needs to store application log files for 90 days for compliance. The logs are generated continuously and are rarely accessed after 30 days. The data engineer must minimize storage costs. Which storage solution should the engineer choose?

A.Amazon CloudWatch Logs with a retention policy of 90 days
B.Amazon S3 Glacier Deep Archive
C.Amazon EBS gp3 volumes attached to an EC2 instance
D.Amazon S3 Standard with a lifecycle policy to transition to S3 Standard-IA after 30 days and expire after 90 days
AnswerD

This minimizes cost by using cheaper storage for infrequently accessed data and deleting after compliance period.

Why this answer

Amazon S3 Standard with a lifecycle policy to transition to S3 Standard-IA after 30 days and expire after 90 days is correct because it aligns with the access pattern: logs are frequently accessed only in the first 30 days, then rarely accessed for the remaining 60 days. S3 Standard-IA offers lower storage costs for infrequently accessed data while still providing millisecond retrieval, and the lifecycle policy automates the transition and eventual deletion, minimizing costs without sacrificing availability.

Exam trap

The trap here is that candidates often choose CloudWatch Logs (Option A) because it is a familiar logging service, but they overlook that its cost model (per GB ingested, per GB stored, and per GB archived) can be significantly higher than S3 for long-term retention of large log volumes, and it lacks the automated tiering to lower-cost storage classes.

How to eliminate wrong answers

Option A is wrong because Amazon CloudWatch Logs is designed for real-time monitoring and log ingestion, not for long-term, cost-optimized archival storage; its retention policy only controls deletion, not tiered storage transitions, and costs can be higher than S3 for large volumes of rarely accessed logs. Option B is wrong because S3 Glacier Deep Archive is intended for data that is accessed at most once or twice a year and has retrieval times of 12 hours or more, making it unsuitable for logs that may need occasional access within 90 days; it also incurs minimum storage charges that make it cost-ineffective for short retention periods. Option C is wrong because EBS gp3 volumes attached to an EC2 instance incur compute costs even when idle, and managing log storage on block storage requires manual lifecycle management, leading to higher operational overhead and cost compared to a fully managed object storage solution.

143
MCQmedium

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The data rate is high and the Lambda function is falling behind, resulting in increased processing latency. What is the MOST effective way to improve throughput?

A.Increase the memory allocated to the Lambda function.
B.Increase the Lambda function timeout.
C.Use Kinesis Data Firehose instead of Lambda.
D.Increase the number of shards in the Kinesis stream.
AnswerD

More shards increase parallelism and throughput.

Why this answer

Option C is correct because increasing the number of shards increases the stream's capacity and allows more concurrent Lambda invocations. Option A is incorrect because Lambda concurrency might be a limit but the root cause is shard count. Option B is incorrect because increasing Lambda memory may help but not as much as more shards.

Option D is incorrect because Firehose would add a separate step.

144
Multi-Selecteasy

A company wants to enforce encryption in transit for data moving between an EC2 instance and an S3 bucket. Which TWO methods can achieve this? (Choose 2)

Select 2 answers
A.Add a bucket policy that denies requests without the aws:SecureTransport condition.
B.Use a VPC endpoint for S3.
C.Enable default SSE-S3 encryption on the bucket.
D.Use the HTTPS endpoint for S3 API calls.
E.Enable CloudTrail to monitor for non-encrypted requests.
AnswersA, D

This enforces HTTPS.

Why this answer

Options A and C are correct. Option A: HTTPS ensures encryption in transit. Option C: S3 bucket policy enforcing aws:SecureTransport ensures HTTPS.

Option B is wrong because SSE-S3 is for at-rest encryption. Option D is wrong because VPC endpoints do not automatically enforce encryption. Option E is wrong because CloudTrail is for logging.

145
Multi-Selectmedium

Which TWO actions are recommended for securing data at rest in Amazon S3? (Choose two.)

Select 2 answers
A.Enable default encryption on the S3 bucket using SSE-S3 or SSE-KMS.
B.Use S3 Bucket Key to reduce KMS request costs.
C.Enable S3 Versioning to protect against accidental deletions.
D.Apply a bucket policy that denies PutObject requests without the x-amz-server-side-encryption header.
E.Configure cross-region replication to replicate data to another bucket.
AnswersA, D

Ensures all new objects are encrypted automatically.

Why this answer

Option A is correct because enabling default encryption on an S3 bucket using SSE-S3 or SSE-KMS ensures that all objects stored in the bucket are encrypted at rest automatically, even if the upload request does not include encryption headers. This satisfies the requirement for securing data at rest by applying server-side encryption to every object written to the bucket.

Exam trap

The trap here is that candidates often confuse data protection features like Versioning or replication with encryption controls, but the question specifically asks for securing data at rest, which requires encryption mechanisms such as default encryption or policy-enforced encryption headers.

146
MCQeasy

A data engineer is designing a disaster recovery strategy for an Amazon Redshift data warehouse. The RPO (Recovery Point Objective) is 1 hour, and the RTO (Recovery Time Objective) is 2 hours. Which approach meets these requirements with the least operational overhead?

A.Set up a secondary Redshift cluster in another region and use AWS DMS for continuous replication.
B.Configure automated cross-region snapshot copy to another region.
C.Export the data to S3 daily using UNLOAD and copy to another region.
D.Enable automated snapshots with a retention period of 1 day.
AnswerB

Cross-region snapshots protect against regional failures and can restore quickly.

Why this answer

Option B is correct because automated snapshots to S3 with cross-region copy provide up-to-the-hour recovery and fast restore. Option A is incorrect because automated snapshots are default but not cross-region. Option C and D have higher overhead and may not meet RTO.

147
Multi-Selectmedium

A data engineer is troubleshooting an Amazon Redshift cluster that has experienced a node failure. The engineer needs to ensure that the cluster is highly available and can withstand a single node failure without downtime. Which TWO actions should the engineer take?

Select 2 answers
A.Enable automated snapshots with cross-region copy.
B.Enable concurrency scaling to handle increased read traffic.
C.Deploy the cluster as a single-node cluster for simplicity.
D.Place the cluster in a public subnet with an internet gateway.
E.Use a multi-node cluster with RA3 node types.
AnswersA, E

Cross-region snapshots allow recovery from region failures.

Why this answer

Options B and D are correct. Enabling multi-node cluster with RA3 node types provides managed storage and resilience. Enabling cross-region snapshot copy ensures data can be restored in another region.

Option A is incorrect because concurrency scaling does not provide high availability. Option C is incorrect because single-node clusters are not highly available. Option E is incorrect because VPC routing does not affect node failure.

148
MCQhard

A data engineer is designing a data ingestion pipeline for clickstream data that arrives in bursts, up to 100 MB/s, and must be processed with exactly-once semantics. The data must be stored in Amazon S3 partitioned by event date and hour. Which combination of services should the engineer use?

A.Amazon Kinesis Data Streams with AWS Lambda consumer writing to S3.
B.Amazon Kinesis Data Firehose with S3 destination and dynamic partitioning.
C.AWS Glue streaming ETL job reading from Amazon MSK and writing to S3.
D.Amazon Kinesis Data Streams with KCL application writing to S3.
AnswerB

Firehose handles bursts and supports partitioning with no custom code.

Why this answer

Amazon Kinesis Data Firehose can buffer and batch data to S3 with partitioning. Option A is correct. Option B is wrong because Lambda cold starts can cause latency.

Option C is wrong because Glue is for batch ETL, not real-time. Option D is wrong because KCL requires custom application.

149
MCQmedium

A data engineer needs to set up a data catalog for a new data lake in AWS Glue. The data resides in S3 in Parquet format. The engineer wants to ensure that the schema is automatically detected and updated when new columns are added to the data. Which configuration should the engineer use?

A.Add a partition index to the Glue Data Catalog table.
B.Configure the crawler's 'Schema updates' option to 'Update the table schema'.
C.Set the crawler's 'Database' output to a new database.
D.Enable partition indexing on the table.
AnswerB

This enables automatic schema detection and updates.

Why this answer

Option D is correct because Glue crawlers can be configured to update the table schema when new columns are detected. Option A is wrong because partition indexing does not update schema. Option B is wrong because the crawler's output is a table, not a database.

Option C is wrong because adding partition indexes does not update schema.

150
MCQhard

A data team runs a daily AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes results to Amazon S3. The job completes successfully but takes 2 hours longer than expected. The job uses the JDBC connection to Redshift. The Redshift cluster is 4 dc2.large nodes. The Glue job has 10 workers of type G.1X. Which change would MOST likely reduce the job duration?

A.Use Redshift Spectrum to query data directly from S3
B.Use the S3 staging option in the Glue connection to unload data from Redshift to S3 first
C.Increase the Redshift cluster size to 8 nodes
D.Increase the number of Glue workers to 20
AnswerB

UNLOAD is parallel and faster than JDBC; Glue can then read from S3.

Why this answer

The JDBC connection in AWS Glue reads data row-by-row from Redshift, which is slow for large datasets. By enabling the S3 staging option in the Glue connection, the job uses Redshift's UNLOAD command to export data to S3 in parallel, then Glue reads from S3. This bypasses the JDBC bottleneck and leverages Redshift's massively parallel processing (MPP) to export data much faster.

Exam trap

The trap here is that candidates assume the bottleneck is either Redshift compute (C) or Glue parallelism (D), when in fact the JDBC driver's single-threaded row-by-row fetch is the primary performance limiter.

How to eliminate wrong answers

Option A is wrong because Redshift Spectrum queries data directly from S3, but the source data is in Redshift, not S3; Spectrum does not help extract data from Redshift. Option C is wrong because the bottleneck is the JDBC connection, not Redshift compute capacity; adding more Redshift nodes would not speed up a single-threaded JDBC read. Option D is wrong because increasing Glue workers only helps if the job is CPU-bound or parallelizable; the JDBC read is I/O-bound and limited by the single connection, so more workers would not reduce the 2-hour delay.

Page 1

Page 2 of 24

Page 3