Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 901–975

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 13 of 24

901

Multi-Selectmedium

A company is building a data lake on Amazon S3. The data sources include relational databases, streaming data, and log files. The data engineer needs to ensure that the data ingestion pipeline can handle schema evolution, support both batch and streaming, and provide a unified metadata catalog. Which THREE services should the engineer use? (Choose three.)

Select 3 answers

A.AWS Glue

B.Amazon DynamoDB

C.Amazon Athena

D.Amazon S3

E.Amazon Kinesis Data Firehose

AnswersA, D, E

Provides schema discovery, catalog, and batch ETL.

Why this answer

Options A, C, and D are correct. AWS Glue provides a metadata catalog and ETL for batch. Kinesis Data Firehose handles streaming ingestion to S3.

S3 stores the data. Option B is wrong because Athena is a query service, not ingestion. Option E is wrong because DynamoDB is not used for data lake ingestion.

Full explanation →

902

MCQmedium

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started timing out. Which step should be taken to resolve this issue?

A.Increase the Lambda function timeout

B.Set the Lambda reserved concurrency to 1

C.Decrease the batch size in the event source mapping

D.Increase the number of shards in the Kinesis stream

AnswerA

Allows the function to run longer without timing out.

Why this answer

Option D is correct because increasing the Lambda timeout gives the function more time to process records. Option A is wrong because increasing shards increases throughput but does not fix timeout. Option B is wrong because it may reduce concurrency but does not extend processing time.

Option C is wrong because decreasing batch size reduces records per invocation but does not extend timeout.

Full explanation →

903

MCQeasy

A data engineer needs to transform CSV files to Parquet format using AWS Glue. The source data contains sensitive columns that must be masked. Which Glue feature should be used?

A.AWS Glue DataBrew

B.AWS Glue Studio

C.AWS Glue Crawler

D.AWS Glue Schema Registry

AnswerA

DataBrew provides visual data preparation with built-in masking.

Why this answer

Option B is correct because Glue DataBrew provides built-in transformations like masking. Option A is wrong because Glue Studio is for building visual ETL jobs but not specifically for masking. Option C is wrong because Glue crawlers catalog data, not transform.

Option D is wrong because Glue Schema Registry manages schemas.

Full explanation →

904

MCQeasy

A data engineer notices that an Amazon Kinesis Data Firehose delivery stream is failing to deliver data to an Amazon S3 bucket. The CloudWatch metrics show 'DeliveryToS3.Success' is 0 and 'S3.BucketExists' is 1. What is the MOST likely cause?

A.The S3 bucket has an ACL that denies access to Firehose.

B.The Firehose delivery stream Lambda transformation function is failing.

C.The IAM role for Firehose lacks s3:PutObject permission.

D.The S3 bucket does not exist.

AnswerC

Write permission is required for delivery.

Why this answer

The metric 'S3.BucketExists' is 1, confirming the S3 bucket exists, so the issue is not bucket existence. With 'DeliveryToS3.Success' at 0, the failure is in the write operation. The IAM role assumed by Firehose must have the s3:PutObject permission to deliver data; lacking it would cause all delivery attempts to fail silently, matching the observed metrics.

Exam trap

The trap here is that candidates may confuse 'S3.BucketExists' with successful delivery, or assume a missing bucket is the issue when the metric clearly shows the bucket exists, leading them to overlook the IAM permission gap.

How to eliminate wrong answers

Option A is wrong because S3 bucket ACLs are not evaluated when the IAM role grants the s3:PutObject permission via a bucket policy or identity-based policy; ACLs are legacy and Firehose uses IAM for authorization. Option B is wrong because a failing Lambda transformation function would cause 'DeliveryToS3.Success' to be 0 only if the transformation is mandatory, but the metric 'S3.BucketExists' would still be 1, and the failure would be logged as 'Lambda.ExecutionErrors' or similar, not directly as a delivery failure. Option D is wrong because 'S3.BucketExists' is 1, which explicitly indicates the bucket exists, so the bucket not existing cannot be the cause.

Full explanation →

905

Multi-Selecthard

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 (Parquet) into a denormalized format for Amazon Redshift. The Glue job uses the DynamicFrame API. The job is failing with a 'MemoryError' when performing a join operation. The data is skewed on the join key. Which THREE actions can reduce memory usage and improve job stability? (Choose THREE.)

Select 3 answers

A.Use a broadcast join if one of the tables is small enough.

B.Use a salted join key to distribute skewed keys across partitions.

C.Increase the number of DPUs for the Glue job.

D.Repartition the data on the join key before the join operation.

E.Split the transformation into multiple Glue job steps to reduce per-step memory.

AnswersA, B, E

Avoids shuffling small table.

Why this answer

Options B, C, and D are correct. Salting the join key distributes skew, using broadcast join for a smaller table reduces shuffle, and splitting the job into multiple steps reduces memory per stage. Option A is wrong because increasing DPUs may help but is not a targeted solution.

Option E is wrong because repartitioning does not fix skew and may increase overhead.

Full explanation →

906

MCQhard

Refer to the exhibit. A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream that writes to an S3 bucket named 'example-bucket'. The IAM role assumed by Firehose has the attached policy shown. When testing, the Firehose delivery stream fails with an access denied error. What is the most likely cause?

A.The S3 bucket has server-side encryption enabled that needs additional permissions.

B.The IAM role does not have permission to use AWS KMS keys.

C.The bucket policy denies access from the Firehose service principal.

D.The IAM policy is missing the s3:AbortMultipartUpload and s3:ListBucket actions.

AnswerD

Firehose uses multipart uploads and needs these permissions.

Why this answer

Option A is correct because Kinesis Data Firehose requires the s3:AbortMultipartUpload and s3:ListBucket permissions to write to S3. The policy only allows s3:PutObject and s3:GetObject. Option B is wrong because SSE is not required.

Option C is wrong because the bucket policy is not shown to be restrictive. Option D is wrong because encryption keys are not mentioned.

Full explanation →

907

MCQeasy

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The source files are in a bucket with thousands of small files. What is the best practice to optimize the Glue job performance?

A.Convert the JSON files to CSV before processing with Glue.

B.Enable 'Group small files' in the Glue job or use a DynamicFrame with coalesce.

C.Use an AWS Lambda function to pre-process the files.

D.Increase the number of DPUs to the maximum.

AnswerB

Grouping reduces the number of tasks and improves performance.

Why this answer

Grouping small files into fewer partitions reduces overhead. Option A is correct. Option B is wrong because increasing DPUs may not help with small files.

Option C is wrong because converting to CSV adds extra step. Option D is wrong because Parquet is more efficient for Glue.

Full explanation →

908

Multi-Selecthard

A company is migrating its data warehouse from on-premises to Amazon Redshift. The migration involves copying 50 TB of data from an S3 bucket to Redshift. The network bandwidth is limited to 1 Gbps. Which TWO approaches should the team use to complete the transfer within 7 days?

Select 2 answers

A.Use Amazon S3 Transfer Acceleration

B.Use AWS Direct Connect with 10 Gbps

C.Use AWS Snowball Edge to transfer the data to S3

D.Use AWS Lambda to copy data in parallel

E.Use Amazon Kinesis Data Firehose

AnswersA, C

S3 Transfer Acceleration can speed up uploads over the network.

Why this answer

A (Snowball Edge) allows physical transfer of the data, bypassing network limitations. D (S3 Transfer Acceleration) optimizes network transfer but may not be sufficient alone. B (Direct Connect) helps but still limited by 1 Gbps.

C (Kinesis Firehose) is for streaming. E (Lambda) is not for bulk transfer.

Full explanation →

909

MCQeasy

A company needs to ingest data from an on-premises database to Amazon S3 with minimal impact on the source database. The data volume is several TB. Which AWS service is best suited for this task?

A.AWS Direct Connect

B.AWS Snowball Edge

C.AWS Database Migration Service (DMS)

D.Amazon S3 Transfer Acceleration

AnswerC

DMS can migrate data from on-premises to S3 with minimal impact using CDC.

Why this answer

AWS DMS can perform a full load with minimal impact on source. Option B (S3 Transfer Acceleration) is for speeding up uploads; Option C (Direct Connect) is a network connection; Option D (Snowball) is for offline transfer.

Full explanation →

910

MCQeasy

A company uses Amazon DynamoDB to store session data for a web application. The application experiences sudden spikes in traffic, causing occasional throttling errors. The data engineer needs to handle these spikes without over-provisioning capacity. What is the MOST cost-effective solution?

A.Set up a TTL (Time to Live) attribute to automatically delete old session data.

B.Enable DynamoDB Accelerator (DAX) to cache read requests.

C.Configure DynamoDB Auto Scaling to automatically adjust provisioned capacity.

D.Switch to DynamoDB on-demand mode.

AnswerC

Auto Scaling dynamically adapts to traffic patterns, preventing throttling and reducing cost.

Why this answer

DynamoDB Auto Scaling (option C) is the most cost-effective solution because it automatically adjusts the provisioned read/write capacity based on actual traffic patterns, handling sudden spikes without manual intervention or over-provisioning. This avoids paying for unused capacity during low-traffic periods while still accommodating bursts within the configured limits.

Exam trap

The trap here is that candidates often confuse DynamoDB on-demand mode (option D) as the default solution for unpredictable traffic, but the exam tests cost optimization—on-demand is premium-priced per request, while Auto Scaling with provisioned capacity is more cost-effective for workloads with variable but not extreme traffic patterns.

How to eliminate wrong answers

Option A is wrong because TTL (Time to Live) only deletes expired session data to reduce storage costs and stale items, but it does not address throttling errors caused by capacity limits during traffic spikes. Option B is wrong because DynamoDB Accelerator (DAX) is an in-memory cache that improves read performance and reduces read throttling, but it does not help with write throttling or capacity management for sudden spikes. Option D is wrong because DynamoDB on-demand mode automatically scales to handle any traffic level, but it is significantly more expensive for predictable or steady-state workloads compared to provisioned capacity with Auto Scaling, making it less cost-effective for this use case.

Full explanation →

911

MCQmedium

A data engineer needs to transform CSV files arriving in S3 into Parquet format and partition them by date. The transformation should be event-driven and run immediately after each file is uploaded. Which approach is most efficient?

A.Use S3 event notification to trigger an AWS Glue job

B.Use an S3 event notification to invoke a Lambda function that converts the file

C.Use an Amazon EMR cluster running Spark to process files as they arrive

D.Use Amazon Athena CREATE TABLE AS SELECT (CTAS) on a schedule

AnswerA

Glue jobs can be triggered by S3 events and efficiently convert to Parquet with partitioning.

Why this answer

Option D (Use S3 event notification to trigger an AWS Glue job) is correct because Glue can be triggered by S3 events and supports Parquet conversion and partitioning. Option A (AWS Lambda) has a 15-minute timeout and may not handle large files. Option B (Amazon Athena CTAS) is not event-driven and does not automatically run on upload.

Option C (Amazon EMR) is overkill for this use case.

Full explanation →

912

MCQhard

A healthcare company processes patient records in near-real-time using Amazon Kinesis Data Streams. Each record contains sensitive personal health information (PHI). The data must be encrypted at rest and in transit. The company also needs to audit access to the data. The data engineer is designing the ingestion pipeline. Which combination of services and configurations meets these requirements?

A.Use Kinesis Data Firehose to deliver data to S3 with SSE-S3, and enable CloudTrail for S3.

B.Use Kinesis Data Streams with TLS and enable CloudTrail for auditing. Do not enable SSE.

C.Use Kinesis Data Streams with SSE-KMS and TLS, and enable CloudTrail for data events.

D.Use Kinesis Data Streams with SSE-KMS and TLS. Do not enable any auditing.

AnswerC

Provides encryption at rest and in transit, plus auditing.

Why this answer

Option C is correct. Kinesis Data Streams supports server-side encryption (SSE) using AWS KMS for at-rest encryption, and TLS for in-transit. CloudTrail can log Kinesis API calls for auditing.

Option A lacks encryption at rest. Option B lacks auditing. Option D is wrong because S3 does not replace Kinesis for streaming.

Full explanation →

913

MCQeasy

A data engineer needs to ingest data from an external HTTP API into Amazon S3. The API returns JSON data for a list of users, updated hourly. The engineer wants to use a serverless solution with minimal operational overhead. Which AWS service should the engineer use?

A.Amazon Kinesis Data Firehose with a custom HTTP endpoint.

B.AWS Lambda function triggered by CloudWatch Events.

C.Amazon AppFlow with an HTTP connector on a scheduled flow.

D.AWS Glue ETL job triggered by EventBridge.

AnswerC

AppFlow is serverless and designed for API ingestion.

Why this answer

Option D is correct because Amazon AppFlow can connect to HTTP APIs and load data to S3 on a schedule. Option A is wrong because Kinesis Data Firehose requires a stream source. Option B is wrong because Glue ETL can do it but requires more configuration.

Option C is wrong because Lambda alone lacks scheduling.

Full explanation →

914

MCQeasy

A data engineer needs to ingest on-premises CSV files into Amazon S3 every hour. The files are less than 1 GB each. Which service is the most cost-effective and requires the least operational overhead?

A.AWS DataSync

B.Amazon Kinesis Data Firehose

C.AWS Snowball Edge

D.AWS Database Migration Service (DMS)

AnswerA

DataSync automates scheduled transfers from on-premises to S3.

Why this answer

Option B is correct because AWS DataSync is designed for scheduled transfers from on-premises to S3 with minimal setup. Option A is wrong because Snowball is for large, offline transfers. Option C is wrong because Kinesis is for streaming data.

Option D is wrong because DMS is for database migrations.

Full explanation →

915

MCQmedium

A data engineer needs to store JSON documents that are accessed by a key-value pattern. The workload requires single-digit millisecond latency at any scale. Which AWS service is most appropriate?

A.Amazon DocumentDB (with MongoDB compatibility)

B.Amazon RDS for PostgreSQL

C.Amazon DynamoDB

D.Amazon Neptune

AnswerC

Key-value and document store with consistent low latency.

Why this answer

Amazon DynamoDB is a key-value and document database that provides single-digit millisecond latency at any scale. RDS is relational, Neptune is graph, and DocumentDB is MongoDB compatible but may not guarantee same latency at extreme scale.

Full explanation →

916

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is accessed frequently for the first 30 days, then rarely after that. The engineer needs to minimize storage costs while ensuring data is available within minutes for the first 30 days and can be retrieved within 12 hours after that. Which lifecycle policy should be applied?

A.Transition to S3 One Zone-IA after 30 days.

B.Transition to S3 Standard-IA after 30 days.

C.Transition to S3 Glacier Deep Archive after 30 days.

D.Transition to S3 Glacier Flexible Retrieval after 30 days.

AnswerC

Cost-effective for rarely accessed data with 12-hour retrieval.

Why this answer

Option C is correct because S3 Glacier Deep Archive offers the lowest storage cost for data that is rarely accessed after 30 days, and its retrieval time (within 12 hours) matches the requirement. The lifecycle policy transitions objects from S3 Standard (or S3 Intelligent-Tiering) to S3 Glacier Deep Archive after 30 days, minimizing costs while meeting the 12-hour retrieval window.

Exam trap

AWS often tests the distinction between retrieval time and cost, leading candidates to choose S3 Glacier Flexible Retrieval (Option D) because it is a 'Glacier' tier, but they overlook that Deep Archive is cheaper and still meets the 12-hour retrieval requirement.

How to eliminate wrong answers

Option A is wrong because S3 One Zone-IA is designed for infrequently accessed data that can be recreated if lost, but it does not provide the lowest cost for long-term archival and its retrieval is immediate, not within 12 hours. Option B is wrong because S3 Standard-IA is for infrequently accessed data with immediate retrieval, but it is more expensive than Glacier Deep Archive for data that is rarely accessed after 30 days. Option D is wrong because S3 Glacier Flexible Retrieval offers retrieval times from minutes to hours (typically 1-5 minutes for expedited, 3-5 hours for standard), but it is more expensive than Glacier Deep Archive and does not meet the 12-hour retrieval requirement as precisely as Deep Archive.

Full explanation →

917

MCQeasy

A team uses Amazon Kinesis Data Analytics to process streaming data. They notice that the application's output is delayed. Which AWS service can be used to monitor the application's performance and identify bottlenecks?

A.AWS CloudTrail

B.Amazon CloudWatch

C.Amazon Athena

D.AWS X-Ray

AnswerB

CloudWatch monitors Kinesis Data Analytics with metrics like MillisBehindLatest and CPU utilization.

Why this answer

Option A is correct because CloudWatch provides metrics and logs for Kinesis Data Analytics applications. Option B is wrong because CloudTrail records API calls, not performance metrics. Option C is wrong because Athena is an interactive query service.

Option D is wrong because X-Ray traces requests but is not the primary service for monitoring Kinesis Data Analytics performance.

Full explanation →

918

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format. The delivery stream is configured with a buffer size of 5 MB and a buffer interval of 60 seconds. However, the data engineer notices that S3 objects are being created with sizes much smaller than 5 MB. What is a likely cause?

A.The data is being compressed before delivery, reducing object size.

B.The incoming data rate is too low, causing the buffer interval to trigger before reaching the buffer size.

C.The data transformation lambda is splitting records into smaller ones.

D.The S3 bucket is configured with a lifecycle policy that splits objects.

AnswerB

Buffer interval triggers first.

Why this answer

Option B is correct because Kinesis Data Firehose delivers data to S3 when either the buffer size (5 MB) or the buffer interval (60 seconds) is reached, whichever occurs first. If the incoming data rate is low, the buffer interval will expire before accumulating 5 MB of data, resulting in smaller S3 objects.

Exam trap

The trap here is that candidates may assume the buffer size is a hard minimum that must be reached before delivery, but Firehose uses an 'or' condition between buffer size and buffer interval, so low data rate causes interval-based delivery of small objects.

How to eliminate wrong answers

Option A is wrong because compression reduces the size of data after buffering, but the buffer size limit is based on the uncompressed data; compression does not cause smaller objects to be created before the buffer interval triggers. Option C is wrong because a data transformation Lambda can modify records but does not inherently split records into smaller ones; it processes records as a batch and returns them, and any splitting would be a custom logic not default behavior. Option D is wrong because S3 lifecycle policies manage object transitions or deletions after objects are created; they do not split objects during delivery.

Full explanation →

919

MCQmedium

Refer to the exhibit. An IAM policy for an AWS Lambda function. The Lambda function is triggered by an S3 event (object created) and needs to read from a Kinesis stream. However, the function fails with access denied when trying to read from Kinesis. What is the most likely cause?

A.The Lambda function is not in the same region as the Kinesis stream

B.The Lambda function does not have permission to list S3 buckets

C.The Kinesis stream is encrypted with a customer managed KMS key, and the Lambda function lacks kms:Decrypt permission

D.The S3 bucket policy denies access to the Lambda function

AnswerC

If the stream uses SSE-KMS, Lambda needs kms:Decrypt on the key.

Why this answer

The Lambda function lacks permission to list the Kinesis stream (kinesis:ListStreams). The policy only allows DescribeStream, GetShardIterator, and GetRecords. To read from a stream, the initial DescribeStream call works, but the Lambda execution role may need additional actions depending on the SDK.

However, often the error is due to missing kinesis:ListStreams if the SDK lists streams first. But more commonly, the policy is correct; the failure might be due to missing s3:ListBucket? No, the error is Kinesis. Actually, the policy allows GetRecords, so the most likely cause is that the Kinesis stream is encrypted with a KMS key and the Lambda function lacks kms:Decrypt permission.

This is a common scenario. The policy does not include KMS permissions.

Full explanation →

920

MCQeasy

A data engineer needs to audit data access events in Amazon S3. Which AWS service should be used to record and monitor API calls for S3 buckets?

A.AWS CloudTrail

B.AWS Config

C.Amazon Macie

D.Amazon GuardDuty

AnswerA

CloudTrail records API calls for auditing.

Why this answer

Option D is correct because AWS CloudTrail records API calls for auditing. Option A is wrong because Amazon GuardDuty is a threat detection service. Option B is wrong because AWS Config tracks resource configuration changes.

Option C is wrong because Amazon Macie discovers sensitive data.

Full explanation →

921

Matchingmedium

Match each AWS storage class to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Frequent access, low latency

Auto-moves data between tiers

Archive retrieval in minutes to hours

Lowest cost, 12-hour retrieval

Infrequent access, single AZ

Why these pairings

S3 storage classes balance cost and access frequency.

Full explanation →

922

MCQhard

A company uses AWS Glue to transform data in an S3 data lake. The transformation logic requires joining two large datasets that are each hundreds of gigabytes. The Glue job runs out of memory. Which configuration change will most likely resolve this issue?

A.Repartition the data before the join.

B.Increase the number of DPUs for the Glue job.

C.Use a different file format like Parquet with compression.

D.Use the 'spark.sql.autoBroadcastJoinThreshold' setting to broadcast the smaller table.

AnswerB

More DPUs provide more memory and parallelism, helping the join fit in memory.

Why this answer

Increasing the number of DPUs provides more memory for the join operation. Glue automatically distributes data across workers, so more workers mean more total memory.

Full explanation →

923

MCQhard

A company uses AWS Lake Formation to manage access to data in a data lake. The data engineer needs to grant a user the ability to query tables in the 'sales' database using Amazon Athena, but only when the user's IP address is within the corporate network (10.0.0.0/8). Which combination of actions should the data engineer take?

A.Grant Lake Formation permissions on the tables and attach an S3 bucket policy with aws:SourceIp condition

B.Grant Lake Formation permissions on the tables and attach an IAM policy to the user with aws:SourceIp condition

C.Use an S3 VPC endpoint and grant Lake Formation permissions on the tables

D.Grant Lake Formation permissions on the tables and configure a network ACL in the VPC

AnswerB

Correct combination.

Why this answer

Option D is correct because Lake Formation permissions control access to tables, and an IAM policy with a condition key `aws:SourceIp` can restrict Athena access to the corporate IP range. Option B is missing the IAM policy. Option A is wrong because S3 bucket policies do not restrict Athena queries through Lake Formation.

Option C is wrong because VPC endpoints alone do not enforce IP-based restrictions.

Full explanation →

924

Multi-Selectmedium

A data engineer is designing a data lake on Amazon S3 that will be accessed by multiple AWS Glue ETL jobs. The engineer needs to ensure that the data is organized efficiently for querying and that sensitive columns are masked for certain users. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Use AWS Lake Formation to define column-level permissions for sensitive data.

B.Configure AWS Glue Data Catalog to automatically mask sensitive columns in table definitions.

C.Organize data in S3 using a partition structure like 'year=YYYY/month=MM/day=DD/region=XX/'.

D.Use S3 object tags to label sensitive data and apply bucket policies to restrict access.

E.Implement S3 lifecycle policies to transition sensitive data to S3 Glacier after 30 days.

AnswersA, C

Lake Formation provides column-level security to mask sensitive columns.

Why this answer

Option A is correct because AWS Lake Formation provides fine-grained access control at the column level, allowing you to mask or restrict sensitive columns (e.g., PII) for specific IAM roles or users without altering the underlying data in S3. This is achieved through Lake Formation’s column-level permissions and data filtering, which integrate directly with the AWS Glue Data Catalog and query engines like Athena and Redshift Spectrum.

Exam trap

The trap here is that candidates often confuse S3 object tags or bucket policies with fine-grained column-level access control, or assume the Glue Data Catalog can natively mask columns, when in fact only Lake Formation provides that capability.

Full explanation →

925

MCQeasy

A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream to load data into Amazon Redshift. The data is coming from an application that produces JSON records. The engineer needs to transform the data to match the Redshift table schema. Which approach is the MOST cost-effective and requires the least operational overhead?

A.Use AWS Glue as a transformation step between Firehose and Redshift, with a trigger on S3.

B.Use Kinesis Data Firehose with direct PUT to Redshift and rely on Redshift's COPY command to transform.

C.Configure a Lambda function in the Firehose delivery stream to transform records before delivery.

D.Use the Kinesis Client Library (KCL) to consume the stream, transform in an EC2 instance, and then load to Redshift.

AnswerC

Firehose supports Lambda for data transformation with minimal overhead.

Why this answer

Using Firehose's built-in Lambda transformation function is the simplest and most cost-effective approach. Option A is wrong because Firehose cannot directly write to Redshift without transformation. Option B is wrong because Glue adds extra cost and complexity.

Option D is wrong because KCL requires custom code and management.

Full explanation →

926

MCQeasy

A startup uses Amazon S3 to store user-uploaded images. The images are accessed frequently for the first week after upload, but after that they are rarely accessed. The company wants to optimize storage costs without compromising availability. The data engineer must implement a lifecycle policy to transition objects to a more cost-effective storage class after 30 days. The objects must be retrievable within minutes. Which storage class should the engineer transition the objects to?

A.S3 One Zone-Infrequent Access

B.S3 Standard

C.S3 Standard-Infrequent Access

D.S3 Glacier Instant Retrieval

AnswerC

Cost-effective for infrequent access with rapid retrieval.

Why this answer

S3 Standard-Infrequent Access (S3 Standard-IA) is the correct choice because it offers a per-GB storage cost lower than S3 Standard while maintaining high durability (99.999999999%) and availability (99.9%), with retrieval times in milliseconds. The lifecycle policy transitions objects after 30 days, aligning with the access pattern where images are rarely accessed after the first week, and the requirement for retrieval within minutes is satisfied by S3 Standard-IA's instant access.

Exam trap

The trap here is that candidates confuse 'retrievable within minutes' with S3 Glacier Instant Retrieval, overlooking that S3 Standard-IA also provides immediate access and is more cost-effective for data that is rarely accessed but not archival, especially with a 30-day transition window that avoids the 90-day minimum storage charge of Glacier Instant Retrieval.

How to eliminate wrong answers

Option A is wrong because S3 One Zone-Infrequent Access stores data in a single Availability Zone, which compromises availability and durability (99.99% object durability) and does not meet the requirement of 'without compromising availability'. Option B is wrong because S3 Standard is the default storage class with higher storage costs, and transitioning to it would not optimize costs; it is intended for frequently accessed data, not for rarely accessed objects after 30 days. Option D is wrong because S3 Glacier Instant Retrieval, while offering millisecond retrieval, is designed for long-term archival with a minimum storage duration charge of 90 days, making it cost-ineffective for a 30-day transition and not aligned with the 'rarely accessed' pattern described.

Full explanation →

927

MCQmedium

A company is building a data lake on Amazon S3 and wants to ingest data from multiple AWS services (CloudTrail, VPC Flow Logs, and ALB logs). The data should be stored in a central S3 bucket with a common partitioning scheme. Which service can be used to collect and centralize this data with minimal configuration?

A.Use AWS Data Pipeline to copy logs from each source S3 bucket to the central bucket.

B.Use AWS Glue to crawl the logs from each source and write to a central S3 bucket.

C.Set up Amazon Kinesis Data Firehose to ingest logs from each service and write to S3.

D.Configure each source service to deliver logs directly to the central S3 bucket.

AnswerD

CloudTrail, VPC Flow Logs, and ALB can all deliver to S3 directly.

Why this answer

Amazon S3 can be configured as a destination for CloudTrail, VPC Flow Logs, and ALB logs directly, but centralizing requires custom logic. AWS Glue can crawl and catalog, but not ingest. AWS Data Pipeline can copy data but requires setup.

However, the best answer is to use S3 replication or a simple Lambda function. But among the options, the most suitable is to use AWS Glue with a custom script? Actually, the question expects S3 cross-region replication? Let's see. The correct answer is to use S3 as a central bucket and configure each service to deliver logs to that bucket.

But the options may include 'Use Amazon S3 event notifications to trigger a Lambda function that copies logs to a central bucket.' That is plausible. However, the simplest is to configure each service to deliver to the same bucket. But the question likely expects using S3 replication? I'll go with: Configure each service to deliver logs to a common S3 bucket prefix.

Full explanation →

928

MCQmedium

A data engineer needs to store semi-structured JSON data that is accessed infrequently but must be retrievable within minutes. The data is generated by IoT devices and each object is about 500 KB. The engineer wants the most cost-effective storage solution. Which AWS service should be used?

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 Standard

C.Amazon S3 Standard-Infrequent Access (S3 Standard-IA)

D.Amazon Elastic Block Store (EBS)

AnswerC

Cost-effective for infrequent access with rapid retrieval.

Why this answer

Amazon S3 Standard-Infrequent Access (S3 Standard-IA) is the correct choice because it is designed for data that is accessed infrequently but requires rapid retrieval (within minutes). The 500 KB JSON objects from IoT devices fit the use case, and S3 Standard-IA offers lower storage costs than S3 Standard while maintaining the same low-latency retrieval performance, making it the most cost-effective option for this scenario.

Exam trap

The trap here is that candidates often confuse 'infrequent access' with 'archival' and choose Glacier Deep Archive, overlooking the retrieval time requirement of 'within minutes' which S3 Standard-IA satisfies but Glacier does not.

How to eliminate wrong answers

Option A is wrong because Amazon S3 Glacier Deep Archive is intended for long-term archival data with retrieval times of 12 hours or more, not within minutes, and its retrieval costs are higher for urgent access. Option B is wrong because Amazon S3 Standard is optimized for frequently accessed data with higher storage costs, making it less cost-effective for infrequently accessed IoT data. Option D is wrong because Amazon Elastic Block Store (EBS) is a block-level storage service designed for EC2 instances, not for storing semi-structured JSON objects as a standalone data store, and it incurs costs even when not in use.

Full explanation →

929

MCQhard

A company is running a critical Amazon RDS for MySQL database. They need to implement a backup strategy that allows point-in-time recovery (PITR) with a recovery time objective (RTO) of 15 minutes and a recovery point objective (RPO) of 5 minutes. Which solution meets these requirements?

A.Enable automated backups with 1-day retention and enable Multi-AZ deployment

B.Use cross-Region automated backups and promote the replica

C.Take manual snapshots every 5 minutes and restore from the latest snapshot

D.Enable a Read Replica and promote it during a disaster

AnswerA

Automated backups provide transaction logs every 5 minutes; Multi-AZ failover is fast.

Why this answer

Option A is correct because automated backups with 5-minute transaction logs meet RPO and a Multi-AZ failover meets RTO. Option B is wrong because manual snapshots cannot achieve 5-minute RPO. Option C is wrong because cross-Region replication adds latency.

Option D is wrong because Read Replicas are not for failover.

Full explanation →

930

MCQmedium

A data engineer is configuring an Amazon Redshift cluster to encrypt data at rest. The company policy requires that encryption keys be stored in AWS CloudHSM. Which integration should the engineer use to meet this requirement?

A.Use AWS KMS with a customer managed key.

B.Configure Redshift to use an HSM for encryption.

C.Enable encryption using the AWS Redshift SSL/TLS feature.

D.Use Redshift automatic key rotation.

AnswerB

Redshift supports integration with CloudHSM for key storage.

Why this answer

Amazon Redshift supports AWS CloudHSM for encryption key management. The integration is done via the Hardware Security Module (HSM) integration. Option A is correct.

Option B is for KMS. Option C is for database encryption, not key storage. Option D is for key rotation.

Full explanation →

931

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. The transfer must be completed within one week. Which service should be used?

A.AWS DataSync

B.AWS Snowball

C.Amazon CloudFront

D.AWS Database Migration Service (DMS)

AnswerB

Physical device for large data transfers.

Why this answer

AWS Snowball is the correct choice because transferring 50 TB over a 100 Mbps network would take approximately 46 days (50 TB * 8 bits/byte / 100 Mbps / 86400 seconds/day), far exceeding the one-week deadline. Snowball provides a physical storage device that can be shipped to the on-premises location, allowing data to be loaded locally and shipped to AWS, bypassing network bandwidth constraints entirely.

Exam trap

The trap here is that candidates may underestimate the time required for online transfer and choose AWS DataSync, failing to calculate that 50 TB at 100 Mbps takes over 46 days, not one week, and overlooking Snowball's physical shipping approach for offline data transfer.

How to eliminate wrong answers

Option A is wrong because AWS DataSync is designed for online data transfer over the network, and at 100 Mbps, it would take over 46 days to transfer 50 TB, which does not meet the one-week requirement. Option C is wrong because Amazon CloudFront is a content delivery network (CDN) for caching and distributing content to edge locations, not a data transfer service for ingesting large volumes of historical data into S3. Option D is wrong because AWS Database Migration Service (DMS) is specialized for migrating databases (e.g., relational, NoSQL) and does not support transferring HDFS files or large-scale file-based data to S3.

Full explanation →

932

MCQmedium

A company stores sensitive data in Amazon S3 and requires that all data be encrypted at rest. The data is accessed by multiple AWS services. Which solution meets the encryption requirement with the LEAST operational overhead?

A.Use server-side encryption with AWS KMS (SSE-KMS)

B.Use client-side encryption with AWS KMS

C.Use server-side encryption with customer-provided keys (SSE-C)

D.Enable S3 default encryption with SSE-S3

AnswerD

Least overhead as AWS manages the keys.

Why this answer

Option D is correct because enabling S3 default encryption with SSE-S3 provides server-side encryption automatically without any key management overhead. Option A is wrong because client-side encryption requires managing keys. Option B is wrong because SSE-KMS requires managing KMS keys.

Option C is wrong because SSE-C requires managing your own keys.

Full explanation →

933

MCQmedium

A company uses AWS Glue to catalog data in Amazon S3. The data arrives in Parquet format, but the crawler fails to update the schema when new columns are added. What is the most likely cause?

A.The Glue Data Catalog is not configured to accept schema changes.

B.Parquet files do not support schema evolution.

C.The crawler configuration has 'Update the table definition' set to 'Ignore the change'.

D.The S3 bucket has versioning disabled.

AnswerC

If set to ignore changes, the crawler will not add new columns.

Why this answer

Option C is correct because the crawler's catalog update policy may be set to 'Update the table definition' but not 'Add new columns only', or it might be set to 'Ignore the change'. Option A is wrong because Parquet schema evolves naturally. Option B is wrong because S3 bucket versioning does not affect crawler schema updates.

Option D is wrong because the Glue Data Catalog is the target, not the source.

Full explanation →

934

MCQhard

A company runs a MySQL-compatible Amazon Aurora database for its e-commerce platform. The database experiences high write latency during peak hours. The application performs frequent INSERT and UPDATE operations on a table with 50 million rows. The DB instance is db.r5.large with 500 GB of Provisioned IOPS storage. A recent performance analysis shows that the average queue depth is consistently above 32 and write latency exceeds 50 ms. The company needs to reduce write latency without changing the application code. What should a data engineer do?

A.Enable Aurora Auto Scaling to automatically add reader instances.

B.Convert the cluster to Aurora Serverless v2 to automatically scale compute capacity.

C.Resize the DB instance to a larger instance type such as db.r5.xlarge.

D.Create a read replica and configure the application to offload read queries.

AnswerC

Increasing instance size provides more CPU and memory, reducing queue depth and write latency.

Why this answer

Option C is correct because the high queue depth (consistently above 32) and write latency exceeding 50 ms indicate that the current db.r5.large instance is CPU or I/O constrained for the write workload. Resizing to a larger instance type such as db.r5.xlarge increases the available vCPUs, memory, and network bandwidth, which directly reduces queue depth and write latency by allowing more concurrent write operations to be processed. This solution does not require application code changes and addresses the root cause of insufficient compute capacity for the frequent INSERT and UPDATE operations on the 50-million-row table.

Exam trap

The trap here is that candidates confuse read scaling solutions (Auto Scaling, read replicas) with write performance improvements, or assume that Aurora Serverless v2 automatically solves all performance issues without considering that write latency is often tied to instance size and storage configuration.

How to eliminate wrong answers

Option A is wrong because Aurora Auto Scaling adds reader instances to handle read traffic, not write traffic; write operations are always handled by the writer instance, so adding readers does not reduce write latency. Option B is wrong because converting to Aurora Serverless v2 changes the scaling model but does not guarantee lower write latency; it may even introduce cold start delays or scaling cooldown periods that could worsen latency during peak hours, and it does not directly address the queue depth issue caused by insufficient instance resources. Option D is wrong because creating a read replica and offloading read queries does not affect write latency; write operations still go to the writer instance, which remains the bottleneck.

Full explanation →

935

MCQeasy

A data engineer needs to ingest data from an Amazon RDS for PostgreSQL database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. Which service is most appropriate for this task?

A.AWS Database Migration Service (DMS) with continuous replication

B.Amazon Athena with federated query to RDS

C.Amazon EMR with Spark job

D.AWS Glue with a scheduled ETL job

AnswerD

AWS Glue can run scheduled ETL jobs to extract from RDS and load to S3.

Why this answer

Option B is correct because AWS Glue can run ETL jobs to extract data from RDS and write to S3. Option A is wrong because AWS DMS is for database migration and continuous replication, not for daily batch jobs. Option C is wrong because Amazon Athena is a query service, not for data transfer.

Option D is wrong because Amazon EMR is for big data processing and is overkill for this simple task.

Full explanation →

936

Multi-Selectmedium

A company is using Amazon S3 to store sensitive financial data. They need to ensure that all objects are encrypted at rest. Which TWO methods can achieve this? (Choose TWO.)

Select 2 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Apply a bucket policy that denies PutObject without encryption.

C.Use SSE-KMS to encrypt objects with AWS KMS.

D.Use client-side encryption before uploading objects.

E.Enable default encryption on the S3 bucket using SSE-S3.

AnswersC, E

SSE-KMS provides server-side encryption with customer-managed keys.

Why this answer

Options A and B are correct. SSE-S3 and SSE-KMS both provide encryption at rest. Option C is wrong because client-side encryption is not managed by S3.

Option D is wrong because bucket policies do not enforce encryption; they require encryption. Option E is wrong because S3 Transfer Acceleration is for speed, not encryption.

Full explanation →

937

MCQmedium

A company uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration is taking longer than expected due to the initial load. Which AWS service can be used to accelerate the initial load by transferring the database files directly?

A.AWS Snowball

B.Amazon S3 Transfer Acceleration

C.AWS Direct Connect

D.Amazon Kinesis Data Firehose

AnswerA

Snowball allows physical transfer of data, which can be faster than network transfer for very large datasets.

Why this answer

Option C is correct because AWS Snowball can be used to transfer large datasets physically, bypassing network constraints. Option A is wrong because S3 Transfer Acceleration speeds up uploads to S3 over the internet, but the data still goes over the network. Option B is wrong because Direct Connect provides a dedicated network connection, but the data still traverses the network.

Option D is wrong because Kinesis Data Firehose is for streaming, not bulk data transfer.

Full explanation →

938

MCQeasy

A data engineer needs to monitor Amazon DynamoDB table metrics to detect throttled requests. Which CloudWatch metric should the engineer set an alarm on?

A.ReadThrottleEvents

B.SuccessfulRequestLatency

C.ThrottledRequests

D.ConsumedWriteCapacityUnits

AnswerC

This metric directly indicates requests that were throttled.

Why this answer

Option C is correct because `ThrottledRequests` is the specific Amazon CloudWatch metric that tracks the number of requests to a DynamoDB table that are throttled due to exceeding the provisioned throughput capacity. This metric directly reflects throttling events, making it the appropriate choice for setting an alarm to detect throttled requests.

Exam trap

The trap here is that candidates confuse `ThrottledRequests` with `ReadThrottleEvents` or `WriteThrottleEvents`, which are not actual CloudWatch metrics, leading them to select a plausible-sounding but incorrect option.

How to eliminate wrong answers

Option A is wrong because `ReadThrottleEvents` is not a valid CloudWatch metric for DynamoDB; the correct metric for throttled reads is `ReadThrottleEvents` is a misconception, as DynamoDB exposes `ThrottledRequests` and `ThrottledGetRecords` for streams, but not a separate read-only throttle metric. Option B is wrong because `SuccessfulRequestLatency` measures the latency of successful requests, not throttling events, and is used for performance monitoring rather than detecting throttled requests. Option D is wrong because `ConsumedWriteCapacityUnits` tracks the amount of write capacity consumed, not throttling events, and is used for capacity planning, not for alerting on throttled requests.

Full explanation →

939

MCQmedium

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that some changes are not being captured in the target database. What is the MOST likely cause?

A.The VPC peering connection between on-premises and AWS is down.

B.The DMS task's table mapping is incorrectly configured.

C.DMS is not publishing task logs to CloudWatch Logs.

D.The source Oracle database is not configured to retain archived redo logs for a sufficient period.

AnswerD

DMS requires archived logs to capture changes; if logs are purged, changes are lost.

Why this answer

Option A is correct because DMS uses the source database's transaction logs (redo logs) for CDC; if they are not retained, DMS cannot read changes. Option B is wrong because VPC peering affects network connectivity, not CDC. Option C is wrong because DMS uses its own task logs, not CloudTrail.

Option D is wrong because table mapping would cause missing tables, not missing changes.

Full explanation →

940

MCQeasy

A company is using Amazon Kinesis Data Firehose to ingest clickstream data from a website into an S3 bucket. The data is then analyzed using Amazon Athena. Recently, the company noticed that Athena queries are returning incomplete results for the last 30 minutes of data. The Firehose delivery stream is configured to buffer data for 60 seconds or 5 MB before delivering to S3. The S3 bucket has a lifecycle policy that transitions objects to Amazon S3 Glacier after 30 days. The IAM role for Firehose has permissions to write to S3 and access a CloudWatch Logs group. The engineer checks the Firehose monitoring and sees that the delivery rate is healthy, but the 'S3.Bytes' metric shows a spike in the last hour. The 'BackupToS3.Bytes' metric is zero. What is the MOST likely cause of the missing data?

A.The lifecycle policy is transitioning data before Athena can query it.

B.The backup is enabled and data is being sent to the backup bucket instead.

C.The data is still being buffered in Firehose and has not yet been delivered to S3.

D.The IAM role for Firehose does not have permissions to write to the S3 bucket.

AnswerC

Firehose buffers data for up to 60 seconds; recent data may still be in transit and not yet queryable by Athena.

Why this answer

Option A is correct. The buffer settings (60 seconds or 5 MB) mean that data may be held in the buffer for up to 60 seconds before being written to S3. For the last 30 minutes, some data may still be in the buffer and not yet delivered.

Athena queries only see data that has been delivered. Option B (lifecycle policy) would not affect recent data. Option C (IAM permissions) would cause errors, not missing data.

Option D (backup) is unrelated.

Full explanation →

941

Multi-Selecteasy

A company is designing a data lake on Amazon S3. The data ingestion pipeline must handle both structured and unstructured data. The data must be cataloged for easy discovery. Which THREE services should be included in the solution? (Choose THREE.)

Select 3 answers

A.Amazon S3

B.AWS Glue Data Catalog

C.Amazon Athena

D.Amazon RDS

E.Amazon Redshift

AnswersA, B, C

S3 is the core storage for data lakes.

Why this answer

Option A is correct because S3 is the storage layer. Option B is correct because Glue Crawlers can catalog data. Option C is correct because Athena can query the cataloged data.

Option D (RDS) is a relational database, not for a data lake. Option E (Redshift) is a data warehouse, not a data lake.

Full explanation →

942

MCQeasy

A company wants to enforce that all data written to an S3 bucket is encrypted with a customer-managed AWS KMS key. The data engineer has created the KMS key and attached an S3 bucket policy. However, users are still able to upload objects without specifying the KMS key. What is the most likely cause?

A.The S3 bucket policy does not include a condition that denies s3:PutObject without the correct encryption

B.The S3 bucket has default encryption enabled with SSE-S3

C.The KMS key policy does not grant the users kms:Encrypt permission

D.The IAM role for the users does not have s3:PutObject permission

AnswerA

The bucket policy must have a deny condition.

Why this answer

Option C is correct because the bucket policy must explicitly deny PutObject if the encryption header does not match the required KMS key. Option A is wrong because KMS key policy is needed but the issue is the bucket policy. Option B is wrong because the IAM role must allow kms:GenerateDataKey, but the issue is the bucket policy.

Option D is wrong because S3 default encryption does not force the use of a specific KMS key.

Full explanation →

943

MCQeasy

A data engineer needs to transform JSON data from a Kinesis Data Stream into Parquet format and store it in an S3 data lake. The transformation includes simple field mapping and data type conversions. Which AWS service is the most cost-effective for performing this transformation in near-real-time?

A.Amazon Athena with CTAS (CREATE TABLE AS SELECT)

B.AWS Lambda function triggered by Kinesis Data Firehose

C.AWS Glue ETL job

D.Amazon EMR with Spark Streaming

AnswerB

Lambda can be invoked by Firehose for record transformation and can output Parquet; it is serverless and cost-effective for near-real-time.

Why this answer

Option D is correct because Kinesis Data Firehose can perform data transformation using a Lambda function before delivering to S3, and it supports converting JSON to Parquet using its built-in capability or Lambda. Option A is wrong because AWS Glue is a serverless Spark ETL service, but it is designed for batch processing, not streaming. Option B is wrong because Amazon EMR is a managed Hadoop cluster that is more complex and costly for simple transformations.

Option C is wrong because Amazon Athena is an interactive query service, not a transformation engine.

Full explanation →

944

MCQhard

A company is building a real-time analytics dashboard using Amazon Kinesis Data Streams and Amazon DynamoDB. The data engineer needs to ensure that the DynamoDB table can handle write spikes without throttling. Which approach is the most cost-effective?

A.Use DynamoDB Accelerator (DAX) to cache writes.

B.Use provisioned capacity with auto-scaling set to a maximum of 10,000 WCU.

C.Use an Amazon Lambda function to buffer writes and batch them to DynamoDB.

D.Use DynamoDB on-demand capacity mode.

AnswerD

On-demand scales instantly and is cost-effective for unpredictable workloads.

Why this answer

DynamoDB on-demand capacity mode automatically scales to handle write spikes without requiring capacity planning or management, making it the most cost-effective choice for unpredictable workloads like a real-time analytics dashboard. It charges per request, so you only pay for the writes you actually use, avoiding over-provisioning costs.

Exam trap

The trap here is that candidates often confuse DAX's read caching with write handling, or assume that auto-scaling provisioned capacity can handle sudden spikes instantly, when in reality it scales gradually and can still throttle during rapid bursts.

How to eliminate wrong answers

Option A is wrong because DAX is an in-memory cache for reads, not writes; it does not prevent write throttling. Option B is wrong because provisioned capacity with auto-scaling can still throttle during sudden spikes due to the lag in scaling up, and setting a maximum of 10,000 WCU may be insufficient or wasteful. Option C is wrong because using Lambda to buffer writes adds latency and complexity, and while it can batch writes, it does not eliminate the risk of throttling if the batch rate exceeds the table's capacity.

Full explanation →

945

MCQeasy

A data engineer needs to store time-series sensor data from thousands of IoT devices. The data is written once, read frequently for the last 24 hours, and rarely accessed after 30 days. Which storage solution is MOST cost-effective?

A.Amazon Redshift with automatic compression and distribution keys.

B.Amazon DynamoDB with TTL to expire data after 30 days.

C.Amazon Timestream with a 30-day retention policy.

D.Amazon S3 with lifecycle policies to transition to S3 Glacier after 30 days.

AnswerC

Timestream is designed for time-series data, with cost-effective tiered storage and built-in analytics.

Why this answer

Amazon Timestream is purpose-built for time-series data, offering automatic storage tiering where recent data resides in memory for fast queries and historical data is moved to a cost-optimized magnetic store. A 30-day retention policy aligns perfectly with the requirement to keep data accessible for frequent reads over the last 24 hours while automatically expiring older data, minimizing storage costs without manual intervention.

Exam trap

The trap here is that candidates often choose Amazon S3 with lifecycle policies (Option D) because they associate S3 with cost-effective storage, but they overlook the requirement for frequent reads of recent data, which S3 cannot serve with low latency without additional caching layers, and they miss that Timestream is the only AWS service natively designed for time-series data with automatic tiering and retention.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift is a columnar data warehouse optimized for complex analytical queries on structured data, not for high-frequency time-series ingestion from thousands of IoT devices; its cost and overhead are excessive for simple sensor data storage. Option B is wrong because Amazon DynamoDB with TTL is designed for key-value and document workloads, not for time-series queries like range scans over the last 24 hours; TTL only deletes expired items but does not provide efficient time-based querying or automatic tiering, leading to higher read costs and complexity. Option D is wrong because Amazon S3 with lifecycle policies to transition to S3 Glacier after 30 days is cost-effective for archival but does not support low-latency frequent reads for the last 24 hours without additional services like S3 Select or Athena, and S3 is not optimized for high-write, time-ordered data ingestion from IoT devices.

Full explanation →

946

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError'. What is the MOST likely cause?

A.The Glue job worker type is too small for the data volume

B.The Glue job uses too many DynamicFrames

C.The S3 output bucket is in a different region

D.The Kinesis stream has insufficient shards

AnswerA

Small worker type leads to out-of-memory errors when data volume exceeds capacity.

Why this answer

The error suggests the job runs out of memory. Increasing the DPU count can allocate more memory per worker and help process larger data volumes without memory errors.

Full explanation →

947

MCQmedium

A company uses AWS Glue DataBrew to clean and transform data for analytics. The source data is in Parquet format in Amazon S3. The transformation includes filtering rows and adding calculated columns. What is the MOST cost-effective way to run these transformations on a schedule?

A.Use Amazon EMR with Spark

B.Create a Glue DataBrew recipe and schedule the job using a cron expression

C.Create an AWS Lambda function triggered by S3 events

D.Use AWS Glue ETL with PySpark

AnswerB

DataBrew supports scheduling directly.

Why this answer

DataBrew jobs can be scheduled using a cron expression. This is the native way to run transformations automatically without additional services.

Full explanation →

948

MCQeasy

The exhibit shows the output of describe-table for a DynamoDB table. The application is experiencing throttling errors when reading data. What is the MOST likely cause?

A.The sort key is not used correctly for queries.

B.The table has a hot partition due to the HASH key.

C.The table size is too large, causing slow reads.

D.The table's provisioned read capacity is too low.

AnswerD

5 RCUs is very low; if the application reads more than 5 RCUs, throttling occurs.

Why this answer

The describe-table output shows the table has provisioned read capacity set to 5, but the application is experiencing throttling errors. Throttling occurs when read requests exceed the provisioned read capacity units (RCUs). Increasing the read capacity or implementing retries with exponential backoff would resolve this.

The throttling is directly caused by insufficient provisioned read capacity for the workload.

Exam trap

The trap here is that candidates may confuse throttling with performance issues like hot partitions or inefficient queries, but the describe-table output directly shows low provisioned read capacity, making insufficient capacity the most likely cause.

How to eliminate wrong answers

Option A is wrong because the sort key not being used correctly would cause inefficient queries (e.g., full table scans) but not necessarily throttling; throttling is a capacity issue, not a query pattern issue. Option B is wrong because a hot partition due to the HASH key would cause throttling on specific partitions, but the question states the application is experiencing throttling errors when reading data generally, not just on a single partition; the describe-table output does not indicate a hot partition. Option C is wrong because table size does not directly cause throttling; DynamoDB can handle large tables efficiently with proper partitioning, and throttling is based on provisioned capacity, not storage size.

Full explanation →

949

MCQhard

A data engineer is designing a data pipeline that ingests CSV files from an FTP server to Amazon S3. The files arrive hourly and each file is about 500 MB. The engineer wants to minimize operational overhead and cost. Which approach is best?

A.Write a Python script in AWS Lambda using boto3 to download from FTP and upload to S3

B.Use AWS Snowball Edge to transfer files weekly

C.Use AWS Transfer for SFTP and point the endpoint to an S3 bucket

D.Deploy an Amazon EC2 instance with a cron job to run wget and aws s3 cp

AnswerC

Fully managed, no servers to manage, direct to S3.

Why this answer

AWS Transfer for SFTP provides a fully managed FTP service that writes directly to S3, eliminating the need to manage servers. Lambda with boto3 is code-heavy; EC2 requires management; Snowball is for large offline transfers.

Full explanation →

950

MCQmedium

A data engineer needs to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. The database is 2 TB and has a continuous stream of write operations. The migration should minimize downtime. Which AWS service should be used?

A.AWS DataSync

B.AWS Database Migration Service (DMS)

C.AWS Snowball Edge

D.AWS Glue

AnswerB

DMS supports ongoing replication and minimal downtime for database migrations.

Why this answer

Option B is correct because AWS DMS supports ongoing replication from on-premises PostgreSQL to RDS, allowing minimal downtime. Option A (S3) is for bulk data transfer. Option C (Snowball) is for offline data transfer, not for ongoing replication.

Option D (Glue) is for ETL, not for database migration with replication.

Full explanation →

951

Multi-Selectmedium

A data engineer is designing a data pipeline using AWS Step Functions to orchestrate multiple AWS Glue ETL jobs. The pipeline must handle failures and retries. Which TWO configurations should the engineer use to ensure the pipeline is resilient? (Choose two.)

Select 2 answers

A.Configure a dead-letter queue (DLQ) for the state machine

B.Configure the state machine to use a 'Catch' rule to handle specific errors and transition to a fallback state

C.Set the 'Retry' interval to a fixed value instead of exponential backoff

D.Define a 'Timeout' for each state to prevent the pipeline from hanging indefinitely

E.Use a 'Parallel' state to run multiple Glue jobs simultaneously

AnswersB, D

Catch rules handle errors gracefully.

Why this answer

Options A and D are correct. Option A: Adding a Catch rule to the state machine handles errors. Option D: Setting a timeout prevents stuck executions.

Option B is wrong because exponential backoff is for retry, not catch. Option C is wrong because parallel state is for concurrency, not resilience. Option E is wrong because DLQ is for Lambda, not state machines.

Full explanation →

952

MCQhard

A company uses AWS Database Migration Service (DMS) to migrate an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration completes successfully, but the data engineer notices that some tables have fewer rows in the target than the source. Which DMS setting should be checked to ensure full data migration?

A.The LOB mode is set to 'Limited LOB mode' instead of 'Full LOB mode'.

B.The task logs show that some rows failed to apply due to data type conversion errors.

C.The 'Enable validation' option is turned off.

D.The 'Parallel Apply' feature is disabled, slowing down the migration.

AnswerB

Failed rows would be logged and can be reviewed.

Why this answer

Option B is correct because if some rows failed to apply due to data type conversion errors, those rows would be logged as errors and not written to the target, resulting in fewer rows. AWS DMS task logs capture these failures, and checking them is the direct way to identify rows that were skipped or rejected during migration. This is the most common cause of row count mismatches after a successful DMS task.

Exam trap

The trap here is that candidates often assume row count mismatches are always due to LOB settings or validation being off, but the most direct cause is data type conversion errors logged in the task logs, which DMS does not surface in the task status summary.

How to eliminate wrong answers

Option A is wrong because LOB mode settings (Limited vs. Full) affect how large objects are handled, not the total row count; even in Limited LOB mode, all rows are migrated, but LOB columns may be truncated if the LOB exceeds the max size. Option C is wrong because 'Enable validation' is a post-migration check that compares source and target data, but turning it off does not cause rows to be lost during migration; it only prevents validation reports from being generated.

Option D is wrong because 'Parallel Apply' affects the speed of applying changes to the target, not the completeness of data; disabling it may slow down the migration but does not cause rows to be omitted.

Full explanation →

953

MCQeasy

A financial services company uses AWS Glue ETL jobs to process credit card transaction data stored in Amazon S3. The data includes PII such as names and credit card numbers. The security team requires that all PII be masked before the data is written to the curated zone of the data lake. The data engineer has implemented a Glue job that reads from the raw zone, applies a custom transform to mask credit card numbers using a regular expression, and writes to the curated zone. However, during a recent audit, the security team discovered that some masked data still contained partial credit card numbers (e.g., showing the last four digits) when viewed by analysts who should only see masked data. The company's policy is that credit card numbers must be completely masked, showing only asterisks or a fixed string like "XXXX-XXXX-XXXX-XXXX". The Glue job uses a DynamicFrame and applies a Map transform with a Python function that replaces digits with 'X'. The data is stored in Parquet format. What should the data engineer do to ensure complete masking of credit card numbers?

A.Use an AWS Glue crawler to classify the data and apply a masking rule based on the classification.

B.Enable server-side encryption with AWS KMS on the curated S3 bucket.

C.Replace the custom Python Map transform with a built-in Glue Transform for data masking, such as the Mask transform available in Glue Studio.

D.Change the output format from Parquet to CSV and use a different write mode.

AnswerC

Built-in masking transforms are designed to handle common patterns and ensure complete masking.

Why this answer

Option C is correct because AWS Glue provides a built-in Mask transform that can be applied directly in Glue Studio or via the AWS Glue API. This transform is designed to reliably obfuscate sensitive data like credit card numbers by replacing them with a fixed string (e.g., 'XXXX-XXXX-XXXX-XXXX') or asterisks, ensuring complete masking regardless of input format. The custom Python Map transform in the current implementation is error-prone because it relies on a regular expression that may not catch all patterns or partial digits, whereas the Mask transform uses predefined logic to guarantee full masking.

Exam trap

The trap here is that candidates may assume any custom Python logic with a regex is sufficient for masking, but the exam tests the understanding that AWS Glue's built-in Mask transform provides a more reliable and policy-compliant solution for sensitive data obfuscation.

How to eliminate wrong answers

Option A is wrong because an AWS Glue crawler is used for schema discovery and classification, not for applying data masking rules; masking must be performed during ETL processing, not at the crawler level. Option B is wrong because enabling server-side encryption with AWS KMS protects data at rest but does not alter the content of the data; it does not mask or obfuscate credit card numbers, so analysts would still see partial digits. Option D is wrong because changing the output format from Parquet to CSV and using a different write mode has no effect on the masking logic; the custom Python Map transform would still produce the same incomplete masking, and CSV format does not inherently mask data.

Full explanation →

954

MCQmedium

A company is using Amazon S3 to store sensitive data. To meet compliance requirements, they need to automatically transition objects to S3 Glacier Deep Archive after 90 days and delete them after 7 years. What is the MOST cost-effective way to configure this?

A.Configure an S3 Lifecycle policy to transition objects to Glacier Deep Archive after 90 days and expire them after 7 years.

B.Manually move objects to Glacier Deep Archive and delete them using a script.

C.Use S3 Intelligent-Tiering to automatically move objects to Glacier Deep Archive and set expiration.

D.Enable S3 Object Lock with a retention period of 7 years and use a lifecycle policy to transition to Glacier Deep Archive.

AnswerA

Lifecycle policies provide automated transitions and expirations based on object age.

Why this answer

Option C is correct because a lifecycle policy can transition objects based on age and delete them after a specified period. Option A is wrong because manual deletion is error-prone and not automated. Option B is wrong because S3 Intelligent-Tiering does not delete objects.

Option D is wrong because S3 Object Lock is for retention, not lifecycle management.

Full explanation →

955

MCQeasy

A data engineer needs to restrict access to an S3 bucket so that only users from a specific AWS account can read objects. Which S3 bucket policy element should be used?

A.Action

B.Principal

C.Resource

D.Condition

AnswerB

Principal identifies the account or user.

Why this answer

Option B is correct. The Principal element specifies the account. Option A is wrong because Resource is the bucket.

Option C is wrong because Action is the operation. Option D is wrong because Condition can be used but is not the primary element.

Full explanation →

956

MCQmedium

A data engineer is designing a data pipeline that processes sensitive financial data. The data must be encrypted at rest and in transit. The pipeline uses Amazon Kinesis Data Streams to ingest data and AWS Lambda to process it. Which combination of actions ensures the data is encrypted in transit? (Select TWO.)

A.Enable TLS for Kinesis Data Streams.

B.Enable Encryption in Transit for the Lambda function's VPC configuration.

C.Enable Server-Side Encryption (SSE-S3) on the S3 bucket used for data storage.

D.Use AWS KMS to encrypt data at rest in Kinesis Data Streams.

E.Encrypt the Lambda function's CloudWatch Logs using KMS.

AnswerA, B

TLS encrypts data in transit between producers and Kinesis.

Why this answer

Option A is correct because enabling TLS for Kinesis Data Streams encrypts data in transit between producers and the stream. Option D is correct because enabling Encryption in Transit for Lambda's VPC configuration ensures TLS is used for Lambda's network connections. Option B is wrong because SSE-S3 encrypts data at rest in S3.

Option C is wrong because SSE-KMS encrypts data at rest in Kinesis. Option E is wrong because CloudWatch Logs encryption is at rest.

Full explanation →

957

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting write throttling on the Orders table. The table has a composite primary key (OrderID as partition key, CustomerID as sort key). The engineer notices that writes are throttled even though the write capacity is not fully utilized. What is the most likely cause?

A.The table is empty and has no items.

B.A global secondary index (GSI) is consuming write capacity.

C.The read capacity units are too low.

D.Writes are concentrated on a single partition key value.

AnswerD

Hot partition causes throttling even if total capacity is not exceeded.

Why this answer

D is correct because write throttling on an Amazon DynamoDB table occurs when requests exceed the provisioned throughput for a specific partition, even if the overall table write capacity is underutilized. With a composite primary key where OrderID is the partition key, writes concentrated on a single OrderID value (e.g., a hot key) will hit that partition's 3,000 WCU or 1,000 WCU (on-demand) limit, causing throttling while other partitions remain idle.

Exam trap

AWS often tests the misconception that overall table capacity utilization is the sole indicator of throttling, but the trap here is that throttling can occur at the partition level due to hot keys, even when the table's total write capacity is underutilized.

How to eliminate wrong answers

Option A is wrong because an empty table does not cause write throttling; throttling is based on capacity consumption, not table size. Option B is wrong because a global secondary index (GSI) consumes write capacity from its own provisioned throughput, not from the base table's write capacity, and the question states the write capacity is not fully utilized. Option C is wrong because read capacity units are independent of write operations; low RCU would throttle reads, not writes.

Full explanation →

958

Multi-Selecthard

A company runs an Amazon RDS for PostgreSQL instance for an OLTP application. The database size is 500 GB. The company wants to minimize downtime during backups and ensure point-in-time recovery (PITR) for the last 7 days. Which TWO features should the company use? (Choose TWO.)

Select 2 answers

A.Enable Multi-AZ deployment for high availability.

B.Create a read replica in a different Availability Zone.

C.Enable automated backups with a retention period of 7 days.

D.Create daily manual snapshots and copy them to another region.

E.Enable Enhanced Monitoring to track backup progress.

AnswersA, C

Multi-AZ reduces downtime during automated backups by taking backups from the standby.

Why this answer

Automated backups with a 7-day retention provide PITR. Multi-AZ deployment ensures high availability and reduces downtime during backups. Manual snapshots are not needed.

Enhanced Monitoring is for performance. Read replicas are for read scaling, not backup.

Full explanation →

959

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Some records fail transformation and are lost because the Lambda function throws an exception. The data engineer needs to capture the failed records for analysis without affecting the pipeline. What should the engineer do?

A.Configure the Firehose delivery stream to send failed records to a backup S3 bucket

B.Increase the buffer size of the Firehose stream

C.Disable the Lambda transformation and process all records in batch later

D.Modify the Lambda function to write failed records to Amazon DynamoDB

AnswerA

Firehose can be configured to send failed records to a backup S3 bucket.

Why this answer

Option B is correct because configuring a backup S3 bucket for failed records captures them without stopping the pipeline. Option A is wrong because storing in DynamoDB adds complexity and is not a native Firehose feature. Option C is wrong because disabling transformation loses data quality.

Option D is wrong because increasing buffer size does not capture failed records.

Full explanation →

960

MCQeasy

A company uses Amazon Redshift for data warehousing. They notice that query performance has degraded over time. Which maintenance operation should be performed to improve performance?

A.Run the VACUUM command

B.Drop and recreate the table

C.Run the REINDEX command

D.Run the ANALYZE command

AnswerA

VACUUM re-sorts rows and reclaims disk space, improving query performance.

Why this answer

Option B is correct because VACUUM re-sorts rows and reclaims space, improving performance. Option A is wrong because ANALYZE updates statistics but does not physically reorganize data. Option C is wrong because REINDEX rebuilds indexes, but Redshift uses sort keys, not indexes.

Option D is wrong because DROP TABLE deletes data.

Full explanation →

961

MCQeasy

A company needs to transfer 20 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited and the transfer must complete within one week. Which service should the company use?

A.Amazon S3 Transfer Acceleration

B.AWS Snowball Edge

C.AWS Direct Connect with DataSync

D.AWS DataSync over a VPN connection

AnswerB

Snowball is a physical device that can transfer large data quickly.

Why this answer

Option C is correct because AWS Snowball is designed for large-scale data transfers with limited bandwidth. Option A is wrong because DataSync over the internet may not meet the deadline. Option B is wrong because the bandwidth is insufficient.

Option D is wrong because S3 Transfer Acceleration speeds up uploads but still uses the internet.

Full explanation →

962

MCQhard

A company uses Amazon EMR to run Spark jobs on a transient cluster. The jobs are submitted via a step in the cluster. The cluster is configured to auto-terminate after the last step completes. However, the cluster is not terminating even though the step shows as 'COMPLETED'. What could be the cause?

A.The cluster's root device size is too large.

B.The step failed with an error, but the status shows 'COMPLETED' due to a reporting bug.

C.The cluster is configured as a long-running cluster.

D.The step's 'ActionOnFailure' parameter is set to 'CONTINUE' and 'KeepClusterAliveOnFailure' is true.

AnswerD

These settings prevent auto-termination.

Why this answer

Option B is correct because if KeepClusterAliveOnFailure is set to true for a step, the cluster will not terminate even if the step succeeds. Option A is wrong because 'COMPLETED' indicates success. Option C is wrong because the root volume size does not affect termination.

Option D is wrong because cluster is transient, not long-running.

Full explanation →

963

MCQeasy

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed into Parquet format and stored in Amazon S3. Which AWS service can perform the transformation in near real-time with minimal operational overhead?

A.Amazon EMR cluster running Spark Streaming

B.AWS Glue ETL job triggered by Kinesis stream

C.Amazon Kinesis Data Firehose with a transformation Lambda function

D.Amazon Kinesis Data Analytics for Apache Flink

AnswerC

Kinesis Data Firehose can invoke a Lambda function to convert data to Parquet and deliver to S3.

Why this answer

Option A is correct because Kinesis Data Firehose can transform data using Lambda functions and deliver to S3 in Parquet format with no servers to manage. Option B (Kinesis Data Analytics) is for SQL queries, not direct S3 delivery. Option C (Glue ETL) is batch-oriented.

Option D (EMR) requires cluster management.

Full explanation →

964

MCQmedium

A company stores sensitive data in an Amazon S3 bucket. A compliance requirement mandates that all data must be encrypted at rest with a key that is automatically rotated every year. The company also needs to maintain an audit trail of who used the key. Which solution meets these requirements?

A.Use AWS KMS customer managed keys (SSE-KMS) with automatic key rotation enabled.

B.Use customer-provided encryption keys (SSE-C) and rotate keys manually.

C.Use S3 managed keys (SSE-S3) and enable S3 server access logs.

D.Configure a bucket policy to enforce encryption using the 'aws:SecureTransport' condition.

AnswerA

KMS customer managed keys support automatic annual rotation and CloudTrail auditing.

Why this answer

Option C is correct because SSE-KMS uses AWS KMS customer managed keys which support automatic annual rotation and provide CloudTrail audit logs for key usage. Option A is wrong because SSE-S3 does not provide key usage audit trail. Option B is wrong because SSE-C requires customer to manage keys and rotation.

Option D is wrong because bucket policies do not encrypt data.

Full explanation →

965

Multi-Selecthard

A data engineer is designing a data pipeline that ingests JSON data from Amazon Kinesis Data Streams and processes it using AWS Lambda. The Lambda function writes the processed data to an Amazon S3 bucket. The engineer needs to ensure at-most-once processing semantics. Which TWO configurations should the engineer implement? (Choose two.)

Select 2 answers

A.Use S3 PutObject with a unique object key (e.g., include a UUID) and overwrite set to false.

B.Use DynamoDB for checkpointing to track processed records.

C.Set the Lambda function's batch size to 100 to process records in larger batches.

D.Enable function-level retries in the Lambda function for transient errors.

E.Set the Lambda function's batch size to 1 and the batch window to 0.

AnswersA, E

Unique keys prevent overwriting and duplicates.

Why this answer

Options B and D are correct. B: Setting the Lambda batch size to 1 and window to 0 ensures one record per invocation, so if the function fails, it will not retry the same record (since at-most-once means no retries). D: Setting the S3 bucket to use Replace or a unique key prevents overwriting.

Option A is wrong because increasing batch size increases chance of partial failures. Option C is wrong because enabling retries violates at-most-once. Option E is wrong because checkpointing is for at-least-once.

Full explanation →

966

Multi-Selecteasy

A company needs to audit access to their Amazon S3 buckets. Which TWO services can be used together to achieve this? (Choose two.)

Select 2 answers

A.Amazon Macie

B.Amazon S3 Inventory

C.Amazon CloudWatch Logs

D.AWS Config

E.AWS CloudTrail

AnswersC, E

CloudWatch Logs can store and monitor CloudTrail logs for access patterns.

Why this answer

CloudTrail records S3 API calls, and CloudWatch Logs can be used to store and monitor those logs. Config records configuration changes, not data access. S3 server access logs record object-level access, but the question asks for auditing access; CloudTrail with CloudWatch Logs is a common solution.

S3 Inventory provides metadata, not access logs.

Full explanation →

967

MCQmedium

A company uses AWS Glue to transform data in S3. The transformation job reads Parquet files, filters rows, and writes to another S3 bucket. The job takes longer than expected. Which change would MOST likely reduce the job execution time?

A.Use a single large file instead of multiple small files.

B.Reduce the number of partitions in the output data.

C.Convert the input files from Parquet to CSV format.

D.Increase the number of DPUs allocated to the Glue job.

AnswerD

More DPUs allow more parallel processing, reducing runtime.

Why this answer

Option B is correct because AWS Glue jobs can benefit from increased DPU allocation for parallel processing. Option A is wrong because converting to CSV would increase size and likely slow processing. Option C is wrong because reducing the number of partitions reduces parallelism, increasing time.

Option D is wrong because using a single file does not improve parallelism.

Full explanation →

968

MCQhard

A company uses AWS Glue to process data from Amazon RDS MySQL into Amazon S3. The Glue job uses a JDBC connection and runs on a schedule. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which troubleshooting step should the data engineer take FIRST?

A.Check the Glue job's DPU allocation; increase if too low.

B.Review the Glue job script for data type mismatches.

C.Verify that the Glue job's VPC subnet and security group allow outbound traffic to RDS.

D.Increase the RDS instance's max_connections parameter.

AnswerC

Network connectivity is the first thing to check for link failures.

Why this answer

Option B is correct because the Glue job's network connectivity to RDS is the likely issue; checking security groups and route tables is the first step. Option A is wrong because connection pooling is a database configuration, not a network issue. Option C is wrong because data format changes cause parsing errors, not connection failures.

Option D is wrong because Glue resource limits would cause different errors (e.g., 'Resource exceeded').

Full explanation →

969

Multi-Selecthard

A company is running a 10-node Amazon EMR cluster to process data from Amazon S3. The cluster is using Apache Spark for transformations. The data processing is taking longer than expected. Which THREE actions can improve the performance of the Spark jobs on EMR? (Choose THREE.)

Select 3 answers

A.Reduce the number of shuffle partitions.

B.Enable dynamic allocation of executors.

C.Disable speculative execution to reduce redundant tasks.

D.Use a larger instance type for core nodes.

E.Use EMRFS consistent view to ensure data consistency.

AnswersB, D, E

Dynamic allocation allows Spark to scale resources based on workload.

Why this answer

Options A, C, and D are correct. Using EMRFS consistent view (A) improves consistency and reduces errors. Enabling dynamic allocation (C) allows Spark to adjust executor resources based on workload.

Using a larger instance type (D) can provide more memory and CPU. Option B (disabling speculation) can improve performance for some jobs but is not always beneficial. Option E (reducing shuffle partitions) can help if there are too many small partitions, but increasing is also possible; the question asks for improvements, and increasing partitions can help with data skew.

Full explanation →

970

MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. Recently, the application has been failing with 'ResourceNotFoundException' for the S3 bucket. What is the MOST likely cause?

A.The IAM role used by the application does not have s3:PutObject permission.

B.The S3 bucket ARN is incorrectly specified in the application configuration.

C.The Flink application code specifies the wrong AWS Region for the S3 bucket.

D.The S3 bucket has versioning disabled.

AnswerA

Without write permissions, the application cannot write to S3.

Why this answer

Option C is correct because Kinesis Data Analytics needs permissions to write to the S3 bucket; if the IAM role lacks s3:PutObject, the error occurs. Option A is wrong because the bucket name is used, not ARN. Option B is wrong because S3 bucket policies do not require versioning.

Option D is wrong because Flink application code does not configure S3 bucket region; it's determined by the bucket location.

Full explanation →

971

MCQmedium

A company is migrating its on-premises Oracle database to Amazon RDS for Oracle. The database is 2 TB in size and has a 24-hour maintenance window. The migration must have minimal downtime. Which AWS service should be used for the migration?

A.Amazon RDS native backup and restore

B.AWS Database Migration Service (DMS)

C.Amazon S3 Transfer Acceleration

D.AWS Snowball Edge

AnswerB

DMS supports minimal downtime migrations.

Why this answer

AWS Database Migration Service (DMS) is the correct choice because it supports heterogeneous migrations (Oracle to RDS for Oracle) with minimal downtime using ongoing replication via Oracle LogMiner or binary reader. DMS can perform a full load of the 2 TB database and then continuously replicate changes from the source to the target during the 24-hour maintenance window, allowing a final cutover with only seconds of downtime.

Exam trap

The trap here is that candidates often confuse AWS Snowball Edge for any large data migration, but Snowball is designed for offline transfers where downtime is acceptable, not for minimal-downtime online migrations that require continuous replication.

How to eliminate wrong answers

Option A is wrong because Amazon RDS native backup and restore requires creating a backup file from the on-premises Oracle database and restoring it into RDS, which involves significant downtime for the backup and restore process, and does not support ongoing replication for minimal downtime. Option C is wrong because Amazon S3 Transfer Acceleration is a service for speeding up uploads to S3 over the internet, but it does not provide database migration capabilities, schema conversion, or ongoing replication needed for a live database migration. Option D is wrong because AWS Snowball Edge is a physical data transfer device for moving large volumes of data (e.g., 2 TB) offline, which introduces days of latency for shipping and cannot achieve minimal downtime; it also lacks the ability to capture and apply ongoing transactional changes during transit.

Full explanation →

972

MCQeasy

Refer to the exhibit. A data engineer runs this CLI command on an S3 bucket. The data is ingested from multiple sources. Which AWS service would be best to process these files in a single batch transformation?

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.AWS Glue

AnswerD

Glue can run batch ETL jobs on multiple files.

Why this answer

Option C is correct because AWS Glue can process multiple files of varying sizes in a batch ETL job. Option A is wrong because Athena is for ad-hoc queries, not transformations. Option B is wrong because Kinesis is for streaming.

Option D is wrong because Lambda has limits on execution time and memory for large files.

Full explanation →

973

Multi-Selecteasy

A company wants to audit API calls made to its Amazon S3 buckets. Which AWS services can be used to achieve this? (Choose TWO.)

Select 2 answers

A.IAM Access Analyzer

B.AWS Config

C.VPC Flow Logs

D.AWS CloudTrail

E.Amazon S3 server access logs

AnswersD, E

CloudTrail can log S3 data events.

Why this answer

Options A and C are correct. AWS CloudTrail can log data events for S3, and S3 server access logs record detailed request information. Option B is wrong because AWS Config tracks configuration changes, not API calls.

Option D is wrong because IAM Access Analyzer reviews resource policies, not API calls. Option E is wrong because VPC Flow Logs capture network traffic, not API calls.

Full explanation →

974

Multi-Selecthard

A data engineering team is designing a batch processing workflow using AWS Glue. The job reads from an S3 bucket, transforms data, and writes to another S3 bucket. The job runs daily and processes new data incrementally. Which THREE features should they use to optimize performance and cost?

Select 3 answers

A.Convert all input data to Parquet format before processing.

B.Enable Glue job autoscaling.

C.Manually increase the number of DPUs for each run.

D.Use predicate pushdown and column pruning in the script.

E.Enable job bookmarks to process only new data.

AnswersB, D, E

Adjusts resources to workload.

Why this answer

Options B, C, and D are correct: job bookmarks for incremental processing, autoscaling for dynamic resource allocation, and predicate pushdown for reducing data scanned. Option A (increasing DPUs manually) is not optimal. Option E (converting to Parquet) may help but is not a Glue feature.

Full explanation →

975

Multi-Selecteasy

A company is designing a data ingestion pipeline for real-time IoT sensor data. The data volume peaks at 10,000 messages per second. The pipeline must process messages in order per sensor and persist raw data to Amazon S3 for archival. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.AWS Lambda

D.Amazon AppFlow

E.Amazon Simple Queue Service (Amazon SQS)

AnswersA, B

Delivers streaming data to S3.

Why this answer

Options B and C are correct. Kinesis Data Streams captures data in order per shard (sensor), and Kinesis Data Firehose delivers to S3. Option A is wrong because SQS does not guarantee order per sensor.

Option D is wrong because AppFlow is for SaaS integration. Option E is wrong because Lambda can process but not persist to S3 efficiently for archival.

Full explanation →

Page 13 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →