Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 601–675

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 9 of 24

601

MCQeasy

A data engineer needs to monitor the number of Amazon S3 PUT requests that result in a 403 AccessDenied error. Which CloudWatch metric and dimension should be used?

A.NumberOfObjects metric with the ObjectType dimension.

B.BucketSizeBytes metric with the StorageType dimension.

C.4xxErrors metric with the FilterId dimension set to '403'

D.AllRequests metric with the BucketName dimension.

AnswerC

4xxErrors metric with a filter for 403 provides the count of AccessDenied errors.

Why this answer

The correct answer is C because Amazon S3 CloudWatch metrics include `4xxErrors`, which counts HTTP 4xx status code responses. To filter specifically for 403 AccessDenied errors, you set the `FilterId` dimension to a filter that matches the 403 status code. This allows precise monitoring of unauthorized PUT requests.

Exam trap

The trap here is that candidates confuse `4xxErrors` (which counts all 4xx errors) with a metric that directly counts 403 errors, forgetting that a dimension filter is required to isolate the specific status code.

How to eliminate wrong answers

Option A is wrong because `NumberOfObjects` with `ObjectType` dimension tracks the count of objects per storage class (e.g., Standard, Glacier), not error responses. Option B is wrong because `BucketSizeBytes` with `StorageType` dimension measures bucket storage size, not request errors. Option D is wrong because `AllRequests` with `BucketName` dimension counts all requests (including successful ones) but does not filter by HTTP status code, so it cannot isolate 403 errors.

Full explanation →

602

MCQeasy

Refer to the exhibit. An IAM policy is attached to a user. What is the security implication of this policy?

A.The policy only allows read access.

B.The policy is invalid because it uses asterisks.

C.The policy is too restrictive.

D.The policy grants excessive permissions, violating least privilege.

AnswerD

It grants full S3 access to all resources.

Why this answer

Option B is correct because the policy grants full S3 access to all resources, violating least privilege. Option A is wrong because it's too permissive. Option C is wrong because the policy is valid syntax.

Option D is wrong because it grants access to all actions.

Full explanation →

603

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The data is JSON formatted and includes a timestamp field. The company wants to partition the output in Amazon S3 by date and hour, and ensure exactly-once processing semantics. Which combination of configurations should be used?

A.Disable checkpointing and use the 'exactly_once' delivery option in Kinesis Data Streams.

B.Enable checkpointing in the AWS Glue streaming job and specify an S3 location for checkpoint data.

C.Use Amazon DynamoDB as a checkpoint store by configuring the Glue job with a DynamoDB connection.

D.Use Kinesis Client Library (KCL) checkpointing with a DynamoDB table.

AnswerB

Glue streaming jobs support checkpointing to S3 for exactly-once processing.

Why this answer

Option B is correct because AWS Glue streaming jobs require checkpointing to track the progress of data consumption from Kinesis Data Streams and to ensure exactly-once processing semantics. By enabling checkpointing and specifying an S3 location, Glue periodically saves the state of processed records, allowing it to resume from the last committed offset in case of failures, thus preventing duplicates or data loss.

Exam trap

The trap here is that candidates confuse the checkpointing mechanism of AWS Glue (which uses S3) with the Kinesis Client Library (KCL) pattern (which uses DynamoDB), leading them to select option D or C, even though Glue streaming jobs do not support DynamoDB for checkpointing.

How to eliminate wrong answers

Option A is wrong because disabling checkpointing removes the mechanism for tracking processed records, making exactly-once semantics impossible; the 'exactly_once' delivery option in Kinesis Data Streams refers to producer-side delivery guarantees, not consumer-side processing semantics. Option C is wrong because AWS Glue streaming jobs do not support DynamoDB as a checkpoint store; they only support S3 for checkpoint data. Option D is wrong because Kinesis Client Library (KCL) checkpointing with DynamoDB is a pattern for custom applications, not for AWS Glue streaming jobs, which manage checkpointing internally via S3.

Full explanation →

604

MCQmedium

A data engineer is troubleshooting a failed AWS Glue ETL job that reads from and writes to the S3 bucket 'example-bucket'. The job's IAM role has the policy shown in the exhibit. The job fails with an Access Denied error when writing to a prefix 'output/'. Which permission is MISSING?

A.s3:PutObjectAcl

B.s3:GetBucketAcl

C.s3:ListBucket on the output prefix

D.s3:DeleteObject

AnswerD

Glue often deletes temporary files and may need DeleteObject permission.

Why this answer

Option C is correct because the policy uses a wildcard for the bucket ARN, but the PutObject action is allowed on 'example-bucket/*', which includes 'output/'. However, the ListBucket action is on the bucket itself, which is fine. The issue is that the GetObject and PutObject actions are only granted for objects, but the job might need s3:PutObject for the specific prefix.

Actually, the policy seems correct. Re-examining: The error might be due to missing s3:GetObject on the output prefix? No, wildcard covers all prefixes. Perhaps the bucket policy denies access? But the question implies the IAM policy is missing something.

Common missing permission is s3:GetBucketLocation for cross-account access, but not in this case. Another possibility: the job needs to delete temporary files? The exhibit shows no DeleteObject permission. So option C: s3:DeleteObject is likely needed for Glue cleanup.

Option A is for listing, already present. Option B is not needed for writing. Option D is for multipart upload, but not required for small files.

Full explanation →

605

MCQhard

A company uses Amazon Redshift for a data warehouse. They notice that queries are slow due to heavy data skew. Which optimization technique should be applied first?

A.Configure workload management (WLM) queues

B.Define sort keys on frequently filtered columns

C.Set an appropriate distribution style

D.Apply compression encodings to columns

AnswerC

Correct distribution style reduces data skew and improves query performance.

Why this answer

Data skew occurs when rows are distributed unevenly across Redshift slices, causing some nodes to process far more data than others. Setting an appropriate distribution style (e.g., KEY, EVEN, or ALL) redistributes the data to balance the workload, directly addressing the root cause of the slowness. This is the first optimization to apply because skew is a fundamental distribution issue that other tuning steps cannot fix.

Exam trap

The trap here is that candidates often confuse distribution skew with sort key optimization or compression, mistakenly believing that improving data organization on disk (sort keys) or reducing I/O (compression) will fix uneven data distribution across nodes.

How to eliminate wrong answers

Option A is wrong because WLM queues manage concurrency and memory allocation for query slots, not the physical distribution of data across nodes; they cannot fix performance degradation caused by data skew. Option B is wrong because sort keys optimize the order of data on disk to improve range-restricted scans and merge joins, but they do not redistribute data or alleviate skew across slices. Option D is wrong because compression encodings reduce storage footprint and I/O by compressing column data, but they have no effect on how rows are distributed across nodes or on query parallelism.

Full explanation →

606

Multi-Selecteasy

A company is using AWS Glue ETL jobs to process data from Amazon S3 and write results back to S3. The jobs are failing intermittently with 'ThrottlingException' errors. Which TWO configurations would help reduce these errors?

Select 2 answers

A.Decrease the number of DPUs for the job.

B.Enable GZIP compression on the output data.

C.Add retry logic with exponential backoff in the job script.

D.Change the job type from Spark to Python shell.

E.Increase the number of DPUs for the job.

AnswersC, E

Retries handle transient throttling gracefully.

Why this answer

Option B: Increasing the number of DPUs may distribute the load and reduce throttling. Option D: Implementing retry logic in the job handles transient throttling. Option A is wrong because it increases the chance of throttling.

Option C is wrong because compression doesn't reduce API calls. Option E is wrong because the job type is not related.

Full explanation →

607

Multi-Selectmedium

A company stores sensitive data in Amazon S3. The data engineer needs to implement a solution that automatically detects and redacts PII in new objects as they are uploaded. Which TWO AWS services should be used together?

Select 2 answers

A.AWS Glue ETL

B.Amazon Macie

C.Amazon DynamoDB

D.Amazon Comprehend

E.AWS Glue Data Catalog

AnswersB, D

Detects PII in S3.

Why this answer

Amazon Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect sensitive data in Amazon S3. Amazon Comprehend is a natural language processing (NLP) service that can be used to detect and redact PII entities from text. Together, they enable automated detection and redaction of PII in newly uploaded S3 objects by triggering Macie to identify sensitive data and then using Comprehend to redact the PII.

Exam trap

AWS often tests the distinction between data discovery (Macie) and data processing/redaction (Comprehend), leading candidates to incorrectly select only Macie or to confuse Glue ETL as a redaction tool.

Full explanation →

608

MCQmedium

A healthcare company uses AWS Glue to process patient data stored in Amazon S3. The data is encrypted at rest using SSE-KMS with a customer managed key. The Glue ETL job runs on a schedule and reads from an S3 bucket, transforms the data, and writes to another S3 bucket also encrypted with the same KMS key. Recently, the security team rotated the KMS key. After the rotation, the Glue job started failing with 'AccessDenied' errors when trying to read from the source bucket. The Glue job's IAM role has permissions to use the KMS key (kms:Decrypt, kms:GenerateDataKey). The S3 bucket policies allow the role to read/write. What is the MOST likely cause of the failure?

A.The KMS key rotation created a new backing key, but the Glue job's IAM role does not have permission to decrypt with the old backing key.

B.The Glue job's IAM role is missing the kms:Encrypt permission on the KMS key.

C.The Glue job is using the wrong encryption context when calling KMS.

D.The S3 bucket policy has a condition that requires the request to use the latest version of the KMS key.

AnswerA

If automatic rotation is enabled, old backing keys are retained, but if the key was manually rotated (new key created), the old key may be disabled. Also, the key policy may have been updated incorrectly.

Why this answer

When you rotate a customer managed KMS key, AWS KMS retains the old backing key to allow decryption of data encrypted before the rotation. However, the Glue job's IAM role must have permission to use the old backing key via the kms:Decrypt action. If the key policy or IAM policy does not explicitly allow decryption with the old backing key (or if the key policy was inadvertently updated to remove access to the old key material), the Glue job will fail with AccessDenied when reading SSE-KMS encrypted objects that were encrypted with the previous key version.

Exam trap

The trap here is that candidates assume KMS key rotation is seamless and never breaks existing access, but they overlook that the IAM role or key policy must still grant kms:Decrypt on the key resource, and that the old backing key remains in use for previously encrypted data.

How to eliminate wrong answers

Option B is wrong because the Glue job is failing on read (decrypt), not write; the error occurs when reading from the source bucket, so missing kms:Encrypt would only affect writes to the destination bucket. Option C is wrong because the encryption context is set by the S3 service when the object was uploaded; the Glue job does not control the encryption context used during decryption, and a mismatch would cause a different error (e.g., InvalidCiphertextException), not AccessDenied. Option D is wrong because S3 bucket policies cannot require the request to use the latest version of a KMS key; KMS key versioning is transparent to S3 policies, and there is no such condition key in S3 bucket policies.

Full explanation →

609

Multi-Selecthard

Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon DynamoDB for storing time-series data? (Choose three.)

Select 3 answers

A.Required query complexity (simple key lookups vs. range scans)

B.Application latency requirements

C.Cost per GB of storage

D.Data access patterns (random vs. sequential)

E.Total data volume

AnswersA, B, D

DynamoDB excels at key lookups; S3 is better for scans.

Why this answer

Options B, C, and D are correct because data access patterns, query complexity, and latency requirements are key factors. Option A (cost) is a consideration but not specific to time-series. Option E (data volume) is less relevant as both scale.

Full explanation →

610

MCQmedium

A company uses Amazon Redshift for its data warehouse. The data engineer notices that queries are running slower than expected. The system administrator reports that the cluster's disk space is 80% full. Which action should the engineer take to improve query performance?

A.Redesign the sort keys to optimize query performance.

B.Run the VACUUM command to reclaim space.

C.Add more nodes to the cluster to increase storage and compute capacity.

D.Enable concurrency scaling to handle more queries.

AnswerC

Adding nodes increases both storage and compute, improving performance.

Why this answer

When a Redshift cluster's disk space is 80% full, query performance degrades because Redshift relies on large sequential I/O operations, and high disk utilization forces more random I/O and increases the likelihood of spilling to disk. Adding nodes increases both storage capacity and compute resources, directly alleviating the I/O bottleneck and improving query throughput. This is the recommended scaling action when disk space exceeds 70-80% utilization.

Exam trap

The trap here is that candidates often confuse the symptom (slow queries) with a need for sort key optimization or vacuuming, when the root cause is insufficient storage capacity causing I/O bottlenecks, which only adding nodes can resolve.

How to eliminate wrong answers

Option A is wrong because redesigning sort keys optimizes data distribution and pruning for specific query patterns, but it does not address the fundamental issue of insufficient storage capacity causing I/O contention. Option B is wrong because the VACUUM command reclaims space from deleted rows and sorts data, but it does not increase total disk capacity; with 80% disk full, vacuuming may only recover a small amount of space and will not resolve the performance degradation caused by high disk utilization. Option D is wrong because concurrency scaling adds transient compute capacity to handle increased query concurrency, but it does not increase the primary cluster's storage or reduce disk space pressure; performance issues from disk fullness persist even with concurrency scaling enabled.

Full explanation →

611

MCQhard

A company runs a nightly AWS Glue ETL job that writes results to an Amazon Redshift table using the JDBC connector. Recently, the job has been failing with the error 'ERROR: connection to server at ... failed: server closed the connection unexpectedly'. The Redshift cluster is in a private subnet with a VPC endpoint for S3. The Glue job runs in the same VPC with enhanced VPC routing enabled. Which is the most likely cause?

A.The JDBC driver is missing the 'redshift' compatibility mode setting.

B.SSL is not enabled on the Redshift cluster.

C.The Glue job's security group does not allow outbound traffic to the Redshift cluster.

D.AWS Glue does not support Redshift as a data source.

AnswerA

Setting 'redshift' compatibility in the JDBC URL (e.g., '?compatible=redshift') ensures proper handling of Redshift-specific features and prevents unexpected disconnections.

Why this answer

Option D is correct because Redshift compatibility mode in the JDBC driver is required for Glue to properly handle connections. Without it, the driver may close the connection unexpectedly. Option A is wrong because Redshift does not enforce SSL by default unless configured.

Option B is wrong because the Glue job is in the same VPC and uses enhanced VPC routing, so the connection should work. Option C is wrong because Glue supports Redshift as a data source.

Full explanation →

612

MCQmedium

A company uses AWS Glue to process sensitive data stored in S3. The security team requires that all data be encrypted at rest using customer-managed KMS keys. The data engineers are encountering 'Access Denied' errors when running Glue ETL jobs. What is the most likely cause?

A.The Glue service role does not have kms:Decrypt and kms:Encrypt permissions for the KMS key.

B.The KMS key policy does not allow the AWS Glue service to use the key.

C.The Glue Data Catalog is encrypted with a different KMS key.

D.The S3 bucket policy denies access to the Glue service role.

AnswerA

Glue needs KMS permissions to decrypt objects from S3 and encrypt output.

Why this answer

Option A is correct because Glue service role must have kms:Decrypt and kms:Encrypt permissions for the KMS key to read/write encrypted data from S3. Option B is incorrect because S3 bucket policies can restrict access but the error is likely due to missing KMS permissions. Option C is incorrect because KMS key policies or IAM policies can grant access to Glue role.

Option D is incorrect because the issue is not about cross-account access.

Full explanation →

613

MCQeasy

A data engineer needs to store semi-structured JSON logs from AWS CloudTrail. The logs are append-only and rarely accessed after 90 days. Which storage solution is MOST cost-effective?

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 Standard

C.Amazon EBS with cold HDD volumes

D.Amazon DynamoDB with on-demand capacity

AnswerA

Glacier Deep Archive offers the lowest cost for long-term archival data.

Why this answer

Amazon S3 Glacier Deep Archive is the most cost-effective storage solution for CloudTrail logs that are append-only and rarely accessed after 90 days. It offers the lowest storage cost among AWS options (approximately $0.00099 per GB/month) and is designed for data that is accessed at most once or twice per year, with retrieval times of 12–48 hours. Since the logs are rarely accessed after 90 days, the retrieval latency is acceptable, and the cost savings over S3 Standard (which costs ~$0.023 per GB/month) are substantial.

Exam trap

The trap here is that candidates may choose S3 Standard or DynamoDB because they assume CloudTrail logs need frequent querying, but the question explicitly states 'rarely accessed after 90 days,' making Glacier Deep Archive the correct cost-optimal choice despite its longer retrieval time.

How to eliminate wrong answers

Option B is wrong because Amazon S3 Standard is designed for frequently accessed data and costs significantly more than Glacier Deep Archive, making it cost-inefficient for data that is rarely accessed after 90 days. Option C is wrong because Amazon EBS with cold HDD volumes (sc1) is a block storage service intended for attached EC2 instances, not for storing append-only logs as a standalone object store; it also incurs per-GB costs and requires managing EC2 instances, leading to higher total cost and complexity. Option D is wrong because Amazon DynamoDB with on-demand capacity is a NoSQL database optimized for low-latency queries and high-frequency access, not for cost-effective archival of append-only logs; its storage cost ($0.25 per GB/month) is orders of magnitude higher than Glacier Deep Archive, and it is not designed for infrequent access patterns.

Full explanation →

614

Multi-Selecthard

A company is building a data lake on S3. They have a large volume of CSV files (hundreds of GB) in a source bucket. They need to convert them to Parquet, partition by date, and ensure the data is encrypted at rest with SSE-KMS. The pipeline must be triggered automatically when new files arrive. Which THREE steps should be part of the solution? (Choose THREE.)

Select 3 answers

A.Configure S3 Event Notification to send events to an SQS queue

B.Use Amazon Kinesis Data Firehose to ingest new files

C.Create an AWS Glue ETL job that converts to Parquet and partitions by date

D.Use Amazon Athena CTAS query to convert files in batch

E.Configure the Glue job to use a KMS key for server-side encryption in S3

AnswersA, C, E

SQS can buffer events and trigger a Lambda or Step Functions workflow.

Why this answer

Option A (S3 Event Notification to SQS) can trigger the pipeline. Option C (Glue ETL with partition pruning) can convert to Parquet and partition. Option D (KMS key with s3:PutObject) ensures SSE-KMS.

Option B (Athena CTAS) is not triggered by events. Option E (Kinesis) is for streaming.

Full explanation →

615

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data ingestion pipeline must handle both batch and streaming data. The engineer wants to use a single service to ingest both types of data. Which service should the engineer choose?

A.Amazon Athena

B.Amazon Kinesis Data Firehose

C.S3 Transfer Acceleration

D.AWS Glue

AnswerB

Firehose can ingest streaming data and deliver to S3 in near real-time; batch data can be sent via Firehose API.

Why this answer

Option A is correct because Amazon Kinesis Data Firehose can ingest streaming data and can also be used for batch delivery by putting records in batches. Option B is wrong because AWS Glue is batch-only. Option C is wrong because Amazon Athena queries data, not ingests.

Option D is wrong because S3 Transfer Acceleration speeds up uploads but does not handle streaming.

Full explanation →

616

MCQhard

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. A Lambda function processes each record. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' when writing results to a DynamoDB table. The data engineer has already increased the DynamoDB write capacity. What else can the engineer do to resolve the issue?

A.Increase the Lambda function memory.

B.Increase the DynamoDB read capacity units.

C.Decrease the Lambda batch size to 1.

D.Increase the number of shards in the Kinesis stream.

AnswerD

More shards distribute the load across more Lambda invocations.

Why this answer

Option B is correct because increasing the number of shards increases the number of Lambda concurrent invocations, reducing the batch size per invocation and easing pressure on DynamoDB. Option A is incorrect because increasing Lambda memory does not directly help with throttling. Option C is incorrect because increasing DynamoDB read capacity does not affect writes.

Option D is incorrect because reducing batch size may increase throttling frequency.

Full explanation →

617

MCQhard

A data engineer is monitoring an Amazon Redshift cluster and notices that the 'WLM query wait time' metric is consistently high during peak hours. The cluster uses automatic WLM. The engineer wants to reduce query wait times without changing the cluster size. Which action is MOST effective?

A.Enable concurrency scaling.

B.Change WLM to manual mode and increase the number of queues.

C.Increase the maximum number of queries per queue.

D.Enable short query acceleration (SQA).

AnswerA

Concurrency scaling adds capacity to handle concurrent queries.

Why this answer

Option C is correct because enabling concurrency scaling adds transient cluster capacity to handle bursts. Option A is wrong because short query acceleration helps with short queries, not necessarily wait times. Option B is wrong because manual WLM requires tuning queues.

Option D is wrong because increased concurrency without resources may worsen contention.

Full explanation →

618

MCQhard

A data engineer is troubleshooting a DMS task that is replicating data from an on-premises Oracle database to an RDS for MySQL instance. The task is failing with 'ORA-1555: snapshot too old' error. What is the best course of action?

A.Disable full supplemental logging on the source tables.

B.Increase the size of the redo logs on the source database.

C.Enable batch optimized apply on the DMS task.

D.Increase the UNDO tablespace size and set UNDO_RETENTION to a higher value.

AnswerD

This gives the CDC process enough undo to read consistent snapshots.

Why this answer

Option A is correct because the snapshot too old error occurs when UNDO segments are too small or retention is too short for the long-running transaction needed by CDC. Increasing undo tablespace size and retention time allows CDC to read consistent old data. Option B is wrong because increasing redo logs does not prevent snapshot too old errors.

Option C is wrong because it would lose CDC capability. Option D is wrong because batch optimized mode may not work with CDC and does not address the root cause.

Full explanation →

619

MCQhard

A data engineering team is managing an Amazon Redshift cluster that is used for BI reporting. The cluster has a mix of large tables (some over 1 TB) and many smaller tables. The team notices that queries on a large fact table are slow. The fact table is distributed using KEY distribution on the customer_id column, which has high cardinality. The team wants to improve query performance. They have the option to change the distribution style and sort key. Which redesign should they implement?

A.Keep the distribution style as AUTO and set the sort key to customer_id.

B.Change the distribution style to ALL and set the sort key to customer_id.

C.Change the distribution style to KEY on a different column with high cardinality.

D.Change the distribution style to EVEN and set the sort key to a date column used in WHERE clauses.

AnswerD

EVEN distributes evenly; sort key on date improves query performance.

Why this answer

Option D is correct because using EVEN distribution ensures data is evenly distributed across all nodes, avoiding data skew that can occur with KEY distribution on a high-cardinality column like customer_id. Setting the sort key to a date column used in WHERE clauses enables range-restricted scans, significantly reducing the amount of data scanned for common BI queries that filter by date. This combination improves query performance by maximizing parallelism and minimizing I/O.

Exam trap

The trap here is that candidates often assume KEY distribution on a high-cardinality column is optimal for large tables, but they overlook that even high-cardinality keys can cause severe data skew if the distribution key values are not uniformly distributed across nodes, leading to poor query performance.

How to eliminate wrong answers

Option A is wrong because AUTO distribution may default to KEY on customer_id, which already causes data skew and slow performance, and setting the sort key to customer_id does not address the distribution imbalance. Option B is wrong because ALL distribution copies the entire table to every node, which is impractical for a 1 TB fact table due to excessive storage and maintenance overhead, and it does not improve scan efficiency for large tables. Option C is wrong because changing the KEY distribution to a different high-cardinality column does not guarantee even distribution and may still lead to skew; the core issue is that KEY distribution on a high-cardinality column does not inherently balance data across slices.

Full explanation →

620

MCQmedium

A data engineer needs to transfer 10 TB of data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must be completed within 48 hours. Which solution meets the requirements?

A.Use AWS DataSync to transfer data online

B.Use AWS Snowball Edge device to transfer data offline

C.Use S3 Transfer Acceleration over the internet

D.Set up AWS Direct Connect to increase bandwidth

AnswerB

Snowball Edge can transfer 10 TB offline within days.

Why this answer

Option D is correct because AWS Snowball Edge can transfer large data volumes offline within the time frame despite low bandwidth. Option A (S3 Transfer Acceleration) improves speed but not enough over 100 Mbps. Option B (Direct Connect) would require setup time.

Option C (AWS DataSync) still depends on network bandwidth.

Full explanation →

621

Multi-Selecthard

Which THREE of the following are benefits of using Amazon DynamoDB Accelerator (DAX)? (Choose three.)

Select 3 answers

A.Offloads read traffic from the DynamoDB table.

B.Improves write throughput by batching writes.

C.Reduces read latency from single-digit milliseconds to microseconds.

D.Supports write-through caching to improve write performance.

E.Provides in-memory caching for DynamoDB tables.

AnswersA, C, E

DAX handles read requests, reducing load on the table.

Why this answer

Option A is correct because DAX acts as a read-through cache that offloads read traffic from the DynamoDB table, reducing the number of read requests that hit the underlying table and thus lowering the consumed read capacity units (RCUs). This allows the table to handle more concurrent reads without scaling up provisioned capacity.

Exam trap

The trap here is that candidates often assume DAX improves write performance or supports write-through caching, but DAX is strictly a read cache and does not accelerate or batch writes.

Full explanation →

622

MCQeasy

A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize startup time and use a serverless approach. Which AWS service should the engineer use?

A.Amazon Redshift

B.Amazon EMR

C.AWS Glue

D.Amazon Athena

AnswerC

Serverless Spark with fast startup.

Why this answer

Option A is correct because AWS Glue provides a serverless Spark environment with fast startup. Option B is wrong because Amazon EMR requires cluster provisioning. Option C is wrong because Amazon Redshift is a data warehouse, not a Spark environment.

Option D is wrong because Amazon Athena is for querying, not transforming with Spark.

Full explanation →

623

MCQeasy

A data engineering team needs to ingest streaming data from an application into Amazon S3 for analytics. The data volume is moderate and the team wants the lowest operational overhead. Which AWS service should they use?

A.Amazon SQS

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon Kinesis Data Firehose

AnswerD

Fully managed, automatically writes streaming data to S3.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service for loading streaming data into S3 with no code required. Option A is wrong because Amazon Kinesis Data Streams requires custom consumers. Option C is wrong because AWS Glue is for batch ETL, not real-time.

Option D is wrong because Amazon SQS is a message queue, not optimized for streaming analytics destinations.

Full explanation →

624

MCQeasy

A company wants to migrate its on-premises MySQL database to Amazon RDS for MySQL with minimal downtime. Which AWS service should be used for the migration?

A.AWS Database Migration Service (DMS)

B.AWS Schema Conversion Tool (SCT)

C.AWS DataSync

D.AWS Direct Connect

AnswerA

Supports minimal downtime via ongoing replication.

Why this answer

AWS Database Migration Service (DMS) supports continuous replication for minimal downtime. SCT helps with schema conversion but not migration. Direct Connect provides connectivity.

Full explanation →

625

MCQhard

A data engineer is migrating an on-premises Apache HBase workload to Amazon DynamoDB. The HBase table has a row key with composite structure: customer_id (10 chars) + timestamp (10 digits). The access pattern is to query by customer_id and retrieve the latest entries. How should the DynamoDB table be designed to optimize performance?

A.Create a table with partition key = customer_id and sort key = timestamp.

B.Use Amazon S3 with customer_id as prefix and timestamp as object name.

C.Create a table with partition key = concatenated customer_id and timestamp.

D.Create a table with partition key = timestamp and sort key = customer_id.

AnswerA

Allows querying by customer_id and sorting by timestamp to get latest entries.

Why this answer

Option A is correct because using customer_id as partition key and timestamp as sort key allows efficient queries for latest entries per customer. Option B is incorrect because using timestamp as partition key leads to hot partitions. Option C is incorrect because a single partition key cannot support range queries efficiently.

Option D is incorrect because S3 is not suitable for low-latency queries.

Full explanation →

626

MCQhard

A data engineering team is ingesting data from multiple sources into Amazon S3 using AWS Glue ETL jobs. The jobs are failing intermittently with the error: 'Task ran out of memory'. The input data size varies widely from 100 MB to 10 GB per job. Which configuration change would best mitigate this issue?

A.Increase the number of workers in the Glue job

B.Enable job bookmarking to process only incremental data

C.Reduce the batch size in the S3 source node

D.Change the job type from Spark to Python shell

AnswerB

Bookmarking reduces the data processed each run, lowering memory requirements.

Why this answer

Option B is correct because enabling job bookmarking in Glue allows the job to process only new data, reducing memory pressure. Option A is wrong because increasing the number of workers adds parallelism but may not address memory per worker. Option C is wrong because switching to Python shell would reduce capability.

Option D is wrong because decreasing batch size may help but could slow down processing.

Full explanation →

627

Multi-Selecteasy

Which TWO actions are effective ways to monitor the health of an Amazon DynamoDB table? (Choose two.)

Select 2 answers

A.Use AWS S3 inventory to track table size.

B.Use EC2 instance status checks.

C.Enable DynamoDB Streams and process with Lambda to detect failures.

D.Set up Amazon CloudWatch alarms on ConsumedReadCapacityUnits.

E.Monitor the 'TableHealth' metric in CloudWatch.

AnswersC, D

Streams can be used for monitoring changes.

Why this answer

Options A and D are correct. CloudWatch metrics like ConsumedReadCapacityUnits and ThrottledRequests provide insight into table health. Option B is wrong because there is no 'TableHealth' metric.

Option C is wrong because S3 is not for DynamoDB monitoring. Option E is wrong because EC2 is not relevant.

Full explanation →

628

MCQhard

A company is migrating its on-premises data warehouse to Amazon Redshift. The daily batch load from the source database takes 6 hours using a single-node Redshift cluster. The engineer needs to reduce load time to under 2 hours without increasing cost significantly. Which strategy should the engineer adopt?

A.Use COPY with compression (gzip) to reduce data volume.

B.Use a VPC endpoint to improve network throughput to S3.

C.Change the table distribution style to EVEN to distribute data evenly.

D.Increase the number of nodes in the Redshift cluster and use parallel COPY from multiple files.

AnswerD

More nodes enable parallel data loading.

Why this answer

Option A is correct because increasing the node count provides more parallelism for the COPY command, reducing load time. Option B is wrong because compression reduces storage, not load time. Option C is wrong because distributing data across nodes helps queries, not loads.

Option D is wrong because VPC routing changes do not affect performance.

Full explanation →

629

MCQhard

Refer to the exhibit. A data engineer is troubleshooting an IAM policy attached to a user. The user reports that they cannot upload objects to the S3 bucket 'data-lake-bucket' unless they explicitly specify the 'x-amz-server-side-encryption' header with value 'AES256'. The engineer wants to modify the policy to allow uploads without requiring encryption headers, but still enforce encryption on the bucket itself. Which change should the engineer make?

A.Remove the entire Deny statement.

B.Remove the Condition block from the Allow statement.

C.Change the Condition in the Allow statement to use aws:kms instead of AES256.

D.Set the bucket's default encryption to AES256 and keep the policy unchanged.

AnswerA

Removing the Deny allows uploads without encryption header; bucket default encryption can be used.

Why this answer

Option B is correct. The Deny statement with condition StringNotEquals requires encryption header to be exactly AES256; removing the Deny statement allows uploads without encryption header, and bucket policy can enforce encryption. Option A is wrong because removing the condition from Allow still requires encryption header due to Deny.

Option C is wrong because setting default encryption on bucket does not override explicit Deny. Option D is wrong because changing to aws:kms still requires encryption header.

Full explanation →

630

MCQeasy

A marketing analytics team needs to ingest customer transaction data from an on-premises PostgreSQL database into Amazon S3 for analysis. The data volume is about 10 GB daily, and the team wants to perform full refresh daily (truncate and load) into S3 as Parquet files. The company has a Direct Connect connection to AWS. The team needs a simple, managed solution that minimizes operational overhead. What should the team use?

A.Set up AWS Database Migration Service (DMS) to continuously replicate data to S3 in Parquet format.

B.Use Amazon EMR with a Spark job that reads from PostgreSQL and writes to S3.

C.Use an AWS Glue ETL job with a JDBC connection to the PostgreSQL database, extract data, and write to S3 in Parquet format.

D.Use AWS Data Pipeline with a SQLActivity to extract data and copy to S3.

AnswerC

Glue is serverless and can handle daily full refresh with minimal setup.

Why this answer

Option A is correct: AWS Glue can connect to on-premises PostgreSQL via JDBC using a connection, and perform full extract and write to S3 as Parquet. Option B (DMS) is more suited for ongoing replication, not full refresh. Option C (Data Pipeline) requires more configuration.

Option D (EMR) is overkill.

Full explanation →

631

MCQhard

A company is using Amazon Redshift for analytics. The cluster has 20 nodes and the data is evenly distributed. Query performance has degraded over time. The data engineer suspects that table maintenance is needed. Which set of operations should be performed to improve query performance?

A.Run VACUUM and ANALYZE commands on all tables

B.Run VACUUM FULL on all tables

C.Run REINDEX on all tables

D.Run ALTER TABLE APPEND to reorganize data

AnswerA

VACUUM reclaims space and sorts rows; ANALYZE updates statistics for the optimizer.

Why this answer

Option A is correct because VACUUM reclaims space and sorts data, and ANALYZE updates statistics. Option B is wrong because VACUUM FULL is more intensive and not recommended unless necessary. Option C is wrong because REINDEX is for indexes, not sort order.

Option D is wrong because ALTER TABLE APPEND does not help.

Full explanation →

632

Multi-Selecthard

A data engineer is designing a disaster recovery plan for an Amazon RDS for PostgreSQL database. The database is 500 GB and has a multi-AZ deployment. The recovery point objective (RPO) is 5 minutes, and the recovery time objective (RTO) is 2 hours. Which THREE actions should the engineer take to meet these objectives?

Select 3 answers

A.Enable Multi-AZ deployment for automatic failover.

B.Enable automated backups with a retention period of 1 day.

C.Take daily manual snapshots and export them to Amazon S3.

D.Disable automatic backups to reduce storage costs.

E.Configure a cross-region read replica for faster recovery in another region.

AnswersA, B, E

Multi-AZ provides automatic failover to standby in case of failure.

Why this answer

Option A, B, and D are correct. Automated backups (A) provide point-in-time recovery. Multi-AZ (B) provides automatic failover.

Read replica promotion (D) can be faster for cross-region recovery. Option C is wrong because snapshot export to S3 is for archival, not fast recovery. Option E is wrong because deleting logs prevents point-in-time recovery.

Full explanation →

633

MCQhard

A company runs an Amazon RDS for PostgreSQL database. To meet disaster recovery requirements, they set up a cross-Region read replica. The replica has been lagging by several minutes. Which action is MOST effective to reduce the replica lag?

A.Enable Multi-AZ on the primary database.

B.Increase the instance size (memory and CPU) of the replica.

C.Increase the instance size of the primary database.

D.Decrease the instance size of the replica to reduce cost.

AnswerB

A larger replica can apply changes faster.

Why this answer

Increasing the instance size (memory and CPU) of the cross-Region read replica is the most effective action because replica lag in Amazon RDS for PostgreSQL is often caused by the replica being unable to keep up with the volume of write-ahead log (WAL) data arriving from the primary. A larger replica instance provides more compute and memory resources to apply WAL changes faster, reducing the replay lag. This directly addresses the bottleneck at the replica side without impacting the primary database.

Exam trap

The trap here is that candidates often assume the primary database is the bottleneck and choose to scale it up, but the lag is caused by the replica's inability to apply changes quickly enough, making the replica's instance size the correct lever to adjust.

How to eliminate wrong answers

Option A is wrong because enabling Multi-AZ on the primary database provides high availability within a single Region but does not reduce cross-Region replica lag; it may even increase lag due to synchronous replication overhead on the primary. Option C is wrong because increasing the instance size of the primary database improves its write performance but does not help the replica apply WAL data faster; the bottleneck is on the replica side, not the primary. Option D is wrong because decreasing the instance size of the replica would reduce its CPU and memory resources, worsening the replica lag by making it even harder to keep up with WAL replay.

Full explanation →

634

MCQmedium

An organization wants to audit all API calls made to AWS services for compliance. Which AWS service should be used to capture and store these API calls?

A.AWS CloudTrail

B.AWS Config

C.Amazon VPC Flow Logs

D.Amazon CloudWatch Logs

AnswerA

CloudTrail records AWS API calls for auditing.

Why this answer

AWS CloudTrail records API calls and stores them in S3. Option B is correct. CloudWatch Logs is for log data, not API calls.

VPC Flow Logs capture network traffic. Config records resource configuration changes.

Full explanation →

635

MCQhard

A data engineer is reviewing an IAM policy that controls access to an S3 bucket. The policy is attached to a user group. The engineer notices that users are unable to download objects from the bucket. What is the likely cause?

A.The policy is attached to a user group instead of an IAM role.

B.The policy does not specify the correct bucket ARN.

C.The policy does not allow the s3:GetObject action.

D.The objects are encrypted using SSE-KMS, not SSE-S3.

AnswerD

The condition requires SSE-S3 (AES256), so SSE-KMS objects are denied.

Why this answer

Option B is correct because the policy condition requires that objects be encrypted with SSE-S3 (AES256). If the objects are encrypted with SSE-KMS or not encrypted, the request fails. Option A is wrong because the policy allows s3:GetObject.

Option C is wrong because the policy does not restrict bucket names. Option D is wrong because the policy is attached to a user group, not a role; but that is not the issue.

Full explanation →

636

Multi-Selectmedium

A company is designing a data lake on AWS using Amazon S3. The data includes sensitive customer information that must be encrypted at rest. The company requires that encryption keys be managed by AWS, but the keys must be rotated automatically every year. Which TWO options meet these requirements? (Choose TWO.)

Select 2 answers

A.Use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3).

B.Use client-side encryption with an AWS KMS key.

C.Use Server-Side Encryption with Customer-Provided Keys (SSE-C).

D.Use Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) with manual key rotation.

E.Use Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) with automatic key rotation enabled.

AnswersA, E

SSE-S3 automatically rotates keys every year.

Why this answer

SSE-S3 uses Amazon S3-managed keys with automatic rotation every year. SSE-KMS with automatic key rotation (enabled by default) also meets the key rotation requirement. SSE-C does not use AWS-managed keys.

SSE-KMS with manual rotation is not automatic. Client-side encryption does not use AWS keys.

Full explanation →

637

MCQhard

A company has an Amazon Redshift cluster that stores petabytes of data. Queries are experiencing high disk usage due to large intermediate results. The data engineer needs to improve query performance without adding more nodes. Which action should the engineer take?

A.Set appropriate distribution keys to minimize data movement.

B.Configure workload management (WLM) queues to limit concurrency.

C.Apply column compression encoding to reduce data size.

D.Define sort keys on all columns used in WHERE clauses.

AnswerC

Compression reduces disk footprint and I/O, improving performance.

Why this answer

Option C is correct because compression reduces data size on disk and can improve query performance by reducing I/O. Option A is wrong because sort keys affect order, not intermediate result size. Option B is wrong because distribution keys affect data distribution, not intermediate results directly.

Option D is wrong because WLM queues manage concurrency, not disk usage.

Full explanation →

638

MCQmedium

A data engineer is ingesting data from an on-premises database to Amazon S3 using AWS DataSync. The data transfer is scheduled to run daily at midnight. The engineer notices that the transfer takes longer than expected and sometimes does not complete before the next scheduled task. What should the engineer do to ensure the transfer completes within the window?

A.Increase the bandwidth limit in the DataSync task settings.

B.Decrease the bandwidth limit to reduce network congestion.

C.Increase the schedule frequency to every 12 hours.

D.Use S3 Transfer Acceleration instead of DataSync.

AnswerA

Higher bandwidth limit allows faster data transfer.

Why this answer

Option C is correct because increasing the bandwidth limit allows DataSync to use more network capacity, speeding up the transfer. Option A is wrong because reducing the schedule frequency does not solve the completion time issue. Option B is wrong because changing to S3 Transfer Acceleration adds cost but may not help if the bottleneck is on-premises bandwidth.

Option D is wrong because decreasing the limit would make it slower.

Full explanation →

639

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data is generated at a high velocity and must be processed in near real-time. The pipeline must also handle bursty traffic. Which TWO AWS services should be combined to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon S3

B.Amazon Kinesis Data Analytics

C.Amazon Simple Queue Service (SQS)

D.AWS Glue

E.Amazon Kinesis Data Streams

AnswersB, E

Can process streaming data in near real-time.

Why this answer

Option A (Kinesis Data Streams) is correct because it can ingest high-velocity streaming data and handle bursty traffic. Option C (Kinesis Data Analytics) is correct because it can perform real-time processing. Option B is wrong because SQS is for decoupling, not real-time streaming.

Option D is wrong because Glue is more batch. Option E is wrong because S3 is not real-time.

Full explanation →

640

MCQhard

A company runs a real-time analytics platform using Amazon Kinesis Data Streams with a shard count of 10. The data is consumed by an AWS Lambda function that writes to an Amazon DynamoDB table. The DynamoDB table has a partition key of 'user_id' and a sort key of 'timestamp'. The table is provisioned with 5000 RCUs and 5000 WCUs. Recently, the application experienced increased write latency and throttling errors (ProvisionedThroughputExceededException) on the DynamoDB table. The CloudWatch metrics show that ConsumedWriteCapacityUnits averages 4500 with occasional spikes to 6000. The Lambda function’s concurrency is set to 1000. The data engineer suspects the issue is due to hot partitions. Upon investigation, the engineer finds that a small number of users generate a disproportionately large amount of data. Which course of action would best resolve the throttling while minimizing cost?

A.Enable DynamoDB adaptive capacity and implement write sharding by adding a suffix to the partition key for high-volume users

B.Increase the provisioned WCUs to 10000 to handle spikes

C.Switch the DynamoDB table to on-demand capacity mode

D.Reduce the Lambda function concurrency to 100 to limit write requests

AnswerA

Adaptive capacity automatically manages partition throughput, and write sharding distributes writes across multiple partitions, reducing hot spots.

Why this answer

Option A is correct because the root cause is hot partitions caused by a small number of high-volume users. Enabling DynamoDB adaptive capacity allows the table to automatically adjust throughput to accommodate uneven access patterns, but the key fix is write sharding — adding a random or calculated suffix to the partition key for those high-volume users. This distributes writes across multiple physical partitions, eliminating the hot spot without requiring a global increase in provisioned capacity, thus resolving throttling while minimizing cost.

Exam trap

AWS often tests the misconception that throttling is always solved by increasing total provisioned capacity or switching to on-demand, when the real issue is partition-level hot spots that require key design changes like write sharding.

How to eliminate wrong answers

Option B is wrong because simply increasing provisioned WCUs to 10000 does not address the hot partition issue; the throttling is due to uneven distribution of writes across partitions, not a lack of total capacity, so this would waste money without fixing the root cause. Option C is wrong because switching to on-demand capacity mode would handle spikes but at a significantly higher cost for sustained high write volumes, and it still does not solve the hot partition problem — on-demand tables can still throttle individual partitions if a single partition exceeds 1,000 WCUs (the per-partition throughput limit). Option D is wrong because reducing Lambda concurrency to 100 would limit the total write throughput, potentially causing data backlogs in Kinesis, and it does not address the uneven distribution of writes across DynamoDB partitions; the hot partition would still be throttled even with fewer concurrent writers.

Full explanation →

641

MCQeasy

A company is migrating its on-premises Oracle data warehouse to Amazon Redshift. The data engineering team needs to load data from Oracle to Redshift using AWS DMS (Database Migration Service). The source database is 2 TB in size. The team wants to minimize downtime and ensure data consistency during full load. Which approach should they take?

A.Use UNLOAD command to export data from Oracle to S3, then COPY into Redshift.

B.Use AWS DMS to perform a full load, then enable ongoing replication to capture changes.

C.Stop the source database, export data to flat files, upload to S3, and use COPY to load into Redshift.

D.Use COPY command directly from Oracle to Redshift over JDBC.

AnswerB

Minimizes downtime and ensures consistency.

Why this answer

Option B is correct because AWS DMS can perform a full load of the 2 TB Oracle database while simultaneously capturing ongoing changes via CDC (Change Data Capture). After the full load completes, DMS applies the cached changes to Redshift, ensuring data consistency with minimal downtime. This approach avoids stopping the source database and leverages DMS's native replication capabilities.

Exam trap

The trap here is that candidates may confuse the UNLOAD command (Redshift export) with an Oracle export tool, or assume that a full database stop is required for consistency, when DMS's CDC capability eliminates that need.

How to eliminate wrong answers

Option A is wrong because the UNLOAD command is an Amazon Redshift command for exporting data from Redshift to S3, not for exporting from Oracle; it cannot be used on the source Oracle database. Option C is wrong because stopping the source database to export flat files introduces unnecessary downtime, which contradicts the goal of minimizing downtime; DMS can achieve consistency without halting operations. Option D is wrong because the COPY command in Redshift cannot read directly from Oracle over JDBC; it only loads data from S3, DynamoDB, or other supported sources, not from a live JDBC connection.

Full explanation →

642

Multi-Selectmedium

A company is running a critical data pipeline using AWS Glue. The pipeline must be highly available and fault-tolerant. Which TWO strategies should the data engineer implement? (Choose TWO.)

Select 2 answers

A.Configure the Glue job to run in multiple Availability Zones.

B.Use a single instance type for all Glue workers.

C.Increase the number of concurrent runs for the Glue job.

D.Enable job retries with exponential backoff.

E.Disable job bookmarks to avoid reprocessing.

AnswersA, D

Multi-AZ provides redundancy.

Why this answer

Configuring the Glue job to run in multiple Availability Zones ensures that if one AZ experiences a failure, the job can continue processing in another AZ, providing high availability and fault tolerance. This is a fundamental strategy for resilient data pipeline design in AWS.

Exam trap

The trap here is that candidates often confuse increasing concurrency (Option C) with fault tolerance, but concurrency only scales processing horizontally without providing redundancy against infrastructure failures.

Full explanation →

643

MCQmedium

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration must have minimal downtime. The source database is highly active with continuous writes. Which DMS migration type and additional configuration should the engineer use?

A.Use a CDC-only migration task to capture changes from the source.

B.Use a full load migration task and stop the source database before starting.

C.Use a full load migration task with task restart enabled.

D.Use a full load migration task followed by ongoing replication (CDC).

AnswerD

Full load migrates existing data, then CDC replicates new changes, minimizing downtime.

Why this answer

Option D is correct because full load + change data capture (CDC) allows initial bulk load and then continuously replicates ongoing changes to keep the target in sync, minimizing downtime. Option A is wrong because full load only will not capture changes during migration, causing data loss. Option B is wrong because CDC only requires an initial snapshot anyway; starting with CDC assumes existing data is already present, which isn't the case.

Option C is wrong because task restart does not address ongoing changes.

Full explanation →

644

MCQeasy

A company is migrating an on-premises MySQL database to Amazon RDS for MySQL. The database is 500 GB in size. The migration must have minimal downtime and must be completed within a week. Which AWS service should the data engineer use to perform the migration?

A.Amazon S3 Transfer Acceleration.

B.AWS Snowball Edge.

C.AWS DataSync.

D.AWS Database Migration Service (DMS).

AnswerD

DMS supports ongoing replication for minimal downtime.

Why this answer

Option A is correct because AWS Database Migration Service (DMS) supports minimal downtime migration with ongoing replication from on-premises to RDS. Option B is incorrect because S3 is object storage, not for database migration. Option C is incorrect because Snowball is for large data transfer but involves shipping and is not suitable for minimal downtime.

Option D is incorrect because DataSync is for file storage, not databases.

Full explanation →

645

MCQeasy

A company stores sensitive customer data in Amazon S3. The security team requires that all data be encrypted at rest using server-side encryption with AWS KMS managed keys (SSE-KMS). Which S3 bucket policy condition will enforce this requirement?

A.s3:x-amz-server-side-encryption-aws-kms-key-id

B.s3:x-amz-server-side-encryption-customer-algorithm

C.s3:x-amz-server-side-encryption

D.s3:x-amz-server-side-encryption-aws-kms-key-id or s3:x-amz-server-side-encryption

AnswerA

This condition can enforce the use of a specific KMS key for encryption.

Why this answer

Option A is correct because the condition `s3:x-amz-server-side-encryption-aws-kms-key-id` enforces that objects uploaded to S3 must use a specific AWS KMS key for server-side encryption (SSE-KMS). This satisfies the security team's requirement that all data be encrypted at rest using SSE-KMS with AWS KMS managed keys, as it explicitly checks for the presence and value of the KMS key ID in the request.

Exam trap

The trap here is that candidates often confuse `s3:x-amz-server-side-encryption` (which only checks the encryption header value, not the key) with `s3:x-amz-server-side-encryption-aws-kms-key-id` (which enforces a specific KMS key), leading them to choose option C or D without realizing that SSE-S3 (AES256) would also satisfy the generic condition.

How to eliminate wrong answers

Option B is wrong because `s3:x-amz-server-side-encryption-customer-algorithm` is used to enforce server-side encryption with customer-provided encryption keys (SSE-C), not SSE-KMS. Option C is wrong because `s3:x-amz-server-side-encryption` only checks whether the `x-amz-server-side-encryption` header is present (e.g., with value `AES256` for SSE-S3 or `aws:kms` for SSE-KMS), but it does not enforce the use of a specific KMS key or even that KMS is used; it could be satisfied by SSE-S3. Option D is wrong because combining `s3:x-amz-server-side-encryption-aws-kms-key-id` with `s3:x-amz-server-side-encryption` is redundant and does not add enforcement for SSE-KMS; the key ID condition alone is sufficient, and the additional condition could allow SSE-S3 if not carefully combined with a deny rule.

Full explanation →

646

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for streaming data from IoT devices. The devices send JSON messages every second. The engineer needs to ingest the data with low latency and store it in Amazon S3 in Parquet format. Which TWO services should the engineer use together?

Select 2 answers

A.AWS Lambda

B.Amazon Athena

C.Amazon Kinesis Data Streams

D.AWS Glue

E.Amazon Kinesis Data Firehose

AnswersC, E

Provides low-latency ingestion.

Why this answer

Options B and D are correct. Kinesis Data Streams provides low-latency ingestion, and Kinesis Data Firehose can convert data to Parquet and deliver to S3. Option A (Glue) is batch-oriented.

Option C (Athena) is for querying. Option E (Lambda) can be used but is not required for the conversion; Firehose can do it natively.

Full explanation →

647

MCQmedium

Refer to the exhibit. A DynamoDB table 'Orders' has a GSI 'CustomerDateIndex'. A developer tries to query the GSI for all orders of a customer between two dates. The query fails. What is the most likely reason?

A.The GSI does not include 'order_id' in the key schema

B.The 'customer_id' attribute is not a partition key in the GSI

C.The query uses a date format that is not lexicographically sortable as a string

D.The 'order_date' attribute is not a sort key in the GSI

AnswerC

String sort on dates requires ISO 8601 format.

Why this answer

Option C is correct because DynamoDB's Query operation requires the sort key to be lexicographically sortable when using comparison operators like BETWEEN. If the 'order_date' attribute is stored as a non-lexicographically sortable string format (e.g., 'MM-DD-YYYY' instead of 'YYYY-MM-DD'), the BETWEEN condition will fail to return correct results or may throw an error. The GSI's sort key must be in a format that supports string comparison for range queries to work properly.

Exam trap

AWS often tests the misconception that any string date format works for range queries, but the trap here is that DynamoDB requires lexicographically sortable strings for BETWEEN conditions, and non-ISO formats will silently fail or return incorrect results.

How to eliminate wrong answers

Option A is wrong because 'order_id' is not required in the GSI key schema for querying by customer and date range; the GSI only needs the partition key (customer_id) and sort key (order_date) for this query. Option B is wrong because 'customer_id' must be the partition key of the GSI for the query to work, and the question states the GSI is 'CustomerDateIndex', implying customer_id is the partition key. Option D is wrong because 'order_date' must be the sort key of the GSI for the BETWEEN query to work, and the index name 'CustomerDateIndex' suggests it is the sort key; the failure is due to date format, not missing sort key.

Full explanation →

648

MCQmedium

A data engineer is managing an Amazon RDS for PostgreSQL instance that serves as a source for change data capture (CDC) using AWS DMS. The DMS task is a full load followed by ongoing replication. The full load completed successfully, but the ongoing replication is failing with the error 'Value too long for character type'. The engineer has verified that the target database schema matches the source. The source table has a VARCHAR(256) column, and the target has VARCHAR(256) as well. However, some source rows contain values longer than 256 characters. What should the engineer do to resolve the issue?

A.Modify the DMS task to truncate data that exceeds the column length.

B.Rename the target column to match a different source column.

C.Change the target column to a CLOB data type.

D.Alter the target table column to a larger data type, such as VARCHAR(512).

AnswerD

Resolves the length mismatch.

Why this answer

Option B is correct because the error indicates that source data exceeds the column length. The source column definition may not enforce the length, or the data was inserted bypassing constraints. The engineer should alter the target column to a larger size to accommodate the actual data.

Option A is wrong because truncating data may cause data loss. Option C is wrong because the error is not about character set. Option D is wrong because DMS maps columns by name; renaming would cause mapping issues.

Full explanation →

649

MCQmedium

A company runs a critical application on Amazon RDS for MySQL. To ensure high availability and automatic failover, the database is deployed as a Multi-AZ DB instance. The application uses read-heavy workloads. Which additional configuration should be used to offload read traffic without impacting write performance?

A.Use the Multi-AZ standby instance for read queries.

B.Create one or more Read Replicas in different AZs.

C.Use Amazon ElastiCache to cache read queries.

D.Change to a Single-AZ deployment with a larger instance size.

AnswerB

Read Replicas can handle read traffic, and Multi-AZ ensures high availability for writes.

Why this answer

Option C is correct because Read Replicas can serve read traffic and reduce load on the primary instance, while Multi-AZ provides high availability. Option A (Multi-AZ for reads) does not offload reads; the standby is not used for reads. Option B (single AZ with larger instance) does not provide high availability.

Option D (ElastiCache) caches data but does not serve read replicas for complex queries.

Full explanation →

650

Multi-Selectmedium

A company is using AWS Glue ETL to transform and load data from Amazon S3 to Amazon Redshift. The data engineer notices that the job is taking longer than expected. Which TWO actions can improve the job performance?

Select 2 answers

A.Use Amazon Redshift Spectrum to query data directly.

B.Partition the source data in S3.

C.Increase the number of DPUs for the Glue job.

D.Enable S3 Transfer Acceleration.

E.Use a larger Redshift node type.

AnswersB, C

Partitioning reduces data scanned.

Why this answer

Options B and D are correct because increasing the number of DPUs and using partitioning in S3 improve parallelism and reduce data scanned. Option A is incorrect because S3 Transfer Acceleration is for uploads, not ETL. Option C is incorrect because Redshift Spectrum is for querying, not Glue ETL.

Option E is incorrect because a larger instance type for Redshift does not affect Glue job performance.

Full explanation →

651

Multi-Selectmedium

A company runs a data processing pipeline on Amazon EMR. The pipeline reads data from S3, processes it with Spark, and writes results back to S3. The engineer notices that the cluster is underutilized and wants to reduce costs. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Use Spot instances for task nodes.

B.Configure the cluster to terminate after the job completes.

C.Change the master node to a larger instance type.

D.Enable EMRFS consistent view.

E.Increase the number of core nodes to improve parallelism.

AnswersA, B

Spot instances are cheaper than On-Demand.

Why this answer

Option A is correct because using Spot instances for task nodes in Amazon EMR can significantly reduce costs, as Spot instances are spare EC2 capacity offered at up to 90% discount compared to On-Demand instances. Since task nodes are stateless and can be added or removed without affecting cluster stability, they are ideal candidates for Spot instances, allowing the engineer to lower expenses while maintaining processing capacity.

Exam trap

The trap here is that candidates may confuse cost optimization features like Spot instances and auto-termination with performance improvements or data consistency settings, leading them to select options that increase resources or enable features unrelated to cost reduction.

Full explanation →

652

MCQmedium

A data engineering team is using Amazon DynamoDB to store time-series data for a monitoring application. The table has a primary key of device_id (partition key) and timestamp (sort key). The application queries data for a specific device over a time range. The team notices that read latency is high for devices that generate large amounts of data. They need to improve query performance. Which solution should they implement?

A.Enable DynamoDB Accelerator (DAX) for the table.

B.Change the application to use eventually consistent reads.

C.Create a global secondary index with device_id as partition key and a truncated timestamp as sort key to distribute writes.

D.Increase the read capacity units for the table.

AnswerC

Helps spread write load and improve read performance for time-range queries.

Why this answer

Option C is correct because creating a global secondary index (GSI) with device_id as the partition key and a truncated timestamp as the sort key helps distribute write traffic more evenly across partitions. This reduces hot partitions caused by devices that generate large amounts of data, thereby improving read latency by preventing throttling and reducing contention on a single partition.

Exam trap

The trap here is that candidates often mistake high read latency as a pure read-throughput issue and choose to increase RCUs or add DAX, overlooking the fact that the real bottleneck is write-side hot partitions causing throttling and increased latency for reads on that partition.

How to eliminate wrong answers

Option A is wrong because DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency for frequently accessed items, but it does not address the root cause of high read latency—hot partitions due to uneven write distribution. Option B is wrong because eventually consistent reads only reduce latency by returning data that may be slightly stale, but they do not solve the underlying partition-level performance issue caused by skewed write patterns. Option D is wrong because increasing read capacity units (RCUs) can help with throughput but does not mitigate the hot partition problem; if a single partition is overwhelmed, additional RCUs cannot be fully utilized due to DynamoDB's partition-level throughput limits.

Full explanation →

653

MCQmedium

Refer to the exhibit. A data engineer attaches this bucket policy to an S3 bucket. A developer tries to upload an object to the bucket using the AWS CLI with the command: `aws s3 cp file.txt s3://my-bucket/`. The upload fails. What is the most likely reason?

A.The CLI command does not specify the encryption header, so the request is denied by the policy

B.The developer used the wrong AWS region

C.The CLI command does not include the required KMS key ID

D.The IAM user does not have s3:PutObject permission

AnswerA

The policy denies requests without encryption header.

Why this answer

Option B is correct because the CLI does not automatically set encryption headers for `aws s3 cp`; it uses SSE-S3 by default unless `--sse` is specified. The policy denies PutObject if encryption header is missing or not SSE-KMS. Option A is wrong because the user may have s3:PutObject.

Option C is wrong because the policy does not require a specific KMS key ID. Option D is wrong because SSE-S3 is not allowed.

Full explanation →

654

MCQeasy

A data engineer uses AWS CloudTrail to investigate a security incident. The engineer runs the command shown in the exhibit. What does the output indicate?

A.A file was downloaded from the S3 bucket.

B.A file was deleted from the S3 bucket.

C.A batch of files was listed from the S3 bucket.

D.A file was uploaded to the S3 bucket.

AnswerD

PutObject indicates an upload.

Why this answer

Option C is correct because the EventName is PutObject, and the ResourceName includes the object key 'sales_2024-07-01.csv', indicating that a file was uploaded. Option A is wrong because the event is a PutObject, not a GetObject. Option B is wrong because the event is an upload, not a deletion.

Option D is wrong because the event is a single object upload, not a batch.

Full explanation →

655

Multi-Selecthard

A company uses AWS Glue to process large datasets. The Glue job occasionally fails with 'DiskFull' errors. Which TWO actions should the engineer take to resolve this issue? (Choose two.)

Select 2 answers

A.Increase the number of workers for the Glue job.

B.Store intermediate results in Amazon S3 instead of local disk.

C.Enable job bookmark to skip already processed data.

D.Use a Python shell job instead of Spark.

E.Use G.2X worker type which provides more disk space per worker.

AnswersA, E

More workers provide more aggregate disk space.

Full explanation →

656

Multi-Selectmedium

Which TWO of the following are valid approaches to implement fine-grained access control for Amazon DynamoDB items based on user attributes? (Choose 2.)

Select 2 answers

A.Enable row-level security in DynamoDB using AWS Lake Formation.

B.Configure a VPC endpoint with a bucket policy to restrict access to specific items.

C.Use Amazon Cognito identity pools with IAM roles that include conditions based on user attributes.

D.Store user-specific items in separate S3 buckets and use IAM policies to restrict bucket access.

E.Use IAM policies with condition keys like 'dynamodb:LeadingKeys' to restrict access to items with a specific partition key value.

AnswersC, E

Cognito can map user attributes to IAM roles with fine-grained policies.

Why this answer

Option C is correct because Amazon Cognito identity pools can be configured to assume IAM roles with fine-grained policies that use condition keys such as `dynamodb:LeadingKeys` or custom attribute-based conditions. This allows access to DynamoDB items to be restricted based on user-specific attributes (e.g., user ID) without hardcoding permissions per user.

Exam trap

The trap here is that candidates often confuse DynamoDB's fine-grained access control with S3 bucket policies or Lake Formation row-level security, mistakenly applying S3 or data lake concepts to DynamoDB item-level permissions.

Full explanation →

657

MCQeasy

A company needs to encrypt data in transit between an EC2 instance and an S3 bucket. Which method should be used?

A.Use HTTPS endpoints

B.Use plain HTTP

C.Use an IPsec VPN

D.Server-side encryption (SSE)

AnswerA

HTTPS encrypts data in transit using TLS.

Why this answer

HTTPS endpoints encrypt data in transit between EC2 and S3 using TLS/SSL, ensuring confidentiality and integrity over the public internet. S3 supports HTTPS natively on its REST endpoints, and the AWS SDKs default to HTTPS, making this the simplest and most secure method for encrypting data in motion.

Exam trap

The trap here is confusing encryption in transit (HTTPS) with encryption at rest (SSE), leading candidates to select server-side encryption even though it does not protect data during network transfer.

How to eliminate wrong answers

Option B is wrong because plain HTTP transmits data in cleartext, exposing it to interception and tampering, which violates encryption-in-transit requirements. Option C is wrong because an IPsec VPN encrypts traffic between networks but is unnecessary and overly complex for direct EC2-to-S3 communication, which can be secured via HTTPS without additional infrastructure. Option D is wrong because server-side encryption (SSE) protects data at rest within S3, not data in transit between EC2 and S3.

Full explanation →

658

MCQeasy

A data engineer needs to grant an IAM user read-only access to a specific KMS key for decrypting S3 objects. Which policy element should be used?

A.Attach an IAM policy to the user allowing kms:Decrypt

B.Add a statement to the KMS key policy allowing the IAM user to call kms:Decrypt

C.Add a bucket policy that grants kms:Decrypt

D.Use an SCP in AWS Organizations to allow kms:Decrypt

AnswerB

Key policies directly grant access to the key.

Why this answer

A KMS key policy can grant IAM users permissions directly. Option B is wrong because IAM policies can also grant KMS permissions, but the key policy must allow the IAM policy. Option C is wrong because SCPs are for Organizations.

Option D is wrong because bucket policies do not grant KMS permissions. Option A is correct, though both A and B are possible; typically key policy is used for cross-account or specific grants.

Full explanation →

659

MCQmedium

A data engineer is migrating an on-premises Apache Cassandra cluster to Amazon Keyspaces (for Apache Cassandra). The cluster has 10 TB of data. The migration must minimize application downtime. Which strategy should the engineer use?

A.Set up a dual-write pattern where the application writes to both the on-premises cluster and Keyspaces, then switch reads to Keyspaces once data is synchronized.

B.Export the data using the Cassandra COPY command and import it into Keyspaces using the COPY command.

C.Take a snapshot of the on-premises cluster and restore it to Keyspaces using the Keyspaces console.

D.Use AWS Database Migration Service (DMS) to continuously replicate data from the on-premises cluster to Keyspaces.

AnswerA

This minimizes downtime by keeping both systems in sync and allows for a gradual cutover.

Why this answer

Option A is correct because the dual-write pattern allows the application to write to both the on-premises Cassandra cluster and Amazon Keyspaces simultaneously, ensuring data consistency with minimal downtime. Once the existing data is backfilled and the systems are synchronized, reads can be switched to Keyspaces with near-zero application interruption. This approach avoids the downtime required for bulk export/import or snapshot restore.

Exam trap

The trap here is that candidates often assume AWS DMS can handle any database migration, but DMS does not support Cassandra as a source, making Option D a distractor for those who overestimate DMS's capabilities.

How to eliminate wrong answers

Option B is wrong because the Cassandra COPY command is a bulk export/import tool that requires the application to stop writes during the migration to ensure consistency, causing significant downtime for 10 TB of data. Option C is wrong because taking a snapshot of the on-premises cluster and restoring it to Keyspaces via the console is not supported; Keyspaces does not provide a native snapshot restore feature from on-premises snapshots. Option D is wrong because AWS DMS does not support Apache Cassandra as a source for continuous replication; DMS supports relational databases and some NoSQL databases but not Cassandra.

Full explanation →

660

MCQeasy

A company needs to ingest data from multiple SaaS applications into Amazon S3. The data sources provide REST APIs. Which AWS service can be used to build a fully managed data ingestion pipeline without writing custom code?

A.Amazon AppFlow

B.Amazon Kinesis Data Streams

C.AWS Lambda with custom code

D.AWS Glue with Python shell

AnswerA

AppFlow is a fully managed integration service for SaaS applications.

Why this answer

Option D is correct because Amazon AppFlow is a fully managed service to transfer data from SaaS applications to AWS. Option A is wrong because AWS Lambda requires custom code. Option B is wrong because AWS Glue is for ETL, not for direct API integration.

Option C is wrong because Amazon Kinesis is for streaming data.

Full explanation →

661

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from Amazon S3 and load it into Amazon Redshift. The jobs are failing with the error 'Unable to connect to Redshift cluster'. The Redshift cluster is in the same VPC as the Glue job. What is the MOST likely cause?

A.The Redshift cluster's security group does not allow inbound traffic from the Glue job's security group.

B.The IAM role associated with the Glue job does not have permission to access Redshift.

C.The Glue job is not configured to use the same VPC as the Redshift cluster.

D.The Redshift cluster is not publicly accessible and Glue is trying to connect from outside the VPC.

AnswerC

Glue jobs need VPC configuration to access resources in a private VPC.

Why this answer

Option A is correct because Glue jobs run in a separate VPC by default, and if the job is not configured to use the same VPC as the Redshift cluster, it cannot connect to it. Option B is incorrect because the Redshift cluster is in the same VPC, so security group rules should allow traffic if properly configured. Option C is incorrect because Redshift port 5439 is the default and should be open.

Option D is incorrect because IAM roles are for authentication, not network connectivity.

Full explanation →

662

Multi-Selecthard

A company uses Amazon Redshift for data warehousing. The security team requires that data be encrypted at rest using a customer-managed key (CMK) in AWS KMS, and that the key be rotated automatically every year. Additionally, the team wants to restrict access to the key to only the Redshift cluster and a security admin IAM role. Which steps should the company take? (Choose THREE.)

Select 3 answers

A.Add the security admin IAM role as a key user in the KMS key policy.

B.Alter the existing Redshift cluster to enable encryption with the CMK.

C.Enable automatic key rotation in the KMS key policy.

D.Disable automatic key rotation to comply with security policy.

E.Create a new Redshift cluster and specify the CMK for encryption.

AnswersA, C, E

Allows the admin to manage the key.

Why this answer

Options A, C, and E are correct. Option A enables encryption with a CMK. Option C enables automatic yearly rotation.

Option E adds the security admin to the key policy. Option B is wrong because enabling encryption on an existing encrypted cluster requires creating a new snapshot and restoring, not just altering. Option D is wrong because disabling rotation prevents automatic rotation.

Full explanation →

663

MCQhard

A company uses Amazon S3 to store log files from multiple applications. The logs are encrypted with AWS KMS (SSE-KMS). A data engineer needs to grant a new IAM user read-only access to the logs. The engineer attaches an S3 bucket policy that allows s3:GetObject and a KMS key policy that allows kms:Decrypt. However, the user still receives an 'Access Denied' error when trying to download an object. What is the MOST likely missing permission?

A.The user does not have s3:ListBucket permission on the bucket.

B.The user does not have s3:GetObjectVersion permission.

C.The user's IAM policy does not include kms:Decrypt permission.

D.The user does not have kms:GenerateDataKey permission.

AnswerC

Both the key policy and the IAM user policy must allow kms:Decrypt; the IAM policy is missing this action.

Why this answer

Option C is correct because to use SSE-KMS, the user needs kms:Decrypt, but also the IAM policy must allow kms:Decrypt, not just the key policy. The key policy alone is not sufficient if the IAM user's policy denies or does not allow the action. Option A is incorrect because s3:ListBucket is for listing, not downloading.

Option B is incorrect because s3:GetObjectVersion is for versioned buckets. Option D is incorrect because kms:GenerateDataKey is for encryption, not decryption.

Full explanation →

664

MCQmedium

A media company stores video files in Amazon S3 buckets organized by content type. The company has a requirement to automatically archive files that are older than 90 days to Amazon S3 Glacier Deep Archive to reduce costs. However, the company wants to retain the ability to restore files within 12 hours if needed. The data engineer creates an S3 Lifecycle policy to transition objects to Glacier Deep Archive after 90 days. After deploying the policy, the engineer notices that the storage costs have not decreased significantly. On reviewing the bucket metrics, the engineer sees that many objects are being deleted directly by users before the lifecycle policy takes effect. The company needs to enforce the lifecycle policy and prevent premature deletions. What should the data engineer do to enforce the lifecycle policy?

A.Apply an S3 Object Lock retention policy with a default retention mode of GOVERNANCE and a retention period of 90 days.

B.Enable S3 Versioning on the bucket and add a lifecycle rule to transition noncurrent versions to Glacier Deep Archive after 90 days.

C.Enable S3 Multi-Factor Authentication (MFA) Delete on the bucket to require MFA for all delete operations.

D.Use S3 Glacier Select to query the objects and restore them if needed.

AnswerB

Versioning preserves deleted objects as noncurrent versions, which can then be transitioned by lifecycle rules.

Why this answer

Option A is correct because an S3 Lifecycle policy with a noncurrent version transition will move older versions to Deep Archive, and enabling S3 Versioning ensures that deletions create delete markers rather than permanently deleting objects. Option B is wrong because MFA Delete is for extra security, not for lifecycle enforcement. Option C is wrong because Object Lock prevents object deletion entirely, which may not be desired.

Option D is wrong because S3 Glacier Select is for querying archived data, not for lifecycle management.

Full explanation →

665

MCQhard

Refer to the exhibit. A data engineer ran the CLI command to check the configuration of an RDS instance named 'mydb'. Which statement accurately describes the current configuration?

A.The database is in the 'stopped' state

B.The database is a Single-AZ deployment and is not a read replica

C.The database is a read replica of another instance

D.The database is a Multi-AZ deployment

AnswerB

MultiAZ False and no ReadReplicaSourceDBInstanceIdentifier.

Why this answer

Option B is correct because the CLI command output shows 'DBInstanceStatus: available' and 'MultiAZ: False', indicating the database is running as a Single-AZ deployment. Additionally, the absence of a 'SourceDBInstanceIdentifier' field confirms it is not a read replica. The 'ReadReplicaSourceDBInstanceIdentifier' is not present, which would be required if it were a read replica.

Exam trap

AWS often tests the misconception that a database with 'available' status must be a read replica or Multi-AZ, but the absence of the 'ReadReplicaSourceDBInstanceIdentifier' and 'MultiAZ: False' clearly identify it as a standalone Single-AZ instance.

How to eliminate wrong answers

Option A is wrong because the CLI output shows 'DBInstanceStatus: available', not 'stopped', so the database is running, not stopped. Option C is wrong because the output lacks a 'ReadReplicaSourceDBInstanceIdentifier' field, which is mandatory for a read replica; without it, the instance is a primary database, not a replica. Option D is wrong because the output explicitly shows 'MultiAZ: False', which means it is a Single-AZ deployment, not Multi-AZ.

Full explanation →

666

MCQmedium

A company runs a production Amazon RDS for MySQL database with Multi-AZ deployment. The database experiences high read latency during peak hours. The company wants to improve read performance with minimal application changes. Which solution should a data engineer recommend?

A.Create one or more read replicas in a different Availability Zone.

B.Use Amazon DynamoDB Accelerator (DAX) as a caching layer.

C.Increase the DB instance size to a larger instance class.

D.Enable Multi-AZ on the existing instance.

AnswerA

Read replicas offload read traffic from the primary instance.

Why this answer

Option B is correct because creating a read replica reduces load on the primary instance and improves read performance; Multi-AZ handles failover but does not offload reads. Option A is wrong because Multi-AZ does not offload reads. Option C is wrong because increasing instance size may help but is more expensive and may require downtime.

Option D is wrong because DynamoDB Accelerator (DAX) is for DynamoDB, not RDS.

Full explanation →

667

MCQhard

A data pipeline uses AWS Glue to read CSV files from an S3 bucket, transform them, and write Parquet back to S3. The pipeline runs daily and processes about 500 GB per run. The team wants to reduce costs without increasing runtime. Which approach is most effective?

A.Pre-convert the CSV files to Parquet in S3 using a separate process.

B.Enable job bookmarks to skip already processed data.

C.Increase the number of DPUs for the Glue job to improve parallelism.

D.Optimize the Glue script to select only required columns and filter rows early.

AnswerD

Reduces data scanned, lowering cost.

Why this answer

Option D is correct because using column pruning and predicate pushdown reduces the amount of data scanned, lowering costs without affecting runtime. Option A is wrong because increasing DPUs may increase cost and runtime unpredictably. Option B is wrong because converting to Parquet upfront would add an extra processing step and cost.

Option C is wrong because job bookmarks track processed data but do not reduce cost per run.

Full explanation →

668

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command to describe a Glue job. The job is expected to process new data incrementally using job bookmarks. However, the job reprocesses all data every time it runs. What is the MOST likely reason?

A.The job bookmark option is set to 'job-bookmark-enable' but should be 'job-bookmark-disable'.

B.The job's MaxRetries is set to 0, which disables bookmarks.

C.The ETL script does not use the 'transformation_ctx' parameter in its DynamicFrame transformations.

D.The Glue job's command name is 'glueetl', which does not support job bookmarks.

AnswerC

Without transformation_ctx, Glue cannot track bookmarks.

Why this answer

Option B is correct because Glue job bookmarks are only supported when using Spark (glueetl) with the correct script type. The exhibit shows 'Command.Name' is 'glueetl', which is correct. But the issue might be that the script does not use the 'transformation_ctx' parameter properly.

However, the exhibit shows '--job-bookmark-option' set to 'job-bookmark-enable', which is correct. Option B is wrong because the command is glueetl. Option A is wrong because bookmarks are enabled.

Option C is wrong because MaxRetries does not affect bookmarks. Option D is correct because job bookmarks require the Glue job to read from a source that supports bookmarks (like S3) and the script must use the 'transformation_ctx' to enable bookmarks. But the exhibit doesn't show the script.

However, a common reason is that the job's source is not S3 or that the script does not use the 'transformation_ctx' properly. Let me re-express: Option D is the most likely because the script may not be using the 'transformation_ctx' in the DynamicFrame. I'll set D as correct.

Full explanation →

669

MCQhard

Refer to the exhibit. An IAM policy is attached to an IAM role used by an application. The application needs to read objects from 'my-bucket' that have the tag 'classification=public'. The application account is 123456789012. However, the application is getting 'Access Denied' errors. What is the most likely reason?

A.The Deny statement uses StringNotEquals, which incorrectly denies the application account.

B.The policy does not grant s3:ListBucket permission, so the application cannot list objects.

C.The object being accessed does not have the tag 'classification=public'.

D.The Deny statement blocks all access from accounts other than 123456789012, but the application is in that account.

AnswerC

Without the tag, the Allow condition fails, leading to implicit deny.

Why this answer

Option A is correct because the Deny statement blocks any request that does not come from account 123456789012. Even though the application is in that account, the Deny does not have a condition that allows the account; it uses StringNotEquals, so if the condition is not met, the Deny applies. In this case, the s3:ExistingObjectTag condition in the Allow statement is not part of the Deny condition, but the Deny statement applies to all s3 actions and resources.

The key issue is that the Allow statement has a condition requiring the tag, but the Deny statement does not have a condition that excludes the account; it denies all actions unless the request comes from the account. However, the request does come from the account, so the Deny should not apply? Actually, StringNotEquals means if the source account is NOT 123456789012, then deny. Since it IS 123456789012, the condition is false, so the Deny does not apply.

So the Allow should work if the tag condition is met. So what's wrong? The Allow condition requires the tag, but the Deny does not. If the object does not have the tag, the Allow does not apply, but there is no explicit deny for that case.

However, the error might be due to the object not having the tag. Option A is plausible: the Deny is too broad, but it's not blocking the account. Option B: missing s3:ListBucket prevents listing but not direct GetObject if you know the key.

Option C: the condition on Allow might not match; but the error is Access Denied, not that the object doesn't exist. Option D: if the object has the tag, the Allow applies, and Deny does not, so it should work. The most likely reason is that the object does not have the tag 'classification=public', so the Allow condition fails, and there is no other Allow for GetObject, resulting in implicit deny.

So Option C is correct: the object's tag does not match the condition.

Full explanation →

670

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

A.Enable compression on the Kinesis stream.

B.Change the output format from Parquet to ORC.

C.Increase the number of DPUs allocated to the Glue job.

D.Reduce the streaming batch size in the Glue job configuration.

AnswerC

More DPUs provide more memory and CPU.

Why this answer

The 'Out of Memory' error in AWS Glue indicates that the job's allocated resources are insufficient for the data volume or processing complexity. Increasing the number of DPUs (Data Processing Units) directly increases the available memory and compute capacity, which is the most straightforward fix for OOM errors in Glue streaming jobs. Option C is correct because it addresses the root cause—resource exhaustion—by scaling the job horizontally.

Exam trap

The trap here is that candidates often confuse 'Out of Memory' with a data format or compression issue, leading them to choose options like A or B, when the real solution is to scale compute resources via DPUs.

How to eliminate wrong answers

Option A is wrong because enabling compression on the Kinesis stream reduces data transfer size but does not affect the memory footprint of the Glue job processing the data; the job still decompresses records into memory. Option B is wrong because changing the output format from Parquet to ORC does not reduce memory usage—both are columnar formats with similar memory profiles, and the error is not related to serialization efficiency. Option D is wrong because reducing the streaming batch size can help with latency but does not guarantee resolution of OOM errors; the job may still fail if individual records or transformations are memory-intensive, and the core issue is insufficient total memory allocation.

Full explanation →

671

MCQeasy

A company needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is about 100 GB per day. Which AWS service is BEST suited for this task?

A.Use AWS DataSync to copy the database files to S3.

B.Use Amazon Kinesis Data Firehose with a database connector.

C.Use AWS Database Migration Service (DMS) to replicate data to S3.

D.Use AWS Glue to extract data from Oracle and write to S3.

AnswerC

DMS supports continuous replication from Oracle to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can continuously replicate data from Oracle to S3, and it supports full load and change data capture (CDC). Option A (AWS DataSync) is for file-based transfers, not database replication. Option B (Amazon Kinesis Data Firehose) is for streaming data, not database pull.

Option D (AWS Glue) is for ETL but does not natively support continuous CDC from Oracle.

Full explanation →

672

Multi-Selecteasy

A data engineer is troubleshooting a failed AWS Glue job that reads from an S3 bucket and writes to an Amazon Redshift table. The error message indicates 'Access Denied'. Which TWO permissions are likely missing? (Choose TWO.)

Select 2 answers

A.s3:GetObject on the source S3 bucket

B.redshift:CopyFromS3 on the Redshift cluster

C.kms:Decrypt on a KMS key

D.ec2:DescribeInstances on the Redshift cluster

E.lambda:InvokeFunction on a Lambda function

AnswersA, B

Glue needs read access to objects in the S3 bucket.

Why this answer

Options A and B are required for the Glue job to read from S3 and write to Redshift. Option C (Lambda) is not used. Option D (EC2) is not needed if the job uses a VPC endpoint.

Option E (KMS) is unnecessary unless encryption is used.

Full explanation →

673

MCQmedium

Refer to the exhibit. An IAM policy is attached to an EC2 instance role that runs a data ingestion application. The application reads files from an S3 bucket 'data-lake-primary' and sends records to a Kinesis stream named 'clickstream'. The application is failing with an 'AccessDenied' error when trying to read from S3. What is the MOST likely cause?

A.The actions are specified incorrectly; they should be s3:GetObject and s3:PutObject only.

B.The policy is not attached to the EC2 instance role.

C.The Kinesis stream name is incorrect.

D.The policy does not include the s3:ListBucket permission.

AnswerD

Reading objects often requires ListBucket permission for the bucket.

Why this answer

Option C is correct. The policy allows s3:GetObject but the resource is specified with a trailing slash and wildcard (/*). If the application is trying to read an object in a subfolder, the policy may still work, but the error could be due to missing s3:ListBucket permission for the bucket itself when the application lists objects.

The most common cause is that the application is trying to list the bucket (s3:ListBucket) which is not allowed. Option A is wrong because the stream name is correct. Option B is wrong because the actions are allowed.

Option D is wrong because the policy is attached to the role.

Full explanation →

674

MCQeasy

A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?

A.Use the 'mergeSchema' option when reading the DynamicFrame.

B.Convert all CSV files to Parquet format using a separate preprocessing job.

C.Define a fixed schema in the Glue job using 'apply_mapping' to map columns.

D.Set the job to 'ignore' schema mismatches in the job parameters.

AnswerA

'mergeSchema' allows Glue to handle schemas that evolve over time.

Why this answer

Option A is correct because the 'mergeSchema' option in AWS Glue's DynamicFrame reader automatically reconciles schema differences across files, including inconsistent column counts. When enabled, Glue merges all schemas encountered during the read, adding nulls for missing columns in files with fewer columns, preventing the 'SchemaDetectionError' without manual intervention.

Exam trap

The trap here is that candidates often confuse 'mergeSchema' with schema-on-read features in other tools or assume that 'apply_mapping' can fix schema mismatches, when in reality it only transforms already-resolved schemas.

How to eliminate wrong answers

Option B is wrong because converting to Parquet does not inherently solve schema inconsistency; Parquet also requires a consistent schema across files unless mergeSchema is explicitly enabled, and adding a preprocessing job is less efficient than handling it inline. Option C is wrong because 'apply_mapping' only remaps existing columns after the schema is resolved; it does not handle files with missing or extra columns that cause the initial schema detection to fail. Option D is wrong because AWS Glue does not have a job parameter to 'ignore' schema mismatches; the error occurs during schema detection, and ignoring it would lead to data corruption or job failure.

Full explanation →

675

Multi-Selectmedium

A data engineering team is designing a data lake on Amazon S3 for storing sensor data from IoT devices. The data is written in near real-time and needs to be queried using Amazon Athena. Which TWO configurations should the team implement to optimize query performance and minimize costs?

Select 2 answers

A.Compress the data using GZIP.

B.Use S3 Standard-IA storage class.

C.Store the data in Apache Parquet format.

D.Partition the data by date and sensor ID.

E.Enable Requester Pays on the S3 bucket.

AnswersC, D

Parquet is columnar and reduces scan size.

Why this answer

Apache Parquet is a columnar storage format that allows Athena to read only the columns needed for a query, drastically reducing I/O and scan costs. Combined with compression (like Snappy or GZIP), Parquet minimizes the amount of data scanned per query, which directly lowers Athena's cost (charged per TB scanned) and improves query performance through predicate pushdown and efficient encoding.

Exam trap

AWS often tests the misconception that any compression (like GZIP alone) is sufficient for Athena optimization, but the trap is that without a columnar format like Parquet or ORC, compression alone does not enable column pruning or predicate pushdown, leading to higher scan costs and slower queries.

Full explanation →

Page 9 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →