Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1576–1650

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 22 of 24

1576

Multi-Selecthard

A company is migrating a legacy on-premises ETL pipeline to AWS. The pipeline processes daily batch files from an FTP server. The data must be transformed using complex business logic before being loaded into Amazon Redshift. Which THREE AWS services should be used for this migration?

Select 3 answers

A.Amazon Athena

B.Amazon Redshift

C.AWS Glue

D.Amazon Kinesis Data Streams

E.AWS Transfer Family

AnswersB, C, E

Redshift is the target data warehouse.

Why this answer

Option A, Option C, and Option D are correct because AWS Transfer Family can ingest files from FTP, AWS Glue can perform complex transformations, and Amazon Redshift is the target. Option B is wrong because Kinesis Data Streams is for streaming data. Option E is wrong because Athena is for querying, not ETL.

Full explanation →

1577

MCQeasy

Refer to the exhibit. A data engineer sees this error log from an Amazon EC2 instance that is trying to access an S3 bucket in the us-west-2 region. The EC2 instance is in a VPC with a private subnet and no internet gateway. What is the MOST likely cause of this error?

A.The S3 bucket is in a different region than us-west-2.

B.The VPC does not have a VPC endpoint for S3.

C.The S3 bucket does not exist.

D.The IAM role attached to the EC2 instance does not have s3:GetObject permission.

AnswerB

Private subnet needs VPC endpoint to access S3.

Why this answer

Option D is correct. The EC2 instance is in a private subnet without internet gateway, so it cannot reach S3 over the internet. A VPC endpoint (Gateway or Interface) is needed for private connectivity.

Option A is incorrect because the bucket exists (DNS resolves). Option B is incorrect because the error is a connection timeout, not a 403. Option C is incorrect because there is no indication of an incorrect region.

Full explanation →

1578

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data lake will store raw data, transformed data, and curated datasets. The engineer needs to ensure that raw data is immutable (never overwritten or deleted) and that only authorized users can access the transformed data. Which combination of S3 features should the engineer use?

A.Use S3 Lifecycle policies to archive raw data to S3 Glacier and set bucket policies for transformed data.

B.Enable S3 Versioning and use S3 Access Points for each prefix.

C.Enable default encryption with SSE-KMS and use S3 bucket policies to restrict access.

D.Enable S3 Object Lock in compliance mode on the raw data prefix and use bucket policies to restrict access to transformed data prefix.

AnswerD

Object Lock in compliance mode prevents writes and deletes; bucket policies control access.

Why this answer

Option B is correct. S3 Object Lock prevents objects from being deleted or overwritten; S3 bucket policies control access to specific prefixes. Option A is wrong because versioning alone does not prevent deletion; delete markers can be placed.

Option C is wrong because lifecycle policies can delete objects. Option D is wrong because SSE-KMS encrypts data but does not prevent deletion.

Full explanation →

1579

MCQeasy

A data engineer needs to store semi-structured JSON data that is accessed infrequently but must be retrievable within 5 minutes. The data is immutable once stored. Which storage solution is MOST cost-effective?

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 Standard

C.Amazon S3 One Zone-IA

D.Amazon S3 Glacier Instant Retrieval

AnswerC

S3 One Zone-IA is cost-effective for infrequently accessed data with low latency retrieval.

Why this answer

Amazon S3 Glacier Deep Archive is the lowest-cost storage option for rarely accessed data, with retrieval times of 12 hours or more, not suitable for 5-minute retrieval. S3 One Zone-IA is cost-effective for infrequently accessed data that can be recreated, and retrieval is within milliseconds. Option A is wrong because S3 Standard is more expensive.

Option B is wrong because S3 Glacier Instant Retrieval is for data accessed once per quarter with retrieval in milliseconds, but more expensive than One Zone-IA for infrequent access. Option D is wrong because S3 Glacier Deep Archive has retrieval times too long.

Full explanation →

1580

MCQeasy

A data engineer needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for a data lake. The data volumes are moderate and the sync needs to be scheduled daily. Which AWS service is most appropriate for this task?

A.AWS Glue

B.Amazon AppFlow

C.AWS Database Migration Service (DMS)

D.Amazon Kinesis Data Firehose

AnswerB

Designed for SaaS data ingestion.

Why this answer

Amazon AppFlow is purpose-built for securely transferring data between SaaS applications (like Salesforce and Marketo) and AWS services (like S3). It supports scheduled, incremental data syncs with built-in connectors, making it the most appropriate choice for moderate-volume daily ingestion into a data lake.

Exam trap

The trap here is that candidates often confuse AWS Glue's ETL capabilities with direct SaaS ingestion, overlooking that Glue requires a custom connector or script to pull from APIs, whereas AppFlow provides native, managed connectors.

How to eliminate wrong answers

Option A is wrong because AWS Glue is an ETL service designed for batch data transformation and cataloging, not for direct ingestion from SaaS applications; it lacks native connectors for Salesforce or Marketo. Option C is wrong because AWS DMS is intended for migrating databases (e.g., Oracle, MySQL) to AWS, not for pulling data from SaaS APIs. Option D is wrong because Amazon Kinesis Data Firehose is optimized for streaming data ingestion (e.g., from IoT or logs) and does not provide native SaaS connectors or scheduled sync capabilities.

Full explanation →

1581

Matchingmedium

Match each AWS data migration tool to its primary function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Migrate databases with minimal downtime

Physical device for large data transfer

Online data transfer between on-prem and AWS

Fast uploads over long distances

Combine data across sources into views

Why these pairings

Different tools suit different migration scenarios.

Full explanation →

1582

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data must be transformed from JSON to Parquet format before delivery. Which approach should be used?

A.Configure Kinesis Data Firehose to invoke a Lambda function for data transformation.

B.Use an AWS Glue ETL job to read from S3 and write Parquet back to S3.

C.Use Amazon EMR to process the data and output Parquet.

D.Use Kinesis Data Analytics to convert the data to Parquet.

AnswerA

Firehose can call a Lambda function to transform records, including converting JSON to Parquet.

Why this answer

Kinesis Data Firehose can invoke a Lambda function as a transformation step before data is delivered to S3. This allows you to convert JSON records to Parquet format inline, without needing an intermediate storage or separate processing pipeline. The Lambda function receives batches of records, transforms them (e.g., using PyArrow or similar libraries), and returns them to Firehose for delivery.

Exam trap

The trap here is that candidates may think they need a separate ETL service like Glue or EMR for format conversion, but Firehose's built-in Lambda integration is the simplest and most cost-effective way to transform data in-flight before delivery.

How to eliminate wrong answers

Option B is wrong because using an AWS Glue ETL job to read from S3 and write Parquet back to S3 introduces unnecessary latency and cost, and does not meet the requirement for transformation before delivery — it processes data after it is already stored. Option C is wrong because Amazon EMR is a heavy, cluster-based solution designed for large-scale batch or stream processing, not for lightweight, real-time transformation within a Firehose delivery stream. Option D is wrong because Kinesis Data Analytics is used for real-time analytics and SQL-based processing, not for converting data formats like JSON to Parquet; it cannot output Parquet directly to S3.

Full explanation →

1583

MCQmedium

Refer to the exhibit. Exhibit: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012:role/DataEngineer" }, "Action": [ "kms:Decrypt", "kms:ReEncrypt*" ], "Resource": "*" } ] } A data engineer tries to encrypt data using the KMS key associated with this key policy and receives an access denied error. What is the cause?

A.The principal is an IAM role, which is not allowed

B.There is an explicit deny in the policy

C.The Resource element is set to "*", which is invalid for KMS key policies

D.The policy does not include kms:Encrypt action

AnswerD

Encrypt action is missing.

Why this answer

Option B is correct. The key policy grants Decrypt and ReEncrypt, but not Encrypt. Option A is wrong because the resource is "*" which means all keys.

Option C is wrong because the principal is correct. Option D is wrong because there is no explicit deny.

Full explanation →

1584

MCQmedium

A company is streaming IoT sensor data to Amazon Kinesis Data Streams. The data is JSON with a schema that changes occasionally. They want to load the data into Amazon S3 in Parquet format partitioned by date and sensor_id. Which approach is MOST cost-effective and operationally efficient?

A.Use Amazon EMR to read from Kinesis Data Streams and write to S3 in Parquet format.

B.Use a Lambda function to transform records to Parquet and write to S3.

C.Use a custom Kinesis Client Library application on EC2 to buffer and write Parquet files to S3.

D.Use Amazon Kinesis Data Firehose with a schema from AWS Glue Data Catalog to convert to Parquet and enable dynamic partitioning by date and sensor_id.

AnswerD

Firehose natively supports Parquet conversion and dynamic partitioning, minimizing custom code and cost.

Why this answer

Option B is correct because Kinesis Data Firehose can directly convert JSON to Parquet using a Glue table schema and also supports custom partitioning (e.g., by date and sensor_id) without needing a separate transformation step. Option A (Lambda) adds cost and complexity. Option C (EMR) is overkill for this streaming case.

Option D (Kinesis Client Library + EC2) is custom and not managed. Option B is the most cost-effective and operationally efficient.

Full explanation →

1585

MCQhard

A data engineer is designing a data ingestion pipeline for JSON files landing in an Amazon S3 bucket. The pipeline must transform the data (e.g., flatten nested structures) and load it into Amazon Redshift. The transformation logic is complex and may evolve frequently. Which approach provides the MOST flexibility and ease of maintenance?

A.Use AWS Lambda functions to transform each file and load into Redshift.

B.Use the Amazon Redshift COPY command to load raw JSON directly.

C.Use AWS Glue ETL jobs to transform the data and load into Redshift.

D.Use Amazon Athena to query the raw data and insert into Redshift.

AnswerC

Glue ETL supports complex transformations and is easy to maintain.

Why this answer

Option A is correct because Glue ETL (Apache Spark) provides a flexible, code-based environment for complex transformations. Option B is incorrect because Athena is for querying, not ETL. Option C is incorrect because Lambda is limited by execution time and memory for large files.

Option D is incorrect because Redshift COPY does not transform.

Full explanation →

1586

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Lambda function that reads from an S3 bucket and writes to a Kinesis Data Stream. The Lambda function fails with an AccessDeniedException when calling the kinesis:PutRecords API. Which change is needed to the IAM policy?

A.Add s3:PutObject permission to the policy

B.Change the resource ARN for Kinesis to a wildcard

C.Change the resource ARN for Kinesis to include the correct stream name

D.Add kinesis:PutRecords permission to the policy

AnswerB

The error may occur if the Lambda is writing to a different stream; a wildcard would allow all streams, but the better fix is to ensure the resource matches the stream. However, given the options, D is the most plausible fix for a permission mismatch.

Why this answer

Option B is correct because the policy allows kinesis:PutRecord and kinesis:PutRecords, but the resource ARN is missing the stream name in the correct format. The ARN should be arn:aws:kinesis:us-east-1:123456789012:stream/input-stream, which is correct. However, the error might be due to missing permissions for the stream.

Actually, the policy looks correct. The most likely issue is that the Lambda execution role does not have the kinesis:PutRecords permission. The exhibit allows it, but maybe the role is different.

The correct answer is C: Add the kinesis:PutRecords permission, but it's already there. Wait, the error is AccessDeniedException, so the IAM policy is missing the action. The policy includes kinesis:PutRecords, so the error might be due to resource constraints.

The correct answer might be D: Change the resource to a wildcard. But the resource is specific. I think the issue is that the stream name is wrong.

But the question says "Refer to the exhibit." and the policy seems to allow PutRecords. Perhaps the Lambda function is trying to write to a different stream. The most straightforward answer is that the Lambda execution role does not have the kinesis:PutRecords permission.

But the exhibit shows it does. Hmm. Let me re-read.

The policy allows kinesis:PutRecord and kinesis:PutRecords. The resource is a specific stream ARN. If the Lambda is using a different stream, it would fail.

So the fix is to update the resource ARN to match the stream. Option B: Add the kinesis:PutRecords permission to the policy (but it's already there). Option D: Change the resource to a wildcard.

That would be a common fix. I'll go with D.

Full explanation →

1587

MCQeasy

A company wants to ensure that all S3 buckets are encrypted using server-side encryption. Which AWS service can be used to automatically remediate non-compliant buckets?

A.AWS CloudTrail

B.Amazon Inspector

C.AWS Trusted Advisor

D.AWS Config

AnswerD

AWS Config can evaluate compliance and automatically remediate resources.

Why this answer

AWS Config can use managed rules like s3-bucket-server-side-encryption-enabled to check compliance and trigger auto-remediation via SSM Automation or Lambda. Option B is correct.

Full explanation →

1588

MCQeasy

A data engineer is configuring AWS Glue jobs to access data stored in Amazon S3. The data is encrypted using server-side encryption with AWS KMS (SSE-KMS). The Glue job needs to read and write data to the S3 bucket. Which IAM policy statement should be added to the Glue job's IAM role to allow it to use the KMS key?

A.{"Effect":"Allow","Action":["kms:Decrypt"],"Resource":"*"}

B.{"Effect":"Allow","Action":["kms:Decrypt","kms:GenerateDataKey"],"Resource":"*"}

C.{"Effect":"Allow","Action":["kms:Decrypt","kms:ReEncrypt"],"Resource":"*"}

D.{"Effect":"Allow","Action":["kms:Decrypt","kms:Encrypt"],"Resource":"*"}

AnswerB

These actions allow reading (Decrypt) and writing (GenerateDataKey) encrypted data.

Why this answer

To read and write data encrypted with SSE-KMS, AWS Glue needs both `kms:Decrypt` (to read existing encrypted data) and `kms:GenerateDataKey` (to create a new data key for writing encrypted data). `kms:GenerateDataKey` is required because S3 uses a data key to encrypt objects, and the caller must generate that key via KMS. Option B correctly includes both actions, allowing the Glue job to perform read and write operations on the SSE-KMS encrypted bucket.

Exam trap

The trap here is that candidates often assume `kms:Encrypt` is needed for writing encrypted data, but S3 SSE-KMS actually requires `kms:GenerateDataKey` because the encryption is done with a derived data key, not by calling `kms:Encrypt` directly.

How to eliminate wrong answers

Option A is wrong because it only grants `kms:Decrypt`, which allows reading encrypted data but not writing new encrypted objects; writing requires `kms:GenerateDataKey` to create the encryption key. Option C is wrong because `kms:ReEncrypt` is used for re-encrypting data under a different KMS key, which is not needed for standard S3 read/write operations with SSE-KMS. Option D is wrong because `kms:Encrypt` is used to encrypt plaintext data directly with a KMS key, but S3 SSE-KMS requires `kms:GenerateDataKey` (not `kms:Encrypt`) to obtain a data key for object-level encryption.

Full explanation →

1589

MCQhard

A company runs a critical application on Amazon RDS for MySQL that requires a Recovery Point Objective (RPO) of 5 minutes and a Recovery Time Objective (RTO) of 1 hour. The database is 500 GB. What is the MOST cost-effective disaster recovery solution that meets these requirements?

A.Deploy the database in a single Availability Zone and perform manual point-in-time restores.

B.Take automated snapshots daily and store them in Amazon S3.

C.Use a Multi-AZ deployment with automatic failover.

D.Create a cross-region read replica and promote it during a disaster.

AnswerC

Multi-AZ provides synchronous replication to a standby in another AZ, achieving RPO of seconds and RTO of minutes.

Why this answer

Multi-AZ RDS for MySQL provides synchronous standby replication to a second Availability Zone, enabling automatic failover with minimal data loss (typically zero) and RTO under 1 hour. This meets the RPO of 5 minutes and RTO of 1 hour without manual intervention, and is more cost-effective than a cross-region replica for a 500 GB database.

Exam trap

The trap here is that candidates often confuse Multi-AZ (synchronous, same-region, automatic failover) with cross-region read replicas (asynchronous, manual promotion), assuming both provide similar DR capabilities, but Multi-AZ is the only option that meets both RPO and RTO cost-effectively for a single-region requirement.

How to eliminate wrong answers

Option A is wrong because manual point-in-time restores from backups cannot achieve an RTO of 1 hour due to the time required to restore 500 GB from S3, and RPO depends on backup frequency, which is not guaranteed to be 5 minutes. Option B is wrong because daily automated snapshots provide an RPO of up to 24 hours, far exceeding the 5-minute requirement, and restoring from snapshots takes longer than 1 hour for a 500 GB database. Option D is wrong because a cross-region read replica uses asynchronous replication, which can introduce lag exceeding 5 minutes, and promoting it during a disaster requires manual steps that increase RTO beyond 1 hour; it is also more expensive due to cross-region data transfer costs.

Full explanation →

1590

MCQhard

A company ingests clickstream data into Amazon S3 via Kinesis Data Firehose. The data arrives in 20 MB files every 2 minutes. The data engineering team needs to transform nested JSON into a flat structure before loading into Amazon Redshift. Which approach is most cost-effective and scalable?

A.Create an AWS Glue ETL job that runs on a schedule, using dynamic frames to flatten the data and write to S3 in Parquet

B.Run an Amazon EMR cluster with Spark to flatten the data and write back to S3

C.Use AWS Lambda to transform each file as it arrives in S3

D.Use Amazon Redshift Spectrum to query the nested JSON directly and create a view

AnswerA

Glue's dynamic frames natively handle nested JSON and can run cost-effectively on a schedule.

Why this answer

Option D is correct because the dynamic frame transform can flatten nested JSON in Glue ETL jobs, and incremental processing based on time partitions is efficient. Option A is wrong because Redshift Spectrum queries raw data but does not transform it. Option B is wrong because EMR is overkill for simple flattening.

Option C is wrong because Lambda has a 15-minute timeout and may not handle large files.

Full explanation →

1591

Multi-Selectmedium

A company wants to use AWS Glue to transform data stored in Amazon S3. The data is partitioned by date and includes both CSV and Parquet files. The transformation should be optimized for cost and performance. Which THREE actions should the data engineer take? (Choose THREE.)

Select 3 answers

A.Run a crawler to update the schema before each job run.

B.Use partition pruning by filtering on the date column in the ETL script.

C.Use job bookmarks to process only new data.

D.Increase the number of DPUs to the maximum allowed.

E.Convert all files to Parquet format before processing.

AnswersB, C, E

Reduces data scanned.

Why this answer

Options A, C, and D are correct. Partition pruning reduces the amount of data scanned. Using Parquet improves performance and compression.

Using a job bookmark prevents reprocessing of old data. Option B is wrong because increasing the number of DPUs may increase cost without optimization. Option E is wrong because a crawler is not needed if the schema is known.

Full explanation →

1592

MCQmedium

Refer to the exhibit. A data engineer creates an AWS Glue job using this CloudFormation template. The job processes new data files in S3 and uses job bookmarks to track processed files. After initial success, the job runs again but processes all files again instead of only new ones. What is the most likely cause?

A.The job bookmark option is set to 'job-bookmark-disable'

B.The enable-metrics parameter is set to true

C.The MaxRetries parameter is set to 0

D.The S3 input path does not have a partitioning scheme or timestamp to identify new files

AnswerD

Job bookmarks rely on partition structure or file timestamps to track progress.

Why this answer

Option A is correct because job bookmarks require that the S3 path structure includes partitioning or a timestamp; if not, Glue cannot identify new files. Option B is wrong because the bookmark option is enabled. Option C is wrong because max retries do not affect bookmark behavior.

Option D is wrong because metrics are unrelated to bookmarks.

Full explanation →

1593

Multi-Selectmedium

Which TWO statements are true about Amazon Redshift distribution styles? (Choose TWO.)

Select 2 answers

A.KEY distribution is always the best choice to minimize data skew.

B.AUTO distribution always selects EVEN distribution.

C.ALL distribution copies the entire table to every node.

D.Redshift automatically assigns a ROUND ROBIN distribution style by default.

E.EVEN distribution distributes rows across slices in a round-robin fashion.

AnswersC, E

ALL distribution is useful for small tables that are frequently joined.

Why this answer

Option A is correct: EVEN distribution distributes rows evenly, reducing data movement. Option D is correct: ALL distribution copies entire table to every node, which can improve join performance but uses more storage. Option B is wrong because KEY distribution can cause skew.

Option C is wrong because AUTO distribution lets Redshift choose. Option E is wrong because Redshift does not use ROUND ROBIN; it uses EVEN, KEY, ALL, AUTO.

Full explanation →

1594

MCQeasy

A media company is building a data pipeline to ingest user activity logs from multiple sources into Amazon S3. The logs are JSON files generated every minute. The company wants to use Amazon Athena to query the logs with minimal latency and cost. The current approach is to use AWS Kinesis Data Firehose to deliver the logs to S3 with a prefix like 'logs/2024/01/01/00/file.json'. However, when running Athena queries, the team notices high query costs because Athena scans all files in the 'logs/' prefix even when querying for a specific date. What should the team do to reduce the amount of data scanned by Athena?

A.Create an Athena view that filters by date.

B.Increase the number of partitions by using a more granular prefix like 'logs/2024/01/01/00/00/'.

C.Convert the JSON files to Apache Parquet format using AWS Glue ETL jobs.

D.Create a Hive-style partition structure in S3 with keys like 'year=2024/month=01/day=01/hour=00/' and update the Glue Data Catalog accordingly.

AnswerD

Partition pruning allows Athena to scan only relevant directories, reducing costs.

Why this answer

Option C is correct because partitioning the data in S3 by date (e.g., year=2024/month=01/day=01/hour=00/) allows Athena to use partition pruning and scan only relevant partitions. Option A is wrong because converting to Parquet helps but alone does not solve the full scan issue. Option B is wrong because increasing number of partitions without proper structure still causes full scan.

Option D is wrong because creating views does not change underlying storage or scan behavior.

Full explanation →

1595

MCQmedium

A company runs a nightly ETL job using AWS Glue. The job reads data from a JDBC connection to an on-premises MySQL database. The job fails with an error indicating that the connection pool is exhausted. What is the most likely cause and solution?

A.The database is not reachable due to network issues. Check VPC and security groups.

B.The Glue job is hitting the AWS Glue connection pool limit. Increase the Glue connection pool size.

C.The database credentials are expired. Rotate the password in AWS Secrets Manager.

D.The Glue job is using too many executors, exhausting the database connections. Reduce the number of DPUs or increase the database max connections.

AnswerD

Glue can open multiple connections; reducing parallelism or scaling database helps.

Why this answer

Option A is correct because high parallelism in Glue creates many connections, overwhelming the database. Option B is wrong because network issues would cause timeouts, not pool exhaustion. Option C is wrong because Glue does not have connection pool limits.

Option D is wrong because password rotation would cause authentication errors.

Full explanation →

1596

MCQhard

A company is running a MySQL database on Amazon RDS. The database size is 2 TB, and the company needs to migrate it to Amazon Aurora MySQL with minimal downtime. Which migration strategy is most appropriate?

A.Create an Aurora MySQL read replica from the RDS instance, then promote it.

B.Use mysqldump to export the database and import it into Aurora.

C.Take a snapshot of the RDS instance and restore it as an Aurora cluster.

D.Use AWS Database Migration Service (DMS) with full load and ongoing replication.

AnswerA

This approach allows replication with minimal downtime, then promote to master.

Why this answer

Creating an Aurora MySQL read replica from the existing RDS MySQL instance allows the Aurora cluster to stay synchronized with the source using MySQL’s native binlog replication. Once the replica lag reaches zero, you can promote it to a standalone Aurora cluster with minimal downtime, typically just a few seconds to stop writes and redirect traffic. This approach avoids the lengthy export/import process and leverages Amazon’s managed replication for near-zero-downtime migration.

Exam trap

The trap here is that candidates often assume DMS is always the best for minimal downtime, but for MySQL-to-Aurora migrations, the native read-replica promotion is simpler, faster, and fully managed by AWS, making it the most appropriate choice for this specific scenario.

How to eliminate wrong answers

Option B is wrong because mysqldump exports data as SQL statements, which for a 2 TB database would take hours to export and even longer to import, causing significant downtime and potential consistency issues. Option C is wrong because RDS snapshots cannot be directly restored as an Aurora cluster; you must first migrate the snapshot to an Aurora-compatible format using AWS DMS or the RDS-to-Aurora snapshot migration feature, which still requires downtime. Option D is wrong because while DMS with full load and ongoing replication can achieve minimal downtime, it adds unnecessary complexity and overhead compared to the simpler, native read-replica promotion method, which is the recommended AWS approach for MySQL-to-Aurora migrations.

Full explanation →

1597

Multi-Selecthard

A data engineer is designing a disaster recovery strategy for an Amazon RDS for MySQL database with Multi-AZ deployment. Which THREE actions should the engineer take to meet a Recovery Point Objective (RPO) of 5 minutes and a Recovery Time Objective (RTO) of 15 minutes? (Choose THREE.)

Select 3 answers

A.Enable automated backups with a retention period of 7 days.

B.Create a cross-region read replica to another AWS region.

C.Configure the DB instance to be Single-AZ for simplicity.

D.Export automated snapshots to an S3 bucket in a different region.

E.Enable Multi-AZ deployment for automatic failover.

AnswersA, B, E

Automated backups allow point-in-time recovery within the retention window, helping meet RPO.

Why this answer

Options A, B, and D provide fast failover and minimal data loss. Option A ensures automatic failover to a standby in another AZ. Option B (cross-region read replica) can be promoted in minutes.

Option D (automated backups) provide point-in-time recovery. Option C (snapshots to S3) is slower and may not meet RTO. Option E (Single-AZ) increases risk.

Full explanation →

1598

MCQmedium

A company uses AWS Glue ETL jobs to transform data in S3. The job runs successfully but takes longer than expected. The data is in Parquet format and partitioned by date. Which change would most improve performance without increasing cost?

A.Repartition the data by a different column.

B.Convert Parquet to CSV for faster serialization.

C.Increase the number of DPUs for the job.

D.Enable pushdown predicates to filter partitions early.

AnswerD

Reduces data scanned, improving performance.

Why this answer

Option B is correct because enabling pushdown predicates filters partitions at the source, reducing data scanned. Option A is wrong because increasing DPUs increases cost. Option C is wrong because converting to CSV increases data size and scan time.

Option D is wrong because repartitioning may cause shuffling overhead.

Full explanation →

1599

MCQeasy

A media company stores video files in an Amazon S3 bucket. The bucket policy allows access only from a specific VPC. The company has enabled S3 Server Access Logs to monitor access. Recently, the security team found that some requests were coming from an IP address outside the allowed VPC. They suspect that the bucket policy may have an incorrect condition. What should they check first?

A.Verify that the bucket policy uses the 'aws:SourceVpc' condition key with the correct VPC ID.

B.Review the S3 Server Access Logs to identify the source IP addresses.

C.Ensure that the IAM role used by the application has the correct permissions.

D.Check if the bucket policy allows public access.

AnswerA

The 'aws:SourceVpc' condition key restricts access to requests originating from the specified VPC.

Why this answer

Option A is correct because the 'aws:SourceVpce' condition key is used to restrict access to a specific VPC endpoint, not the VPC itself. To restrict to a VPC, they should use 'aws:SourceVpc'. Option B is wrong because the bucket policy already restricts access; the issue is with the condition key used.

Option C is wrong because S3 Server Access Logs show the source IP, but the condition key is the root cause. Option D is wrong because IAM policies are evaluated in addition to bucket policies, but the bucket policy condition is likely the issue.

Full explanation →

1600

Multi-Selecteasy

A data engineer is designing a data ingestion pipeline that uses AWS Lambda to process records from a Kinesis Data Stream and write to DynamoDB. Which TWO strategies can help handle increased throughput and prevent data loss? (Choose TWO.)

Select 2 answers

A.Configure the Lambda event source mapping with a batch window and set the number of concurrent batches per shard

B.Use synchronous invocation of Lambda from the producer

C.Increase the number of shards in the Kinesis data stream

D.Configure a dead-letter queue (DLQ) for the Lambda function

E.Increase the Lambda function timeout

AnswersA, C

This improves throughput and handles spikes.

Why this answer

Options B and D are correct. B: Increasing the number of shards allows higher throughput and Lambda concurrency. D: Configuring the Lambda event source mapping with a batch window and concurrent batches helps manage load.

A: Increasing Lambda timeout does not increase throughput. C: Using a DLQ captures failures but does not prevent loss; retries should be configured. E: Using synchronous invocation would block and cause throttling.

Full explanation →

1601

MCQmedium

A data engineer needs to ensure that an S3 bucket is encrypted at rest using AWS KMS. The bucket policy must allow only a specific IAM role to access the bucket and enforce encryption in transit. Which combination of bucket policy statements should be used?

A.Use kms:ViaService: s3.*.amazonaws.com and aws:SecureTransport: true

B.Use s3:x-amz-server-side-encryption: AES256 and aws:SecureTransport: true

C.Use s3:x-amz-server-side-encryption-aws-kms-key-id and aws:SourceIp

D.Use kms:EncryptionContext: service:s3 and aws:SecureTransport: true

AnswerA

This enforces KMS encryption via S3 and TLS.

Why this answer

Option C is correct because it uses the kms:EncryptionContext condition to enforce KMS encryption and aws:SecureTransport for TLS. Option A is wrong because it requires SSE-S3, not KMS. Option B is wrong because it allows any KMS key.

Option D is wrong because it allows access without encryption in transit.

Full explanation →

1602

MCQmedium

A company uses AWS Glue for ETL jobs. The data engineer needs to ensure that the Glue job can access an S3 bucket in another account. What is the recommended approach?

A.Create an IAM role in the target account and have the Glue job assume that role

B.Assign an IAM role to the Glue job with permissions to access the bucket, and configure the bucket policy to allow the role

C.Configure the S3 bucket policy to allow the Glue job's IAM role and also set the Glue job's resource-based policy

D.Store AWS access keys for the target account in AWS Secrets Manager and have the Glue job retrieve them

AnswerB

This is the standard cross-account access method using IAM roles.

Why this answer

Option A is correct because the Glue job's IAM role must have permissions to access the S3 bucket, and the bucket policy must allow the role. Option B is wrong because S3 bucket policies can allow cross-account access without requiring an IAM role in the other account. Option C is wrong because resource-based policies alone are not sufficient; the Glue job needs an IAM role with permissions.

Option D is wrong because Glue jobs cannot use access keys; they use IAM roles.

Full explanation →

1603

MCQeasy

A data engineer needs to store JSON documents that are frequently updated and require ACID transactions. Which AWS database service is most appropriate?

A.Amazon Neptune

B.Amazon DocumentDB

C.Amazon DynamoDB

D.Amazon S3

AnswerC

DynamoDB supports JSON and ACID transactions.

Why this answer

Option A is correct because Amazon DynamoDB supports JSON documents and ACID transactions via DynamoDB Transactions. Option B is wrong because Amazon S3 is not a database and does not support transactions. Option C is wrong because Amazon DocumentDB is a MongoDB-compatible database but ACID transactions are limited.

Option D is wrong because Amazon Neptune is a graph database, not optimized for JSON documents.

Full explanation →

1604

MCQmedium

A data engineering team uses AWS Glue ETL jobs to process data from Amazon S3 and load it into an Amazon Redshift cluster. The cluster has a single node of type dc2.large. The team notices that the ETL jobs are failing intermittently with errors related to disk space. The Redshift cluster shows that the disk is nearly full. The team needs to resolve the disk space issue and ensure the ETL jobs can complete successfully without increasing costs significantly. Which solution should the team implement?

A.Convert the cluster to use RA3 node types (e.g., ra3.xlarge) with managed storage.

B.Load data into a staging table first, then perform a VACUUM and ANALYZE on the target tables.

C.Add more nodes to the Redshift cluster by resizing to a multi-node dc2.large configuration.

D.Set the table's distribution style to ALL for fact tables to avoid data redistribution during joins.

AnswerA

RA3 nodes separate compute from storage, allowing storage to scale independently and cost-effectively.

Why this answer

Option C (converting to a ra3.xlarge node) is correct because RA3 nodes use managed storage, allowing scaling of compute and storage independently. This resolves disk space issues without over-provisioning. Option A (adding more dc2.large nodes) increases cost and still has fixed storage.

Option B (using diststyle ALL) may improve query performance but does not add disk space. Option D (loading into a staging table and using VACUUM) helps with reclaiming space but does not address the underlying insufficient storage capacity.

Full explanation →

1605

MCQmedium

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The data is delivered in JSON format. The company wants to convert the data to Apache Parquet format before delivery to reduce storage costs and improve query performance. How can this be achieved?

A.Deliver data to S3 as JSON, then use Amazon Athena to convert to Parquet.

B.Use the AWS Glue Data Catalog to define a schema and configure Firehose to use it for Parquet conversion.

C.Write an AWS Lambda function to transform the data to Parquet and deliver it to S3.

D.Configure the Firehose stream to convert data to Parquet automatically without any additional setup.

AnswerB

This is the recommended approach for converting streaming data to Parquet in Firehose.

Why this answer

Option B is correct because Kinesis Data Firehose can convert the input data to Parquet or ORC format using a schema from the AWS Glue Data Catalog. Option A is incorrect because Firehose does not support a built-in transformation to Parquet without a schema. Option C is incorrect because Lambda can be used for custom transformations, but Firehose natively supports Parquet conversion using Glue.

Option D is incorrect because Athena is a query service, not a transformation service.

Full explanation →

1606

MCQmedium

A company is migrating an on-premises MySQL database to Amazon RDS for MySQL. The database is 500 GB and has a 24/7 uptime requirement. The migration must minimize downtime. Which approach should be used?

A.Take a snapshot of the on-premises database, convert it to a volume, and restore to RDS.

B.Use AWS Database Migration Service (DMS) with ongoing replication to migrate the data.

C.Export the database using mysqldump and import it into RDS using mysql command.

D.Create an RDS MySQL read replica from the on-premises database using native replication.

AnswerB

DMS supports ongoing replication, minimizing downtime by allowing a final cutover after the initial load.

Why this answer

AWS DMS with ongoing replication (change data capture) allows you to perform a full load of the 500 GB database and then continuously replicate changes from the on-premises MySQL source to the Amazon RDS target. This minimizes downtime because you can cut over to RDS in seconds after the target is synchronized, rather than taking the source offline for an extended period.

Exam trap

The trap here is that candidates often choose mysqldump (Option C) because it is a familiar tool, but they overlook the requirement for minimal downtime and the fact that a 500 GB dump/import would take hours, violating the 24/7 uptime requirement.

How to eliminate wrong answers

Option A is wrong because taking a snapshot of an on-premises database and converting it to a volume is not a supported method for migrating to RDS; snapshots are native to AWS block storage and cannot be directly created from an on-premises database. Option C is wrong because using mysqldump and mysql import requires the source database to be read-locked or offline during the export/import process, causing significant downtime for a 500 GB database with a 24/7 uptime requirement. Option D is wrong because RDS cannot be configured as a read replica of an on-premises MySQL database using native replication; native MySQL replication requires the replica to have direct network access to the source, and RDS does not support being a replica of an external source—only the reverse (RDS as source to external replica) is possible.

Full explanation →

1607

MCQmedium

An IAM policy is attached to an IAM role used by an EC2 instance in the 10.0.0.0/8 VPC. The EC2 instance cannot read objects from the S3 bucket. What is the most likely cause?

A.The policy does not grant s3:ListBucket permission.

B.The S3 bucket has a bucket policy that denies public access, and the IAM policy alone is insufficient.

C.The bucket is encrypted with SSE-KMS and the role does not have kms:Decrypt permission.

D.The EC2 instance's public IP is not in the 10.0.0.0/8 range.

AnswerB

The IAM policy allows access, but if the bucket policy denies all access except from specific principals, the IAM role may still be denied. The bucket policy must explicitly allow the role.

Why this answer

Option D is correct because IAM policies cannot use the aws:SourceIp condition for services that use the principal's IP, but for EC2 with an IAM role, the source IP is the instance's private IP, which is within the condition, so the condition should work. However, the issue is that S3 bucket policies are required for cross-account access or when the bucket is not public. The exhibited policy is an IAM policy, not a bucket policy.

The bucket itself likely has a bucket policy that denies access or the bucket is not public. Option A (wrong IP) is not necessarily true. Option B (no KMS) is irrelevant.

Option C (no s3:ListBucket) is not required for GetObject.

Full explanation →

1608

MCQmedium

A data engineer applies the above IAM policy to an IAM user. The user attempts to download an object from the bucket 'example-bucket' that is encrypted with SSE-S3 (AES256). Will the request succeed?

A.Yes, but only if the user also has s3:ListBucket permission.

B.No, because the policy requires the encryption to be specified in the request.

C.Yes, because the object is encrypted with SSE-S3 which uses AES256.

D.No, because the policy does not allow the s3:GetObject action for encrypted objects.

AnswerC

The condition matches the encryption algorithm.

Why this answer

Option A is correct because the condition requires the object to be encrypted with AES256, and SSE-S3 uses AES256. Option B is incorrect because the condition checks the encryption header, not the key type. Option C is incorrect because the condition is satisfied.

Option D is incorrect because the condition is satisfied.

Full explanation →

1609

MCQhard

A company uses AWS Glue to process data from Amazon S3. The data contains personally identifiable information (PII). The data engineer needs to automatically detect and mask PII fields before the data is loaded into Amazon Redshift. Which combination of AWS services should be used?

A.Amazon Macie and AWS Glue

B.Amazon CloudWatch Logs and AWS Lambda

C.Amazon S3 Object Lambda and AWS Glue

D.AWS IAM Access Analyzer and AWS Glue

AnswerA

Macie detects PII, Glue can mask it in the ETL job before writing to Redshift.

Why this answer

Amazon Macie discovers sensitive data, and then AWS Glue can apply transformations to mask the PII before loading into Redshift. CloudWatch Logs is for monitoring, not detection. IAM Access Analyzer is for analyzing resource policies.

S3 Object Lambda can redact data during retrieval but not during Glue ETL.

Full explanation →

1610

Multi-Selectmedium

A company uses Amazon S3 to store data for analytics. The data engineer needs to ensure that the S3 bucket is protected against accidental deletion of objects. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers

A.Enable server access logging for the S3 bucket.

B.Create an S3 bucket policy that explicitly denies the s3:DeleteObject action.

C.Configure a lifecycle policy to transition objects to Glacier.

D.Enable versioning on the S3 bucket.

E.Enable MFA Delete on the S3 bucket.

AnswersB, D, E

Prevents any user from deleting objects.

Why this answer

Option A is correct because MFA Delete adds an extra layer of protection. Option B is correct because versioning keeps multiple versions, allowing recovery. Option D is correct because a bucket policy denying s3:DeleteObject to all principals prevents any deletion.

Option C is wrong because lifecycle policies delete objects automatically. Option E is wrong because server access logs are for auditing, not prevention.

Full explanation →

1611

MCQeasy

A company needs to store JSON documents that are frequently read and written by a web application. The data must be highly available and durable across multiple Availability Zones. Which AWS database service meets these requirements?

A.Amazon RDS for PostgreSQL

B.Amazon S3

C.Amazon DynamoDB

D.Amazon ElastiCache for Redis

AnswerC

DynamoDB is a fully managed NoSQL database that supports JSON documents and offers multi-AZ durability.

Why this answer

Amazon DynamoDB is a fully managed NoSQL key-value and document database that provides single-digit millisecond performance at any scale. It stores JSON documents natively, supports frequent reads and writes, and offers built-in high availability and durability by automatically replicating data across multiple Availability Zones (AZs) in an AWS Region. This makes it the ideal choice for the described web application workload.

Exam trap

The trap here is that candidates often confuse Amazon S3's high durability and availability with database capabilities, overlooking that S3 is an object store with higher latency and no native query support, while DynamoDB is purpose-built for low-latency, high-throughput document storage with ACID transactions via DynamoDB Transactions.

How to eliminate wrong answers

Option A is wrong because Amazon RDS for PostgreSQL is a relational database that stores data in tables with a fixed schema, not as JSON documents natively, and while it can be deployed in a Multi-AZ configuration for high availability, it does not provide the same level of automatic, seamless scaling and native JSON document support as DynamoDB. Option B is wrong because Amazon S3 is an object storage service, not a database; it can store JSON files but is not designed for frequent, low-latency read/write operations from a web application, and it lacks features like atomic transactions and query capabilities that a database provides. Option D is wrong because Amazon ElastiCache for Redis is an in-memory cache, not a durable database; while it can store JSON documents using the RedisJSON module, data is primarily stored in memory and is not durable by default across AZs, making it unsuitable for the durability and persistence requirements of a primary data store.

Full explanation →

1612

Multi-Selectmedium

A healthcare company stores sensitive patient data in an S3 bucket (bucket name: patient-data-prod). The security team requires that all data be encrypted in transit and at rest, and that access be logged for auditing. The company currently uses S3 default encryption with SSE-S3. An external auditor finds that some objects have been uploaded without encryption because the default encryption setting was not applied to objects uploaded before the setting was enabled. The company wants to prevent any future unencrypted uploads and ensure all existing objects are encrypted. Which combination of actions should the data engineer take? (Choose TWO.)

Select 2 answers

A.Use S3 Batch Operations to copy all existing objects in place with the 'aws:Replicate' operation to apply default encryption.

B.Enable S3 Object Ownership and set the bucket ACL to private.

C.Enable S3 default encryption on the bucket.

D.Create a bucket policy that denies s3:PutObject if the x-amz-server-side-encryption-aws-kms-key-id is not present.

E.Create a bucket policy that denies s3:PutObject if the x-amz-server-side-encryption header is not set to 'AES256'.

AnswersA, E

Batch Operations can apply encryption to existing objects by copying them in place.

Why this answer

To prevent unencrypted uploads, the bucket policy must deny PutObject requests that do not include the x-amz-server-side-encryption header with AES256. To encrypt existing objects, S3 Batch Operations can copy them in place with the default encryption setting applied. Option D alone only covers new objects, not existing ones.

Option E is unnecessary because SSE-S3 uses AES256.

Full explanation →

1613

MCQeasy

A company runs a MySQL database on Amazon RDS. The database size is 500 GB and is experiencing high read traffic. The team wants to improve read performance with minimal operational overhead. Which action should they take?

A.Create a read replica in the same region

B.Enable Multi-AZ deployment

C.Implement Amazon ElastiCache for caching

D.Upgrade to a larger instance class

AnswerA

Read replicas offload read traffic with minimal operational overhead.

Why this answer

Option A is correct because adding a read replica offloads read traffic from the primary with minimal overhead. Option B (multi-AZ) is for high availability, not read performance. Option C (larger instance) increases cost but does not specifically address read workload.

Option D (ElastiCache) adds caching but requires application changes.

Full explanation →

1614

MCQmedium

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon Aurora MySQL. The migration is successful, but the ongoing replication task is experiencing high latency. Which configuration change is most likely to reduce latency?

A.Increase the size of the DMS replication instance.

B.Decrease the task's batch size and batch apply timeout.

C.Change the target endpoint to Amazon S3.

D.Enable Change Data Capture (CDC) from binary logs.

AnswerA

A larger instance provides more resources to process change data capture (CDC) faster.

Why this answer

Option B is correct because increasing the DMS replication instance size provides more CPU and memory, which can process changes faster. Option A (S3 as target) is not applicable as the target is Aurora. Option C (CDC from binary logs) is for MySQL source.

Option D (decrease batch size) would likely increase latency.

Full explanation →

1615

MCQmedium

A data engineer is troubleshooting a Lambda function that reads from the Kinesis stream 'my-data-stream'. The Lambda function is able to read data but occasionally fails with 'KMS.AccessDeniedException'. What is the most likely cause?

A.The Lambda function's execution role does not have kms:Decrypt permission for the KMS key.

B.The retention period is too short; increase it.

C.The stream has too few shards; increase shard count.

D.The Lambda function is not authorized to consume from Kinesis streams.

AnswerA

Kinesis uses KMS for encryption; consumers need decrypt permission.

Why this answer

The stream uses KMS encryption. The Lambda function's IAM role likely lacks permission to decrypt using the KMS key. Option A is correct.

Option B is wrong because the stream has 1 shard, which is fine. Option C is wrong because Lambda can read from streams. Option D is wrong because retention period is 24 hours, which is fine.

Full explanation →

1616

MCQmedium

A data engineer notices that an AWS Glue job writing to Amazon S3 in Parquet format creates many small files (less than 1 MB each). This leads to poor query performance in Amazon Athena. What is the BEST way to reduce the number of output files?

A.Enable 'groupFiles' in the Glue job's S3 target configuration.

B.Use 'coalesce(1)' at the end of the ETL script.

C.Use 'repartition(100)' to increase parallelism.

D.Configure an S3 lifecycle policy to delete small files.

AnswerA

Glue's groupFiles option merges small files during write.

Why this answer

Option C is correct because enabling 'groupFiles' in Glue coalesces small files into larger ones. Option A is wrong because using 'coalesce(1)' may cause a single executor bottleneck or OOM. Option B is wrong because repartition increases the number of partitions, making the problem worse.

Option D is wrong because S3 lifecycle policies only manage object lifecycle, not file creation.

Full explanation →

1617

MCQmedium

A data engineer is troubleshooting an AWS Glue job that is failing with an Access Denied error when trying to read data from an S3 bucket. The IAM policy attached to the Glue job's IAM role is shown in the exhibit. What is the likely cause of the failure?

A.The policy does not include s3:GetObject or s3:PutObject permissions.

B.The policy does not include glue:StartJobRun permission.

C.The policy does not include s3:ListBucket permission on the bucket.

D.The policy does not include glue:GetJobRun permission.

AnswerC

Glue needs s3:ListBucket to enumerate objects in the bucket before reading.

Why this answer

Option A is correct because the policy only allows s3:GetObject on the bucket objects, but not on the bucket itself. Glue needs s3:ListBucket permission to list objects in the bucket before reading them. Option B is incorrect because the policy does allow s3:GetObject and s3:PutObject.

Option C is incorrect because the policy allows glue:StartJobRun, but the issue is S3 access. Option D is incorrect because the policy allows glue:GetJobRun, but the issue is S3 access.

Full explanation →

1618

MCQeasy

A company uses Amazon Athena to query data in S3. Recently, queries have become slow. The data is stored as CSV files in a partitioned table. What is the most effective way to improve query performance?

A.Increase the number of nodes in the Athena query engine.

B.Convert the data to Parquet format and optimize partitioning.

C.Convert the data to JSON format.

D.Increase the size of the CSV files to reduce the number of files.

AnswerB

Parquet is columnar and compressed, improving scan efficiency.

Why this answer

Option C is correct because converting to Parquet and partitioning improves compression, columnar scanning, and partition pruning. Option A is wrong because increasing file size alone may not help; large CSV files still require full scans. Option B is wrong because converting to JSON would likely worsen performance.

Option D is wrong because more nodes only help with distributed processing, but Athena manages resources automatically.

Full explanation →

1619

MCQmedium

A data engineer notices that an Amazon Redshift cluster is running low on disk space. The cluster has three nodes of type dc2.large. Which action will increase the available storage capacity?

A.Increase the number of nodes in the cluster.

B.Mount an Amazon S3 bucket as a file system to store data.

C.Change the volume type to Provisioned IOPS SSD (io1) to increase capacity.

D.Enable automatic compression on the tables.

AnswerA

Adding nodes increases total storage capacity.

Why this answer

Option C is correct because increasing the number of nodes adds more storage. Option A is incorrect because resizing to a larger node type with more storage also works, but the question asks for increasing storage, and that is one way; however, adding nodes is more direct for storage. Actually both A and C could work, but A is about volume type change, which is not available for Redshift.

Wait, Redshift does not have provisioned IOPS. So A is invalid. Option B is incorrect because S3 is not directly attachable.

Option D is incorrect because compression reduces data size but does not add capacity.

Full explanation →

1620

MCQmedium

Refer to the exhibit. A data engineer is creating an IAM policy for an application that sends data to a Kinesis stream and stores processed data in S3. The policy is attached to an IAM role used by an EC2 instance. The application fails to write to S3 with an access denied error. What is the cause?

A.The policy does not allow s3:ListBucket on the bucket.

B.The IAM role is not attached to the EC2 instance profile.

C.The policy does not allow kinesis:PutRecord on the stream.

D.The EC2 instance does not have an internet gateway to reach S3.

AnswerA

Some operations require ListBucket permission; without it, the SDK may fail.

Why this answer

Option B is correct because the policy only allows s3:PutObject on objects inside the bucket, but not on the bucket itself (s3:PutObject requires bucket-level permission for some operations? Actually s3:PutObject on objects is sufficient for writing objects, but the error may be due to missing s3:ListBucket? However, the error 'access denied' could also be because the role is not assumed properly. Wait, the exhibit shows correct permissions. But the question says fails to write to S3.

The policy includes s3:PutObject, so it should work. Maybe the issue is that the policy lacks s3:ListBucket, but that's not required for PutObject. However, some SDKs check bucket existence first.

Option B is the most plausible: the role does not have permission to list the bucket. But let's align: the policy has s3:PutObject, so it should be able to write. Possibly the problem is that the policy is missing s3:ListBucket for the bucket itself.

I'll make Option B correct.

Full explanation →

1621

MCQmedium

A company is ingesting streaming data into Kinesis Data Streams. The consumer application experiences high latency due to a single shard bottleneck. What is the most effective way to reduce latency?

A.Increase the number of shards in the data stream.

B.Wait for automatic scaling to add shards.

C.Use the Kinesis Client Library (KCL) to process records.

D.Switch to Amazon Kinesis Data Firehose.

AnswerA

More shards increase parallelism and throughput, reducing latency.

Why this answer

Increasing the number of shards increases throughput and reduces latency. Waiting for autoscaling is passive, using KCL is for processing, and switching to Firehose changes the architecture.

Full explanation →

1622

MCQmedium

A data engineer is designing a data lake on Amazon S3. Data is ingested from multiple sources in JSON format. The engineer needs to optimize query performance for Amazon Athena while minimizing storage costs. Which storage strategy should the engineer use?

A.Store data as CSV files in a single S3 bucket without prefixes.

B.Convert data to Parquet format and partition by date.

C.Store data as JSON files in a single prefix without partitioning.

D.Store compressed JSON files in Amazon S3 Glacier.

AnswerB

Parquet is columnar and compressed; partitioning improves query performance.

Why this answer

Parquet is a columnar format that reduces storage size and improves query performance in Athena. Partitioning by date further optimizes queries that filter by date. Option B is correct.

Option A: storing as raw JSON with no partitioning leads to higher costs and slower queries. Option C: using Glacier for hot data adds retrieval latency and is not suitable for frequent queries. Option D: storing in a single bucket with no structure causes full scans.

Full explanation →

1623

Multi-Selecthard

Which THREE considerations are important when designing a data pipeline that uses AWS Glue to process streaming data from Amazon Kinesis Data Streams? (Choose 3.)

Select 3 answers

A.Set the number of Glue workers to match the number of shards for optimal parallelism

B.Ensure the Kinesis stream has enough shards to handle the expected record rate

C.Configure checkpointing to prevent data loss on failure

D.Use batch window to accumulate data before processing

E.Convert data to Avro format for better compression

AnswersA, B, C

Each worker can consume one shard.

Why this answer

A, B, and D are correct. A: Checkpointing ensures exactly-once processing. B: Multiple workers for parallelism.

D: Sufficient shards for throughput. C (batch mode) is not streaming. E (data format) is not Glue-specific for streaming.

Full explanation →

1624

MCQeasy

A company runs a nightly batch processing pipeline using AWS Glue ETL jobs. The pipeline reads data from an Amazon S3 bucket, transforms it, and writes results to an Amazon Redshift cluster. Recently, the data volume has increased significantly, and some Glue jobs are failing with the error 'java.lang.OutOfMemoryError: Java heap space'. The data engineer needs to modify the job configuration to prevent these failures without changing the code. The job currently uses 10 DPUs and processes data in a single Spark DataFrame. Which of the following is the MOST effective solution?

A.Reduce the number of DPUs to 5 and increase the Spark executor memory by setting 'spark.executor.memory' in job parameters.

B.Increase the number of DPUs to 20 and enable job bookmarking for incremental processing.

C.Change the script to use DynamicFrame instead of DataFrame and disable the 'spark.sql.shuffle.partitions' configuration.

D.Add a 'coalesce(1)' operation before writing to Redshift to reduce the number of output files.

AnswerB

More DPUs increase total available memory; job bookmarking reduces data volumes by processing only new data.

Why this answer

Increasing DPUs from 10 to 20 provides more memory and compute resources, directly addressing the 'java.lang.OutOfMemoryError: Java heap space' caused by insufficient memory for the single DataFrame. Enabling job bookmarking allows incremental processing, which reduces the volume of data processed per run, further mitigating memory pressure without code changes.

Exam trap

The trap here is that candidates may think reducing DPUs or using coalesce reduces memory usage, but in reality, both actions increase memory pressure on individual executors, making OOM errors more likely.

How to eliminate wrong answers

Option A is wrong because reducing DPUs to 5 would decrease available memory, worsening the OOM error, and increasing 'spark.executor.memory' without more DPUs cannot compensate for the overall resource reduction. Option C is wrong because changing to DynamicFrame does not inherently reduce memory usage; disabling 'spark.sql.shuffle.partitions' may cause imbalanced partitions and does not address the heap space issue. Option D is wrong because 'coalesce(1)' forces all data into a single partition, which increases memory pressure on that executor and can trigger or worsen OOM errors.

Full explanation →

1625

MCQhard

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data volume peaks at 10,000 records per second, and each record is up to 1 KB. The company needs to archive the raw data in Amazon S3 in near real-time and also make it available for real-time analytics using Amazon Kinesis Data Analytics. What is the MOST efficient architecture to meet these requirements?

A.Use Kinesis Data Streams as the ingestion point. Use Kinesis Data Firehose to read from the stream, convert to Parquet, and write to S3. Use a Lambda function to send data to Kinesis Data Analytics.

B.Use Kinesis Data Streams as the ingestion point. Use a Lambda function to read from the stream, write to S3, and send data to Kinesis Data Analytics.

C.Use two Kinesis Data Streams: one for S3 delivery and one for Kinesis Data Analytics.

D.Use Kinesis Data Streams as the ingestion point. Use Kinesis Data Firehose to read from the stream and write to S3. Use Kinesis Data Analytics to read directly from the same stream.

AnswerD

Firehose can read from the stream and write to S3; Kinesis Data Analytics can read from the same stream for real-time analytics.

Why this answer

Option D is correct because Kinesis Data Firehose can read from a Kinesis Data Stream and write to S3, while Kinesis Data Analytics can read from the same stream for real-time analytics. Option A is incorrect because Lambda adds cost and complexity. Option B is incorrect because two separate streams are wasteful.

Option C is incorrect because Firehose cannot convert to Parquet without a schema; also Lambda is avoidable.

Full explanation →

1626

Multi-Selecteasy

Which TWO AWS services can be used to automatically back up an Amazon RDS for SQL Server DB instance? (Choose TWO.)

Select 2 answers

A.AWS Database Migration Service (DMS)

B.AWS Data Pipeline

C.Amazon RDS automated backups

D.Amazon S3

E.AWS Backup

AnswersC, E

RDS provides automated backups by default.

Why this answer

RDS automated backups are enabled by default and retain backups for up to 35 days. AWS Backup is a centralized backup service that can manage RDS backups with custom policies and retention. Options A and C are correct.

Option B: DMS is for migration, not backup. Option D: S3 is storage, not automatic backup service. Option E: Data Pipeline can orchestrate backups but is not the primary automatic backup service.

Full explanation →

1627

MCQhard

A team manages an Amazon DynamoDB table with on-demand capacity. Recently, they noticed increased throttling errors during peak hours. The table has a Lambda trigger that processes changes and writes to an S3 bucket. Which design change would BEST reduce throttling?

A.Switch the table to provisioned capacity and enable auto-scaling.

B.Increase the write capacity units to handle the peak load.

C.Enable S3 bucket versioning to reduce the number of writes.

D.Implement DynamoDB Accelerator (DAX) to cache frequent reads.

AnswerD

DAX reduces read load on the table, lowering throttling.

Why this answer

Option D is correct because DynamoDB Accelerator (DAX) provides an in-memory cache that reduces the number of read requests hitting the table, which can alleviate throttling during peak hours. The question describes throttling errors, which are typically caused by exceeding the table's read or write capacity; DAX offloads read traffic, reducing the load on the table and thus decreasing throttling events.

Exam trap

The trap here is that candidates may assume throttling is always due to insufficient write capacity, but the question's context of a Lambda trigger writing to S3 can increase read traffic (e.g., via stream processing or re-reading items), making DAX a read-side solution that addresses the actual cause.

How to eliminate wrong answers

Option A is wrong because switching to provisioned capacity with auto-scaling does not address the root cause of throttling under on-demand capacity, which already scales automatically; throttling in on-demand mode is usually due to exceeding the table's per-partition throughput limits or burst capacity, not capacity mode. Option B is wrong because increasing write capacity units is not applicable to on-demand capacity, which does not use provisioned write capacity units; on-demand tables automatically scale, and throttling is not resolved by manually setting a capacity that doesn't exist in that mode. Option C is wrong because enabling S3 bucket versioning increases the number of writes (by storing multiple versions of objects) rather than reducing them, and it does not affect DynamoDB throttling.

Full explanation →

1628

MCQeasy

A company uses AWS Lake Formation to manage data lake permissions. A data engineer needs to grant an IAM role 'Read' access to a specific database and all its tables in the Data Catalog. What is the MOST efficient way to achieve this?

A.Grant 'Super' permission on the Data Catalog

B.Add the IAM role to the Lake Formation administrators group

C.Grant 'Select' on the database and select 'Include' to apply to all tables

D.Grant 'Describe' on the database and 'Select' on each table individually

AnswerC

This grants read access to all tables in one operation.

Why this answer

Lake Formation allows granting permissions on the database with 'Include' option to apply to all tables. This is the most efficient method for granting access to all tables.

Full explanation →

1629

Multi-Selectmedium

A company needs to ingest streaming data from an existing Amazon Kinesis Data Streams into Amazon S3 with partitioning by date. Which TWO services can accomplish this with minimal coding? (Choose two.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Glue Streaming

C.AWS Lambda

D.Amazon Kinesis Data Analytics

E.Amazon S3 Transfer Acceleration

AnswersA, C

Firehose can read from a Kinesis stream and deliver to S3 with partitioning.

Full explanation →

1630

MCQeasy

A data engineer needs to move data from an Amazon S3 bucket to an Amazon Redshift cluster on a daily schedule. The data is in CSV format and the target table already exists. Which AWS service should the engineer use to automate this task?

A.AWS Glue

B.Amazon Athena

C.Amazon EMR

D.Amazon Kinesis Data Analytics

AnswerA

Glue provides job scheduling and ETL capabilities.

Why this answer

Option B is correct because AWS Glue can be used to run ETL jobs that copy data from S3 to Redshift on a schedule. Option A is wrong because Amazon Athena is a query service, not an ETL scheduler. Option C is wrong because Amazon EMR is a big data platform but requires more setup for simple scheduling.

Option D is wrong because Amazon Kinesis Data Analytics is for real-time streaming data.

Full explanation →

1631

MCQhard

A data engineering team is designing a data lake on Amazon S3. They need to store raw data in a format that supports schema evolution and is optimized for analytics with Amazon Athena. Which storage format should they use?

A.Parquet

B.CSV

C.Avro

D.JSON

AnswerA

Parquet is columnar, supports schema evolution, and is optimized for Athena.

Why this answer

Option B is correct because Parquet is a columnar format that supports schema evolution and is optimized for Athena. Option A (CSV) does not support schema evolution and is less efficient. Option C (Avro) is row-based and not as efficient for columnar queries.

Option D (JSON) is text-based and not optimized.

Full explanation →

1632

Multi-Selecthard

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams with multiple consumers. The data must be processed by a Lambda function for real-time alerts and also stored in Amazon S3 for historical analysis. Which THREE components are needed to implement this architecture? (Choose THREE.)

Select 3 answers

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.AWS Lambda function

D.Amazon Kinesis Data Analytics

E.Amazon SQS queue

AnswersA, B, C

Reads from the stream and delivers to S3.

Why this answer

Kinesis Data Streams is the ingestion layer. Lambda can be a consumer for real-time alerts. Kinesis Data Firehose can read from the stream and write to S3.

Option A is correct. Option B is correct. Option D is correct.

Option C is wrong because Kinesis Data Analytics is for running SQL or Flink, not needed here. Option E is wrong because SQS is for decoupling, not needed for this pattern.

Full explanation →

1633

MCQhard

A company runs a real-time analytics platform that ingests data from thousands of sensors via Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data is consumed by a fleet of EC2 instances running a custom consumer application. Recently, the consumer has been falling behind, with the iterator age exceeding 10 minutes. The company has already increased the number of shards to 100, but the problem persists. The consumer application is single-threaded per shard and uses the Kinesis Client Library (KCL). The CPU utilization on the EC2 instances is below 30%. What should the data engineer do to reduce the iterator age?

A.Increase the number of shards to 200

B.Use larger EC2 instances with more vCPUs

C.Modify the consumer to use multiple worker threads per shard

D.Replace the EC2 consumer with AWS Lambda functions

AnswerC

Increases processing parallelism within each shard.

Why this answer

Option B is correct because the consumer is not processing data fast enough due to being single-threaded per shard; using multiple worker threads per shard can increase processing throughput. Option A (more shards) has already been tried without success. Option C (larger instance) may not help if CPU is low.

Option D (Lambda) may not handle the volume efficiently and adds complexity.

Full explanation →

1634

MCQeasy

A data pipeline ingests streaming data from thousands of IoT devices into Kinesis Data Streams. The data must be transformed using a simple field mapping before being stored in S3. Which service should be used to perform the transformation with minimal operational overhead?

A.AWS Lambda function invoked by the Kinesis stream

B.AWS Glue ETL job

C.Kinesis Data Analytics

D.Kinesis Data Firehose with a Lambda transformation

AnswerD

Firehose can invoke a Lambda function for simple transformations before delivery.

Why this answer

Option C is correct because Kinesis Data Firehose can perform simple transformations using Lambda functions before delivering data to S3. Option A is wrong because Glue ETL is better for complex batch transformations. Option B is wrong because Kinesis Data Analytics is for real-time analytics, not simple field mapping.

Option D is wrong because Lambda alone would require managing the delivery to S3.

Full explanation →

1635

MCQeasy

A company is using Amazon S3 to store critical data and needs to ensure that objects are automatically transitioned to S3 Glacier Deep Archive after 180 days to reduce costs. Which S3 lifecycle action should be configured?

A.Expiration

B.Transition

C.AbortIncompleteMultipartUpload

D.NoncurrentVersionTransition

AnswerB

Transition moves objects to another storage class based on age.

Why this answer

Option B is correct because the S3 lifecycle 'Transition' action is specifically designed to move objects between storage classes after a specified number of days. To reduce costs by moving objects to S3 Glacier Deep Archive after 180 days, you configure a lifecycle rule with a Transition action that targets the 'DEEP_ARCHIVE' storage class at the 180-day mark.

Exam trap

The trap here is that candidates often confuse 'Expiration' (deletion) with 'Transition' (storage class change), or incorrectly apply 'NoncurrentVersionTransition' when the question does not mention versioning or noncurrent versions.

How to eliminate wrong answers

Option A is wrong because 'Expiration' is used to permanently delete objects after a set period, not to transition them to a different storage class. Option C is wrong because 'AbortIncompleteMultipartUpload' is used to clean up incomplete multipart uploads after a specified number of days, not to transition objects between storage classes. Option D is wrong because 'NoncurrentVersionTransition' applies only to noncurrent versions of versioned objects, not to current versions, and the question does not specify versioning or noncurrent versions.

Full explanation →

1636

MCQeasy

A data engineer needs to ingest data from an external FTP server into S3 on a schedule. The FTP server is only accessible via VPN. Which AWS service is best suited for this task?

A.AWS Transfer Family

B.AWS Snowcone

C.AWS Glue with a Python shell

D.AWS DataSync

AnswerA

Supports FTP and integrates with VPN.

Why this answer

Option D is correct because AWS Transfer Family supports FTP over VPN. Option A is wrong because DataSync requires network connectivity but not FTP protocol. Option B is wrong because Glue can read from FTP but requires custom connectors.

Option C is wrong because Snowcone is for offline data transfer.

Full explanation →

1637

MCQeasy

A startup is building a ride-sharing application that uses Amazon DynamoDB to store trip data. The table has a partition key of 'trip_id' and a sort key of 'status'. The application writes a new item when a trip starts and updates the status when the trip ends. The development team is experiencing high write latency during peak hours. The table is provisioned with 5,000 write capacity units (WCU) and 5,000 read capacity units (RCU). CloudWatch metrics show that WriteThrottleEvents are occurring frequently, but the consumed write capacity is never above 4,000 WCU. The team suspects that the issue is due to hot partitions. How should the data engineer resolve this issue?

A.Modify the application to add a random suffix to the partition key when writing items.

B.Enable DynamoDB Accelerator (DAX) to cache write operations.

C.Decrease the provisioned RCU to 2,000 to reduce costs.

D.Increase the provisioned WCU to 10,000 to handle the spikes.

AnswerA

Adding random suffix distributes writes across multiple partitions, reducing hot spots.

Why this answer

Option C is correct because using a random prefix for the partition key distributes writes across partitions, avoiding hot spots. Option A is incorrect because increasing WCU does not solve the hot partition issue; throttling happens at the partition level. Option B is incorrect because DAX is a cache for reads, not writes.

Option D is incorrect because decreasing RCU is unrelated to write throttling.

Full explanation →

1638

MCQmedium

A company uses Amazon Redshift for data warehousing. The data engineer notices that query performance has degraded over time. The tables are frequently updated with new data, and the data engineer suspects that the distribution style is causing data skew. Which distribution style should the data engineer use to minimize data skew?

A.KEY distribution on a column with high cardinality

B.AUTO distribution

C.ALL distribution

D.EVEN distribution

AnswerD

Distributes rows evenly, ideal for preventing skew.

Why this answer

Option A is correct because EVEN distribution distributes rows across slices evenly, minimizing skew when no good distribution key exists. Option B is wrong because KEY distribution can cause skew if the key is not unique. Option C is wrong because ALL distribution duplicates the table, which is not efficient for large tables.

Option D is wrong because AUTO lets Redshift choose, but may not minimize skew if the key is poorly chosen.

Full explanation →

1639

MCQeasy

A company needs to store archival logs that must be retained for 10 years. The logs are accessed infrequently, but when accessed, retrieval must occur within 12 hours. Which storage class is MOST cost-effective?

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 Intelligent-Tiering

C.Amazon S3 Standard

D.Amazon S3 One Zone-Infrequent Access

AnswerA

Glacier Deep Archive provides the lowest storage cost with retrieval times up to 12 hours.

Why this answer

Option D is correct because S3 Glacier Deep Archive is the lowest-cost storage class with retrieval times within 12 hours. Option A is wrong because S3 Standard is expensive for long-term archival. Option B is wrong because S3 Intelligent-Tiering is designed for unknown access patterns, not pure archival.

Option C is wrong because S3 One Zone-IA is less durable and not cost-effective for 10-year retention due to frequent access fees if retrieved.

Full explanation →

1640

MCQmedium

A data engineer is migrating an on-premises Apache Hive data warehouse to Amazon EMR. The warehouse contains partitioned tables stored in HDFS. The engineer wants to use Amazon S3 as the storage layer for the EMR cluster. What is the MOST important consideration for maintaining query performance on S3?

A.Ensure that the table partitions are organized in a way that minimizes S3 LIST requests

B.Configure EMR to use HDFS for storage instead of S3 for better performance

C.Use DynamoDB as the Hive metastore to improve metadata access

D.Use Amazon Redshift Spectrum to query the data directly from S3

AnswerA

S3 LIST operations are slower than HDFS; partitioning by common query filters and using partition projection can improve performance.

Why this answer

When using Amazon S3 as the storage layer for an EMR cluster, the most critical factor for query performance is minimizing S3 LIST requests. S3 LIST operations are significantly slower and more expensive than GET requests, and Hive/Spark queries on partitioned tables often issue LIST requests to discover partition locations. By organizing partitions with a common prefix (e.g., `year=2023/month=01/day=15/`) and using partition pruning, you reduce the number of LIST calls, directly improving query latency and reducing S3 API costs.

Exam trap

The trap here is that candidates may focus on metastore performance (Option C) or alternative query engines (Option D), missing the fundamental S3 performance bottleneck of LIST requests when querying partitioned data on EMR.

How to eliminate wrong answers

Option B is wrong because using HDFS for storage would negate the benefits of S3 (durability, scalability, cost) and is not the 'most important consideration' for maintaining query performance on S3; EMR can use S3 with optimizations like EMRFS and consistent view. Option C is wrong because DynamoDB is used as a Hive metastore for high availability and scalability, not to improve metadata access performance for S3 queries; the metastore choice does not directly address S3 LIST request overhead. Option D is wrong because Redshift Spectrum is a separate service for querying S3 data with Redshift, not an EMR optimization; it does not address the core performance issue of S3 LIST requests in an EMR context.

Full explanation →

1641

MCQmedium

A company wants to ingest data from SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for analytics. The data volume is moderate and updates occur frequently. Which AWS service is BEST suited for this task?

A.Amazon Kinesis Data Streams

B.Amazon AppFlow

C.AWS Database Migration Service (DMS)

D.AWS Glue

AnswerB

AppFlow supports many SaaS sources and can write to S3.

Why this answer

Option A is correct. AppFlow is designed to transfer data from SaaS applications to AWS services like S3. Option B is wrong because Glue is for ETL, not direct SaaS ingestion.

Option C is wrong because Kinesis is for streaming, not batch from SaaS. Option D is wrong because DMS is for database migrations.

Full explanation →

1642

MCQeasy

A data pipeline ingests daily CSV files from an FTP server into an Amazon S3 bucket. The files must be converted to Parquet format and partitioned by date for efficient querying using Amazon Athena. Which AWS service is most suitable for this transformation?

A.Amazon Kinesis Data Firehose

B.Amazon EMR

C.AWS Glue

D.AWS Lambda

AnswerC

Glue provides a serverless Spark environment that can transform CSV to Parquet and partition data efficiently.

Why this answer

Option D is correct because AWS Glue offers a serverless Spark environment with built-in transformation libraries and can automatically partition data. Option A is wrong because Lambda has a 15-minute timeout and 10 GB memory limit, making it unsuitable for large CSV-to-Parquet conversions. Option B is wrong because EMR requires cluster management.

Option C is wrong because Kinesis Data Firehose is for streaming, not batch.

Full explanation →

1643

MCQmedium

A data engineer needs to store clickstream data from a web application in Amazon S3. Each event is about 5 KB, and the application generates 1 million events per hour. The data is used for real-time analytics and also for batch processing. The engineer wants to minimize storage costs while ensuring that data is available for real-time queries as soon as it is written. Which storage class should the engineer use for the S3 bucket?

A.S3 Standard.

B.S3 Intelligent-Tiering.

C.S3 Standard-IA.

D.S3 Glacier Instant Retrieval.

AnswerA

Standard offers the best performance for frequently accessed data and no retrieval fees.

Why this answer

Option C is correct because S3 Standard offers low latency and high throughput for frequently accessed data, and it is suitable for real-time analytics. Option A is incorrect because S3 Intelligent-Tiering has a monitoring cost and is optimal for unknown access patterns, but for a known frequent access pattern, Standard is cheaper. Option B is incorrect because S3 Standard-IA has a retrieval fee and is not ideal for frequent access.

Option D is incorrect because S3 Glacier Instant Retrieval is for long-lived, rarely accessed data requiring millisecond retrieval, but it has a higher cost per GB than Standard for frequent access.

Full explanation →

1644

Multi-Selectmedium

A data engineer is designing a disaster recovery plan for an Amazon RDS for MySQL database. The database must be recoverable within 1 hour in a different AWS Region. Which TWO actions should the engineer take?

Select 2 answers

A.Create a cross-Region read replica.

B.Enable Multi-AZ deployment.

C.Enable automated backups with cross-Region copy.

D.Take manual snapshots and copy them to an S3 bucket in the other Region.

E.Use Amazon EventBridge to schedule snapshot copies.

AnswersA, C

A read replica can be promoted to a primary in another Region for disaster recovery.

Why this answer

A cross-Region read replica for Amazon RDS for MySQL provides a fully provisioned secondary database in a different AWS Region that can be promoted to a standalone primary in minutes, meeting the 1-hour recovery time objective (RTO). This approach ensures continuous replication from the source database, minimizing data loss and enabling rapid failover without manual snapshot management.

Exam trap

The trap here is that candidates often confuse Multi-AZ (high availability within a Region) with cross-Region disaster recovery, or they assume that scheduling snapshot copies via EventBridge is sufficient for fast recovery, ignoring the significant restore time required for snapshots.

Full explanation →

1645

MCQeasy

A data engineer needs to monitor the number of records processed by an AWS Glue ETL job and send an alert if the count drops below a threshold. Which AWS service should be used to create this custom metric?

A.Amazon S3

B.AWS Config

C.Amazon CloudWatch

D.AWS CloudTrail

AnswerC

CloudWatch can store custom metrics and trigger alarms.

Why this answer

Amazon CloudWatch is the correct service for creating custom metrics because it allows you to publish your own data points, such as the number of records processed by an AWS Glue ETL job. You can use the CloudWatch PutMetricData API or the AWS Glue job script to emit a custom metric, then set an alarm on that metric to trigger an alert when the count drops below a threshold.

Exam trap

The trap here is that candidates often confuse AWS CloudTrail with CloudWatch because both are monitoring-related, but CloudTrail is for auditing API calls, not for ingesting custom numerical metrics or setting alarms on them.

How to eliminate wrong answers

Option A is wrong because Amazon S3 is an object storage service and does not provide a mechanism to create or monitor custom metrics; it only stores data and logs access via server access logs or AWS CloudTrail. Option B is wrong because AWS Config is a service for evaluating and auditing resource configurations against rules, not for ingesting or alerting on custom operational metrics like record counts. Option D is wrong because AWS CloudTrail records API activity for auditing and governance, but it cannot be used to create custom metrics or set threshold-based alarms; it captures events, not numerical data points.

Full explanation →

1646

MCQmedium

A gaming company uses Amazon Redshift for analytics. The Redshift cluster stores user data that must be encrypted at rest using a customer-managed KMS key. The company has enabled audit logging using AWS CloudTrail. The security team wants to ensure that any attempt to disable or delete the KMS key is immediately detected and triggers an automated response. They have set up a CloudWatch Events rule that triggers an SNS notification when the KMS key is scheduled for deletion. However, they also want to prevent the key from being deleted accidentally. What should they do?

A.Enable automatic key rotation for the KMS key to ensure that even if the key is deleted, the data remains encrypted.

B.Add a statement to the KMS key policy that denies 'kms:ScheduleKeyDeletion' for all principals except the root user.

C.Attach an IAM policy to the Redshift cluster role that denies 'kms:ScheduleKeyDeletion'.

D.Set up a CloudTrail trail to monitor for 'ScheduleKeyDeletion' events and send an alert to the security team.

AnswerB

This prevents any IAM user or role from scheduling key deletion.

Why this answer

Option B is correct because enabling key rotation does not prevent deletion. Option A is wrong because while CloudTrail can detect deletion, it does not prevent it. Option C is correct because a KMS key policy can explicitly deny the 'kms:ScheduleKeyDeletion' action for all principals except the account root, preventing accidental deletion.

Option D is wrong because using an IAM policy to deny deletion is less reliable if the key policy allows it; the key policy is the authoritative control.

Full explanation →

1647

Multi-Selecthard

A data engineer is using Amazon Athena to query data stored in an S3 bucket. The queries are running slowly. Which THREE actions can improve query performance?

Select 3 answers

A.Partition the data on commonly filtered columns.

B.Convert the data to JSON format for better schema evolution.

C.Move the data to S3 Standard-IA storage class.

D.Convert the data to a columnar format such as Parquet or ORC.

E.Use compression (e.g., Snappy, Gzip) on the data files.

AnswersA, D, E

Partition pruning reduces amount of data scanned.

Why this answer

Options A, B, and C are correct. Partitioning reduces data scanned. Columnar formats like Parquet improve compression and query speed.

Compression reduces data size. Option D is incorrect because S3 Standard-IA may have higher retrieval costs, not performance. Option E is incorrect because converting to JSON (text) would increase data size and slow queries.

Full explanation →

1648

MCQhard

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams. The data must be enriched with reference data from a DynamoDB table before being written to S3. The engineer wants to minimize latency. Which architecture is BEST?

A.Use AWS Glue streaming ETL to read from Kinesis, enrich, and write to S3.

B.Use Kinesis Data Analytics for Apache Flink to enrich and output to Firehose.

C.Use Kinesis Data Firehose with a Lambda function for enrichment.

D.Use a Lambda function to poll the stream, enrich, and write to Firehose.

AnswerB

Flink provides low-latency streaming enrichment with external sources.

Why this answer

Option C is correct because Kinesis Data Analytics for Apache Flink supports enrichment using external sources like DynamoDB with low latency via asynchronous I/O. Option A is wrong because Kinesis Data Firehose does not support real-time enrichment. Option B is wrong because Lambda may have cold starts and limited concurrency.

Option D is wrong because Glue streaming ETL is batch-oriented and higher latency.

Full explanation →

1649

MCQeasy

A data engineer needs to monitor the number of records processed by a Kinesis Data Firehose delivery stream and set an alarm if the count drops below a threshold. Which CloudWatch metric should be used?

A.IncomingRecords

B.PutRecord.Success

C.DeliveryToS3.Success

D.IncomingBytes

AnswerA

This metric counts the number of records sent to Firehose.

Why this answer

Option A is correct because 'IncomingRecords' counts records received by Firehose, which directly indicates processing volume. Option B is wrong because 'IncomingBytes' measures bytes, not records. Option C is wrong because 'DeliveryToS3.Success' is a success metric but measures successful deliveries, not record count.

Option D is wrong because 'PutRecord.Success' is a per-API call metric for Kinesis Data Streams, not Firehose.

Full explanation →

1650

MCQeasy

A data engineer is designing a data lake on Amazon S3. Which feature should be used to manage the lifecycle of objects and move them to cheaper storage classes automatically?

A.S3 Lifecycle policies

B.S3 Object Lock

C.S3 Storage Class Analysis

D.S3 Inventory

AnswerA

Automatically transitions objects to cheaper storage.

Why this answer

S3 Lifecycle policies automate transitioning objects between storage classes and can also expire objects. S3 Inventory is for reporting, S3 Analytics for analysis, and Object Lock for compliance.

Full explanation →

Page 22 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →