Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1201–1275

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 17 of 24

1201

Multi-Selecthard

A data engineer is designing a data ingestion pipeline that uses AWS DMS to migrate data from an on-premises Oracle database to Amazon S3 in Parquet format. The engineer needs to ensure that data is continuously replicated with minimal latency. Which THREE steps should the engineer take? (Choose three.)

Select 3 answers

A.Configure a DMS task with a transformation rule to convert to Parquet.

B.Specify an S3 bucket as the target endpoint with data format set to Parquet.

C.Use AWS Schema Conversion Tool (SCT) to convert the schema.

D.Enable change data capture (CDC) on the source database.

E.Perform a full load only, without CDC.

AnswersA, B, D

DMS can transform data to Parquet format.

Why this answer

Options A, C, and E are correct. Option A enables continuous replication using CDC. Option C uses a transformation to convert to Parquet.

Option E uses a target endpoint for S3 with parquet format. Option B is wrong because SCT is for schema conversion, not for DMS replication. Option D is wrong because a full load only is not continuous replication.

Full explanation →

1202

MCQeasy

A data engineer needs to restrict access to an Amazon S3 bucket so that only objects encrypted with a specific AWS KMS key can be uploaded. Which S3 bucket policy condition should be used?

A.s3:x-amz-server-side-encryption-aws-kms-key-id

B.kms:ViaService

C.s3:x-amz-server-side-encryption

D.kms:EncryptionContext

AnswerA

This condition key allows you to specify a required KMS key ID for server-side encryption.

Why this answer

The kms:EncryptionContext condition key is used to enforce encryption context, not the key itself. The s3:x-amz-server-side-encryption-aws-kms-key-id condition key is used to require a specific KMS key ID for server-side encryption. Option C is correct.

Full explanation →

1203

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command shown to encrypt a file using AWS KMS. The command succeeds. Later, the engineer tries to decrypt the file using the same key but without providing an encryption context. The decryption fails. What is the most likely reason?

A.The KMS key policy does not allow decryption.

B.The KMS key has been disabled.

C.The plaintext file was corrupted.

D.The encryption context must be provided during decryption.

AnswerD

KMS uses encryption context as AAD; it must match exactly.

Why this answer

Option C is correct because KMS encryption context is used as additional authenticated data (AAD) and must be provided for decryption. Option A is wrong because the key is not disabled. Option B is wrong because key policy does not affect decryption with the same key.

Option D is wrong because the command did not fail.

Full explanation →

1204

MCQeasy

A data engineer needs to troubleshoot why an AWS Glue job is failing with a 'Insufficient Memory' error. The job processes a 10 GB dataset. Which step should the engineer take FIRST?

A.Switch from using Apache Spark to Python shell.

B.Repartition the data into more partitions within the job.

C.Change the job type from Python to Java.

D.Increase the number of DPUs allocated to the job.

AnswerD

More DPUs provide more memory and compute resources.

Why this answer

Option A is correct because increasing the DPU count allocates more memory per worker. Option B is wrong because Glue supports Python and Scala, not Java. Option C is wrong because repartitioning may or may not help; memory is the immediate issue.

Option D is wrong because using Apache Spark is already the default; switching to Python shell would not handle the volume.

Full explanation →

1205

Multi-Selecteasy

A company wants to ensure that an IAM user can only launch Amazon EC2 instances of a specific instance type. Which THREE IAM policy elements are required to define this permission? (Choose THREE.)

Select 3 answers

A.Action

B.Principal

C.Effect

D.Resource

E.Condition

AnswersA, C, E

Action specifies ec2:RunInstances.

Why this answer

To restrict to specific instance types, you need Effect (Allow), Action (ec2:RunInstances), and Condition (to specify instance type). Resource could be used but is not strictly required if you use Condition. Option A is needed to allow.

Option C is needed for the action. Option D is needed to restrict. Option B is not required; you can use Condition to specify instance type instead of Resource.

Option E is optional. Correct: A, C, D.

Full explanation →

1206

MCQhard

A data engineer is designing a system to handle sensitive customer data in Amazon RDS for PostgreSQL. The compliance team requires that the data be encrypted at rest and that encryption keys be rotated every 90 days. Which solution meets these requirements?

A.Use AWS CloudHSM to store the encryption key and create a custom key rotation Lambda function

B.Enable RDS encryption with a customer managed KMS key, enable automatic key rotation, and manually rotate the key every 90 days

C.Enable Transparent Data Encryption (TDE) on the RDS instance

D.Enable RDS encryption with a customer managed KMS key and enable automatic key rotation

AnswerB

Manual rotation every 90 days satisfies the requirement.

Why this answer

Option D is correct because enabling RDS encryption with KMS and using automatic key rotation (annually) and manual rotation every 90 days meets the requirement. Option A is wrong because RDS does not support Transparent Data Encryption. Option B is wrong because CloudHSM is not integrated with RDS.

Option C is wrong because KMS automatic rotation is yearly, not 90 days.

Full explanation →

1207

MCQmedium

A data engineer needs to design a data ingestion pipeline that ingests CSV files from an Amazon S3 bucket, transforms the data by adding a timestamp column, and loads it into an Amazon Redshift table. The pipeline should run automatically whenever a new file is uploaded to the S3 bucket. Which AWS service should be used to trigger the transformation?

A.AWS Lambda

B.AWS Step Functions

C.Amazon EventBridge

D.Amazon Simple Queue Service (SQS)

AnswerA

Lambda can be triggered directly by S3 events.

Why this answer

Option B is correct because S3 events can trigger a Lambda function directly. Option A is incorrect because SQS is not triggered by S3 events without a separate setup. Option C is incorrect because EventBridge cannot directly trigger Lambda from S3 events; S3 events go to Lambda via S3 notification.

Option D is incorrect because Step Functions would need an event source; Lambda is simpler.

Full explanation →

1208

MCQhard

A company needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 10 days. The data is compressible. Which solution is MOST appropriate?

A.Use AWS DataSync over a Direct Connect connection.

B.Use Amazon S3 Transfer Acceleration with multipart uploads.

C.Use multiple parallel AWS CLI sync commands over the internet.

D.Use AWS Snowball Edge to physically ship the data.

AnswerD

Snowball Edge can transfer 50 TB in a few days, independent of network.

Why this answer

Option D is correct. AWS Snowball Edge is a physical device that can transfer large volumes quickly, bypassing network limitations. Option A is wrong because 1 Gbps would take ~111 hours for 50 TB, and with overhead it may exceed 10 days.

Option B is wrong because DataSync also uses network and would be too slow. Option C is wrong because S3 Transfer Acceleration is for existing internet transfers, not bulk offline.

Full explanation →

1209

MCQmedium

A data engineering team is designing a data lake on Amazon S3 with a folder structure that separates raw, transformed, and curated data. The team needs to implement lifecycle policies to minimize storage costs while ensuring that data in the 'raw' zone is retained for 90 days before being moved to Amazon S3 Glacier Deep Archive. Additionally, data in the 'curated' zone should be deleted after 365 days. What is the MOST cost-effective way to achieve these requirements?

A.Configure S3 Intelligent-Tiering on both prefixes with automatic archiving to Glacier Deep Archive after 90 days in raw and deletion after 365 days in curated.

B.Create separate lifecycle policies for each prefix: one to transition raw to S3 Glacier Deep Archive after 90 days, another to delete curated after 365 days.

C.Create a lifecycle policy that transitions raw zone data to S3 Standard-IA after 90 days and deletes curated zone data after 365 days.

D.Create a single lifecycle policy with two rules: one to transition raw zone objects to S3 Glacier Deep Archive after 90 days, and another to delete curated zone objects after 365 days.

AnswerD

This meets cost and retention requirements efficiently.

Why this answer

Option D is correct because a single S3 lifecycle policy can contain multiple rules, each applying to different prefixes. This allows you to transition raw zone objects to S3 Glacier Deep Archive after 90 days and delete curated zone objects after 365 days, minimizing storage costs without needing separate policies. S3 lifecycle policies are evaluated per object based on creation date, and using one policy reduces management overhead.

Exam trap

The trap here is that candidates might think separate lifecycle policies are required for different prefixes, but AWS allows multiple rules within a single policy, making it more cost-effective and easier to manage than creating separate policies.

How to eliminate wrong answers

Option A is wrong because S3 Intelligent-Tiering does not support automatic archiving to Glacier Deep Archive after a fixed number of days; it moves objects between access tiers based on usage patterns, not a scheduled transition, and it cannot enforce deletion after a specific period. Option B is wrong because while separate lifecycle policies can achieve the requirements, they are not the most cost-effective or efficient approach; a single policy with multiple rules is simpler and avoids potential policy conflicts or duplication of management. Option C is wrong because transitioning raw zone data to S3 Standard-IA after 90 days does not meet the requirement to move it to Glacier Deep Archive, which is the lowest-cost storage class for long-term archival; Standard-IA is more expensive than Glacier Deep Archive for data that is rarely accessed.

Full explanation →

1210

MCQhard

A company uses AWS Lake Formation to manage permissions on a data lake. A data engineer creates a table in the Data Catalog and grants SELECT permission to a group of analysts. The analysts report they can see the table but get an AccessDenied error when querying it with Amazon Athena. What is the most likely cause?

A.The analysts' IAM role does not have permission to call the Athena API.

B.The table is not registered with Lake Formation as a resource.

C.The Athena workgroup is configured with a per-query result location that the analysts cannot write to.

D.The S3 bucket policy does not allow the analysts' IAM role.

AnswerA

Lake Formation grants database permissions, but IAM must allow Athena actions.

Why this answer

Option C is correct because Lake Formation integrates with Athena for fine-grained access control. The analysts need both Lake Formation permissions and IAM permissions to use Athena. Option A is wrong because S3 bucket policy is not the issue if Lake Formation is managing permissions.

Option B is wrong because Workgroup settings are not likely the cause. Option D is wrong because the table is registered with Lake Formation.

Full explanation →

1211

MCQmedium

Refer to the exhibit. A data engineer runs two queries on an Athena table partitioned by 'ds'. Both queries scan the same amount of data. What does this indicate?

A.The partition column is not being used as a filter

B.The table does not have any partitions defined

C.The table is not partitioned

D.Partition pruning is working correctly

AnswerA

The filter on ds is not being pushed down, possibly due to data type mismatch.

Why this answer

Option D is correct because if the partition filter is not pushed down, Athena scans all partitions. Option A is wrong because the data is partitioned. Option B is wrong because partition pruning should reduce data scanned.

Option C is wrong because the table is partitioned.

Full explanation →

1212

MCQhard

The exhibit shows an IAM policy attached to a role used by an AWS Glue ETL job. The job reads from an S3 bucket and writes to another S3 bucket. However, the job fails with an access denied error when trying to write to the output bucket. What is the most likely cause?

A.The policy is missing permissions for AWS KMS to decrypt/encrypt objects

B.The policy does not allow glue:StartJobRun on the specific job

C.The policy only allows PutObject on the my-data-lake bucket, but the job writes to a different bucket

D.The policy does not include s3:ListBucket permission

AnswerC

The S3 permissions are scoped to my-data-lake/*; if output bucket is different, access is denied.

Why this answer

The policy allows s3:PutObject on my-data-lake/*, but if the output bucket is different (e.g., my-output-bucket), the policy does not cover it. The error is due to missing permissions on the output bucket. The Glue service role may not have permissions to write to the output bucket.

The policy does not restrict resource to only one bucket, but the ARN specifies my-data-lake. The job might be trying to write to a different bucket. There is no issue with Glue actions.

Full explanation →

1213

Multi-Selecteasy

A company is using AWS Glue to catalog data in Amazon S3. The data is in CSV format with varying schemas. The Data Engineering team wants to ensure the Glue Data Catalog is updated automatically when new partitions are added to S3. Which TWO actions should be taken? (Choose two.)

Select 2 answers

A.Enable partition indexing on the Glue Data Catalog.

B.Set up an S3 event notification to trigger a Lambda function that updates the Glue Data Catalog.

C.Configure a scheduled AWS Glue crawler to run on a regular basis.

D.Run Amazon Athena queries with MSCK REPAIR TABLE to add partitions.

E.Use AWS Glue ETL jobs to write data and update the catalog simultaneously.

AnswersA, C

Partition indexing enables automatic updates and efficient querying of new partitions.

Why this answer

A is correct because enabling partition indexing in the Glue Data Catalog allows partition pruning and automatic updates. C is correct because configuring a Glue crawler with a schedule will automatically discover new partitions. B is wrong because setting up an S3 event notification to trigger Lambda for manual updates is not as efficient as crawler scheduling.

D is wrong because using AWS Glue ETL jobs to update the catalog is not automatic. E is wrong because Amazon Athena does not update the catalog.

Full explanation →

1214

MCQmedium

Refer to the exhibit. A data engineer applies this S3 bucket policy to an S3 bucket. What is the effect of this policy?

A.Allows access only from specific IP addresses.

B.Allows only HTTPS requests to get and put objects, and denies HTTP requests.

C.Allows only GetObject actions over HTTPS.

D.Allows anonymous access to get and put objects over HTTP.

AnswerB

The condition enforces secure transport.

Why this answer

Option A is correct because the policy allows GetObject and PutObject only when the request uses HTTPS (secure transport), and explicitly denies all S3 actions when the request uses HTTP. Option B is wrong because the policy does not allow anonymous access; it allows only secure transport. Option C is wrong because the policy allows both Get and Put.

Option D is wrong because the policy allows access from any IP as long as transport is secure.

Full explanation →

1215

MCQeasy

A company needs to ingest data from a self-managed Apache Kafka cluster running on EC2 into Amazon S3. The data must be delivered in near real-time. Which AWS service is BEST suited for this task?

A.Use Amazon MSK to replicate the Kafka cluster and then use a connector to S3.

B.Use Amazon S3 Transfer Acceleration to speed up the transfer from Kafka brokers to S3.

C.Use Amazon Kinesis Data Streams as an intermediary to buffer data before writing to S3.

D.Use an AWS Glue streaming ETL job that reads from the Kafka cluster and writes to S3.

AnswerD

Glue supports streaming from Kafka and can write to S3.

Why this answer

Option C is correct because Amazon MSK (Managed Streaming for Kafka) is a managed Kafka service, but the question asks for ingestion from a self-managed cluster. Using MirrorMaker or a Kafka Connect S3 connector can replicate data to S3. However, the simplest managed solution is to use Amazon Kinesis Data Firehose with a Kafka connector (if using Firehose's HTTP endpoint), but the best answer is to use MSK as a target? Wait, the question is about ingesting into S3.

Actually, the best practice is to use a Kafka Connect S3 sink connector. The closest AWS service that can help is AWS Glue? No. Let's reconsider: The correct answer is to use a Kafka Connect S3 sink connector running on an EC2 instance, but that's not an AWS service.

Among AWS services, the best is Amazon Kinesis Data Firehose with a custom producer that reads from Kafka. But the simplest is to use Amazon MSK as a target? No, the question says from self-managed Kafka to S3. The most appropriate answer is to use an AWS Lambda function that consumes from Kafka and writes to S3, but that's not the best.

Actually, the best answer is to use a Kafka Connect S3 sink connector deployed on an EC2 instance. But since the options must be AWS services, the correct one is Amazon Kinesis Data Firehose with a custom application that writes to Firehose. However, among the given options, D is the most direct: Use a Kafka Connect S3 sink connector (which is an open-source connector that can be run on EC2).

But the question asks for an AWS service. The answer should be: Use Amazon Kinesis Data Firehose with a custom producer. But that's not listed.

Let's design the options: A) AWS Glue, B) Amazon Kinesis Data Streams, C) Amazon MSK, D) Amazon S3 Transfer Acceleration. None are perfect. The best is to use a Kafka Connect S3 sink connector on EC2, but that's not a service.

So I'll choose C) Use Amazon MSK as an intermediary? That doesn't make sense. Actually, the correct answer is to use a Kafka Connect S3 sink connector, but since it's not an AWS service, the next best is to use AWS Glue with a Kafka source? Glue can read from Kafka. So Option A: Use AWS Glue ETL job with a Kafka source and write to S3.

That is plausible. So I'll set A as correct. Explanation: Glue can connect to Kafka and write to S3 in near real-time using streaming ETL.

Option B: Kinesis Data Streams would require a separate connector. Option C: MSK is a managed Kafka, not a solution for ingesting from self-managed Kafka to S3. Option D: S3 Transfer Acceleration is for speeding up uploads, not for ingestion from Kafka.

Full explanation →

1216

MCQhard

A company runs an AWS Glue ETL job that reads data from Amazon S3, transforms it, and writes back to S3 in a different partition structure. The job uses the 'spark.sql.shuffle.partitions' option set to 200. After the job completes, the output has many small files. The data engineer wants to minimize the number of output files while maintaining job performance. Which action should the engineer take?

A.Use 'coalesce(n)' with n based on target file size (e.g., 128 MB) before writing.

B.Enable S3 multipart upload for the Glue job.

C.Increase the 'spark.sql.shuffle.partitions' to 500.

D.Reduce the 'spark.sql.shuffle.partitions' to 50.

AnswerA

Coalesce reduces partitions without a full shuffle, minimizing files.

Why this answer

Option D is correct because using 'coalesce' or 'repartition' with a number based on the target file size (e.g., 128 MB) reduces files without a single partition bottleneck. Option A is wrong because lowering shuffle partitions reduces parallelism and may cause OOM. Option B is wrong because increasing shuffle partitions increases files.

Option C is wrong because S3 multipart upload is automatic and does not affect file count.

Full explanation →

1217

MCQhard

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that processes each record and writes to an Amazon DynamoDB table. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' from DynamoDB. The Lambda function uses the AWS SDK to batch write items in batches of 25. The DynamoDB table has on-demand capacity mode. The stream has 10 shards, and the Lambda function is configured with a batch size of 100 and 5 concurrent invocations per shard. What step should the team take to resolve the issue?

A.Switch the DynamoDB table from on-demand to provisioned capacity with a high write capacity unit (WCU) value.

B.Reduce the Lambda batch size to 25 and implement exponential backoff with jitter in the Lambda code.

C.Increase the number of Kinesis shards to 20 to reduce the load per shard.

D.Increase the Lambda function's reserved concurrency to allow more parallel executions.

AnswerB

Lowering batch size reduces the number of concurrent writes, and backoff helps handle transient throttling.

Why this answer

Option A is correct because DynamoDB on-demand mode can throttle if traffic spikes exceed a sustained level. The Kinesis stream with 10 shards and Lambda concurrency can produce high write traffic. Reducing the batch size from 100 to 25 decreases the number of records processed per invocation, lowering the write rate and reducing throttling.

Option B is wrong because on-demand mode automatically scales but it's not instantaneous; switching to provisioned with high WCU might help but is not necessary and could be costly. Option C is wrong because increasing concurrency would worsen throttling. Option D is wrong because the error is DynamoDB throttling, not Lambda concurrency.

Full explanation →

1218

Multi-Selecthard

A data engineer is troubleshooting an AWS Glue job that reads from Amazon RDS MySQL and writes to Amazon S3. The job runs successfully but takes longer than expected. The engineer wants to optimize performance. Which THREE actions would improve job performance?

Select 3 answers

A.Increase the number of DPUs allocated to the Glue job.

B.Use a single JDBC connection per partition.

C.Increase the JDBC fetch size parameter.

D.Convert the output format from Parquet to CSV.

E.Use a pushdown predicate to filter data at the source.

AnswersA, C, E

More DPUs provide more parallelism.

Why this answer

Options A, B, and D are correct. Increasing DPUs provides more parallelism. Using pushdown predicates reduces data scanned.

Using JDBC fetch size reduces round trips. Option C is wrong because using a single connection per partition is the default and may cause contention; increasing connections can help but the default is fine. Option E is wrong because Parquet is already efficient; converting to CSV would make it worse.

Full explanation →

1219

MCQhard

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data includes a timestamp field. They want to partition the S3 objects by hour dynamically. The Firehose delivery stream is configured with a prefix like 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/'. However, the objects are not being partitioned as expected; all files end up in a single partition. What is the MOST likely cause?

A.Dynamic partitioning is not enabled in the Firehose delivery stream configuration.

B.The buffer size is set too large, delaying file delivery.

C.The timestamp is in UTC but the prefix uses local time.

D.The IAM role for Firehose lacks permissions to write to S3 with dynamic prefixes.

AnswerA

Without enabling dynamic partitioning, the prefix is static and all data goes into one S3 prefix.

Why this answer

Option D is correct because dynamic partitioning requires a separate configuration (enabling dynamic partitioning and specifying the partitioning keys). The prefix with timestamp expressions is used for S3 prefix customization but not for dynamic partitioning. Option A (time zone) could cause wrong hour but not single partition.

Option B (buffer size) affects file size. Option C (IAM) would cause errors, not mispartitioning. The key is that dynamic partitioning must be explicitly enabled.

Full explanation →

1220

Multi-Selectmedium

Which TWO AWS services can be used to ingest streaming data from a mobile application into Amazon S3 for near-real-time analytics? (Choose 2.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.Amazon DynamoDB Streams

D.AWS Glue

E.Amazon SQS

AnswersA, B

Firehose can ingest streaming data and deliver to S3 near real-time.

Why this answer

Amazon Kinesis Data Firehose can directly ingest streaming data and deliver to S3. Amazon Kinesis Data Streams can also ingest data, but requires a consumer to write to S3. AWS Glue is for batch ETL.

Amazon SQS is for message queuing. Amazon DynamoDB Streams captures changes in DynamoDB, not from mobile apps.

Full explanation →

1221

MCQeasy

A data engineer is designing a data store for a real-time analytics application that requires sub-millisecond read and write latency. The data is key-value in nature and the workload is both read-heavy and write-heavy. Which AWS service is most suitable?

A.Amazon ElastiCache for Redis

B.Amazon DynamoDB

C.Amazon RDS for MySQL

D.Amazon S3

AnswerB

DynamoDB offers consistent single-digit millisecond latency at scale.

Why this answer

Amazon DynamoDB is the most suitable service because it is a fully managed NoSQL key-value and document database designed for single-digit millisecond latency at any scale, and with features like DynamoDB Accelerator (DAX) it can achieve sub-millisecond read latency. It supports both read-heavy and write-heavy workloads through its distributed architecture and auto-scaling capabilities, making it ideal for real-time analytics applications.

Exam trap

The trap here is that candidates often choose ElastiCache for Redis because of its sub-millisecond performance, overlooking that the question specifies a 'data store' for a write-heavy workload, which requires durability and persistence that Redis does not guarantee by default, whereas DynamoDB is a fully managed, durable database designed for such use cases.

How to eliminate wrong answers

Option A (Amazon ElastiCache for Redis) is wrong because while Redis provides sub-millisecond latency, it is an in-memory data store primarily designed for caching and not as a durable primary data store; data persistence is optional and can lead to data loss if not configured correctly, making it unsuitable for a write-heavy workload requiring durability. Option C (Amazon RDS for MySQL) is wrong because it is a relational database that incurs higher latency due to disk I/O and SQL query overhead, and it cannot consistently achieve sub-millisecond read and write latency for key-value workloads. Option D (Amazon S3) is wrong because it is an object storage service with eventual consistency for overwrite PUTS and DELETE operations, and its latency is typically in the tens to hundreds of milliseconds, far exceeding the sub-millisecond requirement.

Full explanation →

1222

MCQeasy

An organization uses AWS Lake Formation to manage a data lake in S3. A new data engineer needs to create a Glue ETL job that reads from a Lake Formation-managed table. The engineer has been granted SELECT permission on the table via Lake Formation. However, the job fails with an AccessDenied error. What is the MOST likely cause?

A.The IAM role used by the Glue job does not have Lake Formation permissions.

B.The S3 bucket policy does not allow the Glue job to access the data.

C.The table has not been registered with Lake Formation.

D.The Glue job is not running in the same VPC as Lake Formation.

AnswerA

The IAM role must have Lake Formation permissions to access the table.

Why this answer

Option C is correct because Glue jobs need an IAM role that has Lake Formation permissions to access the table; the error indicates the job's IAM role lacks necessary permissions. Option A is wrong because S3 permissions are managed by Lake Formation, but access is granted via Lake Formation, not direct S3 bucket policies. Option B is wrong because the table already exists; there is no need to register.

Option D is wrong because Lake Formation does not require a VPC endpoint.

Full explanation →

1223

Multi-Selecthard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be stored in Amazon S3 in near real-time with minimal overhead. Which THREE steps should the data engineer take to achieve this? (Choose THREE.)

Select 3 answers

A.Configure a Lambda function to transform data before delivery to S3.

B.Create a Kinesis Data Firehose delivery stream that delivers data to S3.

C.Use Kinesis Data Analytics to process and store data in S3.

D.Enable S3 cross-region replication for the destination bucket.

E.Enable S3 compression (e.g., GZIP) in Firehose.

AnswersA, B, E

Why this answer

Option A is correct because a Lambda function can be used as a data transformation step within a Kinesis Data Firehose delivery stream. This allows the clickstream data to be transformed (e.g., parsed, enriched, or reformatted) before being delivered to Amazon S3, enabling near real-time storage with minimal operational overhead.

Exam trap

The trap here is that candidates may confuse Kinesis Data Analytics with a storage service or think that cross-region replication is needed for near real-time ingestion, when in fact Firehose with Lambda transformation and compression is the correct, minimal-overhead solution.

Full explanation →

1224

MCQhard

A data engineer runs an AWS Glue Crawler that updates a table in the AWS Glue Data Catalog. The table is used by Amazon Athena queries. After the crawler runs, some queries start failing with the error 'HIVE_CANNOT_OPEN_SPLIT'. What is the most likely cause?

A.The crawler updated the schema and the partition metadata is inconsistent with the actual data.

B.The crawler does not have IAM permissions to read the S3 location.

C.The crawler created too many partitions, exceeding the Athena limit.

D.There are concurrent queries accessing the same table.

AnswerA

Schema changes can cause split errors.

Why this answer

Option C is correct because a schema change (e.g., new column or changed data type) can cause partition metadata to be inconsistent with actual data, leading to split errors. Option A is wrong because S3 permissions would cause access denied. Option B is wrong because the crawler can handle partitions.

Option D is wrong because concurrent queries do not cause split errors.

Full explanation →

1225

Multi-Selecteasy

A data engineer is troubleshooting an Amazon EMR cluster that has been running for several days. The cluster uses Amazon S3 as the data source and HDFS for intermediate storage. The engineer notices that some tasks fail with 'Java heap space' errors. Which TWO actions should the engineer take to resolve this issue?

Select 2 answers

A.Enable EMRFS consistent view for S3.

B.Increase the number of containers per node.

C.Increase the maximum Java heap size for the task nodes (mapreduce.map.java.opts).

D.Increase the YARN memory overhead parameter (yarn.nodemanager.resource.memory-mb).

E.Decrease the YARN container size.

AnswersC, D

Increases memory available to the JVM.

Why this answer

Options C and D are correct. Increasing the YARN memory overhead allows containers to allocate more memory, and increasing the maximum Java heap size reduces out-of-memory errors. Option A is wrong because increasing container count without increasing total memory may not help.

Option B is wrong because reducing container count may cause resource underutilization. Option E is wrong because EMRFS consistent view does not affect memory.

Full explanation →

1226

MCQeasy

A company stores time-series sensor data in Amazon S3. They need to query the data using SQL with minimal latency and no infrastructure management. Which service should they use?

A.Amazon Kinesis Data Analytics

B.Amazon Athena

C.Amazon Redshift

D.Amazon DynamoDB

AnswerB

Athena is serverless and directly queries S3 using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL without any infrastructure to manage. It is optimized for querying structured, semi-structured, and unstructured data stored in S3, making it ideal for time-series sensor data with minimal latency requirements.

Exam trap

The trap here is that candidates often confuse Amazon Athena with Amazon Redshift Spectrum, but the question explicitly requires 'no infrastructure management,' which eliminates Redshift; also, Kinesis Data Analytics is mistakenly chosen by those who think it can query static S3 data, but it is strictly for real-time streams.

How to eliminate wrong answers

Option A is wrong because Amazon Kinesis Data Analytics is designed for real-time stream processing using SQL or Apache Flink, not for querying static data already stored in S3; it requires a streaming data source and incurs ongoing processing costs. Option C is wrong because Amazon Redshift is a fully managed data warehouse that requires provisioning and managing clusters, which contradicts the 'no infrastructure management' requirement; it is also overkill for simple SQL queries on S3 data and incurs higher costs for idle compute. Option D is wrong because Amazon DynamoDB is a NoSQL key-value and document database, not designed for SQL queries on S3 data; it requires data to be loaded into tables and does not support direct querying of S3 objects.

Full explanation →

1227

Multi-Selecthard

A company is using Amazon Kinesis Data Analytics (now part of Amazon Managed Service for Apache Flink) for streaming data processing. The application is experiencing high latency and the data engineer wants to improve performance. Which THREE actions should the engineer consider? (Choose three.)

Select 3 answers

A.Use a larger Kinesis data stream with more shards.

B.Decrease the buffer time in the Flink application to reduce latency.

C.Increase the Flink parallelism parameter in the application configuration.

D.Increase the Parallelism of the Flink application.

E.Decrease the checkpoint interval to reduce state size.

AnswersA, C, D

More shards provide higher throughput.

Why this answer

Options A, C, and E are correct. Increasing parallelism allows more concurrent processing. Using a larger Kinesis stream with more shards increases ingestion throughput.

Increasing the Flink parallelism parameter distributes workload. Option B is incorrect because decreasing the checkpoint interval increases latency. Option D is incorrect because decreasing the buffer time may cause more frequent micro-batches, increasing overhead.

Full explanation →

1228

MCQhard

A data pipeline uses AWS Lambda to process records from an Amazon Kinesis Data Stream. The Lambda function is idempotent and runs once per record. Recently, the function started failing with 'ProvisionedThroughputExceededException' when writing to a DynamoDB table. Which action should the data engineer take to resolve this?

A.Decrease the Lambda function's batch size to process fewer records per invocation.

B.Increase the Lambda function's reserved concurrency.

C.Implement retry logic with exponential backoff in the Lambda function.

D.Increase the number of shards in the Kinesis stream.

AnswerC

Exponential backoff reduces the write rate when throttled, eventually succeeding.

Why this answer

Option C is correct because implementing exponential backoff with jitter is a best practice to handle throttling. Option A is wrong because increasing Lambda concurrency would increase the write rate, worsening the issue. Option B is wrong because Kinesis shard count does not affect DynamoDB.

Option D is wrong because reducing batch size would increase the number of concurrent Lambda invocations.

Full explanation →

1229

MCQeasy

A data engineer needs to store semi-structured JSON transaction logs for analytics. The logs are written once and rarely accessed. The storage must be cost-effective. Which AWS service should be used?

A.Amazon S3

B.Amazon DynamoDB

C.Amazon RDS

D.Amazon Redshift

AnswerA

S3 is cost-effective for infrequently accessed semi-structured data.

Why this answer

Amazon S3 is the correct choice because it provides highly durable, cost-effective object storage ideal for semi-structured JSON transaction logs that are written once and rarely accessed. S3's lifecycle policies can automatically transition such infrequently accessed data to S3 Glacier or S3 Glacier Deep Archive for even lower storage costs, making it the most economical option for this use case.

Exam trap

The trap here is that candidates may choose DynamoDB or Redshift because they support JSON natively, but they overlook the core requirement of cost-effective storage for rarely accessed data, which is best met by S3's low-cost object storage and lifecycle management features.

How to eliminate wrong answers

Option B (Amazon DynamoDB) is wrong because it is a NoSQL key-value and document database optimized for low-latency, high-throughput read/write operations, not for cost-effective archival storage of rarely accessed logs; storing large volumes of infrequently accessed JSON logs in DynamoDB would incur significant costs for provisioned throughput and storage. Option C (Amazon RDS) is wrong because it is a relational database service designed for transactional workloads with structured data and frequent queries, not for storing semi-structured JSON logs at low cost; it would require schema management and incur higher per-GB storage costs compared to S3. Option D (Amazon Redshift) is wrong because it is a petabyte-scale data warehouse optimized for complex analytical queries on structured and semi-structured data, not for simple, cost-effective archival storage; using Redshift for rarely accessed logs would be over-provisioned and expensive due to its compute and storage costs.

Full explanation →

1230

Multi-Selecteasy

Which TWO are valid Amazon Redshift distribution styles? (Choose 2.)

Select 2 answers

A.HASH

B.ALL

C.AUTO

D.RANDOM

E.KEY

AnswersB, E

ALL distributes a copy of the table to all nodes.

Why this answer

Options A and C are correct because KEY and ALL are valid distribution styles. Option B is wrong because HASH is not a distribution style (it's a synonym for KEY). Option D is wrong because RANDOM is not a distribution style.

Option E is wrong because AUTO is not a distribution style.

Full explanation →

1231

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources in Parquet format, and the schema evolves over time. Which approach allows querying the data with Amazon Athena while supporting schema evolution?

A.Use AWS Glue Data Catalog with crawlers to automatically update the table schema.

B.Define Hive-style partitions in Athena and manually update the schema.

C.Use S3 Select to query the data directly without a schema.

D.Use Amazon Redshift Spectrum with external tables and update the schema manually.

AnswerA

Crawlers can detect schema changes and update the Data Catalog, which Athena uses.

Why this answer

AWS Glue Data Catalog with crawlers automatically infers and updates the table schema as new Parquet files with evolving schemas are ingested into S3. This allows Athena to query the data using the latest schema without manual intervention, making it the ideal solution for schema evolution in a data lake.

Exam trap

The trap here is that candidates may think S3 Select or Redshift Spectrum can handle schema evolution automatically, but they lack the schema inference and versioning capabilities that AWS Glue Data Catalog provides for Athena.

How to eliminate wrong answers

Option B is wrong because manually updating the schema in Athena is error-prone and does not scale with frequent schema changes; Hive-style partitions alone do not handle schema evolution. Option C is wrong because S3 Select operates on individual objects and returns data in CSV/JSON format, not Parquet, and it does not support schema evolution or table-level queries across multiple files. Option D is wrong because Redshift Spectrum requires manual schema updates for external tables and is not designed for automatic schema evolution like AWS Glue Data Catalog.

Full explanation →

1232

MCQeasy

A company needs to centralize audit logs from multiple AWS accounts into a single S3 bucket. Which service should be used to aggregate these logs?

A.AWS Config

B.Amazon Kinesis Data Firehose

C.AWS CloudTrail

D.Amazon CloudWatch Logs

AnswerC

CloudTrail supports multi-account trail that aggregates logs into a single S3 bucket.

Why this answer

AWS CloudTrail can be configured to deliver logs from multiple accounts to a single S3 bucket using a trail in the management account. Option B is correct.

Full explanation →

1233

MCQhard

A company runs an Apache Spark job on Amazon EMR that writes output to an S3 bucket. The job fails with the error 'S3AccessDeniedException' when writing the final output, but earlier stages succeed. The EMR cluster uses a service role and an instance profile. The S3 bucket policy allows access from the VPC only. What is the MOST likely cause?

A.The S3 bucket uses SSE-C encryption, and the EMR cluster does not have the encryption key.

B.The EMR service role does not have permissions to write to the S3 bucket.

C.The EMR cluster is not using a VPC endpoint for S3, so requests are denied by the bucket policy's VPC condition.

D.The S3 bucket is configured with 'Bucket owner enforced' setting for ACLs, and the EMR cluster's account is not the bucket owner.

AnswerC

The bucket policy restricts access to VPC, but since the Spark job runs on EMR, its requests originate from inside the VPC only if a VPC endpoint is used; otherwise, they come from public IPs.

Why this answer

The bucket policy restricts access to requests originating from the VPC, typically using a condition like `aws:SourceVpc`. If the EMR cluster does not use a VPC endpoint for S3 (either Gateway or Interface endpoint), traffic from the cluster to S3 traverses the public internet and does not match the VPC condition, causing the `S3AccessDeniedException`. Earlier stages may succeed if they use cached data or different paths, but the final write fails because it hits the bucket policy check.

Exam trap

The trap here is that candidates often assume the EMR service role (EMR_EC2_DefaultRole) is responsible for all S3 access, but in reality the instance profile (EC2 instance role) handles data plane operations, and the bucket policy's VPC condition is the key blocker when earlier stages succeed but final writes fail.

How to eliminate wrong answers

Option A is wrong because SSE-C encryption requires the client to provide the encryption key; if the key were missing, the error would be an encryption-related error (e.g., 'InvalidArgument' or 'AccessDenied' with a different message), not a generic 'S3AccessDeniedException'. Option B is wrong because the EMR service role is used for the cluster's service-level permissions (e.g., launching instances, reading logs), not for data access to S3; the instance profile (IAM role attached to EC2 instances) handles data read/write permissions, and the question states earlier stages succeed, indicating the instance profile has write permissions. Option D is wrong because the 'Bucket owner enforced' setting (S3 Object Ownership) controls ACLs and ownership of objects, not access permissions; it does not cause an 'S3AccessDeniedException' — it would affect who owns new objects, not whether the write is allowed.

Full explanation →

1234

MCQhard

A company runs a real-time analytics platform on AWS. Data is ingested from thousands of IoT devices into Amazon Kinesis Data Streams. A Lambda function consumes the stream, processes the data, and writes the results to an Amazon DynamoDB table. The DynamoDB table has a provisioned write capacity of 1000 WCU, and the read capacity is set to 200 RCU. Recently, the company noticed that the Lambda function is failing with ProvisionedThroughputExceededException on DynamoDB writes. The Lambda function is configured with a batch size of 100 and a concurrency limit of 10. The Kinesis shard count is 4. The number of devices has increased, but the data volume per device has remained the same. The company needs to resolve the write throttling without increasing the DynamoDB write capacity. Which action should the data engineer take?

A.Increase the number of Kinesis shards to 8.

B.Increase the Lambda concurrency limit to 20.

C.Increase the batch size to 200.

D.Reduce the batch size of the Lambda function to 10.

AnswerD

Smaller batches reduce write volume per invocation.

Why this answer

Reducing the batch size from 100 to 10 decreases the number of records processed per Lambda invocation, which reduces the burst of write requests to DynamoDB per invocation. This helps stay within the 1000 WCU limit without increasing capacity, as the same total throughput is spread across more invocations with smaller batches.

Exam trap

The trap here is that candidates assume increasing concurrency or shards will distribute the load better, but in reality, those actions increase the total write throughput, exacerbating throttling when DynamoDB capacity is fixed.

How to eliminate wrong answers

Option A is wrong because increasing Kinesis shards to 8 would increase the number of concurrent Lambda consumers, potentially amplifying the write throttling issue by generating more parallel writes to DynamoDB. Option B is wrong because increasing Lambda concurrency to 20 would allow more simultaneous invocations, each writing up to 100 records, which would increase the aggregate write rate and worsen ProvisionedThroughputExceededException. Option C is wrong because increasing the batch size to 200 would cause each Lambda invocation to attempt writing more records at once, creating larger spikes in write demand that exceed the 1000 WCU limit.

Full explanation →

1235

Multi-Selectmedium

A company uses Amazon RDS for MySQL as a source for AWS DMS. The replication tasks are failing due to large transactions on the source. The team wants to reduce the impact of large transactions on DMS. Which THREE actions should the team take?

Select 3 answers

A.Increase the number of parallel threads on the source.

B.Increase the size of the source RDS instance and enable binary logging with ROW format.

C.Enable BatchApply in the DMS task settings.

D.Use the 'full load only' migration type.

E.Tune the DMS task to use a larger memory limit and adjust the transaction size.

AnswersB, C, E

Larger instance and proper logging help DMS capture changes.

Why this answer

Options A, C, and D are correct. BatchApply reduces apply time, tuning transaction size reduces memory pressure, and increasing the TLog size helps capture changes. Option B is wrong because parallel threads on source may increase CPU.

Option E is wrong because switching to full load only is not a replication solution.

Full explanation →

1236

Multi-Selecteasy

A data engineer needs to transform data in an S3 data lake using AWS Glue ETL. The data is in CSV format and needs to be converted to Parquet with partitioning by date. The engineer wants to minimize the number of files written to S3 to improve query performance. Which TWO configuration options should the engineer use? (Select TWO.)

Select 2 answers

A.Increase the number of workers in the Glue job to increase parallelism.

B.Use the coalesce method to reduce the number of output partitions.

C.Disable compression in the Parquet output.

D.Enable partition pruning in the Glue job by setting the 'partitionKeys' parameter.

E.Set the 'groupFiles' option to 'inPartition' in the DynamicFrame writer.

AnswersB, D

Coalesce reduces the number of partitions before writing, resulting in fewer files.

Why this answer

Option A (use coalesce) reduces the number of output files by merging partitions. Option C (enable partition pruning) helps queries skip irrelevant partitions, improving performance. Option B is wrong because increasing the number of workers may increase the number of output files.

Option D is wrong because disabling compression increases file size. Option E is wrong because using dynamic frame with larger partition size does not directly minimize file count.

Full explanation →

1237

MCQhard

A company uses DynamoDB with provisioned capacity and experiences throttling on a table during peak hours. The data engineer notices that the table has a partition key with high cardinality and the workload is read-heavy. Which action would best resolve the throttling?

A.Enable DynamoDB Auto Scaling for the table.

B.Switch the table to on-demand capacity mode.

C.Increase the provisioned write capacity units.

D.Add a global secondary index (GSI) to distribute reads.

AnswerA

Auto Scaling adjusts capacity based on traffic, preventing throttling efficiently.

Why this answer

DynamoDB Auto Scaling adjusts the provisioned read capacity units (RCUs) based on actual traffic patterns, preventing throttling during peak hours without manual intervention. Since the table has high-cardinality partition keys and is read-heavy, throttling is likely due to insufficient RCUs, which Auto Scaling dynamically increases to match demand.

Exam trap

AWS often tests the misconception that adding a GSI or switching to on-demand is the default fix for throttling, but the correct answer requires identifying that the read-heavy workload needs RCU adjustments, not structural changes or mode switches.

How to eliminate wrong answers

Option B is wrong because switching to on-demand capacity mode would eliminate throttling but at a significantly higher cost for a read-heavy workload, and it does not leverage the existing provisioned capacity setup. Option C is wrong because increasing provisioned write capacity units (WCUs) does not address read throttling; the issue is read-heavy, so RCUs need adjustment, not WCUs. Option D is wrong because adding a GSI distributes reads across partitions but does not increase the table's total provisioned read capacity; it could even worsen throttling if the GSI's write capacity is not properly provisioned.

Full explanation →

1238

MCQeasy

A data engineer needs to ensure that data in transit between an Amazon RDS for PostgreSQL database and an application is encrypted. Which configuration should be used?

A.Use VPC peering to connect the application to the database

B.Enable SSL/TLS for the database connection

C.Enable encryption at rest for the RDS instance

D.Use IAM database authentication

AnswerB

SSL/TLS encrypts data in transit.

Why this answer

Option A is correct because SSL/TLS encryption secures data in transit between the application and the RDS database. Option B (storage encryption at rest) does not protect data in transit. Option C (VPC peering) is a network connectivity option, not encryption.

Option D (IAM database authentication) is for access control, not encryption in transit.

Full explanation →

1239

MCQmedium

A company is ingesting log files from EC2 instances into CloudWatch Logs and then wants to deliver them to S3 for long-term storage and analysis. The data engineer needs to ensure the logs are delivered to S3 within 5 minutes of being generated. Which approach meets this requirement?

A.Configure a CloudWatch Logs metric filter and invoke a Lambda function to write to S3

B.Use CloudWatch Logs Insights to query logs and save results to S3

C.Use CloudWatch Logs subscription filter with Kinesis Data Firehose to deliver to S3

D.Use the CloudWatch Logs export to S3 feature

AnswerC

Firehose can deliver to S3 within minutes.

Why this answer

Option A (Use CloudWatch Logs subscription filter with Kinesis Data Firehose to deliver to S3) is correct because Firehose can deliver to S3 with low latency (typically <1 minute). Option B (CloudWatch Logs export to S3) is a batch process that can take hours. Option C (Use a Lambda function to read from CloudWatch Logs and write to S3) is slower and less reliable.

Option D (Use CloudWatch Logs Insights) is for querying, not delivery.

Full explanation →

1240

MCQmedium

A company wants to enable automatic encryption for all new objects written to an S3 bucket. The bucket has existing objects that are unencrypted. Which solution meets these requirements with the least operational overhead?

A.Configure a lifecycle policy to transition objects to a new bucket with encryption

B.Enable default encryption on the bucket using SSE-S3

C.Use S3 server-side encryption with S3 managed keys (SSE-S3) and apply a bucket policy that denies writes without encryption

D.Use S3 Batch Operations to copy existing objects with SSE-S3

AnswerB

Default encryption encrypts all new objects automatically.

Why this answer

Option C is correct because bucket default encryption automatically encrypts new objects. Option A is wrong because it only encrypts objects during upload, not after. Option B is wrong because S3 Batch Operations is more complex.

Option D is wrong because it's not automatic.

Full explanation →

1241

MCQhard

A financial services company ingests real-time stock trade data using Amazon Kinesis Data Streams with 10 shards. Each shard receives about 500 records per second, each record approximately 1 KB. The data is consumed by a single AWS Lambda function that transforms the data and writes to Amazon S3. The Lambda function is configured with 1024 MB memory and a timeout of 5 minutes. The company notices that the Lambda function is frequently throttled, and data ingestion lags behind. The Lambda function's CloudWatch metrics show that the iterator age is increasing, and the function's concurrency is maxed out at 1000. The data engineer needs to resolve the throttling issue without changing the Lambda function code. What should the data engineer do?

A.Increase the number of shards in the Kinesis data stream to increase parallelism.

B.Reduce the Lambda function memory to 512 MB to increase concurrency limit.

C.Decrease the batch size to 10 records to reduce processing time per invocation.

D.Increase the Lambda function memory to 2048 MB to improve processing speed.

AnswerA

More shards allow more Lambda concurrent executions, reducing iterator age.

Why this answer

Option B is correct. Increasing the number of shards increases the number of Lambda consumers, allowing more parallel processing and reducing the iterator age. Option A is wrong because increasing Lambda memory may not improve throughput if the function is CPU-bound; also, the issue is concurrency, not memory.

Option C is wrong because decreasing batch size may increase overhead; the function is already maxing concurrency. Option D is wrong because reducing memory would likely make the function slower.

Full explanation →

1242

MCQeasy

A data engineer needs to automate the backup of an Amazon RDS for PostgreSQL database. Which AWS service can be used to schedule and manage the backups?

A.Amazon S3

B.AWS Lambda

C.AWS Backup

D.Amazon CloudWatch

AnswerC

AWS Backup provides centralized backup management for RDS.

Why this answer

AWS Backup is a fully managed backup service that can automate backups of RDS databases. Option B is wrong because CloudWatch is for monitoring. Option C is wrong because S3 is storage.

Option D is wrong because Lambda can be used but is not the primary managed service for backup scheduling.

Full explanation →

1243

MCQeasy

A company uses AWS DMS to replicate data from an on-premises Oracle database to Amazon RDS for MySQL. The full load completes successfully, but ongoing replication (CDC) is failing with a 'Failed to add supplemental logging' error. What should the data engineer do to resolve this issue?

A.Enable supplemental logging on the source Oracle database manually.

B.Recreate the DMS endpoint for the source database with a new connection.

C.Modify the target MySQL database to use a different engine version.

D.Increase the task log interval in the DMS task settings.

AnswerA

DMS requires supplemental logging for CDC.

Why this answer

Option B is correct because AWS DMS CDC requires supplemental logging on the source Oracle database to capture changes. The error indicates that DMS cannot enable it automatically, so the engineer must enable it manually. Option A is incorrect because increasing the task log interval does not affect supplemental logging.

Option C is incorrect because changing the target engine version is unrelated. Option D is incorrect because the error is not about source connectivity.

Full explanation →

1244

Multi-Selecteasy

A company needs to store log files from multiple applications in a centralized location. The logs are written once and accessed rarely after 30 days. The company must retain logs for 5 years. Which TWO actions meet these requirements cost-effectively?

Select 2 answers

A.Configure a lifecycle policy to transition objects to S3 Glacier Deep Archive after 30 days

B.Configure a lifecycle policy to transition objects to S3 Glacier Flexible Retrieval after 30 days

C.Use S3 Intelligent-Tiering for automatic cost optimization

D.Use S3 One Zone-IA for the first 30 days, then delete

E.Store all logs in S3 Standard

AnswersA, C

Deep Archive is the lowest-cost storage class for long-term retention.

Why this answer

Option A is correct because S3 Glacier Deep Archive is the lowest-cost storage class for data that is accessed rarely, with retrieval times of 12 hours or more, making it ideal for logs that are rarely accessed after 30 days. A lifecycle policy transitions objects from a higher-cost class (e.g., S3 Standard) to S3 Glacier Deep Archive after 30 days, meeting the 5-year retention requirement cost-effectively.

Exam trap

AWS often tests the distinction between S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, where candidates mistakenly choose the former for rarely accessed data due to familiarity, ignoring the cost savings of the latter for deep archival use cases.

Full explanation →

1245

MCQhard

A data engineer is ingesting XML data from an external API into Amazon S3. The engineer needs to transform the XML to JSON using AWS Glue. The XML structure is deeply nested. Which Apache Spark method should be used in the Glue ETL script?

A.Use the built-in AWS Glue 'xml' data source

B.Use Hadoop's XmlInputFormat

C.Use spark.read.format('xml') with Databricks XML library

D.Use the Spark SQL function from_xml()

AnswerC

This is the standard way to parse XML in Spark.

Why this answer

Spark's Databricks XML library (spark-xml) can parse XML and convert to DataFrame; then write as JSON. The 'com.databricks.spark.xml' format is the standard. AWS Glue's built-in 'xml' option is limited; 'from_xml' is not a Spark method; 'XmlInputFormat' is Hadoop MR.

Full explanation →

1246

MCQeasy

Refer to the exhibit. An S3 event notification is configured to trigger an AWS Lambda function when objects are created in 'my-bucket'. The Lambda function processes the JSON file and writes results to Amazon DynamoDB. The function fails with a timeout error. Which action should the engineer take to resolve the issue?

A.Modify the S3 event notification to use a different event type

B.Grant the Lambda function permission to access DynamoDB

C.Change the trigger to Amazon SQS instead of S3

D.Increase the Lambda function timeout

AnswerD

Timeout error indicates the function needs more time.

Why this answer

Option C is correct. If the function times out, increasing the timeout gives it more time to complete. Option A is wrong because the event is correct.

Option B is wrong because the function can access DynamoDB with the right permissions. Option D is wrong because the event is from S3.

Full explanation →

1247

MCQmedium

Refer to the exhibit. A data engineer created this IAM policy for a Lambda function that reads from a Kinesis stream and writes to an S3 bucket. The Lambda function fails with an 'AccessDenied' error when trying to write to S3. What is the missing permission?

A.s3:ListBucket on the bucket

B.s3:GetObject on the bucket

C.s3:PutObjectAcl on the bucket

D.s3:DeleteObject on the bucket

AnswerA

ListBucket is often required for PutObject operations to verify bucket existence.

Why this answer

Option C is correct because the Lambda function needs permission to list the bucket (s3:ListBucket) to verify the bucket exists and possibly to check for existing objects. The 's3:PutObject' action is allowed, but without 's3:ListBucket', the function may fail when trying to write if the key does not exist or for bucket-level operations. Option A is wrong because 's3:GetObject' is already present.

Option B is wrong because 's3:DeleteObject' is not needed. Option D is wrong because 's3:PutObjectAcl' is not required.

Full explanation →

1248

MCQeasy

A data engineer notices that an Amazon Kinesis Data Firehose delivery stream is failing to deliver data to an Amazon S3 bucket. The engineer verifies that the S3 bucket exists and that the IAM role attached to the delivery stream has the necessary permissions. What is the MOST likely cause of the failure?

A.The delivery stream is configured to deliver to Amazon CloudWatch Logs.

B.The IAM role does not have permissions to write to the S3 bucket.

C.No data is being written to the Kinesis Data Firehose delivery stream.

D.The delivery stream is configured to deliver to Amazon Kinesis Data Streams instead of S3.

AnswerC

If no data is put into the stream, it cannot deliver to S3.

Why this answer

Option C is correct because if no data is written to the stream, Firehose has nothing to deliver. Option A is wrong because delivery streams typically use S3 as a destination, not Kinesis Data Streams. Option B is wrong because insufficient permissions would cause an access denied error.

Option D is wrong because CloudWatch Logs is for monitoring, not for storing delivery data.

Full explanation →

1249

MCQmedium

A company has an Amazon RDS for MySQL DB instance with read replicas. The primary DB instance fails. What is the correct procedure to promote a read replica to become the new primary?

A.Modify the read replica to be a Multi-AZ deployment and failover will occur.

B.RDS automatically fails over to the read replica within 5 minutes.

C.Manually promote the read replica to a standalone DB instance.

D.Delete the primary and the read replica will automatically become the primary.

AnswerC

This is the correct procedure to make the read replica the new primary.

Why this answer

When an Amazon RDS for MySQL primary DB instance fails, read replicas do not automatically become the new primary. The correct procedure is to manually promote the read replica using the AWS Management Console, CLI, or API, which converts it into a standalone DB instance. After promotion, you must update your application endpoints to point to the new primary, as RDS does not handle this automatically.

Exam trap

The trap here is that candidates confuse read replicas with Multi-AZ standby instances, assuming automatic failover applies to both, but RDS read replicas require manual promotion and do not provide automatic failover.

How to eliminate wrong answers

Option A is wrong because modifying a read replica to be Multi-AZ does not trigger a failover; Multi-AZ is a separate feature for high availability within a single region, and read replicas are not part of the Multi-AZ failover mechanism. Option B is wrong because RDS does not automatically fail over to a read replica; automatic failover only occurs with Multi-AZ deployments, not with read replicas. Option D is wrong because deleting the primary DB instance does not cause the read replica to automatically become the primary; the read replica remains a read-only copy until manually promoted.

Full explanation →

1250

Multi-Selecteasy

Which TWO AWS services can be used to schedule and orchestrate ETL workflows that involve multiple steps and dependencies? (Choose 2.)

Select 2 answers

A.AWS Batch

B.AWS Lambda

C.AWS Data Pipeline

D.AWS Step Functions

E.Amazon Managed Workflows for Apache Airflow (MWAA)

AnswersD, E

Step Functions can coordinate multiple AWS services into workflows.

Why this answer

A and C are correct. A: Step Functions can orchestrate Glue, Lambda, etc. C: Managed Workflows for Apache Airflow (MWAA) is designed for orchestration.

B (Data Pipeline) is an older service, but not as common. D (Batch) is for batch computing, not orchestration. E (Lambda) is for individual functions.

Full explanation →

1251

MCQmedium

A data engineering team is using Amazon EMR to process large datasets stored in Amazon S3. The cluster uses Spot Instances for cost savings. During processing, the team notices that tasks are failing due to Spot Instance interruptions. The team needs to make the EMR job resilient to Spot interruptions without increasing costs significantly. Which solution should they implement?

A.Use EMR instance fleets with a mix of Spot and On-Demand, but set the allocation strategy to 'lowest price'.

B.Increase the number of core nodes using On-Demand instances.

C.Use only Spot Instances but enable automatic termination and checkpointing.

D.Use EMR instance fleets with a mix of Spot and On-Demand, setting the allocation strategy to 'diversified' and using On-Demand for core nodes.

AnswerD

Diversified spreads risk; On-Demand core ensures stability.

Why this answer

Option D is correct because enabling auto-termination and using a mix of On-Demand and Spot instances for core and task nodes ensures job completion. Specifically, using On-Demand for core nodes and Spot for task nodes with a diversified allocation strategy reduces interruption impact. Option A is incorrect because increasing core nodes with On-Demand increases cost.

Option B is incorrect because using only Spot instances increases failure risk. Option C is incorrect because instance fleets with only Spot do not help.

Full explanation →

1252

MCQeasy

A company wants to ensure that only encrypted connections are used when data is transferred to S3. Which policy condition should be used in an S3 bucket policy?

A.Condition: { Null: { s3:x-amz-server-side-encryption: true } }

B.Condition: { StringEquals: { s3:signatureversion: ["AWS4-HMAC-SHA256"] } }

C.Condition: { StringNotEquals: { aws:SourceIp: ["0.0.0.0/0"] } }

D.Condition: { Bool: { aws:SecureTransport: false } }

AnswerD

Denying requests where SecureTransport is false enforces HTTPS.

Why this answer

Option B is correct because aws:SecureTransport checks if the request is sent over SSL/TLS. Option A is wrong because aws:SourceIp checks IP address. Option C is wrong because s3:x-amz-server-side-encryption checks encryption at rest.

Option D is wrong because s3:signatureversion checks the signature version, not encryption in transit.

Full explanation →

1253

MCQmedium

Refer to the exhibit. The S3 bucket policy above is applied to the bucket "example-bucket". An IAM user attempts to upload an object to the bucket without specifying any encryption header. What is the outcome?

A.The upload succeeds but the object is not encrypted

B.The upload fails because GetObject requires encryption

C.The object is uploaded successfully with SSE-S3 encryption by default

D.The upload fails with an Access Denied error

AnswerD

The Deny statement blocks the upload.

Why this answer

Option B is correct because the Deny statement denies PutObject if the encryption header is not AES256. Since the user did not specify encryption, the condition StringNotEquals evaluates to true, and the request is denied. Option A is wrong because the Deny overrides the Allow.

Option C is wrong because the Deny is specific. Option D is wrong because the bucket policy does not require encryption for GetObject.

Full explanation →

1254

MCQmedium

A company is migrating an on-premises Apache Cassandra database to Amazon Keyspaces. The database has a table with a partition key of 'user_id' and a clustering column of 'timestamp'. The application frequently queries the last 10 records for a given user. Which table design in Keyspaces would provide the BEST query performance for this access pattern?

A.Partition key: random column, clustering column: none.

B.Partition key: timestamp, clustering column: user_id.

C.Partition key: user_id, clustering column: none.

D.Partition key: user_id, clustering column: timestamp (descending order).

AnswerD

This design groups all records for a user in one partition and sorts by timestamp descending, enabling efficient retrieval of the last 10 records.

Why this answer

Option D is correct because it preserves the original Cassandra table design with 'user_id' as the partition key and 'timestamp' as the clustering column in descending order. This allows Keyspaces to efficiently retrieve the last 10 records for a given user by performing a range query on the clustering column within a single partition, avoiding full table scans or cross-partition queries.

Exam trap

The trap here is that candidates may think a random partition key (Option A) or timestamp-based partition key (Option B) improves write distribution, but they overlook that the query pattern requires efficient reads within a single partition, which is best achieved by using the query filter column as the partition key and the sort column as the clustering key with the appropriate order.

How to eliminate wrong answers

Option A is wrong because using a random partition key with no clustering column would scatter data across partitions, requiring a full scan to find records for a specific user, which is highly inefficient. Option B is wrong because using 'timestamp' as the partition key would place each timestamp in a separate partition, making it impossible to query all records for a user without scanning multiple partitions, and the clustering column 'user_id' would not help retrieve the last 10 records per user efficiently. Option C is wrong because while 'user_id' as the partition key correctly groups data by user, having no clustering column means you cannot order records by timestamp, so retrieving the last 10 records would require fetching all records for that user and sorting them in application code, which is suboptimal.

Full explanation →

1255

Multi-Selectmedium

A company needs to securely store and manage database credentials used by a data pipeline. Which AWS services can be used to store and rotate secrets automatically? (Choose TWO.)

Select 2 answers

A.AWS Systems Manager Parameter Store

B.AWS IAM

C.AWS Key Management Service (AWS KMS)

D.AWS CloudHSM

E.AWS Secrets Manager

AnswersA, E

Parameter Store can store secrets, but rotation requires custom setup.

Why this answer

Options A and B are correct. AWS Secrets Manager can rotate secrets automatically. AWS Systems Manager Parameter Store can store secrets securely, but automatic rotation requires custom logic or integration.

Option C is wrong because IAM stores certificates, not secrets. Option D is wrong because KMS is for encryption keys, not secrets. Option E is wrong because CloudHSM is for hardware security modules.

Full explanation →

1256

MCQmedium

A data engineer sees this AWS Glue table definition in the Data Catalog. The engineer wants to query this table with Amazon Athena, but the query returns zero rows. What is the MOST likely cause?

A.The data files are not in the specified S3 location.

B.The SerDe library is incorrect for CSV files.

C.The table format CSV is not supported by Athena.

D.Athena cannot read tables from the Glue Data Catalog.

AnswerA

If no files exist at s3://data-lake/sales/, query returns zero rows.

Why this answer

The most likely cause is that the data files are not in the specified S3 location. When an AWS Glue table is defined in the Data Catalog, Athena reads the table's metadata (including the S3 location) and then attempts to read the underlying data files from that exact path. If the files are missing, misnamed, or in a different prefix, Athena returns zero rows because there is no data to scan.

This is a common misconfiguration when the S3 path in the table definition does not match the actual data storage.

Exam trap

The trap here is that candidates often assume the issue is with the SerDe or format compatibility, but the most common real-world cause is simply that the data files are not present at the specified S3 location, leading to zero rows returned.

How to eliminate wrong answers

Option B is wrong because the SerDe library is not incorrect for CSV files; Athena uses the LazySimpleSerDe by default for CSV, which is fully supported and does not cause zero rows. Option C is wrong because CSV is a widely supported table format in Athena, and Athena can query CSV files natively. Option D is wrong because Athena is designed to read tables from the Glue Data Catalog; in fact, Athena and Glue Data Catalog are tightly integrated, and this is a standard use case.

Full explanation →

1257

MCQmedium

A company is using Amazon Athena to query data stored in S3. Queries are failing with 'HIVE_INVALID_PARTITION' errors. What is the most likely cause?

A.The S3 bucket is configured with a bucket policy that denies access to the Athena service.

B.A partition folder in S3 has been deleted or moved, but the table metadata still references it.

C.The data is compressed with gzip, but the table definition expects uncompressed data.

D.The data files are in CSV format but the table definition expects Parquet.

AnswerB

Athena expects all partitions to exist.

Why this answer

The 'HIVE_INVALID_PARTITION' error in Amazon Athena occurs when the table's partition metadata in the AWS Glue Data Catalog (or Hive metastore) references a partition folder that no longer exists in the S3 bucket. Athena relies on the metadata to locate data files; if a partition folder is deleted or moved without updating the metadata, queries fail because Athena cannot find the expected data location.

Exam trap

The trap here is that candidates confuse permission errors (like S3 bucket policies) with metadata consistency errors, or assume compression or format mismatches cause partition-specific errors, when in reality 'HIVE_INVALID_PARTITION' is a direct indicator of a stale or missing partition folder in the catalog.

How to eliminate wrong answers

Option A is wrong because a bucket policy denying Athena access would cause an 'Access Denied' error, not a 'HIVE_INVALID_PARTITION' error, which is specific to partition metadata mismatch. Option C is wrong because Athena supports reading gzip-compressed data transparently, and compression mismatch does not produce partition-related errors. Option D is wrong because a schema mismatch between CSV and Parquet would cause a 'HIVE_CANNOT_OPEN_SPLIT' or data type conversion error, not a partition validation error.

Full explanation →

1258

MCQhard

A company is using Amazon DynamoDB for a gaming application with high read and write throughput. The data engineer notices that the read latency is high during peak hours. The table has a partition key only (no sort key). The engineer wants to improve read performance by distributing reads across partitions more evenly. Which action should the engineer take?

A.Increase the read capacity units of the table.

B.Add a sort key to the table.

C.Enable DynamoDB Accelerator (DAX).

D.Enable DynamoDB global tables.

AnswerC

DAX provides a write-through cache, reducing read latency.

Why this answer

Option C is correct because DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency from single-digit milliseconds to microseconds by caching frequently accessed items. Since the issue is high read latency during peak hours and the table has only a partition key, DAX offloads reads from the underlying table, distributing the read load and improving response times without changing the table's key structure.

Exam trap

The trap here is that candidates often confuse increasing capacity (Option A) with improving read distribution, not realizing that latency issues from hot partitions require caching or key redesign, not just more RCUs.

How to eliminate wrong answers

Option A is wrong because increasing read capacity units (RCUs) only raises the provisioned throughput limit, but does not inherently distribute reads more evenly across partitions; high latency during peak hours is often due to hot partitions or throttling, not insufficient capacity. Option B is wrong because adding a sort key to an existing table is not possible without recreating the table, and a sort key does not directly improve read distribution across partitions; it only enables more flexible query patterns within a partition. Option D is wrong because enabling DynamoDB global tables provides multi-region replication for disaster recovery and low-latency reads from multiple regions, but it does not improve read distribution across partitions within a single table or reduce latency for a single-region workload.

Full explanation →

1259

MCQmedium

A company uses Amazon EMR to process large datasets stored in Amazon S3. The data is encrypted at rest using SSE-S3. The security team now requires that all data at rest be encrypted with customer-managed KMS keys (SSE-KMS). The data engineer needs to migrate existing data to use SSE-KMS without downtime. The engineer plans to use S3 Batch Operations to copy objects in place. However, the Batch Operations job fails with a KMS access denied error. The engineer has confirmed that the Batch Operations service role has the necessary KMS permissions. What is the most likely cause?

A.The KMS key policy does not allow the S3 service to use the key.

B.The Batch Operations job is using the wrong IAM role.

C.The source objects are encrypted with SSE-S3, which cannot be copied to SSE-KMS.

D.The Batch Operations service role is missing the kms:GenerateDataKey permission for the destination KMS key.

AnswerD

Batch Operations needs to generate a new data key for the destination.

Why this answer

Option D is correct because Batch Operations uses a service role that must have kms:Decrypt permission for the source objects and kms:GenerateDataKey for the destination. The source objects are encrypted with SSE-S3, which does not use KMS, so the service role does not need kms:Decrypt for source. However, the error indicates KMS access denied, likely because the service role does not have kms:GenerateDataKey for the destination KMS key.

Option A is wrong because the service role is used. Option B is wrong because the source objects are SSE-S3. Option C is wrong because KMS key policy is for the destination key.

Full explanation →

1260

MCQmedium

A data engineer needs to implement a data pipeline that ingests data from an on-premises database using AWS DMS and loads it into Amazon S3 in Parquet format. The data should be encrypted at rest in S3 using a customer-managed KMS key. Which combination of actions should the engineer take? (Choose the correct course of action.)

A.Configure the DMS task to write to S3 in Parquet format, and specify the KMS key ID in the S3 endpoint settings.

B.Set up an EC2 instance to run a script that reads from the source and writes Parquet to S3 with KMS encryption.

C.Use DMS to write JSON to S3, then use an AWS Glue job to convert to Parquet and enable KMS encryption on the Glue job.

D.Configure the S3 bucket policy to require KMS encryption for all objects, and use DMS with default settings.

AnswerA

DMS S3 endpoint supports KMS encryption and Parquet format.

Why this answer

Option D is correct because AWS DMS can write directly to S3 in Parquet format, and you can specify a KMS key for server-side encryption. Option A is incorrect because DMS does not support writing to S3 in JSON format without conversion, and encryption is handled by DMS settings. Option B is incorrect because DMS can write Parquet directly without an intermediate EC2 instance.

Option C is incorrect because KMS encryption is set at the DMS task level, not via a bucket policy.

Full explanation →

1261

MCQmedium

The IAM policy shown in the exhibit is attached to a user. The user tries to upload an object to my-bucket using the AWS CLI without specifying encryption. What will happen?

A.The upload will succeed because the bucket has default encryption

B.The upload will succeed but the object will not be encrypted

C.The upload will fail because a KMS key is required

D.The upload will fail with an access denied error

AnswerD

The condition requires the encryption header to be present.

Why this answer

Option C is correct because the policy allows s3:PutObject only if the request includes the header x-amz-server-side-encryption: AES256. Without that header, the condition is not met and the request is denied. Option A is wrong because the condition is not optional.

Option B is wrong because default encryption applies to objects uploaded without encryption headers, but the policy explicitly requires the header. Option D is wrong because the policy does not require a specific KMS key.

Full explanation →

1262

MCQeasy

A data engineer needs to ingest JSON files from an S3 bucket into a DynamoDB table. The files are updated hourly and contain new records. Which AWS service should be used to trigger a Lambda function for each new object?

A.Kinesis Data Firehose

B.Amazon EventBridge

C.S3 Event Notifications

D.Amazon SQS

AnswerC

S3 can send events to Lambda on object creation.

Why this answer

Option A is correct because S3 Event Notifications can invoke Lambda on PutObject events. Option B is wrong because Kinesis is for streaming, not event-based S3 triggers. Option C is wrong because SQS would require polling.

Option D is wrong because EventBridge is more complex for simple S3 object creation triggers.

Full explanation →

1263

MCQmedium

A data engineering team uses AWS Glue ETL jobs to process data from Amazon S3. The jobs recently started failing with 'Access Denied' errors when writing to the output S3 bucket. What is the most likely cause?

A.The KMS key used for server-side encryption is not accessible to the Glue job.

B.The S3 bucket does not have default encryption enabled.

C.The job ran out of memory due to large data volume.

D.The S3 bucket policy was modified to deny write access to the Glue job's IAM role.

AnswerD

An explicit deny in the bucket policy overrides any allow in the IAM role policy.

Why this answer

Option C is correct because AWS Glue ETL jobs require an IAM role with permissions to write to the S3 bucket. If the bucket policy was inadvertently changed to deny access to that role, the job would fail. Option A is wrong because insufficient memory would cause out-of-memory errors, not access denied.

Option B is wrong because separate encryption permissions are not the primary cause; S3 access permissions are. Option D is wrong because KMS key permissions would cause encryption-related errors, not general access denied.

Full explanation →

1264

MCQmedium

A company uses Amazon S3 to store sensitive data. The security team wants to ensure that all objects uploaded to a specific S3 bucket are automatically encrypted at rest using server-side encryption with AWS KMS managed keys (SSE-KMS). Which bucket policy statement should be added to enforce this requirement?

A.Deny put requests where 's3:x-amz-server-side-encryption' is 'aws:kms'

B.Deny put requests where 's3:x-amz-server-side-encryption' is not 'aws:kms'

C.Deny put requests where 's3:x-amz-server-side-encryption' is not 'AES256'

D.Deny put requests where 's3:x-amz-server-side-encryption' is not set

AnswerB

This enforces SSE-KMS encryption.

Why this answer

Option C is correct because the condition 's3:x-amz-server-side-encryption': 'aws:kms' enforces SSE-KMS. Option A denies requests with SSE-S3 but allows unencrypted uploads. Option B denies without encryption but allows SSE-S3.

Option D denies SSE-KMS, which is the opposite of what is needed.

Full explanation →

1265

MCQeasy

A data engineer needs to ensure that data stored in Amazon S3 is automatically deleted after 30 days. Which S3 feature should be used?

A.S3 Lifecycle policy

B.S3 MFA Delete

C.S3 Versioning

D.S3 Object Lock

AnswerA

Lifecycle policies can expire objects after 30 days.

Why this answer

Option C is correct because S3 Lifecycle policies can automatically transition objects to different storage classes or expire them after a specified time. Option A (versioning) keeps multiple versions of objects. Option B (object lock) prevents deletion.

Option D (MFA delete) requires multi-factor authentication for deletion.

Full explanation →

1266

MCQeasy

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error: 'An error occurred while calling o137.pyWriteDynamicFrame. No such file or directory: s3://bucket/output/part-00000.parquet'. The job reads from a JDBC source and writes to S3. What is the most likely cause?

A.The schema of the source data has changed, causing a mismatch during write

B.The output S3 path does not exist and the Glue job does not have permission to create it

C.The Glue job ran out of memory during the transformation phase

D.The IAM role attached to the Glue job lacks permissions to read from the JDBC source

AnswerB

The error message indicates missing directory; Glue may not auto-create if permissions are insufficient.

Why this answer

Option A is correct because the error indicates the output path does not exist; the Glue job may not have permission to create the directory if the bucket or prefix does not exist. Option B (IAM role) would cause a permissions error, not 'No such file or directory'. Option C (schema mismatch) would cause a different error.

Option D (memory) would cause out-of-memory or timeout errors.

Full explanation →

1267

MCQhard

A company runs a critical transactional database on Amazon RDS for PostgreSQL. They need to achieve high availability with automatic failover to a different AWS Region in case of a regional outage. Which solution meets these requirements?

A.Create a cross-Region Read Replica and promote it during a disaster.

B.Take daily automated snapshots and restore them in another Region.

C.Deploy the RDS instance in a Multi-AZ configuration.

D.Use Amazon Aurora Global Database with a primary cluster in one Region and a secondary in another.

AnswerD

Aurora Global Database supports automatic failover across Regions with RPO of 1 second.

Why this answer

Option D is correct because Amazon Aurora Global Database provides cross-region replication with failover capabilities. Option A is wrong because Multi-AZ only protects against AZ failures, not regional. Option B is wrong because Read Replicas are for read scaling, not automatic failover.

Option C is wrong because manual snapshot restore takes too long for high availability.

Full explanation →

1268

MCQeasy

A data engineer needs to ensure that a Redshift cluster can recover from a failure with minimal data loss. The cluster is used for reporting and can tolerate a few minutes of downtime. Which feature should the engineer enable?

A.Configure cross-region snapshot copy.

B.Take manual snapshots every hour.

C.Enable Multi-AZ deployment.

D.Enable automated snapshots with a retention period of 1 day.

AnswerD

Automated snapshots allow recovery to any point within the retention period.

Why this answer

Automated snapshots in Amazon Redshift are taken at regular intervals (default every 8 hours or 5 GB of data changes) and retained for a specified period. Enabling automated snapshots with a retention period of 1 day ensures that, in the event of a failure, the cluster can be restored to the most recent snapshot, minimizing data loss to at most the snapshot interval. This aligns with the requirement for minimal data loss and tolerance for a few minutes of downtime, as restoring from a snapshot takes time but preserves recent data.

Exam trap

The trap here is that candidates often confuse Multi-AZ (a feature for RDS, not Redshift) with high availability, or assume manual snapshots are more reliable than automated ones, when in fact automated snapshots with a short retention period provide the best balance of minimal data loss and operational simplicity for Redshift.

How to eliminate wrong answers

Option A is wrong because cross-region snapshot copy provides disaster recovery across AWS regions but does not directly reduce data loss for a single-region failure; it adds latency and cost without improving recovery point objective (RPO) within the primary region. Option B is wrong because manual snapshots every hour require manual intervention and do not guarantee consistent, automated recovery; they also lack the automated scheduling and retention management that Redshift provides, making them less reliable for minimal data loss. Option C is wrong because Redshift does not support Multi-AZ deployment; it is a single-AZ service by design, and enabling Multi-AZ is not a valid feature for Redshift clusters.

Full explanation →

1269

Multi-Selectmedium

A data engineer is designing a data lake on S3 with fine-grained access control using AWS Lake Formation. Which THREE permissions can be managed by Lake Formation?

Select 3 answers

A.SELECT on a table

B.INSERT on a table

C.DESCRIBE on a table

D.ALTER TABLE on a table

E.DELETE on a table

AnswersA, B, C

Lake Formation grants SELECT permission.

Why this answer

Options A, B, and D are correct. Lake Formation manages SELECT, INSERT, and DESCRIBE permissions. Option C is wrong because DELETE is not supported for S3 data lakes? Actually Lake Formation supports DELETE on tables? For data lakes, DELETE is not typically managed.

Option E is wrong because ALTER TABLE is not managed by Lake Formation.

Full explanation →

1270

MCQhard

A company stores PII in an S3 bucket. The security team wants to use Amazon Macie to discover sensitive data. After enabling Macie, they notice that no sensitive data findings are generated. The S3 bucket is in the same account. What is the most likely reason?

A.The bucket policy blocks access from Macie's service principal.

B.Macie is not configured with cross-account access to the bucket.

C.The S3 bucket is in a different AWS Region than the Macie session.

D.The S3 objects have private ACLs that prevent Macie from reading them.

AnswerC

Macie only analyzes data in the same Region.

Why this answer

Option D is correct because Macie requires the bucket to be in the same AWS Region as the Macie session. Option A is incorrect because Macie can access buckets in the same account without cross-account roles. Option B is incorrect because Macie uses service-linked roles, not object ACLs.

Option C is incorrect because Macie does not require public access; it reads objects via IAM.

Full explanation →

1271

MCQeasy

A company uses S3 to store sensitive customer data. To prevent accidental public access, a data engineer needs to ensure that all S3 buckets block public access at the account level. Which AWS service should be used to enforce this policy?

A.Enable S3 Block Public Access at the account level in the management account

B.Create an IAM policy that denies s3:PutBucketPolicy

C.Use an SCP in AWS Organizations to deny s3:PutBucketPublicAccessBlock

D.Set up AWS Config rules to automatically remediate public buckets

AnswerC

SCPs can enforce that no account can disable block public access, covering all accounts.

Why this answer

AWS Organizations with SCPs can centrally control permissions across all accounts, including blocking public access to S3 buckets. Option A is wrong because IAM policies are per-identity and not account-wide. Option B is wrong because S3 Block Public Access settings exist per bucket or account, but to enforce across all accounts, Organizations is needed.

Option D is wrong because AWS Config can detect non-compliance but not enforce. Option C is correct.

Full explanation →

1272

MCQeasy

A company is using Amazon DynamoDB for a gaming application. They want to store player session data that expires after 24 hours. Which DynamoDB feature should be used?

A.Time to Live (TTL)

B.DynamoDB Streams

C.Global Tables

D.Point-in-Time Recovery

AnswerA

TTL deletes items automatically after a defined expiration time.

Why this answer

Amazon DynamoDB Time to Live (TTL) allows you to define a per-item timestamp attribute that automatically deletes items after a specified duration. For the gaming session data that must expire after 24 hours, you can set the TTL attribute to the current time plus 24 hours, and DynamoDB will asynchronously delete expired items without any additional cost or write operations.

Exam trap

The trap here is that candidates may confuse DynamoDB Streams (which can react to deletions) with the actual mechanism that performs the deletion, or assume Point-in-Time Recovery can be used to 'roll back' expired data, neither of which addresses automatic expiration.

How to eliminate wrong answers

Option B (DynamoDB Streams) is wrong because it captures a time-ordered sequence of item-level changes (inserts, updates, deletes) in a DynamoDB table, but it does not automatically expire or delete data; it is used for event-driven processing or replication, not for scheduled data removal. Option C (Global Tables) is wrong because it provides multi-region, fully replicated tables for low-latency access and disaster recovery, but it has no built-in mechanism to expire or delete items based on time. Option D (Point-in-Time Recovery) is wrong because it enables continuous backups of DynamoDB table data to restore to any point within the last 35 days, but it does not delete or manage the lifecycle of individual items.

Full explanation →

1273

Multi-Selectmedium

Which TWO actions can reduce the cost of an Amazon S3 bucket that stores infrequently accessed data? (Choose 2.)

Select 2 answers

A.Enable cross-region replication

B.Enable versioning to keep multiple versions

C.Use lifecycle policies to expire objects after a certain period

D.Enable MFA Delete for extra security

E.Transition objects to S3 Standard-IA after 30 days

AnswersC, E

Expiration deletes unneeded objects.

Why this answer

Options A and C are correct because transitioning to S3 Standard-IA reduces cost for infrequent access, and lifecycle policies automatically transition objects. Option B is wrong because enabling versioning increases storage costs. Option D is wrong because MFA Delete does not affect storage cost.

Option E is wrong because cross-region replication incurs additional costs.

Full explanation →

1274

Multi-Selecteasy

A data engineer is setting up Amazon S3 event notifications to trigger an AWS Lambda function when new objects are uploaded. Which TWO actions are required to enable this?

Select 2 answers

A.Add a resource-based policy to the Lambda function to allow S3 to invoke it.

B.Enable S3 versioning on the bucket.

C.Create an S3 bucket policy that grants S3 permission to invoke Lambda.

D.Configure an event notification on the S3 bucket for s3:ObjectCreated:* events.

E.Set up an Amazon CloudWatch Events rule to detect S3 uploads.

AnswersA, D

Necessary for S3 to trigger Lambda.

Why this answer

Lambda function must have resource policy allowing S3 to invoke it. S3 bucket must have notification configuration. S3 bucket policy is not required if Lambda resource policy allows.

CloudWatch Events are not used for S3 notifications.

Full explanation →

1275

MCQmedium

A data engineer is troubleshooting a failed AWS Glue ETL job that reads from an S3 bucket and writes to an Amazon Redshift table. The job fails with a permission error. Which IAM policy addition is MOST likely required for the Glue job's role?

A.Add redshift:DataAPI

B.Add redshift:ModifyCluster

C.Add redshift:DescribeStatement

D.Add redshift:GetClusterCredentials

AnswerA

Glue uses the Redshift Data API to write data; this permission is required.

Why this answer

Option B is correct because the Glue job needs to write to Redshift; the role requires redshift:DataAPI access to use the Redshift Data API, which is the recommended method for Glue to write to Redshift. Option A is for Redshift Spectrum, not needed here. Option C is for Redshift cluster management.

Option D is for reading from Redshift.

Full explanation →

Page 17 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →