Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 526–600

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 8 of 24

526

MCQhard

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

A.Write the aggregated results to a single large file instead of multiple partitions.

B.Convert the Parquet files to CSV to simplify the schema.

C.Replace the Redshift target with Amazon Redshift Spectrum.

D.Increase the number of Glue worker nodes (DPUs) for the job.

AnswerD

More workers parallelize tasks and reduce runtime.

Why this answer

Increasing the number of Glue worker nodes (DPUs) directly scales the distributed processing capacity of the ETL job, allowing it to process larger volumes of Parquet data in parallel. This is the most straightforward way to reduce execution time when data volume is growing, as AWS Glue automatically partitions the workload across the additional workers.

Exam trap

The trap here is that candidates assume increasing DPUs always increases cost without considering that the job's runtime reduction often lowers total cost, and they mistakenly choose a data format or target change that does not address the core parallelism issue.

How to eliminate wrong answers

Option A is wrong because writing to a single large file eliminates parallelism in downstream reads and can cause bottlenecks in Redshift's COPY operation, which benefits from multiple files for concurrent loading. Option B is wrong because converting Parquet to CSV increases file size and I/O overhead due to lack of columnar compression and predicate pushdown, degrading performance. Option C is wrong because replacing Redshift with Redshift Spectrum would offload query processing to S3 but does not address the ETL job's performance bottleneck; the job still writes to Redshift, and Spectrum is a query engine, not a write target.

Full explanation →

527

MCQmedium

An Amazon Kinesis Data Streams application is lagging behind. The data records are small (1 KB) and the shard count is 10. The consumer uses the KCL with default configuration. Which action will MOST effectively reduce the consumer lag?

A.Increase the number of KCL workers per shard (e.g., 2 workers per shard).

B.Use Enhanced Fan-Out to provide dedicated throughput.

C.Increase the number of shards to 20.

D.Reduce the record size by compressing the data.

AnswerA

More workers can process records concurrently, reducing lag.

Why this answer

Option A is correct because the KCL (Kinesis Client Library) uses a single worker per shard by default, and each worker processes records sequentially within that shard. Increasing the number of workers per shard (e.g., 2 workers) allows parallel processing of the same shard’s records, directly reducing consumer lag when records are small (1 KB) and the bottleneck is CPU or processing time per record, not throughput limits.

Exam trap

The trap here is that candidates often assume increasing shards (Option C) always reduces lag, but they miss that KCL workers are per-shard by default, so more shards only help if the shard is saturated with data, not when the consumer is slow at processing each record.

How to eliminate wrong answers

Option B is wrong because Enhanced Fan-Out provides dedicated throughput per consumer (up to 2 MB/s per shard per consumer), but the issue here is processing lag, not throttling or throughput limits—the default KCL already handles the 1 KB records easily, so dedicated throughput does not address the processing bottleneck. Option C is wrong because increasing shards to 20 would increase the number of parallel processing units, but each shard still has only one KCL worker by default, so the per-shard processing capacity remains unchanged; this would only help if the shard were overloaded with data, which is not the case with small records. Option D is wrong because compressing data reduces the size of records, but the records are already only 1 KB, and the bottleneck is processing time per record, not network or storage throughput; compression adds CPU overhead and does not reduce lag.

Full explanation →

528

Multi-Selecthard

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data is consumed by multiple applications: one for real-time sentiment analysis and another for archival to S3. The data must be processed in order for each social media post. Which TWO approaches meet the requirements? (Choose TWO.)

Select 2 answers

A.Use Amazon Kinesis Data Firehose to buffer and deliver to S3

B.Use Amazon SQS FIFO queues between the stream and consumers

C.Use a single shard in the Kinesis Data Streams and have all consumers read from that shard

D.Use a partition key that ensures related records go to the same shard

E.Use multiple shards and assign each consumer to a specific shard

AnswersC, D

Single shard guarantees ordering.

Why this answer

Option A is correct because using a single shard ensures ordering for all records. Option C is correct because using a partition key that groups related records ensures they go to the same shard, preserving order. Option B (multiple shards) does not guarantee global ordering.

Option D (Firehose) does not guarantee ordering. Option E (SQS FIFO) can guarantee order but adds another service.

Full explanation →

529

MCQeasy

A data engineer needs to audit who accessed specific objects in an S3 bucket over the past 30 days. Which AWS service should be used?

A.AWS Config

B.Amazon CloudWatch Logs

C.Amazon S3 server access logs

D.AWS CloudTrail

AnswerC

These logs record access to objects.

Why this answer

Option B is correct. S3 server access logs record object-level access. Option A is wrong because CloudTrail logs API calls, but S3 server access logs provide detailed object access.

Option C is wrong because Config records resource changes, not access. Option D is wrong because CloudWatch Logs can store logs but not generate them.

Full explanation →

530

MCQhard

Refer to the exhibit. A company has an S3 bucket 'my-data-lake' with the lifecycle policy shown. Objects under the 'logs/' prefix are being moved to GLACIER after 30 days and expire after 365 days. A data engineer notices that objects older than 365 days are still present in the bucket and are not being deleted. What is the most likely cause?

A.Lifecycle expiration does not apply to objects in GLACIER storage class

B.The rule status is disabled

C.The prefix filter does not match the objects

D.The expiration days count from the transition date, not the object creation date

AnswerA

Objects in GLACIER cannot be expired; they must be restored first.

Why this answer

Option C is correct because objects in GLACIER storage class can only be expired if they are restored first, or the expiration action must be applied to the current version. The policy does not specify a filter for current version, and GLACIER objects are not deletable by lifecycle without restoration. Option A is wrong because 365 days have passed.

Option B is wrong because the prefix is correct. Option D is wrong because the rule is enabled.

Full explanation →

531

Multi-Selecthard

A company is using Amazon Redshift for a data warehouse. The data engineer needs to improve query performance for a table that is frequently joined with other tables on a specific column. Which THREE actions would help improve join performance? (Choose THREE.)

Select 3 answers

A.Set the distribution style to KEY on the join column

B.Apply a SORTKEY on the join column

C.Use DISTKEY on the join column to co-locate data

D.Use DISTSTYLE ALL to replicate the table to all nodes

E.Change the column data type to a fixed-length CHAR

AnswersA, B, C

This is the same as option C; it co-locates data for joins.

Why this answer

Setting the distribution style to KEY on the join column (option A) ensures that rows with the same join key value are co-located on the same compute node. This allows Redshift to perform a collocated join, avoiding the expensive redistribution of data across the network during query execution, which significantly improves join performance.

Exam trap

The trap here is that candidates often confuse DISTSTYLE ALL (option D) as a required action for join performance, but the question asks for three specific actions, and DISTSTYLE ALL is a valid but separate optimization not listed among the correct three; the exam expects you to recognize that A, B, and C are the correct trio, with D being a distractor that is also correct in isolation but not part of the required set.

Full explanation →

532

MCQeasy

A company wants to store data from thousands of IoT devices with varying data rates. The data must be stored in a schema-on-read fashion and support SQL queries. Which AWS service should be used?

A.Amazon RDS for MySQL

B.Amazon S3 with Amazon Athena

C.Amazon DynamoDB

D.Amazon Redshift

AnswerB

S3 provides scalable storage, and Athena enables SQL queries with schema-on-read.

Why this answer

Option B is correct because Amazon Athena allows querying data directly from S3 using SQL, supporting schema-on-read. Option A is wrong because DynamoDB is NoSQL and does not support SQL directly. Option C is wrong because RDS is schema-on-write.

Option D is wrong because Redshift is schema-on-write.

Full explanation →

533

Multi-Selecteasy

Which TWO of the following are features of Amazon RDS Multi-AZ deployments? (Choose 2.)

Select 2 answers

A.Read replicas in the same region for offloading read traffic.

B.Automatic failover to a standby instance in case of an AZ failure.

C.A standby instance that is not accessible for reads or writes.

D.Automatic storage scaling based on usage.

E.Synchronous replication across AWS Regions.

AnswersB, C

Multi-AZ automatically fails over to the standby in another AZ.

Why this answer

Option B is correct because Multi-AZ provides automatic failover. Option D is correct because the standby instance is in a different AZ. Option A is wrong because read replicas are separate from Multi-AZ.

Option C is wrong because Multi-AZ does not support cross-region failover. Option E is wrong because Multi-AZ does not automatically scale storage.

Full explanation →

534

MCQmedium

Refer to the exhibit. An IAM policy allows kms:Decrypt and kms:GenerateDataKey on a specific KMS key. A data engineer is unable to upload an object to an S3 bucket that uses SSE-KMS with that key. What is the MOST likely missing permission?

A.kms:Decrypt permission on the key.

B.s3:PutObject permission on the bucket.

C.kms:Encrypt permission on the key.

D.kms:CreateGrant permission on the key.

AnswerB

The user needs S3 permission to upload objects.

Why this answer

Option A is correct because to upload an object with SSE-KMS, the user needs kms:GenerateDataKey, which is already allowed, but also s3:PutObject permission on the bucket. Option B is wrong because kms:Decrypt is already allowed. Option C is wrong because kms:CreateGrant is not required for uploading.

Option D is wrong because kms:Encrypt is not required; GenerateDataKey is sufficient.

Full explanation →

535

MCQhard

A company runs a data lake on Amazon S3 with AWS Lake Formation for access control. The data lake contains sensitive customer information. A data scientist needs to query the data using Amazon Athena. The data scientist has been granted SELECT permission on the database and tables via Lake Formation. However, when the data scientist runs a query in Athena, they receive an error: 'Access denied. Please check your permissions.' The IAM role used by Athena has the following permissions: s3:GetObject, s3:ListBucket, and lakeformation:GetDataAccess. The Lake Formation admin has verified that the data scientist is a member of a Lake Formation data lake location and has been granted 'Describe' and 'Select' permissions on the table. What is the most likely reason for the access denied error?

A.The data scientist is not assigned to the correct Lake Formation tag.

B.The S3 bucket policy does not grant the Athena IAM role access to the S3 location.

C.The Athena IAM role is missing lakeformation:GetEffectivePermissions permission.

D.The data scientist's IAM user lacks the necessary S3 permissions.

AnswerB

Lake Formation permissions are separate from S3 bucket policies; the bucket policy must allow the IAM role to read the data.

Why this answer

Option A is correct because Lake Formation requires the IAM role to have lakeformation:GetDataAccess permission, but also the role needs permissions to the underlying S3 location. If the S3 bucket policy does not allow the Athena IAM role to access the bucket, the request will be denied. The data scientist is granted permissions in Lake Formation, but the S3 bucket policy must also grant access to the IAM role.

Option B is wrong because the role already has lakeformation:GetDataAccess. Option C is wrong because the error is not about Lake Formation tag access. Option D is wrong because Athena uses the IAM role's permissions, not the user's IAM permissions directly.

Full explanation →

536

MCQeasy

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for centralized analytics. The data volume is several GB per day. Which AWS service is most suitable for this ingestion?

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Athena

D.AWS Data Pipeline

AnswerB

Glue can connect to SaaS sources via JDBC and perform ETL to S3.

Why this answer

Option B (AWS Glue) is correct because Glue can connect to various data sources using crawlers and ETL jobs, and it supports JDBC connections to SaaS databases. Option A (Amazon Athena) is for querying, not ingestion. Option C (Amazon Kinesis) is for streaming, not batch from SaaS.

Option D (AWS Data Pipeline) is older and less flexible than Glue.

Full explanation →

537

Multi-Selecthard

A data pipeline uses AWS Glue to process large CSV files. The team notices that some jobs fail with out-of-memory errors. Which TWO configuration changes can help mitigate this issue?

Select 2 answers

A.Reduce the number of DPUs to limit concurrency.

B.Increase the number of DPUs for the Glue job.

C.Enable Glue job autoscaling.

D.Convert input files from CSV to Parquet.

E.Enable job bookmarks.

AnswersB, C

More DPUs provide more memory.

Why this answer

Options A and D are correct: increasing DPUs and enabling autoscaling provide more memory. Option B (reducing DPUs) would worsen the problem. Option C (conversion to Parquet) may reduce memory but is not a direct configuration change for the Glue job.

Option E (job bookmarks) does not affect memory.

Full explanation →

538

MCQmedium

A data engineer is designing a data store for a real-time leaderboard application that requires sub-millisecond read and write latency. The leaderboard stores scores for millions of users and needs to be sorted by score. Which AWS service should the engineer use?

A.Amazon RDS for PostgreSQL with an index on score

B.Amazon DynamoDB with a global secondary index on score

C.Amazon ElastiCache for Redis with a sorted set

D.Amazon Neptune with a graph model

AnswerC

Redis sorted sets provide O(log N) operations and sub-millisecond latency.

Why this answer

Option C is correct because ElastiCache for Redis with sorted sets provides sub-millisecond latency and sorted data. Option A (DynamoDB) is fast but not designed for sorted sets. Option B (RDS) is slower.

Option D (Neptune) is graph database.

Full explanation →

539

Multi-Selecthard

Which THREE steps are recommended for migrating an on-premises Oracle database to Amazon RDS for Oracle with minimal downtime? (Choose 3.)

Select 3 answers

A.Set up a VPN or Direct Connect between on-premises and AWS

B.Disable archiving on the source database

C.Use AWS Schema Conversion Tool (SCT) to convert the schema

D.Perform a full load migration without change data capture

E.Use AWS Database Migration Service (DMS) for ongoing replication

AnswersA, C, E

Secure connectivity is essential.

Why this answer

Options A, C, and E are correct because DMS supports ongoing replication, SCT assesses schema, and setting up a VPN ensures secure connectivity. Option B is wrong because full load without CDC does not minimize downtime. Option D is wrong because disabling archiving prevents point-in-time recovery.

Full explanation →

540

MCQeasy

A company needs to ingest data from an on-premises Oracle database into Amazon Redshift for analytics. The data volume is 500 GB and the network bandwidth is limited. Which AWS service should be used for the initial one-time data migration?

A.AWS Snowball

B.AWS Direct Connect

C.Amazon S3 Transfer Acceleration

D.AWS Database Migration Service (DMS)

AnswerA

Snowball allows physical transfer of data, bypassing network limitations.

Why this answer

AWS Snowball is ideal for large data transfers over limited network bandwidth. It allows physical shipping of storage devices. AWS DMS can be used for ongoing replication, but for initial large volume with limited bandwidth, Snowball is more efficient.

S3 Transfer Acceleration speeds up transfers, but still relies on network. Direct Connect improves network but may still be insufficient for 500 GB.

Full explanation →

541

Multi-Selecteasy

A data engineer is setting up Amazon S3 bucket policies for a data lake. The security team requires that all objects uploaded to the bucket be encrypted at rest using server-side encryption. Which TWO methods can enforce encryption at upload time?

Select 2 answers

A.Enable S3 Transfer Acceleration.

B.Enable AWS CloudTrail to monitor uploads.

C.Enable AWS KMS automatic key rotation.

D.Enable S3 default encryption on the bucket.

E.Create a bucket policy that denies PutObject if the x-amz-server-side-encryption header is missing.

AnswersD, E

Default encryption automatically encrypts objects.

Why this answer

Option D is correct because enabling S3 default encryption on the bucket automatically applies server-side encryption (SSE-S3 or SSE-KMS) to all objects uploaded without an encryption header, ensuring encryption at rest. Option E is correct because a bucket policy with a Deny effect on PutObject when the x-amz-server-side-encryption header is missing enforces encryption at upload time by rejecting unencrypted uploads, providing a complementary enforcement mechanism.

Exam trap

The trap here is that candidates often confuse default encryption (which applies encryption automatically but does not block unencrypted uploads) with a bucket policy that explicitly denies unencrypted uploads, thinking either alone is sufficient, when both are needed for full enforcement.

Full explanation →

542

MCQmedium

A data engineer applies the bucket policy shown in the exhibit to an S3 bucket. The bucket contains sensitive data that must be encrypted at rest and accessed only over HTTPS. Which of the following statements is true?

A.The policy allows both HTTP and HTTPS access.

B.The policy allows anonymous access to list objects in the bucket.

C.The policy enforces that all PutObject requests must include the x-amz-server-side-encryption header with value AES256.

D.The policy requires the use of AWS KMS for server-side encryption.

AnswerC

The Allow statement requires the condition s3:x-amz-server-side-encryption equals AES256 for PutObject.

Why this answer

Option C is correct because the bucket policy includes a condition that denies PutObject requests unless the `s3:x-amz-server-side-encryption` header is present and set to `AES256`. This enforces server-side encryption with S3-managed keys (SSE-S3) for all uploads, ensuring data at rest is encrypted.

Exam trap

AWS often tests the distinction between SSE-S3 (`AES256`) and SSE-KMS (`aws:kms`) in bucket policy conditions, and candidates may mistakenly think the policy requires KMS when it actually specifies AES256.

How to eliminate wrong answers

Option A is wrong because the policy includes a `Deny` statement that blocks requests when `aws:SecureTransport` is `false`, which effectively denies HTTP access and allows only HTTPS. Option B is wrong because the policy does not grant any `s3:ListBucket` permission to anonymous principals; it only denies requests that fail encryption or transport conditions, but does not allow anonymous listing. Option D is wrong because the policy requires the `x-amz-server-side-encryption` header with value `AES256`, which corresponds to SSE-S3, not AWS KMS (which would require `aws:kms`).

Full explanation →

543

MCQmedium

A data engineer is designing a data ingestion pipeline to load millions of small JSON files from an on-premises FTP server into Amazon S3. The pipeline should minimize cost and operational overhead. Which approach is most suitable?

A.Use S3 Transfer Acceleration to upload files directly from the FTP server

B.Deploy AWS DataSync to transfer files from the FTP server to S3

C.Use AWS Snowball Edge to ship the data to AWS

D.Set up an AWS Direct Connect connection and use AWS CLI to copy files

AnswerB

AWS DataSync is designed for efficient data transfer from on-premises to AWS, handling small files well with minimal operational overhead.

Why this answer

Option C is correct because AWS DataSync can transfer data from on-premises to S3 efficiently, handling many small files. Option A is wrong because S3 Transfer Acceleration speeds up transfers over long distances but requires S3 API access. Option B is wrong because Direct Connect provides dedicated network but still needs a transfer mechanism.

Option D is wrong because Snowball Edge is for large data volumes and incurs shipping delays.

Full explanation →

544

MCQmedium

A data engineering team notices that an Amazon Kinesis Data Stream is frequently exceeding its shard write throughput limit, causing throttling. The team needs a long-term solution to handle variable write traffic without manual intervention. Which action should the team take?

A.Configure the Kinesis Client Library to throttle consumption.

B.Increase the number of shards manually during peak hours.

C.Use Amazon Kinesis Data Firehose to buffer records before delivery to the stream.

D.Implement a buffer using Amazon S3 and AWS Lambda that aggregates records and writes to Kinesis in batches.

AnswerD

This buffers writes and reduces throttling.

Why this answer

Option C is correct because using an S3 buffer with a Lambda function that batches records and writes to Kinesis can smooth out traffic spikes and reduce throttling. Option A is wrong because increasing shard count manually is not automatic. Option B is wrong because Kinesis Data Firehose is for delivery, not buffering for Kinesis streams.

Option D is wrong because Kinesis Client Library is for consumers, not producers.

Full explanation →

545

MCQmedium

A company is migrating its on-premises PostgreSQL database to Amazon RDS for PostgreSQL. The database is 5 TB in size and supports a critical application that requires less than 30 minutes of downtime. The company has a 1 Gbps network connection to AWS. The data engineering team plans to use AWS Database Migration Service (DMS) with change data capture (CDC) to keep the target in sync. During the full load phase, DMS is taking longer than expected, and the team is concerned about meeting the downtime window. Which action should the team take to speed up the full load?

A.Increase the compute capacity of the target RDS instance.

B.Enable DMS validation to ensure data integrity.

C.Use AWS Snowball Edge to transfer the data offline.

D.Create multiple DMS tasks to load different tables in parallel.

AnswerD

Parallel tasks increase throughput.

Why this answer

Creating multiple DMS tasks to load different tables in parallel (Option D) is the correct action because DMS performs full load sequentially within a single task. By splitting tables across multiple tasks, the team can parallelize the data transfer, utilizing the 1 Gbps network more efficiently and reducing the overall full load time to meet the 30-minute downtime window.

Exam trap

The trap here is that candidates assume increasing target instance size (Option A) will speed up the full load, but they overlook that DMS's single-task architecture is the primary bottleneck, not the target's write capacity.

How to eliminate wrong answers

Option A is wrong because increasing the compute capacity of the target RDS instance does not address the bottleneck of DMS's sequential full load process; the target can ingest data faster, but DMS still processes tables one at a time. Option B is wrong because enabling DMS validation adds overhead by comparing source and target records, which would slow down the full load further, not speed it up. Option C is wrong because AWS Snowball Edge is designed for offline data transfer over multiple days, not for a migration requiring less than 30 minutes of downtime; the 1 Gbps network connection is sufficient if parallelism is used.

Full explanation →

546

MCQeasy

Refer to the exhibit. A data engineer is troubleshooting an IAM policy attached to a user who cannot list objects in the S3 bucket 'example-bucket'. What is the most likely reason?

A.The bucket policy explicitly denies access to the user.

B.The resource ARN for the bucket is incorrect; it should be 'arn:aws:s3:::example-bucket/*'.

C.The policy includes s3:GetObject but not s3:ListObjects.

D.The policy does not include the s3:ListBucket action.

AnswerA

An explicit deny overrides the IAM policy.

Why this answer

Option B is correct because the policy grants s3:ListBucket on the bucket ARN, but the user also needs permission on the objects to list them (s3:GetObject is for reading, not listing). Actually, s3:ListBucket allows listing, but the error might be due to missing s3:GetObject on the bucket? Wait, the policy includes both actions. The issue is that the policy is correct for listing.

The exhibit shows a valid policy. The most likely reason is that the bucket policy denies access. Option A is wrong because the policy includes both actions.

Option C is wrong because there is no such action. Option D is plausible but less likely.

Full explanation →

547

MCQhard

A company uses AWS Lake Formation to manage access to data in S3. A data analyst reports being unable to query a table in Amazon Athena, receiving an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What additional configuration is MOST likely causing the issue?

A.Athena does not have permission to access the Glue Data Catalog

B.The IAM role used by Athena does not have S3 GetObject permission on the underlying data

C.The analyst does not have DESCRIBE permission

D.The table is not registered with Lake Formation

AnswerB

Lake Formation grants SELECT, but S3 bucket policies or IAM may still block access.

Why this answer

Option D is correct because Lake Formation enforces access at the S3 level via IAM; if the IAM role lacks S3 permissions, access is denied. Option A is wrong because the table is registered. Option B is wrong because the analyst has SELECT permission.

Option C is wrong because Athena permissions are typically granted via Lake Formation.

Full explanation →

548

MCQhard

A data engineer is troubleshooting a daily batch ingestion pipeline that uses AWS Glue to read CSV files from Amazon S3 and write Parquet files to another S3 bucket. The job runs successfully but takes significantly longer than expected. The engineer notices that the input data is highly skewed with many small files. Which is the most effective optimization to reduce job duration?

A.Change the output format to JSON

B.Enable the 'groupFiles' option in the S3 source configuration

C.Increase the number of DPUs allocated to the job

D.Enable the 'use_glue_schema_registry' option

AnswerB

Grouping small files into larger splits reduces task overhead and improves performance.

Why this answer

Option D is correct because grouping small files into larger splits reduces the overhead of task creation and improves Spark performance. Option A is wrong because increasing the number of DPUs can increase parallelism but may not help if the issue is file count overhead; it could even be wasteful. Option B is wrong because using a different file format is not the issue; the output is Parquet, which is already efficient.

Option C is wrong because changing the shuffle behavior does not directly address the small files problem.

Full explanation →

549

MCQhard

A data engineer is troubleshooting a failed AWS Glue ETL job that reads from a JDBC source. The error log shows 'java.sql.SQLException: Connection timed out'. The job previously ran successfully. Which of the following is the MOST likely cause?

A.The JDBC connection string has incorrect credentials.

B.The source database schema has changed.

C.The Glue job's timeout setting is too low.

D.The security group for the source database no longer allows traffic from the Glue job's IP range.

AnswerD

A network connectivity issue causes a timeout.

Why this answer

The error 'Connection timed out' indicates a network-level failure, not an authentication or schema issue. Since the job previously ran successfully, the most likely cause is that the security group for the source database no longer allows inbound traffic from the Glue job's IP range. AWS Glue ETL jobs run in a VPC with elastic network interfaces, and the security group rules must permit traffic on the JDBC port (e.g., 5432 for PostgreSQL, 3306 for MySQL).

Exam trap

AWS often tests the distinction between authentication errors (wrong credentials) and network connectivity errors (timeout), and candidates may confuse the Glue job timeout setting with a network timeout.

How to eliminate wrong answers

Option A is wrong because incorrect credentials would produce an authentication error (e.g., 'Access denied for user'), not a timeout. Option B is wrong because a schema change would cause a data type mismatch or column-not-found error, not a connection timeout. Option C is wrong because the Glue job's timeout setting controls how long the job can run before being terminated, not the network connection timeout to the JDBC source.

Full explanation →

550

MCQmedium

A retail company uses Amazon Kinesis Data Firehose to ingest clickstream data from its website into an Amazon S3 bucket. The data includes fields: user_id, event_type, timestamp, page_url. Recently, the data engineering team noticed that some records have malformed JSON (missing commas, extra brackets) causing delivery failures to S3. The Firehose delivery stream is configured to retry failed records for 300 seconds, after which the records are sent to an S3 bucket for failed records. The team wants to transform the data to correct malformed JSON before delivery to the main S3 bucket. They need a solution that does not require managing servers and can handle high throughput. What should the team do?

A.Configure an AWS Lambda function as a data transformation in Kinesis Data Firehose to correct malformed JSON.

B.Set up an Amazon EMR cluster with Apache Spark to process the data in micro-batches and fix JSON errors.

C.Use an AWS Glue streaming ETL job to read from Firehose and write corrected data to S3.

D.Use Amazon Kinesis Data Analytics with a SQL application to parse and fix JSON.

AnswerA

Firehose supports Lambda transformations for record-level processing; it scales automatically.

Why this answer

Option B is correct: Lambda transformation in Firehose can process each record and fix JSON errors. Option A (Kinesis Data Analytics) is for real-time analytics, not record-level transformation. Option C (Glue streaming) adds complexity and latency.

Option D (EMR) requires cluster management.

Full explanation →

551

Drag & Dropmedium

Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, catalog the source data with a crawler. Then, prepare the ETL script. Configure the job with connections, run it, and finally verify the results in Redshift.

Full explanation →

552

MCQeasy

A data engineer needs to store semi-structured JSON logs from an application for up to 30 days, with infrequent access. Which storage solution is the most cost-effective?

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)

C.Amazon S3 Standard

D.Amazon S3 Standard-Infrequent Access (S3 Standard-IA)

AnswerD

Cost-effective for infrequently accessed data with rapid access needs.

Why this answer

Amazon S3 Standard-Infrequent Access (S3 Standard-IA) is the most cost-effective choice for storing semi-structured JSON logs for up to 30 days with infrequent access. It offers low storage cost (compared to S3 Standard) while providing low-latency retrieval and high durability (99.999999999%) across multiple Availability Zones, making it ideal for data that is accessed less frequently but needs immediate availability when requested.

Exam trap

The trap here is that candidates often choose S3 One Zone-IA (Option B) thinking it is cheaper due to single-AZ storage, but they overlook the durability and availability requirements for logs that may need to be recovered from an AZ failure, and the fact that S3 Standard-IA is actually more cost-effective for this 30-day retention scenario when considering retrieval costs and minimum storage charges.

How to eliminate wrong answers

Option A is wrong because Amazon S3 Glacier Deep Archive is designed for long-term archival (retrieval times of 12-48 hours) and has a minimum storage duration of 180 days, making it unsuitable for a 30-day retention period with infrequent but potentially immediate access needs. Option B is wrong because Amazon S3 One Zone-Infrequent Access stores data in a single Availability Zone, which does not provide the multi-AZ durability required for logs that may need to be recovered from failures; it is also not the most cost-effective for this use case due to its higher retrieval costs and lower resilience. Option C is wrong because Amazon S3 Standard is optimized for frequently accessed data with higher storage costs per GB, making it overpriced for logs that are accessed infrequently over a 30-day period.

Full explanation →

553

MCQeasy

A company stores its application logs in an Amazon S3 bucket. The logs are accessed frequently for the first 30 days, after which they are rarely accessed but must be retained for 7 years for compliance. The company wants to optimize storage costs while maintaining immediate retrieval availability for the first 30 days and the ability to retrieve logs within 12 hours after that. Which lifecycle policy should the data engineer configure?

A.Delete objects after 30 days to minimize storage costs.

B.Transition objects to S3 Standard-IA after 30 days and then to S3 Glacier Deep Archive after 1 year.

C.Transition objects to S3 One Zone-IA after 30 days and delete after 7 years.

D.Transition objects to S3 Glacier Flexible Retrieval after 30 days and delete after 7 years.

AnswerB

Standard-IA provides immediate retrieval for the first 30 days, then Deep Archive for cost-effective long-term retention.

Why this answer

Option B is correct because S3 Standard-IA is for infrequently accessed data but with immediate retrieval; after 30 days, transition to S3 Glacier Deep Archive for long-term retention with retrieval within 12 hours. Option A is incorrect because S3 One Zone-IA is not durable enough for compliance. Option C is incorrect because S3 Glacier Flexible Retrieval has retrieval times of minutes to hours, not up to 12 hours; Deep Archive is cheaper.

Option D is incorrect because deleting after 30 days violates retention requirement.

Full explanation →

554

Multi-Selectmedium

A company uses AWS Glue to transform data in S3. The Glue job fails with memory errors. Which THREE actions can help resolve this?

Select 3 answers

A.Optimize the transformation to use pushdown predicates.

B.Use a larger worker type (e.g., G.2X).

C.Increase the number of DPUs.

D.Increase the job timeout.

E.Decrease the number of DPUs.

AnswersA, B, C

Pushdown predicates reduce data loaded into memory.

Why this answer

Options A, B, and E are correct. Increasing DPUs (A) adds more worker nodes. Using a larger worker type (B) increases memory per worker.

Optimizing transformations like using pushdown predicates (E) reduces data scanned. Option C is wrong because reducing DPUs would make the problem worse. Option D is wrong because increasing timeout does not address memory issues.

Full explanation →

555

MCQeasy

A data engineer needs to grant an IAM user the ability to view Amazon CloudWatch Logs log groups and stream log events from a specific log group. Which IAM policy action should be used?

A.logs:DescribeLogGroups and logs:GetLogEvents

B.logs:PutLogEvents

C.logs:CreateLogGroup

D.logs:DeleteLogGroup

AnswerA

These allow listing and reading logs.

Why this answer

Option A is correct because logs:DescribeLogGroups and logs:GetLogEvents are the required actions. Option B is wrong because logs:PutLogEvents is for writing. Option C is wrong because logs:CreateLogGroup is for creation.

Option D is wrong because logs:DeleteLogGroup is for deletion.

Full explanation →

556

Multi-Selecteasy

A data engineer needs to transfer 50 TB of data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 500 Mbps. Which TWO methods are appropriate for this transfer? (Choose TWO.)

Select 2 answers

A.Set up an AWS Direct Connect connection for higher bandwidth.

B.Order an AWS Snowball Edge device to physically ship the data.

C.Use S3 Transfer Acceleration to upload over the internet.

D.Use Amazon Kinesis Data Firehose to stream the data.

E.Use AWS DataSync to transfer data over the network.

AnswersB, E

Snowball is ideal for large datasets with low bandwidth.

Why this answer

Options A and C are correct. AWS Snowball Edge is a physical device for large data transfers over slow networks. AWS DataSync can transfer data over the network with optimization.

Option B is wrong because S3 Transfer Acceleration speeds up transfers but still requires network bandwidth. Option D is wrong because AWS Direct Connect is a dedicated network connection, not a transfer method. Option E is wrong because Amazon Kinesis is for streaming data, not bulk transfer.

Full explanation →

557

MCQeasy

A data engineer notices that an Amazon S3 bucket policy is overly permissive. What is the best practice to restrict access while maintaining required permissions?

A.Grant full S3 access using a new IAM policy.

B.Write a new bucket policy that denies all actions.

C.Use an S3 blocklist to restrict access.

D.Attach the AWS managed policy AmazonS3ReadOnlyAccess to the IAM user.

AnswerD

This policy grants only read access to S3, which is more restrictive than the current overly permissive policy.

Why this answer

Option A is correct because the AWS managed policy 'AmazonS3ReadOnlyAccess' grants read-only access and is more restrictive than full access. Option B (Deny all) would break applications. Option C (blocklist) is not a standard method.

Option D (full access) is the opposite of restriction.

Full explanation →

558

MCQmedium

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is 10 TB and the network bandwidth is 1 Gbps. The migration must be completed within 48 hours. What should the data engineer do to meet the deadline?

A.Use S3 Transfer Acceleration to speed up the transfer

B.Use AWS Snowball Edge to transfer the data physically

C.Use AWS DataSync with multiple agents and enable data compression

D.Request a bandwidth increase from the ISP

AnswerC

Multiple agents and compression maximize throughput to meet the deadline.

Why this answer

Option D (Use AWS DataSync with multiple agents and enable data compression) is correct because it maximizes throughput. Option A (Use Snowball Edge) is not necessary for 10 TB with 1 Gbps; it would take about 22 hours theoretically, but compression and multiple agents can help. Option B (Increase bandwidth) is not always feasible.

Option C (Use S3 Transfer Acceleration) improves upload speed but not as much as multiple agents and compression.

Full explanation →

559

MCQhard

Refer to the exhibit. A data engineer runs an AWS Glue job that fails with an 'Access Denied' error when writing to S3. The IAM role attached to the job has s3:PutObject permission on the output bucket. What additional configuration is most likely missing?

A.The Glue job is not configured to write to S3 with the correct prefix

B.The S3 bucket policy does not grant access to the Glue job's IAM role

C.The Glue job is running in a VPC without an S3 VPC endpoint

D.The S3 bucket is encrypted with AWS KMS and the IAM role lacks kms:Decrypt permission

AnswerB

Even if IAM allows, bucket policy can deny; this is a common misconfiguration.

Why this answer

Option C is correct because the error shows 'Access Denied' when writing, and the job uses Glue version 3.0 (which includes Spark 3.1), which may require the S3 bucket to have a bucket policy that allows the Glue service principal or a VPC endpoint. Often the issue is that the S3 bucket policy does not allow the Glue job's IAM role. Option A (KMS) is possible but not mentioned.

Option B (VPC) might be needed if the job is in a VPC. Option D (CloudWatch) is not related.

Full explanation →

560

MCQeasy

A company streams clickstream data from websites to Amazon Kinesis Data Streams. A Lambda function processes each record and writes it to Amazon S3. Recently, the function has been timing out under high load. Which solution should a data engineer implement to handle the increased throughput?

A.Increase the Lambda function's timeout value.

B.Increase the number of shards in the Kinesis data stream.

C.Increase the memory allocated to the Lambda function.

D.Configure Amazon S3 Event Notifications to trigger Lambda directly.

AnswerB

More shards increase parallelism and allow Lambda to process more records concurrently.

Why this answer

Option C is correct because increasing the number of shards in Kinesis Data Streams increases parallelism, allowing more Lambda invocations to process records concurrently. Option A is wrong because writing to S3 more frequently does not reduce Lambda processing time. Option B is wrong because Lambda already processes records; increasing memory may help but does not address the root cause of limited shard count.

Option D is wrong because S3 does not support streaming directly.

Full explanation →

561

MCQeasy

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data must be transformed in near real-time and stored in Amazon S3 for analytics. Which AWS service should be used to transform the data as it is ingested?

A.AWS Lambda (streaming function)

B.Amazon EMR (Spark Streaming)

C.AWS Glue (ETL jobs)

D.Amazon Kinesis Data Analytics

AnswerD

Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink.

Why this answer

Option B is correct because Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink. Option A (AWS Glue) is a batch ETL service, not for real-time streams. Option C (Amazon EMR) is for big data processing but requires more setup.

Option D (AWS Lambda) can process Kinesis records but is less efficient for complex transformations on high-throughput streams.

Full explanation →

562

MCQmedium

A data engineer is tasked with designing a disaster recovery solution for a data lake stored in Amazon S3. The data lake contains sensitive customer data that must be replicated to a different AWS Region. The engineer needs to ensure that all objects, including those with encryption using SSE-KMS, are replicated. Which solution meets the requirements?

A.Use S3 Batch Operations to copy objects to the destination bucket.

B.Enable S3 Cross-Region Replication (CRR) with the appropriate KMS key and IAM role.

C.Use S3 Transfer Acceleration to copy objects across regions.

D.Use the AWS CLI s3 sync command scheduled in a cron job.

AnswerB

CRR supports SSE-KMS with proper configuration.

Why this answer

Option C is correct because S3 Cross-Region Replication can replicate objects with SSE-KMS if the KMS key is specified and the IAM role has necessary permissions. Option A is wrong because S3 Batch Operations is for one-time bulk actions. Option B is wrong because S3 Sync CLI command is not automatic for ongoing replication.

Option D is wrong because S3 Transfer Acceleration speeds up uploads but does not replicate.

Full explanation →

563

MCQmedium

A retail company uses AWS Glue to process daily sales data from multiple CSV files stored in Amazon S3. The Glue job runs a PySpark script that reads the files, performs joins, and writes the output as Parquet. Recently, the job has been failing with 'Out of Memory' errors. The data volume has grown from 10 GB to 50 GB per day. The Glue job uses 10 DPUs and the standard worker type. The data engineer needs to fix the job without rewriting the script. What should the data engineer do?

A.Split the input CSV files into smaller partitions.

B.Change the worker type to G.2X to get more memory per worker.

C.Decrease the number of DPUs to reduce memory contention.

D.Increase the number of DPUs for the Glue job to 20.

AnswerD

More DPUs provide more memory and parallel processing, solving OOM.

Why this answer

Option C is correct. Increasing the number of DPUs provides more memory and compute resources, addressing the OOM error. Option A is wrong because changing worker type to G.2X may not be sufficient if the issue is simply memory; but increasing DPUs is a direct solution.

Option B is wrong because splitting files does not reduce the memory needed for joins. Option D is wrong because decreasing DPUs would make the problem worse.

Full explanation →

564

MCQhard

A company needs to ingest real-time clickstream data from a web application into Amazon Redshift with minimal latency. The data volume is high and requires processing before loading. Which architecture is MOST appropriate?

A.AWS Glue ETL jobs scheduled every 5 minutes -> Redshift

B.S3 -> Lambda -> Redshift

C.DynamoDB Streams -> Lambda -> Redshift

D.Kinesis Data Streams -> Kinesis Data Firehose -> Redshift

AnswerD

Provides real-time ingestion with transformation capability.

Why this answer

Option A is correct because Kinesis Data Firehose can deliver streaming data to Redshift with built-in transformation via Lambda. Option B is wrong because S3 does not provide real-time ingestion. Option C is wrong because Glue is batch-oriented.

Option D is wrong because DynamoDB Streams is for DynamoDB changes, not clickstream.

Full explanation →

565

MCQhard

Refer to the exhibit. A data engineer creates this KMS key policy. An IAM role in account 123456789012 is granted decrypt access to the key. However, when the DataAnalystRole tries to decrypt an S3 object encrypted with this key, the operation fails. What is the most likely reason?

A.The S3 bucket policy does not allow the role to call s3:GetObject

B.The KMS key is in a different region than the S3 bucket

C.The role does not have permission to call kms:DescribeKey

D.The KMS key policy does not grant kms:Decrypt permission to the role

AnswerA

Even with decrypt permission, the role needs s3:GetObject permission on the encrypted object.

Why this answer

KMS key policies grant access to principals. However, if the S3 bucket policy does not allow the role to call kms:Decrypt, the combination of policies might still deny. But the key policy itself grants decrypt.

A common issue is that the S3 bucket policy might not allow the s3:GetObject action, or the role might not have S3 permissions. Another possibility is that the KMS key is in a different region (us-east-1) but the S3 object is in another region, causing cross-region access which is not allowed by default. However, the most likely reason based on typical exam scenarios is that the S3 bucket policy does not grant the necessary S3 permissions.

Full explanation →

566

MCQmedium

A company stores sensitive data in an Amazon S3 bucket. To comply with regulations, all data must be encrypted at rest using server-side encryption. The security team wants to ensure that any attempt to upload an unencrypted object is automatically denied. Which S3 bucket policy condition should be used?

A.s3:x-amz-server-side-encryption-aws-kms-key-id

B.s3:x-amz-acl

C.s3:x-amz-server-side-encryption

D.s3:x-amz-storage-class

AnswerC

Setting this condition to require 'AES256' enforces SSE-S3 encryption.

Why this answer

The s3:x-amz-server-side-encryption condition key enforces that objects must be encrypted with AES-256 (SSE-S3). s3:x-amz-server-side-encryption-aws-kms-key-id is for KMS key enforcement. s3:x-amz-acl controls access control lists, not encryption.

Full explanation →

567

MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon S3 and write results to Amazon Redshift. The Glue job uses the JDBC connection to Redshift. Recently, the job has been failing intermittently with the error: 'java.sql.SQLException: [Amazon](500310) Invalid operation: INSERT has more expressions than target columns;' The Glue job writes to a staging table in Redshift before performing a merge into the final table. The staging table schema matches the source data. The error occurs only on some days and affects different columns each time. The data engineer suspects that the source data occasionally contains extra columns due to a schema drift in the upstream data producer. Which approach should the data engineer take to handle this issue robustly?

A.Skip any records that have extra columns by adding a conditional check in the Glue script.

B.Use a Glue DynamicFrame and apply the resolveChoice method to make the schema consistent.

C.Manually update the Redshift staging table schema whenever the source data changes.

D.Use a Glue DynamicFrame and apply the dropFields method to remove extra columns before writing.

AnswerB

resolveChoice can handle schema drift by casting or dropping columns, making the job resilient.

Why this answer

Option B is correct because Glue DynamicFrames can automatically handle schema drift using the `resolveChoice` method, which allows you to specify how to handle columns that appear inconsistently across records (e.g., making them null, casting to a common type, or dropping them). This directly addresses the intermittent error caused by extra columns in the source data without requiring manual schema updates or fragile conditional logic.

Exam trap

The trap here is that candidates may confuse `dropFields` (which removes specific columns statically) with `resolveChoice` (which handles dynamic schema drift), leading them to choose Option D even though it cannot adapt to varying extra columns across different days.

How to eliminate wrong answers

Option A is wrong because skipping records with extra columns would result in data loss and does not address the root cause—the schema mismatch between the source and the staging table. Option C is wrong because manually updating the Redshift staging table schema whenever the source data changes is not scalable, error-prone, and defeats the purpose of an automated ETL pipeline. Option D is wrong because `dropFields` removes specific named columns statically at coding time, but the error occurs on different columns each day, so a dynamic approach like `resolveChoice` is needed.

Full explanation →

568

MCQeasy

A data engineer is migrating an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. The database is 2 TB in size. The engineer needs to minimize downtime. Which AWS service should be used for the migration?

A.AWS Data Pipeline

B.AWS Database Migration Service (DMS)

C.AWS Snowball

D.Amazon S3

AnswerB

DMS supports continuous replication with minimal downtime.

Why this answer

AWS Database Migration Service (DMS) supports live migration with minimal downtime using change data capture. Option B is correct. Option A: S3 is for object storage, not database migration.

Option C: Snowball is for large data transfer offline, which would cause downtime. Option D: Data Pipeline is for data processing workflows, not direct database migration.

Full explanation →

569

Matchingmedium

Match each AWS database service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Relational database with managed operations

NoSQL key-value and document database

In-memory caching for low latency

Graph database for connected data

Time-series data for IoT and analytics

Why these pairings

AWS offers purpose-built databases.

Full explanation →

570

MCQmedium

A team is designing a data lake on S3 and needs to enforce encryption at rest. They want to use server-side encryption with a KMS key that they manage. Which encryption option should they configure on the S3 bucket?

A.SSE-KMS

B.Client-side encryption

C.SSE-S3

D.SSE-C

AnswerA

SSE-KMS uses KMS keys that the customer manages.

Why this answer

SSE-KMS is the correct choice because it provides server-side encryption using a customer-managed KMS key. This allows the team to enforce encryption at rest with their own key, giving them control over key rotation, access policies, and audit trails via AWS CloudTrail, which aligns with the requirement to manage the encryption key themselves.

Exam trap

The trap here is that candidates often confuse SSE-S3 with SSE-KMS, assuming both use customer-managed keys, but SSE-S3 uses AWS-managed keys and does not provide the customer with key management control or audit capabilities.

How to eliminate wrong answers

Option B (Client-side encryption) is wrong because it encrypts data before it is sent to S3, not at rest on the server side, and does not involve configuring encryption on the S3 bucket itself. Option C (SSE-S3) is wrong because it uses an AWS-managed key, not a customer-managed KMS key, so the team would not have control over key management. Option D (SSE-C) is wrong because it requires the customer to provide their own encryption keys in each request, and the bucket configuration does not manage the key; instead, the key is supplied per-object, which is not a bucket-level encryption setting.

Full explanation →

571

MCQmedium

A data engineering team notices that an AWS Glue ETL job fails intermittently with a 'ThrottlingException' error. The job reads from an Amazon S3 bucket and writes to an Amazon Redshift table. What is the MOST likely cause of this error?

A.The S3 bucket's request rate is exceeding the bucket's performance limits.

B.The Redshift cluster's write throughput is exceeding its provisioned capacity.

C.The Glue job is exceeding the maximum number of concurrent runs allowed.

D.The Glue job's allocated memory is insufficient for the data volume.

AnswerB

Redshift throttles writes when the cluster's I/O capacity is exceeded.

Why this answer

Option A is correct because ThrottlingException when writing to Redshift typically indicates that the write throughput exceeds the cluster's capacity, causing API throttling. Option B is incorrect because S3 throttling would result in a different error (e.g., SlowDown). Option C is incorrect because Glue job throttling would be at the API level for job operations, not data operations.

Option D is incorrect because insufficient memory would cause OutOfMemoryError, not ThrottlingException.

Full explanation →

572

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver streaming log data to an Amazon S3 bucket. The delivery stream uses dynamic partitioning with a custom prefix. Recently, the delivery stream has been failing with the error 'InvalidArgumentException: The number of partitions exceeds the limit'. What is the likely cause?

A.The incoming data contains more distinct partition key values than the allowed limit.

B.The S3 bucket has a bucket policy that restricts the number of prefixes.

C.The buffer size and interval are set too low, causing many small files.

D.The data volume exceeds the maximum throughput of the delivery stream.

AnswerA

Firehose dynamic partitioning has a limit on distinct partition values per batch.

Why this answer

Option C is correct because dynamic partitioning in Firehose has a limit of 20 unique partition values per batch. Option A is wrong because data volume does not cause this error. Option B is wrong because S3 bucket policy would cause different errors.

Option D is wrong because buffer size does not affect partition count.

Full explanation →

573

MCQhard

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB MySQL database to Amazon Aurora MySQL. The migration is taking longer than expected. The source database is in a different AWS region. Which change would MOST likely improve the migration speed?

A.Use a smaller DMS replication instance to reduce costs.

B.Use a Multi-AZ deployment for the DMS replication instance in the target region.

C.Increase the number of parallel tables being migrated.

D.Disable binary logging on the source MySQL database.

AnswerB

Multi-AZ can improve availability and performance by using a standby replica, which can help with cross-region data transfer.

Why this answer

Option A is correct because using a Multi-AZ DMS replication instance in the target region can improve performance by providing high availability and potentially better network throughput. Option B is incorrect because increasing the number of parallel tables can improve performance, but the question asks for the MOST likely improvement. Option C is incorrect because disabling logging on the source database could affect application performance and is not recommended.

Option D is incorrect because using a smaller instance type would reduce performance.

Full explanation →

574

MCQmedium

A company is using Amazon Redshift Spectrum to query data in Amazon S3. The S3 bucket uses SSE-KMS encryption. The Redshift cluster has an IAM role that allows access to S3 and KMS. However, queries fail with an 'Access Denied' error. What is the most likely cause?

A.The Redshift cluster does not have the IAM role attached.

B.The external schema does not have the IAM role specified.

C.The IAM role does not have the kms:Decrypt permission.

D.The external table is not defined in the schema.

AnswerB

The schema must reference the IAM role for Redshift Spectrum to assume it.

Why this answer

When using Redshift Spectrum with SSE-KMS encrypted data in S3, the IAM role must be explicitly associated with the external schema via the `CREATE EXTERNAL SCHEMA` command using the `IAM_ROLE` parameter. Even if the cluster has the IAM role attached, Spectrum queries fail with 'Access Denied' if the role is not specified at the schema level, because Redshift needs to pass that role to S3 and KMS for each query execution. Option B correctly identifies this missing configuration as the most likely cause.

Exam trap

The trap here is that candidates assume attaching an IAM role to the Redshift cluster is sufficient for all Spectrum operations, but the DEA-C01 exam tests the specific requirement that the role must be declared in the external schema definition for Spectrum to use it.

How to eliminate wrong answers

Option A is wrong because the question states the Redshift cluster has an IAM role attached, so the role is present on the cluster; the issue is that it is not specified in the external schema. Option C is wrong because the IAM role is explicitly stated to allow access to KMS, and the 'Access Denied' error typically occurs before KMS permission checks if the role is not passed to Spectrum at all. Option D is wrong because the external table definition is irrelevant to the 'Access Denied' error; the error occurs at the schema or role association level, not due to missing table definitions.

Full explanation →

575

MCQhard

Refer to the exhibit. A data engineer reviews an Amazon S3 server access log entry for an object upload. The log shows a status of 200 and encryption status "AES256". The company policy requires that all data be encrypted with SSE-KMS. Which action should the engineer take to enforce this policy?

A.Attach an S3 bucket policy that denies s3:PutObject unless the request includes x-amz-server-side-encryption: aws:kms

B.Revoke the IAM role's s3:PutObject permission

C.Enable AWS CloudTrail data events to monitor future uploads

D.Enable S3 default encryption with SSE-KMS on the bucket

AnswerA

Enforces SSE-KMS.

Why this answer

Option C is correct because the log shows the object was encrypted with AES256 (SSE-S3), not KMS. The engineer should attach a bucket policy that denies PutObject unless the encryption header is aws:kms. Option A is wrong because the upload succeeded.

Option B is wrong because CloudTrail does not prevent uploads. Option D is wrong because default encryption only applies to new objects without encryption headers, but the policy must deny SSE-S3.

Full explanation →

576

Multi-Selectmedium

A company uses AWS CloudTrail to log all API calls. The security team wants to ensure that log files are tamper-proof and cannot be deleted. Which TWO actions should the data engineer take? (Choose TWO.)

Select 2 answers

A.Enable CloudTrail log file validation

B.Enable S3 Object Lock on the S3 bucket

C.Enable MFA Delete on the S3 bucket

D.Enable S3 Versioning on the S3 bucket

E.Enable SSE-KMS encryption on the S3 bucket

AnswersA, B

Provides integrity verification to detect tampering.

Why this answer

S3 Object Lock prevents deletion, and CloudTrail log file validation provides integrity verification. Option A is wrong because SSE-KMS encrypts but does not prevent deletion. Option B is wrong because MFA Delete is not set up by CloudTrail.

Option D is wrong because versioning alone does not prevent deletion. Options C and E are correct.

Full explanation →

577

MCQmedium

A company is using AWS Glue to process data from Amazon S3. The Glue job reads CSV files and writes Parquet files to a different S3 bucket. The job occasionally fails with 'java.lang.OutOfMemoryError: Java heap space'. The data size varies. Which change should the engineer make to avoid this error?

A.Increase the number of DPUs allocated to the Glue job.

B.Convert the CSV files to JSON format before processing.

C.Decrease the Spark shuffle partitions in the job script.

D.Increase the job timeout setting.

AnswerA

More DPUs provide more memory and compute resources.

Why this answer

The 'java.lang.OutOfMemoryError: Java heap space' error in AWS Glue indicates that the Spark executors ran out of memory while processing the data. Increasing the number of DPUs (Data Processing Units) allocated to the Glue job increases the total memory available across the cluster, allowing larger datasets to be processed without hitting the heap limit. Each DPU provides 4 vCPUs and 16 GB of memory, so adding more DPUs scales memory linearly.

Exam trap

The trap here is that candidates often confuse 'increasing DPUs' with 'increasing parallelism' and assume it only speeds up jobs, but in reality it also increases total memory, which directly mitigates heap space errors.

How to eliminate wrong answers

Option B is wrong because converting CSV to JSON does not reduce memory pressure; JSON is typically more verbose than CSV and would increase memory consumption. Option C is wrong because decreasing Spark shuffle partitions reduces parallelism and can cause each partition to hold more data, worsening memory issues and potentially increasing the risk of OutOfMemoryError. Option D is wrong because increasing the job timeout setting only extends the maximum runtime before the job is killed; it does not address memory constraints or prevent heap space errors.

Full explanation →

578

MCQeasy

A company needs to ingest data from a relational database into Amazon S3 for analytics. The database is an Amazon RDS MySQL instance. Which AWS service should be used for a one-time historical data load?

A.AWS Database Migration Service (DMS)

B.AWS Glue ETL

C.Amazon Athena

D.Amazon Kinesis Data Firehose

AnswerA

DMS supports full load from RDS to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can perform a one-time full load from RDS MySQL to S3. Option A is incorrect because Glue ETL can do it but DMS is specialized for databases. Option B is incorrect because Kinesis Data Firehose is for streaming, not database.

Option D is incorrect because Athena queries data, not ingest.

Full explanation →

579

MCQhard

A data engineer needs to share a dataset stored in an Amazon S3 bucket with another AWS account. The dataset must remain encrypted at rest using AWS KMS. The data engineer creates a bucket policy that grants the other account access to the bucket. However, the other account reports that objects appear encrypted and they cannot decrypt them. What is the most likely cause?

A.The KMS key policy does not grant the other account the kms:Decrypt permission

B.The bucket policy does not grant the s3:GetObject permission

C.The other account must use the same KMS key to upload objects

D.The objects are encrypted with SSE-S3, which is not supported for cross-account access

AnswerA

Without decrypt permission on the KMS key, the other account cannot decrypt the objects even if they can download them.

Why this answer

When using SSE-KMS, the bucket policy alone is not enough; the KMS key policy must also grant the consuming account permission to use the key (kms:Decrypt). The bucket policy controls access to the S3 objects, but KMS key policy controls who can decrypt.

Full explanation →

580

Multi-Selectmedium

Which TWO options are valid ways to encrypt data at rest in Amazon S3? (Choose two.)

Select 2 answers

A.Client-Side Encryption

B.SSL/TLS Encryption

C.Server-Side Encryption with S3-Managed Keys (SSE-S3)

D.IAM Policy Encryption

E.Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS)

AnswersC, E

SSE-S3 is a server-side encryption option.

Why this answer

Server-Side Encryption with S3-Managed Keys (SSE-S3) is a valid method for encrypting data at rest in Amazon S3 because it uses AES-256 encryption to automatically encrypt objects when they are written to S3 and decrypt them when accessed, with the encryption keys managed entirely by AWS. This option is correct as it directly addresses data at rest encryption within S3, requiring no client-side effort beyond setting the `x-amz-server-side-encryption` header to `AES256`.

Exam trap

The trap here is that candidates often confuse encryption in transit (SSL/TLS) or client-side encryption with data at rest encryption, or mistakenly think IAM policies can encrypt data, when only server-side encryption options (SSE-S3, SSE-KMS, SSE-C) are valid for encrypting data at rest in S3.

Full explanation →

581

MCQhard

A company has an Amazon Redshift cluster with a mix of frequently accessed hot data and rarely accessed cold data. They want to reduce storage costs without affecting query performance for the hot data. Which strategy is MOST effective?

A.Use RA3 nodes with managed storage to automatically offload cold data to Amazon S3.

B.Reduce the number of nodes and increase the number of slices.

C.Create external tables in Redshift Spectrum to query cold data in S3.

D.Use Dense Compute nodes and unload cold data to Amazon S3 manually.

AnswerA

RA3 nodes use managed storage that automatically moves cold data to S3, reducing local storage costs.

Why this answer

RA3 nodes with managed storage automatically separate compute and storage, offloading cold data to Amazon S3 while keeping hot data on local SSD for fast queries. This reduces storage costs without manual intervention or affecting hot data performance.

Exam trap

The trap here is that candidates may choose Redshift Spectrum (Option C) thinking it automatically offloads cold data, but Spectrum requires manual external table creation and does not integrate with the cluster's automatic storage tiering.

How to eliminate wrong answers

Option B is wrong because reducing nodes and increasing slices does not address cold data storage; it changes cluster configuration without reducing storage costs for cold data. Option C is wrong because creating external tables in Redshift Spectrum allows querying cold data in S3 but does not automatically offload cold data from the cluster; it requires manual data movement and schema management. Option D is wrong because Dense Compute nodes are compute-optimized and do not support managed storage offloading; manually unloading cold data to S3 adds operational overhead and does not leverage automatic tiering.

Full explanation →

582

MCQhard

A data engineer runs this CLI command. Which query is MOST efficient against this table?

A.Query the CustomerIndex GSI by CustomerID and OrderDate.

B.Scan the table to find all orders for a CustomerID.

C.Create a local secondary index on CustomerID.

D.Query by OrderID and filter by OrderDate.

AnswerA

The GSI is designed for this query pattern.

Why this answer

The CLI command likely created a global secondary index (GSI) named CustomerIndex on CustomerID and OrderDate. Querying this GSI directly is the most efficient because it uses the index's sort key to retrieve only the relevant items without scanning the entire table, minimizing read capacity consumption.

Exam trap

The trap here is that candidates often default to scanning or creating a local secondary index without recognizing that a GSI already exists and is purpose-built for the query pattern, leading to inefficient or invalid solutions.

How to eliminate wrong answers

Option B is wrong because scanning the entire table to find orders for a specific CustomerID is inefficient and costly, as it reads every item rather than using an index to directly locate the data. Option C is wrong because creating a local secondary index on CustomerID alone would require the table to have the same partition key as the base table (OrderID), which may not align with the query pattern, and it cannot be created after table creation if the table already exists without one. Option D is wrong because querying by OrderID and filtering by OrderDate is inefficient if OrderID is not the partition key for the query pattern; it would either require a scan or an index that supports the filter, and filtering after a query still consumes read capacity for all items returned by the query.

Full explanation →

583

MCQhard

A company is using Amazon DynamoDB with on-demand capacity for a gaming application that experiences unpredictable traffic spikes. The application consistently sees 'ProvisionedThroughputExceededException' errors during spikes. The data engineer needs to resolve this issue without changing the application code. What should the engineer do?

A.Switch the table to on-demand capacity mode

B.Enable DynamoDB Accelerator (DAX) to cache read requests

C.Increase the read capacity units

D.Enable auto scaling for the table with a higher maximum capacity

AnswerA

On-demand mode automatically scales to handle traffic spikes without throttling.

Why this answer

The application is already using on-demand capacity, but the error 'ProvisionedThroughputExceededException' indicates the table is actually in provisioned mode, not on-demand. Switching to on-demand capacity mode eliminates throttling by automatically scaling throughput to match traffic spikes, with no code changes required.

Exam trap

The trap here is that candidates assume the table is already on-demand because the question states 'on-demand capacity,' but the error message 'ProvisionedThroughputExceededException' reveals the table is actually in provisioned mode, testing whether you recognize the mismatch between the stated configuration and the error.

How to eliminate wrong answers

Option B is wrong because DynamoDB Accelerator (DAX) only caches read requests to reduce latency and read load, but it does not resolve write throttling or provisioned throughput exceptions, and the error occurs during spikes regardless of read caching. Option C is wrong because increasing read capacity units only addresses read throughput, not write throughput, and the error is generic to both reads and writes; also, it requires manual intervention and does not handle unpredictable spikes. Option D is wrong because enabling auto scaling with a higher maximum capacity still uses provisioned mode, which can throttle during rapid spikes before scaling triggers, and the question specifies the table is already on-demand (though the error suggests it is not), so auto scaling is unnecessary and would not eliminate throttling for unpredictable traffic.

Full explanation →

584

MCQmedium

A data engineer is using AWS Glue ETL to transform data from an S3 data lake. The job fails with a memory error. Which approach should be used to resolve this issue without major code changes?

A.Rewrite the ETL script in PySpark instead of Scala

B.Change the input file format from CSV to Parquet

C.Increase the number of DPUs allocated to the Glue job

D.Use Amazon EMR instead of AWS Glue

AnswerC

More DPUs provide more memory and parallelism.

Why this answer

Option C is correct because increasing the number of DPUs (Data Processing Units) allocated to the Glue job provides more memory and parallelism. Option A is wrong because using a smaller file format may not address memory issues. Option B is wrong because rewriting in PySpark is a major code change.

Option D is wrong because using a different service is unnecessary.

Full explanation →

585

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data delivery is delayed by up to 5 minutes. The engineer wants to reduce the delay to under 1 minute. Which parameter should be adjusted?

A.Enable error logging to CloudWatch.

B.Increase the buffer size in Kinesis Data Firehose.

C.Enable data compression.

D.Decrease the buffer interval in Kinesis Data Firehose.

AnswerD

Lower buffer interval triggers deliveries more frequently.

Why this answer

Option B is correct because lowering the buffer interval reduces the time Firehose waits before delivering a batch, thus reducing latency. Option A is wrong because increasing buffer size would increase delay. Option C is wrong because compression has no effect on delivery timing.

Option D is wrong because error handling does not affect delivery delay.

Full explanation →

586

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The data is then processed by a scheduled AWS Glue ETL job that loads it into an Amazon Redshift table. Recently, the Glue job has been failing with the error: 'S3ServiceException: Access Denied'. The Firehose delivery stream is configured with a prefix and error logging to the same S3 bucket. The Glue job uses the same IAM role that has s3:GetObject and s3:ListBucket permissions on the bucket. What is the most likely cause?

A.The Glue job expects a different data format than what Firehose writes.

B.The Glue job's IAM role does not have s3:GetObjectVersion permission.

C.The Glue job is using the wrong IAM role that does not have permissions to the S3 bucket.

D.The S3 bucket has default encryption enabled with AWS KMS (SSE-KMS), and the Glue job's IAM role lacks kms:Decrypt permission.

AnswerD

SSE-KMS requires kms:Decrypt permission; missing it causes access denied when reading.

Why this answer

Option D is correct because Firehose uses SSE-S3 by default unless configured otherwise. If the S3 bucket has default encryption enabled with SSE-KMS, Firehose will use that encryption, but the Glue job's IAM role may lack kms:Decrypt permission for the KMS key. The error 'Access Denied' when reading from S3 often indicates encryption permission issues.

Option A is wrong because the Glue job can read from S3 with the current permissions if no encryption is involved. Option B is wrong because the error is about access, not schema. Option C is wrong because the Glue job can use the same role as Firehose, but the role may not have KMS permissions.

Full explanation →

587

MCQeasy

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The delivery occasionally fails with 'Firehose is throttled'. What should be done to reduce throttling?

A.Enable compression on the Firehose delivery stream

B.Increase the buffer size and buffer interval

C.Decrease the buffer size to flush more frequently

D.Increase the number of shards in the Kinesis stream

AnswerB

Larger buffer reduces the number of write requests.

Why this answer

Throttling occurs when the write rate exceeds the buffer limit. Increasing the buffer size or interval allows Firehose to accumulate more records before writing, reducing the number of PutRecordBatch calls.

Full explanation →

588

MCQmedium

A data engineering team is designing a data ingestion pipeline that will receive millions of small JSON files per hour from external partners via API. The files should be stored in Amazon S3 and then transformed into Parquet for querying. Which approach is MOST cost-effective and scalable?

A.Use Amazon Kinesis Data Firehose to buffer and deliver data to S3, then use AWS Glue to convert to Parquet.

B.Use AWS Lambda to process each file as it arrives and write to S3.

C.Use AWS Direct Connect to establish a dedicated network for file uploads.

D.Use Amazon EMR to process the files as they arrive in S3.

AnswerA

Firehose can ingest high throughput, buffer, and deliver to S3; Glue can run scheduled conversions.

Why this answer

Option C is correct because Kinesis Firehose can ingest streaming data, buffer it, and deliver in batches to S3, and then a Glue job can convert to Parquet. Option A is wrong because Lambda has a payload limit and is not cost-effective for millions of files. Option B is wrong because Direct Connect is for dedicated network, not for API ingestion.

Option D is wrong because EMR is overkill and more expensive.

Full explanation →

589

MCQeasy

A company is using Amazon S3 to store sensitive data. They need to automatically transition objects to S3 Glacier Deep Archive after 90 days and delete them after 7 years. Which S3 lifecycle configuration action should be used?

A.Transition

B.AbortIncompleteMultipartUpload

C.Expiration

D.NoncurrentVersionExpiration

AnswerA

Transition moves objects to another storage class based on age.

Why this answer

Option A is correct because the S3 lifecycle 'Transition' action is specifically designed to move objects between storage classes. To automatically move objects to S3 Glacier Deep Archive after 90 days, you define a transition rule with a 'Days' value of 90 and a 'StorageClass' of 'DEEP_ARCHIVE'. This action directly meets the requirement for transitioning data to a colder storage tier.

Exam trap

The trap here is that candidates often confuse 'Expiration' with 'Transition', thinking that deleting objects after a period is the same as moving them to a colder storage class, but expiration deletes data while transition preserves it in a different tier.

How to eliminate wrong answers

Option B is wrong because 'AbortIncompleteMultipartUpload' is used to abort multipart uploads that are not completed within a specified number of days; it does not transition or delete objects. Option C is wrong because 'Expiration' is used to delete objects after a specified time period, but the question requires a transition to Glacier Deep Archive after 90 days, not deletion at that point; expiration would delete the objects prematurely. Option D is wrong because 'NoncurrentVersionExpiration' is used to delete noncurrent versions of versioned objects, not to transition or delete current objects based on age.

Full explanation →

590

Multi-Selectmedium

A data engineer is troubleshooting an AWS Glue job that fails with 'java.lang.OutOfMemoryError: Java heap space'. The job processes a large dataset. Which TWO configuration changes should the engineer consider to resolve this issue? (Choose TWO.)

Select 2 answers

A.Change the output format from Parquet to CSV.

B.Increase the Spark shuffle partitions configuration (spark.sql.shuffle.partitions).

C.Reduce the number of partitions in the source data.

D.Increase the number of DPUs allocated to the Glue job.

E.Disable job bookmarks to avoid incremental processing.

AnswersB, D

More partitions reduce data per partition, lowering memory usage.

Why this answer

Options A and B are correct. A: Increasing the number of DPUs provides more memory for the job. B: Increasing the Spark shuffle partitions reduces the amount of data shuffled per partition, reducing memory pressure.

Option C is wrong because reducing the number of partitions in the source data may increase partition size. Option D is wrong because using a different data format does not directly address heap space. Option E is wrong because disabling job bookmark may cause reprocessing but not fix memory.

Full explanation →

591

MCQeasy

A data engineer is setting up a data pipeline to ingest data from an Amazon RDS for MySQL database into Amazon S3 using AWS Glue ETL. The Glue job uses a JDBC connection to read from the MySQL database. The job runs successfully, but the engineer notices that the job is taking longer than expected. The MySQL database is 500 GB in size and the Glue job uses 10 workers of type G.1X. The engineer wants to improve the performance of the extraction phase. The database is actively used by other applications, so the engineer must minimize the impact on the source database. Which approach should the engineer take?

A.Partition the table by a numeric column, such as the primary key, and use the 'hashex' or 'hashpar' partitioning option in the Glue JDBC connection.

B.Use an incremental extraction strategy with a watermark column to reduce the amount of data read each time.

C.Create a read replica of the MySQL database and configure the Glue job to read from the replica.

D.Increase the number of Glue workers to 20 to increase parallelism.

AnswerA

Partitioning enables parallel reads and reduces the load on the source by reading chunks sequentially.

Why this answer

Option A is correct because partitioning the table on a key column (e.g., primary key) allows Glue to read in parallel from multiple partitions, reducing the load on the database and improving performance. Option B is wrong because increasing workers without partitioning may cause the database to be overwhelmed with connections. Option C is wrong because using a read replica offloads the read traffic but still requires efficient partitioning; also, setting up a read replica adds cost and complexity.

Option D is wrong because full table scans are inefficient; incremental loads are for ongoing changes, not initial extraction.

Full explanation →

592

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon RDS to Amazon S3 daily. The job recently started failing with memory errors. The data volume has grown 3x in the past month. Which change should the data engineer make to resolve the issue?

A.Increase the size of the Amazon RDS instance

B.Switch the Glue job type from Python Shell to Spark

C.Partition the output data in Amazon S3 by date

D.Increase the number of DPUs allocated to the Glue job

AnswerD

More DPUs provide more memory to handle larger data volumes.

Why this answer

Option D is correct because increasing the number of DPUs (Data Processing Units) in the AWS Glue job provides more memory and compute resources, which can handle larger data volumes. Option A (increasing RDS instance size) may not help if the bottleneck is the Glue job. Option B (switching to Spark) is not directly relevant; Glue already uses Spark.

Option C (partitioning S3 output) improves query performance but does not fix memory errors during transformation.

Full explanation →

593

Multi-Selecthard

A company is streaming IoT sensor data from thousands of devices into Amazon Kinesis Data Firehose. The data is then delivered to Amazon S3 for long-term storage. Occasionally, some records fail to be delivered to S3. The company must capture and analyze these failed records. Which TWO actions should be taken? (Choose two.)

Select 2 answers

A.Configure an AWS Lambda function as a pre-processing step to catch and log failed records.

B.Use Amazon Kinesis Data Analytics to analyze the failed records in real time.

C.Send failed records to an Amazon Kinesis Data Stream for reprocessing.

D.Enable Amazon CloudWatch Logs for Kinesis Data Firehose to capture delivery errors.

E.Set up an S3 event notification to trigger a Lambda function to reprocess failed records.

AnswersA, D

Lambda can handle errors during transformation and log them.

Why this answer

A is correct because enabling Kinesis Data Firehose error logging to CloudWatch allows monitoring of delivery failures. C is correct because configuring a Lambda function as a data transformation and error handler can process and log failed records. B is wrong because S3 is the destination, not the source for reprocessing.

D is wrong because Kinesis Data Streams is a different service and does not directly capture Firehose failures. E is wrong because Kinesis Data Analytics is for real-time analytics, not for handling delivery failures.

Full explanation →

594

Multi-Selecthard

A company needs to enforce encryption at rest for all data stored in Amazon S3. The security team wants to ensure that no objects can be uploaded without encryption. Which THREE steps should be taken to meet this requirement?

Select 3 answers

A.Require all clients to use AWS CloudTrail for logging

B.Enable Amazon S3 Transfer Acceleration

C.Use AWS Key Management Service (KMS) to manage encryption keys

D.Create an S3 bucket policy that denies s3:PutObject if the x-amz-server-side-encryption header is not present

E.Enable default encryption on the S3 bucket using SSE-S3

AnswersC, D, E

SSE-KMS is a valid option for encryption at rest, and enforcing its use can be part of the policy.

Why this answer

A bucket policy denying s3:PutObject without the x-amz-server-side-encryption header enforces encryption. Using SSE-S3 or SSE-KMS ensures encryption at rest. SSE-C is not recommended for most cases.

Requiring HTTPS ensures encryption in transit, not at rest. CloudTrail is for auditing.

Full explanation →

595

Multi-Selecthard

A data engineer needs to transform data in Amazon S3 using AWS Glue. The job must handle schema evolution and partition pruning. Which THREE features should be used?

Select 3 answers

A.AWS Glue Data Catalog

B.AWS Glue job bookmarks

C.AWS Glue FindMatches transform

D.AWS Glue crawlers

E.Partition indexes

AnswersA, D, E

The Data Catalog stores schema and partition metadata.

Why this answer

Options A, B, and D are correct. Glue crawlers update the schema, partitions enable pruning, and the Data Catalog stores metadata. Option C is wrong because Bookmarks are for state tracking, not schema evolution.

Option E is wrong because FindMatches is for deduplication.

Full explanation →

596

MCQhard

A data engineer is designing a data ingestion pipeline for a social media analytics platform. The pipeline must ingest tweets in real-time, perform sentiment analysis, and store results in Amazon S3. The sentiment analysis is compute-intensive and must be done as the data arrives. The estimated throughput is 10,000 tweets per second. Which architecture is most suitable?

A.Amazon SQS with AWS Lambda pollers to process tweets and store in S3.

B.Amazon EMR with Spark Streaming to process tweets and write to S3.

C.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics for sentiment analysis, then Kinesis Data Firehose to S3.

D.Amazon API Gateway with AWS Lambda to process each tweet and store in S3.

AnswerC

Scalable real-time stream processing.

Why this answer

Option C is the most suitable because Amazon Kinesis Data Streams can ingest up to 10,000 records per second per shard (with shard-level scaling), and Kinesis Data Analytics provides built-in, low-latency stream processing for compute-intensive sentiment analysis using SQL or Apache Flink. Kinesis Data Firehose then reliably buffers and writes the processed results to Amazon S3 without custom code, ensuring near-real-time delivery.

Exam trap

The trap here is that candidates often choose SQS+Lambda (Option A) for simplicity, underestimating the throughput ceiling and polling overhead, while overlooking Kinesis Data Analytics as the only AWS-managed service that natively supports real-time, compute-intensive stream processing without custom infrastructure.

How to eliminate wrong answers

Option A is wrong because Amazon SQS with Lambda pollers introduces polling latency and cannot efficiently handle 10,000 tweets per second; Lambda has a maximum concurrency limit and SQS batch sizes are capped at 10 messages, leading to throttling and backpressure. Option B is wrong because Amazon EMR with Spark Streaming is designed for large-scale batch and micro-batch processing, not for true real-time, per-record sentiment analysis at 10,000 TPS; it incurs startup overhead and is better suited for historical analysis. Option D is wrong because Amazon API Gateway with Lambda processes each tweet synchronously, which cannot sustain 10,000 requests per second without aggressive throttling and cold starts; it also lacks built-in stream buffering and ordering for real-time ingestion.

Full explanation →

597

MCQhard

A data engineer runs an AWS Glue crawler that is configured to crawl an S3 bucket named 'my-data-lake' and update the Glue Data Catalog. The crawler fails with an access denied error. The IAM role attached to the crawler has the policy shown in the exhibit. What is the likely cause of the failure?

A.The policy does not allow glue:CreateTable on the 'my-data-lake' database.

B.The policy does not allow s3:PutObject on the 'my-data-lake' bucket.

C.The policy does not allow logging to CloudWatch Logs.

D.The policy does not allow s3:ListBucket on the 'my-data-lake' bucket.

AnswerC

Glue crawlers require permissions to create log groups and streams and write logs; the policy lacks logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents.

Why this answer

Option B is correct because the crawler needs permission to write logs to CloudWatch Logs; the policy does not include logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents. Option A is wrong because the policy includes s3:ListBucket on the bucket and s3:GetObject on the bucket contents. Option C is wrong because the policy includes glue:CreateTable and glue:UpdateTable.

Option D is wrong because the crawler does not need to write to S3.

Full explanation →

598

MCQeasy

A data engineer receives an alert that an AWS KMS key has been scheduled for deletion by mistake. What is the immediate action to prevent the key from being deleted?

A.Cancel the key deletion from the KMS console or API.

B.Create a new KMS key and re-encrypt the data.

C.Wait for the key to be deleted and restore it from backup.

D.Disable the key immediately to stop usage.

AnswerA

Canceling deletion restores the key.

Why this answer

Option B is correct because KMS keys in pending deletion can be canceled to restore the key to its previous state. Option A is wrong because creating a new key does not restore the old one. Option C is wrong because disabling the key does not cancel deletion.

Option D is wrong because the default waiting period is 7-30 days, but the deletion can be canceled during that time.

Full explanation →

599

MCQmedium

A logistics company ingests real-time GPS location data from thousands of delivery vehicles into Amazon Kinesis Data Streams. Each vehicle sends a JSON payload every 10 seconds containing vehicle_id, latitude, longitude, timestamp, and speed. The data must be stored in Amazon S3 for historical analysis, but the company wants to first aggregate the data per vehicle per minute (average speed, min/max coordinates) to reduce storage costs. The solution must be serverless and handle potential duplicate records without double-counting. What should the engineer do?

A.Use Amazon EMR with Spark Streaming to perform the aggregation and write to S3.

B.Use Amazon Kinesis Data Analytics for Apache Flink to aggregate data in a 1-minute tumbling window with deduplication logic, then output to Kinesis Data Firehose for delivery to S3.

C.Use Kinesis Data Firehose with a Lambda transformation to aggregate records in a 1-minute window.

D.Use an AWS Glue streaming ETL job with Spark Structured Streaming to aggregate and deduplicate.

AnswerB

Flink supports windowed aggregations and stateful deduplication; Firehose delivers to S3.

Why this answer

Option D is correct: Kinesis Data Analytics for Flink can aggregate data in real-time using tumbling windows and deduplication. Option A (Firehose with Lambda) would require per-record processing and state management. Option B (Glue streaming) adds latency.

Option C (EMR) is not serverless.

Full explanation →

600

MCQmedium

A data engineer is designing a data pipeline that ingests personally identifiable information (PII) into Amazon Redshift. The engineer needs to ensure that only authorized users can view the data, and that all queries are logged for auditing. Which combination of AWS services should the engineer use?

A.AWS CloudTrail and Amazon Redshift audit logging

B.AWS IAM Access Analyzer and Amazon Redshift audit logging

C.Amazon S3 access logs and AWS CloudTrail

D.AWS CloudTrail and Amazon CloudWatch Logs

AnswerD

CloudTrail logs API calls; CloudWatch Logs can capture Redshift audit logs.

Why this answer

Option B is correct because AWS CloudTrail logs Redshift API calls and Amazon Redshift audit logging captures SQL queries. Option A is wrong because CloudWatch Logs can also be used, but CloudTrail is the standard for API auditing. Option C is wrong because S3 access logs do not capture Redshift queries.

Option D is wrong because IAM Access Analyzer does not log queries.

Full explanation →

Page 8 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →