AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 151225

1786 questions total · 24pages · All types, answers revealed

Page 2

Page 3 of 24

Page 4
151
MCQeasy

A company wants to audit all changes to IAM policies in their AWS account. Which AWS service should be used to record these changes?

A.AWS Config
B.AWS CloudTrail
C.Amazon GuardDuty
D.Amazon Inspector
AnswerA

Config tracks configuration changes and can record IAM policy changes.

Why this answer

AWS Config records configuration changes to AWS resources, including IAM policies. CloudTrail records API calls, not configuration snapshots. GuardDuty is for threat detection.

Inspector is for vulnerability assessment.

152
MCQhard

A data engineer is designing a data lake on Amazon S3 for a healthcare organization that must comply with HIPAA regulations. The data includes protected health information (PHI) and must be encrypted at rest. The organization requires that all encryption keys be managed by AWS and rotated automatically every year. Additionally, the data must be replicated to another AWS Region for disaster recovery. Which combination of S3 features should the engineer use to meet these requirements?

A.Use SSE-S3 with S3 Same-Region Replication (SRR).
B.Use SSE-S3 with S3 Cross-Region Replication (CRR).
C.Use SSE-KMS with S3 Cross-Region Replication (CRR).
D.Use SSE-C with S3 Cross-Region Replication (CRR).
AnswerB

SSE-S3 provides AWS-managed keys with automatic rotation; CRR replicates to another region.

Why this answer

Option C is correct because SSE-S3 uses AWS-managed keys that are automatically rotated, and S3 Cross-Region Replication (CRR) replicates objects to another region. Option A is incorrect because SSE-KMS uses customer-managed keys, not AWS-managed keys. Option B is incorrect because S3 Same-Region Replication does not replicate to another region.

Option D is incorrect because SSE-C uses customer-provided keys, not AWS-managed.

153
Multi-Selecteasy

A company is using AWS Glue to catalog data in Amazon S3. The data is stored in CSV format, but the schema is not consistent across all files. Which TWO actions can the company take to handle schema evolution and ensure the Glue Data Catalog is up to date? (Choose TWO.)

Select 2 answers
A.Configure the Glue crawler to update the table schema on each run.
B.Manually update the Glue Data Catalog tables whenever the schema changes.
C.Disable schema update in the crawler and add partitions manually.
D.Schedule the Glue crawler to run periodically to detect changes.
E.Require all data producers to use a single fixed schema.
AnswersA, D

This allows the crawler to automatically detect and apply schema changes.

Why this answer

Options B and D are correct. Enabling schema update in the Glue crawler (B) allows the crawler to update the existing table schema when new columns are found. Running the crawler on a schedule (D) ensures that changes are captured periodically.

Option A (manual schema update) is not automated. Option C (using a single schema) is not practical for evolving data. Option E (disabling schema update) would not update the catalog.

154
MCQhard

A data engineer runs an AWS Glue ETL job that reads from a large Amazon S3 source (several terabytes of CSV files) and writes transformed data to an S3 bucket in Parquet format. The job fails with the error shown in the exhibit. The job uses the Standard worker type with 10 workers (G.1X). The engineer needs to resolve the failure with minimal cost increase. What should the engineer do?

A.Increase the number of workers to 20 while keeping G.1X worker type.
B.Change the worker type to G.2X with 10 workers.
C.Change the worker type to G.4X with 10 workers.
D.Set the 'coalesce' parameter to reduce the number of output files.
AnswerB

G.2X provides double the memory (32 GB) per worker compared to G.1X (16 GB), resolving the heap space error with minimal cost increase.

Why this answer

Option D is correct because the OutOfMemoryError indicates that the Spark executors do not have enough memory; switching to G.2X workers doubles memory per worker (from 16 GB to 32 GB) without increasing the number of workers, which is more cost-effective than increasing the number of workers with G.1X. Option A is wrong because increasing the number of workers with G.1X may not resolve the memory issue per executor; it spreads the data across more executors but each still has limited memory. Option B is wrong because reducing the number of files (coalesce) may not help if the issue is per-task memory.

Option C is wrong because using the G.2X worker type with more memory per worker is likely sufficient; G.4X may be unnecessary and more expensive.

155
Multi-Selecthard

A company uses AWS DMS to continuously replicate data from an on-premises SQL Server to Amazon Aurora MySQL. The replication lag is increasing. Which THREE actions can reduce the lag? (Choose three.)

Select 3 answers
A.Use parallel apply on the target endpoint.
B.Filter out unnecessary tables from replication.
C.Enable DMS validation.
D.Enable Multi-AZ for the DMS replication instance.
E.Increase the DMS replication instance size.
AnswersA, B, E

Parallel apply speeds up writes on the target.

156
MCQeasy

A data analyst needs to query a large Amazon S3 bucket containing CSV files using Amazon Athena. The bucket has millions of small files (less than 1 MB each). The analyst reports that queries are very slow and often time out. The data is partitioned by date and the partition columns are defined in the table. What is the most effective way to improve query performance?

A.Convert the files to Apache Parquet format using an AWS Glue ETL job.
B.Run a compaction job to consolidate small files into fewer larger files (e.g., 128 MB each).
C.Add more partitions by including hour and minute as partition keys.
D.Use S3 Select to push down filtering to S3 before Athena processes the data.
AnswerB

Consolidating small files reduces the overhead of listing and reading many objects, significantly improving Athena performance.

Why this answer

Option D is correct because many small files cause high overhead for listing and reading files in Athena. Compacting small files into fewer larger files (e.g., 128 MB each) reduces the I/O operations and improves performance. Option A is wrong because file format conversion (e.g., to Parquet) helps but does not address the small file problem.

Option B is wrong because increasing partitions would increase overhead. Option C is wrong because S3 Select is for retrieving subset of data from a single file, not for optimizing many files.

157
MCQmedium

A company runs a data warehouse on Amazon Redshift. Queries are slow, and the team suspects data distribution is skewed. Which approach would best help identify distribution skew?

A.Check the STL_LOAD_ERRORS table for load failures
B.Query the SVV_TABLE_INFO table to see table size
C.Query the SVV_DISKUSAGE table to examine data distribution across slices
D.Review the WLM configuration in the parameter group
AnswerC

SVV_DISKUSAGE provides per-slice disk usage, helping identify skew.

Why this answer

Option C is correct because the SVV_DISKUSAGE table provides per-slice data distribution information, allowing you to identify skew by comparing the number of blocks allocated to each slice for a given table. In Amazon Redshift, data is distributed across slices based on the distribution key, and significant variation in block counts across slices indicates distribution skew, which can cause query performance degradation due to uneven workload distribution.

Exam trap

The trap here is that candidates confuse table-level metadata (SVV_TABLE_INFO) with slice-level distribution data (SVV_DISKUSAGE), assuming overall table size alone can reveal skew, when in fact only per-slice block counts expose uneven data distribution.

How to eliminate wrong answers

Option A is wrong because STL_LOAD_ERRORS records errors during COPY or INSERT operations, such as data type mismatches or malformed data, and has no relation to data distribution skew. Option B is wrong because SVV_TABLE_INFO shows overall table size, row count, and compression ratios, but it does not provide per-slice data distribution details needed to identify skew. Option D is wrong because WLM configuration in the parameter group manages query concurrency and memory allocation, not data distribution or skew detection.

158
MCQhard

An organization is using AWS Glue to process sensitive data. The data is stored in S3 with server-side encryption using AWS KMS (SSE-KMS). The Glue job fails with an error indicating that it cannot read the data. The IAM role used by Glue has the following policy. What is missing?

A.The s3:GetObject permission on the bucket
B.The kms:Decrypt permission on the KMS key
C.The kms:GenerateDataKey permission on the KMS key
D.The kms:ReEncrypt permission on the KMS key
AnswerB

Glue needs decrypt permission to read encrypted data.

Why this answer

Option B is correct because the Glue job needs kms:Decrypt permission to read data encrypted with SSE-KMS. Option A is wrong because s3:GetObject is present. Option C is wrong because kms:GenerateDataKey is for writing, not reading.

Option D is wrong because kms:ReEncrypt is not needed.

159
MCQmedium

A company uses Amazon Kinesis Data Streams to ingest real-time data. The compliance team requires that all data in the stream be encrypted at rest. Which configuration should be enabled?

A.Enable TLS encryption on the Kinesis stream
B.Enable server-side encryption using an AWS KMS key
C.Use client-side encryption in the producer application
D.Store the data in Amazon CloudWatch Logs instead
AnswerB

Kinesis supports SSE with KMS.

Why this answer

Option A is correct. Kinesis Data Streams supports server-side encryption with KMS. Option B is wrong because client-side encryption is not built-in.

Option C is wrong because TLS is for in-transit. Option D is wrong because CloudWatch Logs is not for encryption.

160
MCQeasy

A data engineer needs to ingest JSON data from an on-premises relational database into Amazon S3 every hour. Which AWS service should be used to set up a scheduled, incremental data transfer?

A.Amazon S3 Transfer Acceleration with a cron job.
B.AWS Database Migration Service (DMS) with S3 as target.
C.AWS Glue with a JDBC connection and a scheduled crawler.
D.Amazon Kinesis Data Firehose with a database source.
AnswerB

DMS supports scheduled, incremental transfers from databases to S3.

Why this answer

AWS DMS is purpose-built for migrating databases to AWS targets, including Amazon S3. It supports ongoing replication (change data capture) and scheduled full-load tasks, making it ideal for hourly incremental transfers from an on-premises relational database to S3 without custom scripting.

Exam trap

The trap here is that candidates confuse AWS Glue's ETL capabilities with DMS's managed database migration, assuming Glue's JDBC connections can handle incremental transfers, but Glue lacks built-in change data capture and requires custom logic for scheduled incremental loads.

How to eliminate wrong answers

Option A is wrong because S3 Transfer Acceleration only speeds up uploads over long distances via edge locations; it does not provide scheduling, incremental data capture, or database connectivity. Option C is wrong because AWS Glue crawlers are designed for schema discovery and metadata cataloging, not for scheduled incremental data transfer from a database to S3; Glue ETL jobs can do this but require custom code, whereas DMS is the managed service for database migration. Option D is wrong because Kinesis Data Firehose ingests streaming data from producers like Kinesis streams or direct PUT, not from a relational database via JDBC; it lacks built-in change data capture for incremental database loads.

161
MCQhard

A data engineer is troubleshooting an Amazon DynamoDB table that has frequent throttling exceptions for write requests. The table has auto scaling enabled. What is the most likely cause?

A.The partition key is causing a hot partition
B.The table's read capacity is set too low
C.The table's auto scaling is disabled
D.The table is using global tables without conflict resolution
AnswerA

Hot partitions throttle even if overall capacity is sufficient.

Why this answer

Hot partition is a common cause where a single partition key receives a disproportionate amount of writes, exhausting that partition's capacity. Auto scaling adjusts total capacity, not partition-level distribution.

162
Matchingmedium

Match each AWS data analytics service to its primary function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Serverless SQL query on S3

Business intelligence and dashboards

Data lake setup and access control

Real-time SQL on streaming data

Query data in S3 from Redshift

Why these pairings

Analytics services cover different needs.

163
MCQmedium

A company is running a data warehouse on Amazon Redshift. The data engineering team notices that query performance has degraded over time. They suspect that data distribution is causing excessive data movement between nodes. The table is joined frequently on the customer_id column. Which column should be chosen as the distribution key to optimize join performance?

A.AUTO distribution
B.customer_id
C.order_date
D.EVEN distribution
AnswerB

Distributing on the join column reduces data movement.

Why this answer

The correct answer is B (customer_id) because Redshift distributes data across nodes based on the distribution key. When two tables are joined on customer_id, using it as the distribution key ensures that matching rows from both tables are co-located on the same node, eliminating the need for data redistribution (broadcast or shuffle) during the join. This minimizes network traffic and reduces query latency, directly addressing the performance degradation caused by excessive data movement.

Exam trap

The trap here is that candidates may choose EVEN distribution (D) thinking it balances data evenly, but they overlook that it causes maximum data movement for joins, while AUTO distribution (A) seems safe but does not guarantee co-location for the specific join column.

How to eliminate wrong answers

Option A (AUTO distribution) is wrong because AUTO lets Redshift choose the distribution style based on table size and usage patterns, but it may not guarantee co-location for frequent joins on customer_id, potentially still causing data movement. Option C (order_date) is wrong because it is not the join column; using it as the distribution key would scatter customer_id values across nodes, forcing redistribution for every join on customer_id. Option D (EVEN distribution) is wrong because it distributes rows round-robin across nodes without considering join keys, which maximizes data movement during joins on customer_id and degrades performance.

164
Multi-Selecteasy

A company is designing a data lake on AWS using S3. The security team requires that all data be encrypted at rest and that encryption keys be rotated annually. Which services can be used to meet these requirements? (Choose TWO.)

Select 2 answers
A.AWS Secrets Manager
B.AWS Certificate Manager (ACM)
C.AWS Key Management Service (AWS KMS)
D.AWS CloudHSM
E.Amazon S3 managed keys (SSE-S3)
AnswersC, E

Allows customer-managed keys with automatic yearly rotation.

Why this answer

Option A and D are correct. SSE-S3 provides encryption at rest with Amazon-managed keys that are rotated automatically. AWS KMS allows customer-managed keys with manual rotation or automatic yearly rotation.

Option B is wrong because CloudHSM provides hardware security modules but is not directly integrated with S3 for automatic key rotation. Option C is wrong because ACM is for TLS certificates. Option E is wrong because Secrets Manager is for secrets, not encryption keys.

165
Multi-Selecteasy

A company uses Kinesis Data Firehose to deliver streaming data to S3. They need to transform the data by adding a timestamp and removing sensitive fields. Which TWO approaches can achieve this?

Select 2 answers
A.Use Kinesis Data Analytics to transform the stream
B.Use S3 Select to transform data at rest
C.Use AWS Glue ETL to process data after delivery to S3
D.Use Amazon Redshift Spectrum to transform data
E.Configure a Lambda function as a data transformation in Firehose
AnswersC, E

Glue can transform data after it is stored in S3.

Why this answer

Option A and Option D are correct because Lambda transformation can modify records, and Glue ETL can process data from Firehose's S3 destination. Option B is wrong because Kinesis Data Analytics is for analytics, not transformation. Option C is wrong because S3 Select is for retrieving subsets of data, not transformation.

Option E is wrong because Redshift Spectrum is for querying data in S3.

166
MCQhard

A data engineer created the IAM policy shown in the exhibit. The engineer then attempts to upload an object to 'my-bucket' using the AWS CLI with the command: aws s3 cp file.txt s3://my-bucket/ --sse aws:kms. The upload fails with an 'AccessDenied' error. What is the most likely cause?

A.The policy resource is incorrect
B.The policy requires SSE-S3 (AES256), but the command uses SSE-KMS
C.The policy does not allow the s3:PutObject action
D.The command is missing the --sse-customer-algorithm parameter
AnswerB

The condition mandates AES256, but the command uses aws:kms.

Why this answer

The IAM policy in the exhibit requires the `s3:x-amz-server-side-encryption` header to be set to `AES256`, which corresponds to SSE-S3. The AWS CLI command uses `--sse aws:kms`, which sets the header to `aws:kms` for SSE-KMS. This mismatch causes the request to fail the `s3:PutObject` condition check in the policy, resulting in an 'AccessDenied' error.

Exam trap

The trap here is that candidates may overlook the condition key in the policy and assume the error is due to a missing action or incorrect resource, rather than recognizing that the encryption header value must exactly match the policy's requirement.

How to eliminate wrong answers

Option A is wrong because the policy resource `arn:aws:s3:::my-bucket/*` correctly specifies the bucket and its objects, so the resource is not the issue. Option B is wrong because the policy explicitly requires SSE-S3 (AES256), but the command uses SSE-KMS, which is the direct cause of the failure. Option C is wrong because the policy does allow `s3:PutObject` via the `Effect: Allow` statement; the failure is due to the condition key mismatch, not a missing action.

Option D is wrong because `--sse-customer-algorithm` is used for SSE-C, not SSE-KMS or SSE-S3, and the command already specifies `--sse aws:kms` correctly for SSE-KMS.

167
MCQmedium

A data engineer needs to audit all access to an Amazon S3 bucket containing sensitive data. The audit must capture who accessed the bucket, from which IP address, and what actions were performed. Which AWS service should be enabled?

A.Enable S3 server access logging for the bucket.
B.Enable AWS CloudTrail with data events for the S3 bucket.
C.Use AWS Config to record S3 bucket-level changes.
D.Configure Amazon CloudWatch Logs to monitor S3 access.
AnswerB

CloudTrail data events capture detailed API activity.

Why this answer

Option B is correct because AWS CloudTrail logs all API calls to S3, including the caller identity, source IP, and actions. Option A is wrong because S3 server access logs provide similar info but are not as detailed or centralized. Option C is wrong because CloudWatch Logs can store logs but does not generate them.

Option D is wrong because AWS Config tracks resource configuration changes, not API calls.

168
MCQhard

A company uses AWS Lake Formation to manage data lake permissions. A data engineer needs to grant a group of analysts SELECT permission on a set of tables in the 'analytics' database, but only for columns that are not classified as 'PII'. Which approach should the engineer use?

A.Grant SELECT on the entire database and rely on analysts to avoid PII columns.
B.Create an IAM policy that denies access to PII columns.
C.Use an S3 bucket policy to restrict access to objects containing PII data.
D.Use Lake Formation tag-based access control (LF-TBAC) to grant SELECT on columns without the 'PII' tag.
AnswerD

LF-TBAC allows column-level permissions by matching tags on columns with tags on the grant.

Why this answer

Option C is correct because LF-TBAC allows column-level permissions based on tags. Option A (IAM policy) does not support column-level restrictions. Option B (S3 bucket policy) is too coarse.

Option D (database-level grant) grants access to all columns.

169
MCQeasy

Your organization uses Amazon Redshift for analytical workloads. You have noticed that queries are slow on a large fact table. The table is distributed by KEY on the customer_id column and sorted by transaction_date. The table is frequently updated with new records. To improve query performance, you decide to implement a distribution style that reduces data movement. Which action should you take?

A.Change the distribution style to ALL to put a copy of the table on every node.
B.Change the distribution style to AUTO to let Redshift choose the best distribution.
C.Change the distribution style to EVEN to distribute rows evenly across all nodes.
D.Change the sort key to include customer_id as well.
AnswerC

EVEN reduces data movement for large fact tables when join columns are not well-distributed.

Why this answer

Option D is correct because EVEN distribution distributes rows evenly across nodes, reducing data movement for queries that do not benefit from key distribution. Option A is wrong because ALL distribution duplicates the entire table on each node, which is inefficient for large tables. Option B is wrong because AUTO lets Redshift decide, but it may not choose the best.

Option C is wrong because sorting by other columns does not reduce data movement.

170
MCQeasy

A data engineer needs to share a dataset from an S3 bucket in Account A with another AWS account (Account B). The data must remain encrypted at rest with KMS. Which steps are required?

A.Update the KMS key policy to allow Account B's root user
B.Create an IAM role in Account A and grant cross-account access
C.Update the S3 bucket policy and the KMS key policy to allow Account B
D.Update the S3 bucket policy to allow Account B's root user
AnswerC

Both are required for decryption and access.

Why this answer

To share S3 objects encrypted with a KMS key, you must grant the consuming account access to both the S3 bucket and the KMS key. Option A is wrong because a bucket policy alone does not grant KMS permissions. Option B is wrong because the KMS key policy must explicitly allow Account B.

Option C is correct as it includes both. Option D is wrong because cross-account roles are not necessary if bucket policy and key policy are used.

171
MCQhard

A data engineer is designing a data ingestion pipeline for real-time clickstream data from a website. The data must be stored in Amazon S3 in near-real time, and also be available for real-time analytics using Amazon Athena. The pipeline must handle occasional spikes of up to 10x the normal throughput. Which combination of services should the engineer use?

A.Amazon Simple Queue Service (SQS) with AWS Lambda to write to Amazon S3, and Amazon Athena for queries.
B.AWS Database Migration Service (DMS) to stream data to Amazon S3, and Amazon Athena for queries.
C.Amazon Kinesis Data Streams with AWS Lambda to write to Amazon S3, and Amazon Athena for queries.
D.Amazon Kinesis Data Firehose with AWS Glue for transformation, and Amazon Athena for queries.
AnswerC

Kinesis handles spikes, Lambda writes to S3, Athena queries.

Why this answer

Option A is correct because Kinesis Data Streams can handle high throughput spikes, Lambda can process and write to S3, and Athena can query directly. Option B is wrong because SQS is pull-based and does not support push to S3 directly; Lambda would need to poll, adding latency. Option C is wrong because DMS is for database migration, not real-time clickstream.

Option D is wrong because Glue is batch-oriented and not suitable for low-latency streaming.

172
MCQeasy

A company wants to ingest real-time clickstream data from a website into Amazon S3 with minimal code. The data should be delivered within 60 seconds of generation. Which AWS service should be used?

A.Amazon Kinesis Data Firehose
B.AWS Database Migration Service (DMS)
C.Amazon Kinesis Data Streams
D.Amazon S3 Transfer Acceleration
AnswerA

Firehose is designed for near-real-time streaming ingestion into S3 with minimal configuration.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service that can ingest streaming data and deliver it to S3 with near-real-time latency (typically under 60 seconds). Option A is wrong because AWS Database Migration Service is for database migration, not streaming ingestion. Option C is wrong because Amazon Kinesis Data Streams requires custom consumers and does not directly deliver to S3.

Option D is wrong because Amazon S3 Transfer Acceleration speeds up uploads but does not provide streaming ingestion capabilities.

173
Multi-Selecteasy

A data engineer is setting up a data pipeline using AWS Glue. The engineer wants to monitor job failures and receive notifications. Which TWO services can be used together for this purpose?

Select 2 answers
A.AWS Step Functions
B.Amazon CloudWatch
C.Amazon SNS
D.Amazon Kinesis Data Streams
E.Amazon SQS
AnswersB, C

Glue publishes job metrics to CloudWatch.

Why this answer

Amazon CloudWatch (B) is correct because it is the native monitoring service for AWS Glue, capturing job metrics, logs, and state changes. You can configure CloudWatch alarms to trigger on job failures, which then invoke Amazon SNS (C) to send notifications via email, SMS, or other endpoints. Together, they provide a complete monitoring and alerting solution without additional orchestration.

Exam trap

The trap here is that candidates may confuse AWS Step Functions (A) as a monitoring tool because it can orchestrate retries, but it does not natively send notifications and is not the primary service for monitoring Glue job failures.

174
MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a slow internet connection (100 Mbps). The data must be transferred within 2 weeks. Which service should the engineer recommend?

A.AWS DataSync
B.AWS Snowball Edge
C.Amazon Kinesis Data Firehose
D.AWS Glue ETL job with JDBC connection to Hadoop
AnswerB

Snowball Edge physically transfers data, bypassing network limitations.

Why this answer

The correct answer is AWS Snowball Edge, a physical device for large data transfers over slow networks. 50 TB at 100 Mbps would take over 46 days, exceeding the 2-week deadline. Option A (AWS DataSync) uses the network and would be too slow. Option B (AWS Glue) is for ETL, not transfer.

Option D (Amazon Kinesis) is for streaming.

175
MCQmedium

A company is using Amazon Redshift for analytics and needs to ensure that all data is encrypted at rest. The current cluster does not have encryption enabled. What is the most efficient way to enable encryption?

A.Change the cluster parameter group to enable encryption
B.Modify the cluster configuration to enable encryption
C.Use AWS DMS to migrate data to a new encrypted cluster
D.Create a snapshot of the cluster and restore it to a new cluster with encryption enabled
AnswerD

This is the supported method to migrate to an encrypted cluster.

Why this answer

Redshift does not support enabling encryption on an existing cluster; a new encrypted cluster must be created and data migrated. Modifying the cluster configuration or parameter groups does not enable encryption. Creating a snapshot and restoring it to a new cluster with encryption enabled is the standard approach.

176
MCQeasy

A data engineer is designing a data lake on AWS using Amazon S3. The data consists of CSV files generated by IoT devices. The data is accessed by multiple analytics jobs, and the engineer needs to ensure that new files are immediately visible to all consumers after writing. What S3 consistency model applies?

A.Consistent reads require S3 Object Lock.
B.Strong consistency for all operations.
C.Eventual consistency for all operations.
D.Read-after-write consistency for new object PUTS.
AnswerD

S3 provides read-after-write consistency for new objects, so they are immediately visible.

Why this answer

Amazon S3 provides read-after-write consistency for PUTS of new objects (since 2020), meaning new objects are immediately readable. Option A is wrong because eventual consistency applies to overwrites and deletes. Option C is wrong because there is no strong consistency for all operations; S3 now provides strong consistency for all operations.

Option D is wrong because no locking mechanism is needed for new objects.

177
MCQeasy

A data engineer needs to schedule a daily ETL job that runs on Amazon EMR. The job should be triggered automatically and send an email on failure. Which AWS service should the engineer use to orchestrate the job?

A.Amazon EventBridge
B.AWS Step Functions
C.Amazon Simple Queue Service (SQS)
D.Amazon CloudWatch Events
AnswerB

Orchestrates EMR steps and integrates with SNS.

Why this answer

Option A is correct because AWS Step Functions can orchestrate EMR steps and integrate with Amazon SNS for notifications. Option B is wrong because Amazon CloudWatch Events can trigger Lambda but not directly orchestrate EMR steps. Option C is wrong because Amazon SQS is a message queue.

Option D is wrong because Amazon EventBridge can trigger but not orchestrate complex workflows.

178
MCQhard

A gaming company uses Amazon DynamoDB to store player profiles and game state. The table has a partition key of 'player_id' and no sort key. The table is provisioned with 5,000 RCUs and 5,000 WCUs. The application performs frequent reads and writes to update player scores. Recently, the company introduced a new feature that allows players to form guilds. The guild data is stored in a separate DynamoDB table with a partition key of 'guild_id'. The application often needs to retrieve all members of a guild. The data engineer is encountering high latency when querying the guild table because the guilds can have up to 100 members. The engineer wants to reduce latency without changing the application architecture. What should the data engineer do?

A.Increase the provisioned read and write capacity for the guild table to 10,000 RCUs and 10,000 WCUs.
B.Create a global secondary index (GSI) on the guild table with partition key guild_id and sort key member_id.
C.Enable DynamoDB Streams on the guild table and process the stream to populate a separate read table.
D.Use DynamoDB Accelerator (DAX) to cache the results of the guild queries.
AnswerB

The GSI allows efficient retrieval of all members of a guild by querying on guild_id.

Why this answer

Option B is correct because adding a global secondary index (GSI) on the guild table with guild_id as the partition key and member_id as the sort key allows efficient queries for all members of a guild. Option A is wrong because increasing capacity may not solve the access pattern issue. Option C is wrong because DynamoDB Streams are for change data capture, not for query optimization.

Option D is wrong because DAX caches read results; if the query pattern is inefficient, DAX won't help much.

179
Multi-Selectmedium

Which TWO statements are true about Amazon S3 bucket policies and ACLs?

Select 2 answers
A.When both exist, bucket policies are evaluated before ACLs.
B.ACLs are a legacy access control mechanism that is still supported.
C.ACLs can grant permissions to all authenticated AWS users.
D.ACLs support conditions such as IP address restrictions.
E.Bucket policies can grant access to users in other AWS accounts.
AnswersB, E

ACLs are older but still functional.

Why this answer

Option B is correct because ACLs (Access Control Lists) are indeed a legacy access control mechanism that Amazon S3 continues to support for backward compatibility. While bucket policies and IAM policies are the modern, recommended approach, ACLs can still be used to grant basic read/write permissions to AWS accounts or predefined groups like AllUsers or AuthenticatedUsers.

Exam trap

The trap here is that candidates confuse ACLs with bucket policies, assuming ACLs support advanced conditions like IP restrictions or that bucket policies and ACLs are evaluated in a strict order, when in fact ACLs are simplistic and both are evaluated as an OR.

180
MCQmedium

A data engineer reviewed the S3 lifecycle policy shown in the exhibit. The engineer notices that objects under the 'logs/' prefix are being deleted after 365 days. The business requirement is to retain logs for at least 5 years. What should the engineer change in the lifecycle policy?

A.Change the prefix to 'logs/archive/'
B.Set the expiration days to 1825
C.Change the transition to GLACIER on day 365
D.Remove the expiration action
AnswerB

1825 days equals 5 years.

Why this answer

The business requirement is to retain logs for at least 5 years, which is 1,825 days (5 × 365). The current lifecycle policy sets expiration to 365 days, causing premature deletion. By setting the expiration days to 1,825, the S3 lifecycle policy will delete objects under the 'logs/' prefix only after 5 years, meeting the retention requirement.

Exam trap

The trap here is that candidates may confuse transition actions (which change storage class) with expiration actions (which delete objects), or incorrectly assume that changing the prefix or removing expiration will meet the retention requirement without adjusting the day count.

How to eliminate wrong answers

Option A is wrong because changing the prefix to 'logs/archive/' would only apply the lifecycle rules to a different subset of objects, not fix the retention period for the original 'logs/' prefix. Option C is wrong because transitioning to GLACIER on day 365 only changes the storage class for cost optimization; it does not extend the deletion timeline, so objects would still be deleted after 365 days. Option D is wrong because removing the expiration action entirely would mean objects are never automatically deleted, which may lead to indefinite storage and increased costs, not a 5-year retention.

181
Multi-Selecthard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The Flink application reads from a Kinesis Data Streams source, performs aggregations, and writes results to Amazon S3. The application is experiencing high checkpoint failures, and the processing lag is increasing. The data volume is 50 MB/s with an average record size of 1 KB. Which TWO actions would improve checkpoint reliability and reduce lag? (Choose TWO.)

Select 2 answers
A.Decrease the checkpoint interval to complete checkpoints faster.
B.Replace the S3 sink with Kinesis Data Firehose.
C.Decrease the parallelism of the Flink application.
D.Increase the checkpoint interval in the Flink configuration.
E.Increase the number of Kinesis Processing Units (KPUs) for the application.
AnswersD, E

Less frequent checkpoints reduce overhead.

Why this answer

Options A and D are correct. Increase the checkpoint interval to reduce frequency, and increase parallelism with more KPUs. Option B is wrong because reducing parallelism would worsen lag.

Option C is wrong because decreasing checkpoint interval increases failures. Option E is wrong because Kinesis Data Firehose is not a direct solution to checkpoint failures.

182
MCQeasy

A data engineer is monitoring an Amazon EMR cluster and notices that the cluster is running out of disk space on the core nodes. Which action can be taken to resolve this issue?

A.Reduce the retention period of data stored on HDFS
B.Change the core node instance type to a compute-optimized type
C.Increase the EBS volume size attached to core nodes
D.Use Spot Instances for core nodes
AnswerC

More EBS capacity directly adds disk space.

Why this answer

Option A is correct because increasing the volume size of EBS attached to core nodes provides more storage. Option B is wrong because changing instance type affects memory/CPU, not disk. Option C is wrong because Spot Instances are cheaper but do not add disk.

Option D is wrong because reducing data retention is not always feasible and may lose data.

183
MCQhard

A financial services company processes real-time stock trade data. They use Amazon Kinesis Data Streams with a shard count of 5, each shard receiving about 500 records per second. The consumer application uses the Kinesis Client Library (KCL) with DynamoDB for checkpointing. Lately, some records are being processed multiple times. What is the most likely cause?

A.The consumer application is crashing and restarting, causing re-processing of records.
B.The Kinesis stream's iterator age is exceeding the retention period.
C.The DynamoDB table used for checkpointing is throttling write requests.
D.The record size exceeds the 1 MB API limit, causing retries.
AnswerA

KCL reprocesses from last checkpoint after failure.

Why this answer

The Kinesis Client Library (KCL) uses DynamoDB to track checkpoint progress for each shard. If the consumer application crashes and restarts, the KCL will resume processing from the last committed checkpoint, which may be behind the actual processing point. This causes records that were already processed (but not yet checkpointed) to be re-processed, leading to duplicate processing.

Exam trap

The trap here is that candidates often confuse checkpoint throttling (Option C) with duplicate processing, but throttling would cause checkpoint failures and potential re-processing only if the application cannot recover, whereas the direct cause of duplicates is the gap between processing and checkpointing after a crash.

How to eliminate wrong answers

Option B is wrong because iterator age exceeding the retention period would cause data to expire and become unavailable, not cause duplicate processing. Option C is wrong because DynamoDB throttling on checkpoint writes would cause checkpoint failures and potential re-processing, but the question states checkpointing is occurring and the issue is duplicate processing, not checkpoint failures. Option D is wrong because the 1 MB API limit applies to the total payload per PutRecords request, not per record, and exceeding it would cause write failures or retries, not duplicate processing of already-successful records.

184
MCQmedium

Your company has an Amazon S3-based data lake partitioned by year/month/day. An AWS Glue crawler runs daily to update the Data Catalog. A Spark job on Amazon EMR reads the latest partition and performs transformations. Recently, the Spark job has been failing with a 'FileNotFoundException' for a file that is expected to exist. You check the S3 bucket and see that the file exists. The job is configured to use S3 as the direct input source with EMRFS consistent view enabled. The IAM role for the EMR cluster has full S3 access. What is the most likely cause?

A.The Data Catalog partition metadata is out of sync because the crawler runs after the job.
B.The Glue crawler did not update the Data Catalog because of insufficient permissions.
C.The EMR cluster is experiencing eventual consistency issues despite EMRFS consistent view.
D.The file was deleted by a lifecycle policy after the job started.
AnswerC

Even with consistent view, there can be short propagation delays; the job might be reading a directory listing that hasn't updated.

Why this answer

Option C is correct because even with EMRFS consistent view enabled, there can be a propagation delay between the crawler updating the catalog and the actual S3 objects being visible. The Spark job may be reading stale metadata. Option A is wrong because the crawler updates the catalog, but the job may read from S3 directly.

Option B is wrong because the file exists. Option D is wrong because the catalog is updated.

185
Multi-Selectmedium

A company needs to ingest streaming data from thousands of IoT devices. The data must be processed in real-time and stored in Amazon S3. Which TWO services should be used together?

Select 2 answers
A.Amazon Kinesis Data Streams
B.Amazon Kinesis Data Firehose
C.AWS Glue
D.Amazon Simple Queue Service (SQS)
E.AWS Direct Connect
AnswersA, B

Provides real-time data ingestion.

Why this answer

Options A and C are correct. Kinesis Data Streams ingests streaming data, and Kinesis Data Firehose can deliver data from the stream to S3. Option B is wrong because SQS is a message queue, not for real-time processing.

Option D is wrong because Glue is batch. Option E is wrong because Direct Connect doesn't apply.

186
MCQeasy

A data engineer is troubleshooting an Amazon Redshift cluster that is running slowly. The cluster has 4 dc2.large nodes. The engineer runs a query that scans a large table and notices that the query uses only a single slice instead of all slices. The table is distributed with DISTSTYLE ALL. What is the most likely reason for the query using only one slice?

A.The query is running on the leader node instead of the compute nodes.
B.The table uses DISTSTYLE ALL, which stores the entire table on a single slice per node.
C.The workload management (WLM) queue is configured with a single query slot.
D.The table does not have a sort key defined.
AnswerB

DISTSTYLE ALL replicates the table to each node, but it is stored on one slice per node, limiting parallelism.

Why this answer

Option B is correct because DISTSTYLE ALL means the entire table is copied to every node, but within each node, the data is stored on a single slice. Redshift slices are determined by the number of cores per node; for dc2.large nodes, there are 2 slices per node. However, with DISTSTYLE ALL, each node stores a full copy of the table, but the table is not distributed across slices; it is replicated to each node, and queries may not parallelize across slices efficiently.

Actually, with DISTSTYLE ALL, the table is replicated to all nodes, but within each node, the data is stored on a single slice (the leader node assigns the table to one slice per node). So queries that scan the entire table may only use one slice per node, leading to underutilization. Option A is wrong because the sort key affects data ordering, not slice utilization.

Option C is wrong because WLM queues affect concurrency, not slice usage. Option D is wrong because the query is not using a single node but multiple nodes, but only one slice per node.

187
MCQeasy

A data engineer needs to securely store database credentials used by a Lambda function. The solution must automatically rotate the credentials every 90 days. Which AWS service should the engineer use?

A.AWS CloudHSM
B.AWS Systems Manager Parameter Store
C.IAM Roles for Lambda
D.AWS Secrets Manager
AnswerD

Secrets Manager provides automatic rotation of secrets.

Why this answer

Option D is correct because AWS Secrets Manager supports automatic rotation of secrets. Option A is wrong because Parameter Store does not natively rotate secrets. Option B is wrong because CloudHSM is a hardware security module, not a secret store with rotation.

Option C is wrong because IAM Roles are for access to AWS services, not for storing database credentials.

188
Multi-Selecteasy

A company is building a data pipeline that ingests streaming data from IoT devices. The data must be stored in a durable, scalable, and cost-effective manner for batch processing. Which TWO AWS services should be used together?

Select 2 answers
A.Amazon ElastiCache
B.Amazon Kinesis Data Streams
C.Amazon Redshift
D.Amazon DynamoDB
E.Amazon S3
AnswersB, E

Ingests streaming data in real-time.

Why this answer

Options A and C are correct. Kinesis Data Streams ingests streaming data, and S3 stores the data durably and cost-effectively. Option B (Redshift) is for analytics, not raw storage.

Option D (DynamoDB) is for low-latency queries, not cost-effective bulk storage. Option E (ElastiCache) is a cache, not durable storage.

189
Multi-Selecteasy

A data engineer needs to migrate an on-premises MongoDB database to AWS. The migration must have minimal downtime and support automatic scaling. Which TWO AWS services should the engineer use for the target data store? (Choose TWO.)

Select 2 answers
A.Amazon Redshift
B.Amazon RDS for MySQL
C.Amazon DocumentDB (with MongoDB compatibility)
D.Amazon DynamoDB
E.Amazon S3
AnswersC, D

DocumentDB is compatible with MongoDB workloads.

Why this answer

Options A and D are correct. Amazon DocumentDB is MongoDB-compatible; Amazon DynamoDB is a NoSQL database that can handle document-like data and supports auto scaling. Option B (RDS) is relational; Option C (Redshift) is for analytics; Option E (S3) is object storage, not a database.

190
MCQeasy

A data engineer is running an AWS Glue ETL job that reads from an Amazon RDS MySQL database and writes to Amazon S3. The job fails with a 'Communications link failure' error. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause of the failure?

A.The JDBC connection string in the Glue job does not include the database name.
B.The Glue job is using the wrong JDBC driver.
C.The Glue job's security group does not allow outbound traffic to the RDS security group on port 3306.
D.The IAM role used by the Glue job does not have rds:Connect permission.
AnswerC

Without outbound rule, the connection fails.

Why this answer

Option C is correct because AWS Glue ETL jobs run in a VPC that requires outbound security group rules to initiate connections to RDS. Even if the RDS security group allows inbound traffic from the Glue security group, the Glue security group must also have an outbound rule allowing traffic to the RDS security group on port 3306 (MySQL default port). Without this outbound rule, the TCP handshake from Glue to RDS fails, causing a 'Communications link failure'.

Exam trap

The trap here is that candidates assume only inbound rules matter for security groups, but outbound rules are equally critical for initiating connections from the client (Glue) to the server (RDS).

How to eliminate wrong answers

Option A is wrong because omitting the database name from the JDBC connection string would cause a different error (e.g., 'Unknown database' or connection rejection), not a 'Communications link failure', which indicates a network-level issue. Option B is wrong because AWS Glue automatically includes the correct JDBC driver for MySQL (compatible with Amazon RDS MySQL) when using the Glue connection type 'MySQL'; using the wrong driver would typically produce a class-not-found or driver-incompatibility error, not a communications link failure. Option D is wrong because IAM permissions for Glue jobs use actions like 'glue:GetConnection' and 'rds:DescribeDBInstances' to retrieve connection metadata, but there is no 'rds:Connect' IAM action; database authentication is handled via username/password in the Glue connection, not IAM.

191
MCQmedium

A company is ingesting data from multiple sources into S3 using AWS Glue. The data engineer notices that the Glue job is failing with an OutOfMemory error. Which step should be taken to resolve this issue?

A.Reduce the volume of incoming data
B.Configure the job to use a larger memory setting
C.Use a smaller file size for input
D.Increase the number of DPUs allocated to the Glue job
AnswerD

More DPUs provide more memory and compute.

Why this answer

Option C is correct because increasing the number of DPUs provides more memory. Option A is wrong because reducing data volume is not a solution. Option B is wrong because using a smaller file size may help but not directly address memory.

Option D is wrong because Glue does not support memory configuration directly.

192
MCQhard

A company is using AWS Glue to process data stored in Amazon S3. The data includes personally identifiable information (PII) that must be masked before being written to a separate output bucket. Which AWS service or feature can be used to automatically detect and mask sensitive data in the Glue ETL job?

A.Configure CloudWatch Logs to filter and mask PII.
B.Use Amazon Macie to identify sensitive data and apply masking logic in the Glue job.
C.Use an IAM policy to restrict access to the PII columns.
D.Enable S3 Object Lock on the output bucket.
AnswerB

Amazon Macie can detect sensitive data, and the Glue job can use that information to mask it.

Why this answer

Option D is correct because Amazon Macie can be integrated with AWS Glue to detect and mask sensitive data. Option A is wrong because IAM policies control access, not data masking. Option B is wrong because S3 Object Lock prevents object deletion or modification, not masking.

Option C is wrong because CloudWatch Logs do not mask data.

193
Multi-Selectmedium

A data engineering team is using AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration must have minimal downtime and needs to capture ongoing changes after the full load. Which THREE resources are required for this task? (Choose three.)

Select 3 answers
A.A DMS source endpoint configured for Oracle.
B.An AWS DMS replication instance.
C.An AWS Snowball Edge device for initial data transfer.
D.An Amazon S3 bucket for staging the data.
E.A DMS target endpoint configured for Amazon RDS PostgreSQL.
AnswersA, B, E

Connects to the source Oracle database.

Why this answer

A is correct because a DMS replication instance is needed to run the migration tasks. B is correct because a source endpoint for Oracle is required. E is correct because a target endpoint for RDS PostgreSQL is required.

C is wrong because an S3 bucket is not required unless using S3 as a target. D is wrong because a Snowball device is for large offline transfers, not for DMS.

194
MCQmedium

A company uses AWS Glue ETL to process data from Amazon S3 and write results to Amazon Redshift. The job fails with a memory error when processing large files. Which action should the data engineer take to resolve this issue?

A.Reduce the number of partitions in the Glue job.
B.Increase the number of DPUs allocated to the Glue job.
C.Switch to a smaller instance type in the Glue job configuration.
D.Use S3 Select to filter columns before reading into Glue.
AnswerB

More DPUs provide additional memory and compute resources.

Why this answer

Option C is correct because increasing the number of DPUs provides more memory for processing. Option A is wrong because reducing parallelism may increase memory per worker but typically reduces overall throughput. Option B is wrong because S3 Select filters data server-side but does not reduce memory usage in Glue.

Option D is wrong because using a smaller instance type would exacerbate the memory issue.

195
MCQmedium

A company uses AWS Glue to run ETL jobs on a schedule. Recently, a job failed with the error: 'AnalysisException: cannot resolve '`column_name`' given input columns: ...'. The job reads from an Amazon S3 source that has a schema defined in the AWS Glue Data Catalog. What is the MOST likely cause?

A.The schema of the source data has changed and is not reflected in the Data Catalog.
B.The source data file is corrupted and cannot be parsed.
C.The IAM role associated with the Glue job does not have permissions to read the S3 bucket.
D.The data type of the column in the source does not match the Data Catalog definition.
AnswerA

Schema evolution without updating catalog causes column resolution errors.

Why this answer

Option B is correct because the error indicates a missing column. This typically happens when the source data schema changes (e.g., column renamed or dropped) but the Data Catalog schema is not updated. Option A is incorrect because an incompatible IAM role would cause a different error (e.g., AccessDenied).

Option C is incorrect because a corrupted file would cause a read error, not a schema resolution error. Option D is incorrect because the 'cannot resolve' error is about column names, not data types.

196
MCQmedium

A company is using Amazon S3 to store large amounts of archival data. The data is accessed infrequently but must be immediately retrievable when needed. Which storage class is the most cost-effective choice?

A.S3 Standard
B.S3 Standard-IA
C.S3 Glacier Deep Archive
D.S3 Intelligent-Tiering
AnswerB

Designed for infrequently accessed data with immediate retrieval.

Why this answer

S3 Standard-IA is designed for infrequently accessed data that requires rapid access. S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive have retrieval delays. S3 Standard is more expensive for infrequent access.

197
MCQmedium

A company wants to ingest data from a SaaS application into Amazon S3. The SaaS application supports streaming data via HTTP POST requests. The data volume is approximately 100 MB per hour, and the company needs to store the raw data in S3 for archival and later analysis. Which approach is the most cost-effective and operationally efficient?

A.Launch a t3.nano EC2 instance that runs a script to receive HTTP POST requests and write to S3.
B.Use Amazon Kinesis Data Firehose with HTTP endpoint as the source, and configure S3 as the destination.
C.Use Amazon Simple Queue Service (SQS) to queue the HTTP POST data and have an AWS Lambda function read from SQS and write to S3.
D.Use Amazon API Gateway to create a REST API that receives the data and triggers an AWS Lambda function to store it in S3.
AnswerB

Firehose is serverless, scales automatically, and directly delivers to S3 with no intermediate storage.

Why this answer

Option C is correct because Kinesis Data Firehose can directly accept HTTP POST requests via the Kinesis Agent or direct API calls, and it automatically delivers data to S3 with buffering, requiring no servers to manage. Option A is wrong because running an EC2 instance adds operational overhead and cost even if it's small. Option B is wrong because SQS also requires a consumer (e.g., Lambda) to move data to S3, adding complexity.

Option D is wrong because API Gateway and Lambda add unnecessary layers and cost for simple ingestion.

198
MCQhard

A data engineer is setting up an Amazon Kinesis Data Analytics application to process streaming data from a Kinesis data stream named "input-stream". The application uses a reference data source from an S3 bucket. The engineer has attached the IAM policy shown in the exhibit to the application's IAM role. When starting the application, the engineer receives an 'AccessDeniedException' error. Which additional permission is required?

A.kinesis:PutRecord on the input stream
B.s3:GetObject on the S3 bucket containing the reference data
C.kinesis:CreateStream on the input stream
D.kinesis:PutRecords on the input stream
AnswerB

The application needs to read reference data from S3, so GetObject is required.

Why this answer

The Kinesis Data Analytics application needs to read reference data from the S3 bucket, which requires the s3:GetObject permission on the bucket and its objects. The error 'AccessDeniedException' indicates the IAM role lacks this specific permission to retrieve the reference data file. Option B correctly adds the missing s3:GetObject action to allow the application to fetch the reference data from S3.

Exam trap

The trap here is that candidates often confuse the direction of data flow and assume the application needs write permissions (PutRecord/PutRecords) to the input stream, when in fact it only needs read permissions (kinesis:DescribeStream, kinesis:GetShardIterator, kinesis:GetRecords) and the missing permission is for the separate S3 reference data source.

How to eliminate wrong answers

Option A is wrong because kinesis:PutRecord is used to write data to a Kinesis stream, but the application reads from the input stream as a source, not writes to it; the error is not about writing. Option C is wrong because kinesis:CreateStream is an administrative action to create a new stream, which is irrelevant to an existing stream used as input. Option D is wrong because kinesis:PutRecords is for batch writing to a stream, not for reading or for accessing reference data from S3.

199
MCQeasy

A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?

A.Configure the crawler to 'Inherit schema from table' and set the table name.
B.Configure the crawler to 'Create a single schema for each S3 path' and enable 'Merge tables'.
C.Configure the crawler to 'Create a single schema for each S3 path' without enabling 'Merge tables'.
D.Configure the crawler to 'Create a single schema for each S3 path' and set 'Each file as a separate table'.
AnswerB

This merges schemas from all files in the path.

Why this answer

Option B is correct because when CSV files have varying schemas (extra columns), the Glue crawler must be configured to 'Create a single schema for each S3 path' with 'Merge tables' enabled. This configuration instructs the crawler to union the schemas from all files in the S3 path, adding new columns as they appear, rather than creating separate tables for each schema variation.

Exam trap

The trap here is that candidates often assume 'Merge tables' is about combining multiple tables into one, when in fact it merges schemas from multiple files within the same S3 path into a single table definition.

How to eliminate wrong answers

Option A is wrong because 'Inherit schema from table' is not a valid Glue crawler configuration; crawlers do not inherit schemas from existing tables automatically. Option C is wrong because without enabling 'Merge tables', the crawler will create multiple tables for each distinct schema, not a single unified table. Option D is wrong because 'Each file as a separate table' would create a separate table per CSV file, which defeats the goal of having a single table with all columns merged.

200
MCQeasy

A company is designing a data pipeline that ingests data from an on-premises database to Amazon S3. The data contains personally identifiable information (PII) that must be masked before storage. Which AWS service can be used to mask the data in transit?

A.AWS Database Migration Service (DMS)
B.AWS Data Pipeline
C.Amazon Kinesis Data Firehose
D.AWS Glue
AnswerD

AWS Glue can run ETL jobs to mask PII before writing to S3.

Why this answer

AWS Glue can transform and mask data during ETL jobs. Option C is correct. Kinesis is for streaming data.

Data Pipeline orchestrates but does not have native masking. DMS is for migration, not masking.

201
Multi-Selectmedium

A company is designing a data lake on Amazon S3. Which TWO strategies improve query performance for Amazon Athena?

Select 2 answers
A.Enable S3 Versioning on the bucket.
B.Use server-side encryption with AWS KMS (SSE-KMS).
C.Partition the data by frequently queried columns such as date or region.
D.Use columnar file formats like Parquet or ORC.
E.Store data in CSV format with header rows.
AnswersC, D

Partitioning prunes the data scanned.

Why this answer

Partitioning data by frequently queried columns (e.g., date or region) allows Athena to prune the data scanned by only reading the relevant partitions, reducing the amount of data scanned and improving query performance. This is a core optimization for Athena, which charges based on data scanned and performs better with less I/O.

Exam trap

The trap here is that candidates often confuse data management features (like versioning or encryption) with performance optimizations, or assume that simpler formats like CSV are sufficient for analytics, ignoring the significant performance benefits of partitioning and columnar storage.

202
MCQhard

A company uses AWS KMS to encrypt sensitive data stored in S3. To meet compliance requirements, they need to ensure that the encryption keys are automatically rotated every year. Which type of KMS key should they use?

A.Customer managed key with manual rotation
B.AWS managed key
C.Custom key store (CloudHSM) key
D.Customer managed key with automatic rotation enabled
AnswerD

Customer managed keys can have automatic rotation enabled, which rotates the key annually.

Why this answer

Option A is correct because AWS managed keys with automatic rotation are rotated annually. Option B is wrong because customer managed keys can have automatic rotation but it is not mandatory. Option C is wrong because custom key stores do not support automatic rotation.

Option D is wrong because AWS managed keys are not created by the customer.

203
Multi-Selecthard

A company uses AWS KMS to encrypt data in Amazon Redshift. The data engineer needs to rotate the customer-managed KMS key annually. Which TWO actions must be taken to successfully rotate the key without data loss?

Select 2 answers
A.Create a new KMS key and update the Redshift cluster to use the new key
B.Keep the old KMS key enabled to allow decryption of existing encrypted data
C.Use AWS CloudTrail to verify the key rotation was successful
D.Store the new key in Amazon S3 for backup
E.Enable automatic KMS key rotation on the existing key
AnswersA, B

Needed to re-encrypt data with new key.

Why this answer

Option A is correct because the Redshift cluster must be configured to use the new key. Option C is correct because the old key must remain enabled to decrypt existing data. Option B is wrong because automatic rotation is optional.

Option D is wrong because CloudTrail logs rotation but is not required. Option E is wrong because Redshift does not use S3 for key storage.

204
MCQeasy

Refer to the exhibit. A data engineer applies this S3 bucket policy to the bucket 'example-bucket'. What is the effect of this policy?

A.PutObject requests are denied unless they include the x-amz-server-side-encryption header set to AES256
B.All PutObject requests are denied regardless of encryption
C.All PutObject requests are allowed only if they use SSE-KMS
D.The policy has no effect because it does not allow any action
AnswerA

The condition StringNotEquals denies if the header is not AES256, so only requests with AES256 are allowed.

Why this answer

The policy denies s3:PutObject if the encryption header is not set to AES256 (SSE-S3). It does not enforce a specific KMS key. It allows uploads with SSE-S3.

It denies uploads without encryption or with other encryption types.

205
Multi-Selectmedium

A company uses AWS Glue to run ETL jobs daily. The jobs consume data from an Amazon RDS for MySQL database and write results to Amazon S3. The company wants to minimize the impact on the source database during extraction. Which THREE actions should the data engineer take to achieve this? (Choose THREE.)

Select 3 answers
A.Schedule the Glue job to run during off-peak hours.
B.Configure the Glue job to connect to a read replica of the RDS instance.
C.Increase the number of Glue DPUs to process data faster.
D.Disable Glue job bookmarks to force full refresh.
E.Use a JDBC connection with a WHERE clause to extract only incremental data.
AnswersA, B, E

Runs when database load is naturally low.

Why this answer

Options A, C, and D are correct. Option A: Using a read replica avoids load on the primary database. Option C: Using JDBC connection with a WHERE clause to extract only new/changed data reduces the amount of data read.

Option D: Scheduling Glue jobs during off-peak hours minimizes impact. Option B is wrong because increasing DPUs does not reduce database load; it may increase parallelism and actually increase load. Option E is wrong because reducing job bookmarks would cause full scans, increasing load.

206
MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error 'java.lang.OutOfMemoryError: Java heap space'. The job processes a large number of small files in Amazon S3. Which action would MOST effectively resolve the issue?

A.Enable S3 groupFiles option in the Glue job
B.Change the worker type to G.1X
C.Increase the number of workers in the Glue job
D.Use a G.2X worker type with more memory
AnswerA

Groups small files, reducing partitions and memory usage.

Why this answer

Option D is correct because grouping small files into fewer, larger splits reduces the number of Spark partitions and memory overhead. Option A is wrong because increasing the number of workers increases parallelism but may not fix heap space per worker. Option B is wrong because using a different instance type with more memory per worker could help, but grouping files is more effective.

Option C is wrong because changing worker type to G.1X increases memory but may not solve the root cause of too many small files.

207
MCQmedium

A data engineer is troubleshooting a failed AWS Glue job that reads from an Amazon RDS for MySQL table. The error message indicates 'java.sql.SQLException: No suitable driver'. What is the most likely cause?

A.The MySQL JDBC driver JAR is not included in the Glue job's dependencies.
B.The Glue job is using the wrong JDBC driver class name.
C.The Glue job's VPC subnet does not have a route to the RDS instance.
D.The RDS instance is not publicly accessible.
AnswerA

Glue needs the JDBC driver in its classpath to connect to MySQL.

Why this answer

Option A is correct because the MySQL JDBC driver must be included in the Glue job's dependent JARs or as a Python module. Option B is incorrect because the driver class name is correct; the driver JAR is missing. Option C is incorrect because the error is about driver, not connection.

Option D is incorrect because subnet routing does not affect driver loading.

208
Multi-Selectmedium

Which TWO actions are required to enforce encryption in transit for data being loaded into Amazon Redshift from Amazon S3? (Choose two.)

Select 2 answers
A.Configure Redshift to require SSL connections
B.Use client-side encryption for data in S3
C.Enable encryption at rest on Redshift cluster
D.Enable S3 server-side encryption
E.Use S3 VPC endpoints with HTTPS
AnswersA, E

Ensures data in transit to Redshift is encrypted.

Why this answer

To enforce encryption in transit, you must use S3 endpoints that support HTTPS (option A) and configure Redshift to use SSL (option D). Option B is about encryption at rest, not in transit. Option C is about encryption at rest in S3.

Option E is about client-side encryption, which is not required for transit encryption.

209
Multi-Selectmedium

A data engineer is designing a pipeline to ingest daily CSV files from an SFTP server into Amazon S3. The files are large (up to 10 GB) and must be encrypted in transit. The pipeline should be fully managed and serverless where possible. Which TWO services should be used together to achieve this? (Choose TWO.)

Select 2 answers
A.Amazon Kinesis Data Firehose
B.AWS Lambda
C.AWS Glue
D.AWS Transfer Family
E.Amazon Athena
AnswersC, D

Glue can process CSV files in S3.

Why this answer

Options B and D are correct. AWS Transfer Family provides a fully managed SFTP service that can transfer files directly to S3 with encryption in transit. AWS Glue can then be used to process the CSV files in S3.

Option A is wrong because Lambda has a 15-minute timeout and may not handle large files. Option C is wrong because Kinesis is for streaming data, not batch file transfers. Option E is wrong because Athena is a query service, not a transfer service.

210
Multi-Selecthard

A company is using Amazon Redshift for its data warehouse. The data engineering team needs to improve query performance for a large fact table that is frequently joined with multiple dimension tables. Which THREE strategies should be considered?

Select 3 answers
A.Define sort keys on columns used in WHERE clauses.
B.Use DISTSTYLE EVEN to distribute data evenly.
C.Increase the number of nodes in the cluster.
D.Choose an appropriate distribution key based on join columns.
E.Apply columnar compression to reduce storage and I/O.
AnswersA, D, E

Improves filter efficiency.

Why this answer

Option A is correct because defining sort keys on columns used in WHERE clauses allows Amazon Redshift to use zone maps to skip large blocks of data that do not satisfy the filter condition, dramatically reducing the amount of data scanned. This is especially effective for large fact tables where selective filters can prune entire disk blocks, improving query performance without additional hardware.

Exam trap

The trap here is that candidates often assume DISTSTYLE EVEN is always the best choice for performance, but for frequently joined fact tables, a distribution key aligned with the join columns is critical to avoid network-heavy data shuffling.

211
MCQmedium

A data engineer needs to store sensitive data in Amazon S3 and automatically classify the data using a managed service. The data is uploaded via an S3 bucket. Which AWS service can automatically detect and classify sensitive data?

A.Amazon Macie
B.AWS WAF
C.Amazon Inspector
D.AWS Shield
AnswerA

Macie automatically discovers and classifies sensitive data.

Why this answer

Option C is correct because Amazon Macie uses machine learning to discover and classify sensitive data in S3. Option A is incorrect because AWS Shield is for DDoS protection. Option B is incorrect because Amazon Inspector is for vulnerability assessment on EC2.

Option D is incorrect because AWS WAF is a web application firewall.

212
MCQhard

A data engineer is designing a data pipeline that ingests streaming data from Kinesis Data Streams, transforms it using AWS Lambda, and writes to S3. The Lambda function sometimes fails due to transient errors, and the engineer wants to ensure no data is lost. Which approach should be used?

A.Use the Kinesis Client Library to process records with checkpointing
B.Increase the Lambda function's timeout and memory
C.Use Kinesis Data Firehose as the delivery stream with Lambda for transformation and configure error handling with retries and a backup S3 bucket
D.Configure a dead-letter queue (DLQ) on Lambda to capture failed records
AnswerC

Firehose automatically retries on errors and can send failed records to a backup S3 bucket.

Why this answer

Option D (Use Kinesis Data Firehose as the delivery stream with Lambda for transformation and configure error handling with retries and a backup S3 bucket) is correct because Firehose provides built-in retry and error handling. Option A (Increase Lambda timeout) does not solve transient errors. Option B (Use a DLQ) helps capture failures but does not automatically retry.

Option C (Use Kinesis Client Library) requires more custom code.

213
Multi-Selecthard

A company stores sensitive data in Amazon S3. The security team requires encryption at rest and that the encryption keys are managed by the company using AWS KMS. The data is frequently accessed by multiple AWS services. Which THREE steps should be taken to meet these requirements?

Select 3 answers
A.Use client-side encryption with the KMS key before uploading to S3
B.Configure the KMS key policy to allow the necessary AWS services to use the key for decryption
C.Enable default encryption on the S3 bucket using SSE-S3
D.Create a bucket policy that denies s3:PutObject if the object is not encrypted with SSE-KMS
E.Enable default encryption on the S3 bucket using SSE-KMS
AnswersB, D, E

Services must have decrypt permissions to access the encrypted objects.

Why this answer

Option B is correct because the security team requires that encryption keys be managed by the company using AWS KMS, and that multiple AWS services can access the data. To allow those services to decrypt objects encrypted with a customer-managed KMS key, the KMS key policy must explicitly grant the necessary AWS services (e.g., AWS Lambda, Amazon Athena) permission to use the key for decryption (kms:Decrypt). Without this policy, even if the bucket is configured for SSE-KMS, the services will fail to read the encrypted objects.

Exam trap

AWS often tests the distinction between enforcing encryption (bucket policy) and enabling access to encrypted data (KMS key policy), leading candidates to overlook the KMS key policy step when multiple services need to decrypt objects.

214
MCQmedium

An AWS Glue job that performs data transformation on large Parquet files in Amazon S3 is taking a long time to complete. The job uses the default number of DPUs. Which change would most likely improve the job's performance?

A.Increase 'Max capacity' (number of DPUs) for the job.
B.Use 'coalesce' to reduce the number of output files.
C.Reduce the number of partitions in the source data.
D.Change the input format from Parquet to CSV.
AnswerA

More DPUs provide more compute resources.

Why this answer

Option B is correct because increasing DPUs adds parallelism and memory, speeding up processing. Option A is incorrect because Parquet is already efficient; CSV would be slower. Option C is incorrect because reducing partitions may cause OOM.

Option D is incorrect because consolidating files into fewer, larger files can reduce parallelism.

215
MCQmedium

A data engineer runs the above AWS CLI command and receives the output. The object is part of an S3 Lifecycle policy that transitions objects to Glacier Instant Retrieval after 30 days. The object was created on January 1, 2023. Why is the object still in STANDARD_IA storage class?

A.The Lifecycle policy has a filter that excludes this object's prefix
B.Versioning is enabled and the current version is not the oldest
C.The object has not reached the transition age of 30 days yet
D.The metadata timestamp is used for lifecycle transitions instead of LastModified
AnswerC

The LastModified is Jan 2, so as of Jan 3, it is only 1 day old.

Why this answer

Option C is correct because the S3 Lifecycle rule transitions objects to Glacier Instant Retrieval after 30 days, but the object was created on January 1, 2023, and the current date (implied by the command output) is before January 31, 2023. The transition age is calculated from the object's LastModified date, not from any other timestamp, and the object must be at least 30 days old before S3 applies the transition. Since the object is only 20 days old (as of January 21, 2023, based on the output showing STANDARD_IA), it has not yet met the 30-day threshold.

Exam trap

The trap here is that candidates assume the object's current storage class (STANDARD_IA) means the 30-day transition to Glacier Instant Retrieval has already failed or been misconfigured, when in fact the object simply hasn't aged enough yet.

How to eliminate wrong answers

Option A is wrong because the AWS CLI command output shows the object's storage class as STANDARD_IA, which indicates the Lifecycle policy has already transitioned it from STANDARD to STANDARD_IA, proving the filter does not exclude this prefix. Option B is wrong because versioning being enabled does not prevent lifecycle transitions; S3 Lifecycle policies apply to all versions unless explicitly filtered, and the current version's age is based on its own LastModified date, not the oldest version. Option D is wrong because S3 Lifecycle transitions are based on the object's LastModified date, not any metadata timestamp; the LastModified field is the authoritative timestamp for age calculations.

216
Multi-Selecteasy

Which THREE AWS services can be used to centrally manage and govern data across multiple AWS accounts? (Select THREE.)

Select 3 answers
A.Amazon S3
B.AWS Control Tower
C.AWS Organizations
D.Amazon Redshift
E.AWS Lake Formation
AnswersB, C, E

Control Tower provides a governance framework for multi-account environments.

Why this answer

Options A, C, and E are correct. Option A (AWS Organizations) provides centralized management. Option C (AWS Lake Formation) can manage data lake permissions across accounts.

Option E (AWS Control Tower) offers a pre-configured environment for governance. Option B (Amazon S3) is a storage service, not central governance. Option D (Amazon Redshift) is a data warehouse, not a governance service.

217
MCQmedium

A company is using Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data must be transformed from JSON to Parquet format before landing in S3. The transformation logic is simple: convert the JSON schema to Parquet. Which approach meets the requirements with the least operational overhead?

A.Use the built-in data format conversion feature of Firehose with an AWS Glue Data Catalog table
B.Use an AWS Lambda function to transform records to Parquet before sending to Firehose
C.Use Amazon Kinesis Data Analytics to convert the stream to Parquet
D.Provision an Amazon EMR cluster to convert the data in micro-batches
AnswerA

Firehose can convert to Parquet automatically.

Why this answer

Option A is correct because Firehose can natively convert JSON to Parquet using a schema from AWS Glue Data Catalog, without custom code. Option B (Lambda) requires writing and maintaining code. Option C (Kinesis Data Analytics) is overkill.

Option D (EC2) adds management overhead.

218
MCQeasy

Refer to the exhibit. A Lambda function named 'IngestionProcessor' is failing. The engineer checks CloudWatch Logs and sees the log group exists but storedBytes is 0. Why might the logs show no data?

A.The Lambda execution role does not have permission to write logs to CloudWatch
B.The Lambda function is configured with a dead letter queue
C.The Lambda function has not been invoked yet
D.The log group is encrypted with a KMS key and the Lambda function lacks decrypt permission
AnswerA

Without logs:CreateLogGroup, CreateLogStream, PutLogEvents, logs are not written.

Why this answer

The log group exists but has no log streams, meaning the function hasn't executed (or its execution role lacks logs:CreateLogStream and logs:PutLogEvents). The most common cause is missing IAM permissions for CloudWatch Logs.

219
Multi-Selectmedium

A data engineer is designing a data pipeline that processes PII data using AWS Glue and stores results in S3. Which TWO actions should be taken to protect the data? (Choose 2)

Select 2 answers
A.Use S3 default encryption with SSE-S3 for the output bucket.
B.Store database credentials in AWS Secrets Manager and reference them in Glue connections.
C.Enable S3 object deletion protection by setting a retention policy.
D.Configure AWS Glue to use a KMS key for encrypting data written to S3.
E.Use HTTPS for all data transfer between Glue and S3.
AnswersB, D

Secrets Manager secures credentials.

Why this answer

Options A and D are correct. Option A: Glue can use KMS to encrypt data at rest. Option D: Secrets Manager securely stores database credentials.

Option B is wrong because SSE-S3 does not use KMS. Option C is wrong because deleting data after processing is not a security best practice for compliance (may need to retain). Option E is wrong because HTTPS is not a substitute for encryption at rest.

220
MCQhard

A data engineer is designing a data pipeline that ingests data from an on-premises system into Amazon S3 using AWS Transfer Family. The data must be encrypted at rest using a customer-managed key in AWS KMS. The S3 bucket policy must allow only encrypted connections. Which policy condition should be used?

A.aws:SecureTransport
B.kms:EncryptionContext
C.s3:x-amz-server-side-encryption-aws-kms-key-id
D.s3:x-amz-server-side-encryption
AnswerA

This condition enforces that connections use TLS.

Why this answer

Option D is correct because the condition aws:SecureTransport is used to enforce encrypted connections (TLS). Option A is wrong because s3:x-amz-server-side-encryption-aws-kms-key-id enforces the use of a specific KMS key, not encrypted connections. Option B is wrong because kms:EncryptionContext is for additional context, not connection encryption.

Option C is wrong because s3:x-amz-server-side-encryption enforces SSE, not in-transit encryption.

221
MCQeasy

A company is using Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is delivered in 5-minute intervals. The company wants to reduce the delivery frequency to 1 minute to get data faster. Which parameter should be changed in the Firehose delivery stream configuration?

A.Reduce the buffer interval from 300 seconds to 60 seconds.
B.Increase the buffer size to trigger delivery sooner.
C.Enable dynamic partitioning to deliver data more frequently.
D.Enable compression to reduce data size and speed up delivery.
AnswerA

The buffer interval controls the maximum time between deliveries.

Why this answer

Option A is correct because the buffer interval determines how often data is delivered to the destination. Changing it from 300 seconds to 60 seconds will deliver data every minute. Option B (buffer size) affects delivery based on data volume, not time.

Option C (compression) does not affect frequency. Option D (partitioning) does not affect delivery frequency.

222
MCQmedium

A data pipeline uses AWS Step Functions to orchestrate multiple Lambda functions for data transformation. The pipeline occasionally fails with a 'StateMachineExecutionLimitExceeded' error. What is the MOST likely cause?

A.The API Gateway endpoint used by Step Functions has a rate limit.
B.The Lambda functions have reached their concurrent execution limit.
C.The account has reached the maximum number of concurrent state machine executions.
D.The state machine definition has a syntax error causing infinite loops.
AnswerC

Step Functions has a limit on concurrent executions; increase the limit or reduce concurrency.

Why this answer

Option C is correct because Step Functions has a default limit on concurrent executions (e.g., 1 million per account per region). Option A is wrong because Lambda concurrency limits would produce a different error. Option B is wrong because API Gateway is not involved.

Option D is wrong because state machine definition does not affect execution limits.

223
MCQeasy

A company uses Amazon RDS for PostgreSQL. The data engineer needs to ensure that the database is automatically backed up and that backups are retained for 35 days. What is the simplest way to achieve this?

A.Use AWS Backup to schedule daily backups with a 35-day retention.
B.Enable automated backups with a retention period of 35 days in the RDS instance configuration.
C.Create a manual snapshot every day and delete them after 35 days using a script.
D.Enable automatic export of transaction logs to Amazon S3 and use S3 lifecycle policies.
AnswerB

RDS automated backups run daily and retain backups for the specified period, up to 35 days.

Why this answer

Amazon RDS for PostgreSQL allows you to enable automated backups directly in the instance configuration. By setting the backup retention period to 35 days, RDS automatically performs daily snapshots and retains transaction logs for point-in-time recovery within that window. This is the simplest method because it requires no external services or custom scripting.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing AWS Backup (Option A) or manual scripting (Option C), not realizing that RDS native automated backups already provide the simplest, fully managed way to achieve the required retention period.

How to eliminate wrong answers

Option A is wrong because AWS Backup is an additional service that adds complexity and cost; RDS native automated backups already support retention up to 35 days without needing AWS Backup. Option C is wrong because manual snapshots require custom scripting to create and delete daily, which is not the simplest approach and does not provide automated point-in-time recovery. Option D is wrong because automatic export of transaction logs to S3 is not a native RDS feature for PostgreSQL; RDS handles transaction logs internally for point-in-time recovery, and using S3 lifecycle policies would not replace the need for automated backups.

224
MCQhard

Refer to the exhibit. A Lambda function with this IAM policy is used to process records from a Kinesis stream and write to S3. The function is failing with access denied errors when writing to S3. What is the issue?

A.The function needs to use Kinesis Data Analytics for transformation.
B.The Kinesis stream ARN is incorrect.
C.The Lambda function does not have permission to read from the Kinesis stream.
D.The S3 bucket name in the policy does not match the actual bucket used by the function.
AnswerD

Common cause of access denied.

Why this answer

Option B is correct because the policy allows s3:PutObject on 'my-bucket/*' but the function might be writing to a different bucket or the resource ARN is missing permissions for the bucket itself (e.g., s3:ListBucket). Option A is wrong because the policy includes GetRecords. Option C is wrong because the policy has kinesis actions.

Option D is wrong because Kinesis Data Analytics is not mentioned.

225
MCQmedium

A company uses Amazon Kinesis Data Analytics to process real-time data. The application needs to aggregate data over a 10-minute window. The team notices that late-arriving events are being dropped. Which configuration should they adjust?

A.Configure a Kinesis Firehose delivery stream to buffer the late events.
B.Increase the shard count of the source Kinesis stream.
C.Set the allowed_lateness parameter in the application's windowed aggregation.
D.Increase the RecordColumn count in the input stream mapping.
AnswerC

Controls how late events are accepted.

Why this answer

Option C is correct because Kinesis Data Analytics allows you to set a lateness tolerance (allowed_lateness) for streaming aggregations. Option A is wrong because RecordColumn is for schema. Option B is wrong because ShardCount affects parallelism, not late events.

Option D is wrong because the application uses Kinesis Data Analytics, not Firehose.

Page 2

Page 3 of 24

Page 4