Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 451–525

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 7 of 24

451

MCQhard

A data engineer is troubleshooting an access issue. A user has the IAM policy shown in the exhibit. The user attempts to upload an object to `s3://data-lake-bucket/confidential/report.pdf`. What will happen?

A.The upload will fail with an 'Access Denied' error.

B.The upload will succeed because the Deny statement is not valid without a condition.

C.The upload will succeed because the Allow statement is more specific than the Deny.

D.The upload will succeed because the user has s3:PutObject permission on the bucket.

AnswerA

The Deny statement explicitly denies all s3 actions on the confidential prefix, taking precedence over the Allow.

Why this answer

Option B is correct because the explicit Deny overrides the Allow, so the upload will be denied. Option A is incorrect because the user has s3:PutObject allowed for the bucket, but the Deny for the confidential path takes precedence. Option C is incorrect because the policy is valid.

Option D is incorrect because the Deny is explicit.

Full explanation →

452

MCQhard

A data pipeline ingests streaming data from Kinesis Data Streams into S3 via Kinesis Data Firehose. Occasionally, small files are written to S3, increasing downstream processing costs. What is the most efficient way to reduce the number of small files?

A.Use a Lambda function to aggregate records before sending to Firehose.

B.Use the Kinesis Client Library (KCL) to write larger batches to S3 directly.

C.Run a daily AWS Glue job to concatenate small files.

D.Increase the Firehose buffering interval to 300 seconds and buffering size to 64 MB.

AnswerD

Firehose will buffer more data per file.

Why this answer

Option C is correct because increasing the buffering interval and size in Firehose batches more data into fewer files. Option A is wrong because Lambda can aggregate but adds complexity and cost. Option B is wrong because KCL runs on EC2, not serverless.

Option D is wrong because Glue ETL runs after delivery, not preventing small files.

Full explanation →

453

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command to look up GetObject events. The output shows an event from the DataEngineer role. However, the engineer suspects that some GetObject requests are not being logged. What is the MOST likely reason?

A.The IAM role does not have permission to read the CloudTrail logs.

B.The CloudTrail trail is not configured to log data events.

C.The trail is not enabled in the us-east-1 region.

D.The S3 bucket is in a different region than the CloudTrail trail.

AnswerB

By default, CloudTrail does not log data events for S3 objects; they must be enabled.

Why this answer

Option B is correct because CloudTrail must have Data Events enabled for S3 object-level operations such as GetObject. Option A is wrong because the event is logged, so the trail exists. Option C is wrong because the engineer is looking up events, not configuring logging.

Option D is wrong because the lookup is for a specific region, but the bucket might be in a different region, but that would not cause missing logs if events are logged in the bucket's region.

Full explanation →

454

MCQhard

A company runs a data lake on AWS using S3 for storage and AWS Glue for ETL. The security team discovers that a contractor who left the company two months ago still has access to an S3 bucket containing sensitive data. The access was granted via an IAM user that was not deleted. The data engineer is asked to implement a solution to prevent future occurrences. The company uses AWS Organizations and has multiple accounts. The requirement is to automatically detect and remediate IAM users that have not been used for 90 days by disabling their access keys and notifying the security team. The solution must be least privilege and use AWS-native services. Which approach should the data engineer take?

A.Use AWS IAM Access Analyzer to generate findings for unused access and create an AWS Config managed rule to automatically disable the IAM user's access keys.

B.Use AWS CloudTrail to monitor IAM user activity and set up a CloudWatch alarm that triggers an SNS notification to the security team to manually disable the keys.

C.Use AWS Lake Formation to revoke the permissions of the IAM user and set up a scheduled Lambda function to check for unused IAM users.

D.Use AWS IAM Access Analyzer to generate findings for unused access and create an AWS Config custom rule with a Lambda function that automatically disables the access keys and sends a notification via SNS.

AnswerD

This automates detection and remediation.

Why this answer

Option D is correct. AWS IAM Access Analyzer can generate findings for unused access, and AWS Config with a custom rule can auto-remediate by invoking a Lambda function to disable keys and notify via SNS. Option A is wrong because IAM roles are not relevant for user access keys.

Option B is wrong because CloudTrail does not automatically disable keys. Option C is wrong because Lake Formation does not manage IAM users.

Full explanation →

455

MCQhard

A company has an Amazon DynamoDB table with on-demand capacity mode. The table stores session data for a web application. Recently, the application experienced throttling errors during a traffic spike. The team wants to prevent future throttling while optimizing costs. What should they do?

A.Implement a DynamoDB Accelerator (DAX) cluster

B.Enable DynamoDB auto scaling on the table

C.Switch to provisioned capacity with auto scaling

D.Increase the read and write capacity of the table

AnswerA

DAX provides in-memory caching to reduce read throttling.

Why this answer

A DynamoDB Accelerator (DAX) cluster provides an in-memory cache that absorbs read-heavy traffic spikes, reducing the number of read requests that reach the underlying DynamoDB table. Since the throttling errors occurred during a traffic spike and the table uses on-demand capacity, which already scales automatically for writes and reads, the bottleneck is likely read-heavy traffic overwhelming the table's throughput. DAX offloads reads from the table, preventing throttling without requiring any changes to capacity mode, and it is cost-effective because it reduces read capacity unit consumption.

Exam trap

The trap here is that candidates assume throttling in on-demand mode must be fixed by switching to provisioned capacity or enabling auto scaling, but they overlook that on-demand already scales automatically and the real solution is to reduce read load via caching with DAX.

How to eliminate wrong answers

Option B is wrong because DynamoDB auto scaling is only available for provisioned capacity mode, not on-demand mode; on-demand already scales automatically, so enabling auto scaling is not applicable. Option C is wrong because switching to provisioned capacity with auto scaling would introduce management overhead and potential cost inefficiency compared to on-demand, and it does not address the root cause of read throttling during spikes as effectively as caching. Option D is wrong because increasing read and write capacity is only possible in provisioned mode; in on-demand mode, you cannot manually increase capacity, and doing so would not prevent throttling caused by read-heavy spikes without incurring unnecessary costs.

Full explanation →

456

Multi-Selecteasy

A company is using Amazon RDS for MySQL and wants to automate backups for point-in-time recovery. Which TWO actions should be taken? (Choose TWO.)

Select 2 answers

A.Enable automated backups with a retention period.

B.Use AWS Backup to schedule backups.

C.Set the backup retention period to the desired number of days.

D.Enable Multi-AZ deployment.

E.Take manual snapshots daily.

AnswersA, C

Automated backups provide point-in-time recovery and are enabled by default.

Why this answer

Options B and D are correct. Automated backups are enabled by default, and the retention period can be set up to 35 days. Option A is wrong because manual snapshots are not automated.

Option C is wrong because Multi-AZ is for high availability, not backups. Option E is wrong because AWS Backup is an optional service but not required; RDS native backups suffice.

Full explanation →

457

Drag & Dropmedium

Order the steps to query data in Amazon Redshift Spectrum from an external table in Athena.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Start by creating the external schema in Redshift, then the external table, grant permissions, run the query, and verify results.

Full explanation →

458

MCQhard

Refer to the exhibit. A data engineer is running an AWS Glue job that reads from an S3 bucket encrypted with a customer-managed KMS key. The job fails with the error shown. What is the most likely cause?

A.The S3 bucket policy denies the kms:Decrypt action.

B.The IAM role used by the Glue job is missing the kms:Decrypt permission.

C.The Glue job does not have permission to call kms:GenerateDataKey.

D.The KMS key policy does not grant the Glue service principal access.

AnswerB

The error says no identity-based policy allows kms:Decrypt.

Why this answer

The error indicates that the AWS Glue job cannot access the S3 bucket because it lacks the necessary KMS permissions. Since the bucket is encrypted with a customer-managed KMS key, the IAM role assigned to the Glue job must include the kms:Decrypt permission to read the encrypted objects. Without this permission, the job fails when attempting to decrypt the data.

Exam trap

AWS often tests the distinction between kms:Decrypt and kms:GenerateDataKey, leading candidates to mistakenly choose the latter when the job is only reading data, not writing or generating new encryption keys.

How to eliminate wrong answers

Option A is wrong because the S3 bucket policy denying kms:Decrypt would cause a different error (e.g., Access Denied), but the error shown specifically points to a missing permission, not a denial. Option C is wrong because kms:GenerateDataKey is used for encrypting new data, not for reading existing encrypted objects; the job only needs kms:Decrypt to read the encrypted data. Option D is wrong because the KMS key policy does not need to grant the Glue service principal directly; the IAM role used by the Glue job is the entity that requires the kms:Decrypt permission, and the key policy must allow that role (or the account) to use the key.

Full explanation →

459

Multi-Selecthard

A company uses AWS Glue to transform data from Amazon S3 into Parquet format. The job fails with an out-of-memory error for large files. Which TWO actions can resolve this issue? (Choose TWO.)

Select 2 answers

A.Change the input format from CSV to JSON.

B.Increase the number of DPUs allocated to the job.

C.Use the Glue streaming ETL feature.

D.Enable CloudWatch logs for detailed error analysis.

E.Split the input data into smaller files.

AnswersB, E

More DPUs provide more memory and processing power.

Why this answer

Option A increases memory, Option C increases parallelism. Option B is not relevant since the job reads from S3. Option D is for streaming.

Option E is for debugging, not resolving OOM.

Full explanation →

460

Multi-Selectmedium

A data engineer is configuring an S3 bucket policy to allow cross-account access for a partner organization to write data to a specific prefix. The partner's AWS account ID is 111111111111. The engineer wants to ensure that only the partner can write, and that the partner cannot read or delete objects. Which policy statements should be included? (Choose TWO.)

Select 2 answers

A.{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::111111111111:user/PartnerUser"},"Action":"s3:PutObject","Resource":"arn:aws:s3:::mybucket/partner/*"}

B.{"Effect":"Allow","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::mybucket/partner/*","Condition":{"StringEquals":{"aws:SourceAccount":"111111111111"}}}

C.{"Effect":"Allow","Principal":{"AWS":"111111111111"},"Action":"s3:PutObject","Resource":"arn:aws:s3:::mybucket/partner/*"}

D.{"Effect":"Allow","Principal":{"AWS":"111111111111"},"Action":["s3:GetObject","s3:DeleteObject"],"Resource":"arn:aws:s3:::mybucket/partner/*"}

E.{"Effect":"Deny","Principal":{"AWS":"111111111111"},"NotAction":"s3:PutObject","Resource":"arn:aws:s3:::mybucket/partner/*"}

AnswersC, E

Grants write access to the prefix.

Why this answer

Options A and C are correct. Option A grants s3:PutObject for the prefix. Option C explicitly denies all other actions.

Option B is wrong because it grants read and delete. Option D is wrong because the principal should be the partner's account root. Option E is wrong because using a condition without explicit deny may not block all actions.

Full explanation →

461

MCQhard

An IAM role 'DataLakeRole' has the above S3 bucket policy attached to an S3 bucket. The role is assumed by an AWS Glue job. The Glue job is failing with 'Access Denied' errors when trying to list objects in the bucket. Which action should be added to the policy to fix the issue?

A.Add s3:ListObjects action for the bucket ARN.

B.Add s3:ListBucket action for the bucket ARN (arn:aws:s3:::my-data-lake).

C.Add s3:GetObjectVersion action for the object ARN.

D.Add s3:ListBucket action for the object ARN (arn:aws:s3:::my-data-lake/*).

AnswerB

ListBucket is required to list objects in the bucket.

Why this answer

The Glue job is failing with 'Access Denied' when trying to list objects, which requires the s3:ListBucket permission on the bucket itself (not on objects). Option B correctly adds s3:ListBucket for the bucket ARN (arn:aws:s3:::my-data-lake), which grants permission to list the contents of the bucket. Without this action, even if other permissions exist, the ListObjectsV2 API call used by AWS Glue to enumerate objects will be denied.

Exam trap

The trap here is that candidates confuse s3:ListBucket (which applies to the bucket itself) with s3:GetObject or s3:ListObjects (which are often misapplied to object ARNs), leading them to pick Option D or A, not realizing that listing requires the bucket-level permission and a bucket ARN, not an object ARN.

How to eliminate wrong answers

Option A is wrong because s3:ListObjects is an alias for s3:ListBucket and must be applied to the bucket ARN, not the bucket ARN with a trailing slash or object path; however, the key issue is that the action name itself is correct but the ARN in the answer is unspecified, and the question asks for the action to add, not the ARN—but more critically, s3:ListObjects is a legacy action name and the exam expects s3:ListBucket for consistency with the S3 API. Option C is wrong because s3:GetObjectVersion is used to retrieve a specific version of an object, not to list objects, and does not address the 'list objects' failure. Option D is wrong because s3:ListBucket must be applied to the bucket ARN (arn:aws:s3:::my-data-lake), not to an object ARN (arn:aws:s3:::my-data-lake/*); applying it to an object ARN would be invalid and would not grant the permission to list the bucket's contents.

Full explanation →

462

Multi-Selecthard

A company stores sensitive financial data in an Amazon Redshift cluster. The data engineer must ensure that all queries are logged for audit purposes and that the logs are stored in Amazon S3 with server-side encryption. Which THREE steps should the data engineer take to meet these requirements?

Select 3 answers

A.Configure audit logs to be stored in an Amazon S3 bucket.

B.Enable encryption on the Redshift cluster.

C.Enable AWS CloudTrail to log Redshift queries.

D.Enable audit logging on the Redshift cluster.

E.Enable default encryption on the S3 bucket using SSE-S3 or SSE-KMS.

AnswersA, D, E

Audit logs can be delivered to an S3 bucket.

Why this answer

To audit queries, enable audit logging, set the bucket to S3, and ensure encryption is enabled. Option A is correct because audit logging captures query logs. Option B is correct because audit logs are stored in S3.

Option D is correct because enabling default encryption on the S3 bucket encrypts the logs. Option C is wrong because user activity logging in CloudTrail does not capture query logs. Option E is wrong because enabling encryption on the Redshift cluster encrypts data at rest, not the audit logs.

Full explanation →

463

MCQmedium

Refer to the exhibit. A data engineer deploys this CloudFormation template to create an AWS Glue job. The job fails on the first run with an error: 'AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/GlueServiceRole/... is not authorized to perform: s3:GetObject on resource: s3://my-bucket/scripts/etl.py'. What is the most likely cause?

A.The ExecutionProperty MaxConcurrentRuns is set to 1, preventing the job from running.

B.The IAM role associated with the Glue job does not have an S3 GetObject permission for the script location.

C.The MaxRetries is set to 0, so the job does not retry on failure.

D.The script location is incorrectly specified; it should be an S3 URI with bucket and key.

AnswerB

Glue needs s3:GetObject on the script.

Why this answer

Option A is correct because the IAM role (GlueServiceRole) does not have permission to read the script from S3. Option B is wrong because the script location is correct. Option C is wrong because MaxConcurrentRuns does not affect first run.

Option D is wrong because MaxRetries is 0, meaning no retries, but the error is about access.

Full explanation →

464

MCQeasy

A data engineer is running an Amazon EMR cluster with Spark to process log files. The cluster uses instance fleets with m5.xlarge core nodes. The engineer observes that the Spark job is running slower than expected. CloudWatch metrics show that the cluster's CPU utilization is below 20% but memory utilization is near 90%. Which configuration change would most likely improve performance?

A.Use memory-optimized instances (r5.xlarge) for core nodes.

B.Increase the number of core nodes from 5 to 10.

C.Increase the number of Spark shuffle partitions.

D.Decrease the number of core nodes to reduce overhead.

AnswerA

r5 instances have higher memory-to-CPU ratio, reducing memory pressure and spills.

Why this answer

Option D is correct because high memory usage with low CPU indicates that the data does not fit in memory, causing spills to disk. Using a memory-optimized instance type (e.g., r5.xlarge) provides more memory per core. Option A is wrong because increasing core nodes adds more CPU but does not address memory per node.

Option B is wrong because reducing nodes reduces total memory. Option C is wrong because the issue is memory, not shuffle partitions.

Full explanation →

465

MCQeasy

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The job runs successfully but the data in S3 is missing some records that exist in the source. The engineer notices that the job uses a JDBC connection and the query extracts data based on a timestamp column. What is the MOST likely cause of the missing records?

A.The timestamp column includes time portion and the job is using an exclusive upper bound.

B.The S3 bucket lacks write permissions.

C.The JDBC connection uses connection pooling, causing some records to be dropped.

D.The Glue job is configured to only read from one table.

AnswerA

Records with the same timestamp as the upper bound may be excluded.

Why this answer

Option B is correct because if the timestamp column includes time, records with the same timestamp may be missed if the job uses an exclusive upper bound. Option A is wrong because the job can read from different tables. Option C is wrong because the S3 bucket permissions would cause failure, not missing records.

Option D is wrong because connection pooling does not cause missing records.

Full explanation →

466

MCQeasy

Refer to the exhibit. A data engineer runs the AWS CLI command and observes the output. The stream has two shards. A producer sends a record with a partition key that hashes to 150000000000000000000000000000000000000. To which shard will the record be written?

A.shardId-000000000001

B.The record will be rejected because it does not match any shard

C.shardId-000000000000

D.The record will be written to both shards

AnswerA

The hash key falls within the range of the second shard.

Why this answer

Option B is correct because the hash key range for shardId-000000000000 is 0 to 113427455640312821154458202477256070484. The given hash 150000... is greater than the end of the first shard, so it falls into the second shard's range (113427... to 226854...). Option A is wrong because the hash is not in the first shard's range.

Option C is wrong because the hash is within the second shard's range. Option D is wrong because the hash is not outside both ranges.

Full explanation →

467

Multi-Selecteasy

A data engineer needs to store streaming data from multiple sources into Amazon S3. The data should be organized by source, date, and hour. The engineer wants to minimize processing overhead. Which THREE S3 features should the engineer use to achieve this? (Choose THREE.)

Select 3 answers

A.S3 Inventory to list objects and their metadata.

B.S3 Object Lock to prevent object modifications.

C.S3 Batch Operations to rename objects after upload.

D.S3 Event Notifications to invoke Lambda functions for data processing.

E.S3 prefixes to create a folder structure (e.g., source=.../date=.../hour=...).

AnswersA, D, E

Inventory helps audit and manage the stored data.

Why this answer

S3 prefixes organize objects into a hierarchy. S3 Inventory provides a list of objects. S3 Event Notifications trigger downstream processing.

Batch Operations are for bulk actions, not organization. Object Lock is for retention. S3 Select is for querying within files.

Full explanation →

468

MCQhard

A company runs a data pipeline using AWS Step Functions to orchestrate multiple AWS Lambda functions and AWS Glue jobs. The pipeline processes large CSV files from Amazon S3, transforms them, and loads them into Amazon Redshift. Recently, the pipeline has been failing intermittently with a 'StateMachineExecutionLimitExceeded' error. The error occurs when multiple pipeline runs are triggered simultaneously. The current execution limit for the state machine is 1000. The team expects up to 200 concurrent executions during peak hours. Which action should the team take to resolve the issue?

A.Increase the execution timeout for the state machine to 1 hour.

B.Increase the Lambda function concurrency limits to allow more parallel processing.

C.Implement a queue (e.g., Amazon SQS) to buffer the pipeline triggers and process them sequentially.

D.Request a service quota increase for the maximum number of state machine executions from AWS Support.

AnswerD

The default limit is 1000; increasing it to 2000 would accommodate the expected concurrency.

Why this answer

Option D is correct because the error indicates the state machine execution limit has been reached. The team should request a limit increase from AWS Support. Option A is wrong because reducing concurrency does not solve the limit issue; it only reduces the number of concurrent executions.

Option B is wrong because increasing Lambda concurrency limits does not affect Step Functions execution limits. Option C is wrong because the error is not about execution timeout; it's about exceeding the maximum number of concurrent executions.

Full explanation →

469

MCQhard

A company uses Amazon Redshift for analytics. The data engineer notices that queries are slow and the system is experiencing high disk usage. The engineer suspects that the distribution style is suboptimal. Which action should the engineer take to improve query performance?

A.Convert all tables to use SORTKEY on the most frequently filtered column.

B.Increase the number of nodes in the cluster to distribute data across more slices.

C.Use the DISTSTYLE AUTO setting and analyze query patterns to let Redshift choose.

D.Set all tables to DISTSTYLE EVEN to distribute data evenly.

AnswerC

AUTO adapts distribution based on workload.

Why this answer

Option B is correct because analyzing query patterns helps choose optimal distribution. Option A changes all tables, which may not be ideal. Option C is for storage, not distribution.

Option D is for sort keys.

Full explanation →

470

MCQeasy

A company is using Amazon Redshift for data warehousing. They need to ensure that all queries are logged for audit purposes. Which AWS service should be used to capture query logs?

A.AWS CloudTrail

B.Amazon S3

C.Amazon CloudWatch Logs

D.Amazon Athena

AnswerC

Redshift can send query logs to CloudWatch Logs.

Why this answer

Option A is correct because Amazon Redshift can publish audit logs to CloudWatch Logs. Option B is wrong because CloudTrail captures API activity, not query logs. Option C is wrong because S3 is a storage service, not for logging.

Option D is wrong because Athena is a query service, not a logging service.

Full explanation →

471

MCQeasy

A data engineer is troubleshooting a slow Amazon Redshift query. The EXPLAIN plan shows a 'Seq Scan' on a large table. What is the most likely cause?

A.The cluster has too many nodes.

B.There are too many concurrent queries.

C.The table does not have a proper sort key defined.

D.The workload management (WLM) queue is misconfigured.

AnswerC

Without a sort key, Redshift performs a full table scan (Seq Scan) instead of a range-restricted scan.

Why this answer

Option D is correct because 'Seq Scan' indicates a full table scan, which typically occurs when there is no suitable index (or in Redshift, no sort key). Option A (too many nodes) would not cause a Seq Scan. Option B (concurrent queries) could cause slowdown but not a Seq Scan specifically.

Option C (WLM queue) affects concurrency, not scan type.

Full explanation →

472

Multi-Selectmedium

A company ingests IoT data into an S3 bucket using AWS IoT Core rules. The data is in JSON format, and each record is about 500 bytes. The data volume is 5 GB per day. The company wants to convert the data to Parquet format and partition it by year/month/day. Which TWO AWS services can be used together to achieve this with minimal operational overhead?

Select 2 answers

A.Amazon Athena CTAS query

B.AWS Glue ETL job triggered by S3 event

C.AWS Lambda function triggered by S3 event

D.Amazon EMR with Spark job

E.Amazon Kinesis Data Firehose with Parquet conversion

AnswersB, C

Glue can be triggered by S3 events (via Lambda or EventBridge) and perform the conversion and partitioning.

Why this answer

Options B and E are correct. S3 Events can trigger Lambda, and Lambda can convert JSON to Parquet and write to partitioned paths. AWS Glue can also be triggered by S3 events via a workflow, but the combination of Lambda + Glue (option E) works as Glue can read from the landing zone and write partitioned Parquet.

Option A (Kinesis Firehose) can convert to Parquet but cannot partition by year/month/day easily. Option C (EMR) requires cluster management. Option D (Athena) is for querying, not transformation.

Full explanation →

473

Multi-Selecteasy

A data engineer needs to ingest data from an Amazon RDS MySQL database into a data lake on Amazon S3. The engineer wants to perform an initial full load and then capture incremental changes. Which TWO AWS services can be combined to achieve this?

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon S3

D.AWS Database Migration Service (DMS)

E.AWS Data Pipeline

AnswersC, D

S3 is the target for the data lake.

Why this answer

AWS DMS can do full load and CDC. S3 can be the target. Option B is wrong because Glue does not support CDC from MySQL directly.

Option C is wrong because Firehose doesn't connect to MySQL. Option D is wrong because Data Pipeline does not support CDC.

Full explanation →

474

MCQeasy

A company uses Amazon Athena to query data in S3. The security team wants to ensure that users can only query tables they have permissions to in the AWS Glue Data Catalog. Which service should be used to manage these permissions centrally?

A.AWS Lake Formation

B.AWS IAM

C.AWS CloudTrail

D.S3 bucket policies

AnswerA

Lake Formation provides fine-grained access control to Data Catalog resources.

Why this answer

Option A is correct because Lake Formation provides centralized permissions management for the Glue Data Catalog. Option B (IAM) is too low-level and does not integrate directly with Data Catalog tables. Option C (S3 bucket policy) does not control table access.

Option D (CloudTrail) is for auditing, not access control.

Full explanation →

475

MCQmedium

A data engineer is migrating an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. The database is 2 TB in size and has a tight migration window. Which migration approach minimizes downtime?

A.Use AWS Database Migration Service (DMS) with full load only.

B.Use pg_dump to export the database and pg_restore to import into RDS.

C.Create a read replica in RDS and promote it when ready.

D.Use AWS DMS with ongoing replication to capture changes during migration.

AnswerD

Ongoing replication syncs changes until cutover, minimizing downtime.

Why this answer

Option D is correct because AWS DMS with ongoing replication allows minimal downtime. Option A is full dump and restore, causing downtime. Option B is slow.

Option C requires manual setup.

Full explanation →

476

MCQhard

A company uses Kinesis Data Analytics for SQL-based real-time analytics on streaming data. They notice that the application is processing data slower than the incoming rate, causing increased latency. Which action is MOST likely to improve the throughput?

A.Increase the number of Kinesis Processing Units (KPUs) for the application

B.Increase the number of shards in the Kinesis data stream

C.Enable auto-scaling on the Kinesis data stream

D.Decrease the retention period of the Kinesis data stream

AnswerA

More KPUs increase parallelism and throughput.

Why this answer

Option B is correct because increasing the number of Kinesis Processing Units (KPUs) allows more parallelism. Option A is wrong because decreasing the retention period does not affect processing speed. Option C is wrong because enabling auto-scaling for the stream itself does not increase application parallelism.

Option D is wrong because increasing shards increases ingestion capacity, but the bottleneck is the analytics application.

Full explanation →

477

MCQhard

A data engineer ran the command shown in the exhibit on the bucket 'my-data-lake'. The engineer then tries to delete an object version but receives an 'AccessDenied' error. The engineer has full S3 permissions via IAM. What is the most likely reason for the error?

A.Versioning is suspended on the bucket

B.MFA Delete is enabled, requiring multi-factor authentication

C.The bucket policy denies s3:DeleteObjectVersion

D.An S3 Object Lock retention policy is in effect

AnswerB

MFA Delete requires additional authentication to delete versions.

Why this answer

The command shown in the exhibit (likely `aws s3api put-bucket-versioning --bucket my-data-lake --versioning-configuration Status=Enabled,MFADelete=Enabled`) enables both versioning and MFA Delete on the bucket. When MFA Delete is enabled, any operation that permanently deletes an object version or changes the versioning state requires the request to include a multi-factor authentication token. Even though the engineer has full S3 permissions via IAM, the missing MFA token causes the 'AccessDenied' error.

This is a bucket-level setting that overrides IAM permissions for these specific operations.

Exam trap

The trap here is that candidates assume 'full S3 permissions via IAM' guarantees all operations succeed, but MFA Delete is a bucket-level condition that overrides IAM permissions for version deletion and versioning state changes, requiring explicit MFA authentication.

How to eliminate wrong answers

Option A is wrong because suspending versioning does not prevent deletion of existing object versions; it only stops new versions from being created, and the engineer would still be able to delete versions with appropriate IAM permissions. Option C is wrong because the engineer has full S3 permissions via IAM, and there is no indication of a bucket policy explicitly denying s3:DeleteObjectVersion; the error is not caused by a deny statement. Option D is wrong because an S3 Object Lock retention policy prevents deletion or overwrite of objects during the retention period, but the error message is 'AccessDenied' specifically due to missing MFA, not a retention-based block.

Full explanation →

478

MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for MySQL and load it into Amazon S3. The job runs daily and processes incremental changes using the JDBC connection. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which step should the engineer take first to diagnose the issue?

A.Verify that the IAM role used by Glue has the correct permissions to access RDS.

B.Change the Glue job type from Spark to Python shell.

C.Check the security group and network ACL rules for the RDS instance and the Glue connection.

D.Check that the JDBC driver is compatible with the Glue version.

AnswerC

Network misconfiguration is the most common cause of link failure.

Why this answer

Option B is correct because network connectivity between Glue and RDS is likely the issue, and checking security groups and subnets is the first step. Option A is wrong because the error is not about authentication. Option C is wrong because S3 permissions are not related to the link failure.

Option D is wrong because the job type (Spark vs Python) does not affect connectivity.

Full explanation →

479

MCQhard

A company uses Amazon Redshift for data warehousing. The security team requires that all data stored in Redshift be encrypted at rest. The current cluster is unencrypted. Which approach should the data engineer take to meet this requirement with minimal downtime?

A.Modify the cluster to enable encryption.

B.Unload data to S3 and reload into a new encrypted cluster.

C.Use the COPY command to load data into a new encrypted table.

D.Take a snapshot of the existing cluster and restore it to a new encrypted cluster.

AnswerD

Snapshot restore allows creating an encrypted cluster with minimal downtime.

Why this answer

Option D is correct because Redshift allows restoring a snapshot to a new encrypted cluster. The engineer can take a snapshot of the existing cluster, restore it to a new cluster with encryption enabled, and then redirect traffic to the new cluster. Option A is wrong because encryption cannot be enabled on an existing cluster.

Option B is wrong because COPY command does not encrypt the cluster. Option C is wrong because unloading and reloading data would cause significant downtime.

Full explanation →

480

MCQhard

A data engineer is designing a multi-Region disaster recovery solution for an Amazon DynamoDB table. The table must be available in a secondary Region with minimal data loss and automatic failover. Which feature should be used?

A.DynamoDB on-demand backup and restore in the secondary Region

B.DynamoDB global tables

C.DynamoDB point-in-time recovery (PITR)

D.DynamoDB cross-Region snapshot export to S3

AnswerB

Global tables replicate data across Regions and support automatic failover.

Why this answer

DynamoDB global tables provide a fully managed, multi-Region, multi-active database solution that replicates data automatically across selected AWS Regions. This ensures automatic failover with eventual consistency and minimal data loss, meeting the disaster recovery requirements for high availability and automatic failover without manual intervention.

Exam trap

The trap here is that candidates often confuse point-in-time recovery (PITR) with cross-Region disaster recovery, but PITR is a single-Region feature that does not provide automatic failover or multi-Region replication.

How to eliminate wrong answers

Option A is wrong because on-demand backup and restore is a manual process that requires user intervention to initiate a restore in the secondary Region, not providing automatic failover or minimal data loss in real time. Option C is wrong because point-in-time recovery (PITR) protects against accidental writes or deletes within a single Region by restoring to a point in time, but it does not replicate data across Regions or enable automatic failover. Option D is wrong because cross-Region snapshot export to S3 is a manual, batch-oriented process that exports table data to Amazon S3 in another Region, requiring manual import and setup for failover, and does not provide automatic, continuous replication or failover.

Full explanation →

481

MCQmedium

A company uses Amazon Athena to query data stored in an S3 bucket. The data is partitioned by year, month, day, and hour. The data engineer notices that queries are scanning a large amount of data even with a WHERE clause on the partition columns. What is the MOST likely cause?

A.The data has too many partitions, causing overhead.

B.The table does not have partitions defined in the AWS Glue Data Catalog.

C.The S3 bucket uses the S3 Glacier storage class.

D.The data files are compressed with GZIP.

AnswerB

Without partition definitions, Athena scans all data.

Why this answer

Option B is correct because if partitions are not defined in the table, Athena cannot perform partition pruning. Option A is wrong because S3 storage class does not affect scanning. Option C is wrong because too many partitions improve pruning, not hinder.

Option D is wrong because compressed files reduce scan size, not increase.

Full explanation →

482

MCQhard

A data engineer has attached the IAM policy shown in the exhibit to a role used by an AWS Glue ETL job. The job fails when trying to write to the S3 bucket 'example-bucket' with the error: 'Access Denied'. What is the MOST likely reason?

A.The IAM policy does not include the bucket ARN for write operations.

B.The IAM role's trust policy does not allow Glue to assume the role.

C.The S3 bucket policy denies the PutObject action for the role.

D.The IAM policy does not grant s3:PutObject permission.

AnswerC

A bucket policy can explicitly deny access even if IAM allows it.

Why this answer

Option C is correct because the Glue job's IAM role may not have permission to call the s3:PutObject action, but the error could also be due to the bucket policy denying access. However, the policy shown allows s3:PutObject on the bucket. The most common issue is that the Glue job's role does not have the necessary trust policy or the bucket policy blocks the request.

But based on the exhibit, the policy appears correct. The error could be due to missing permissions on the Glue job's execution role. Option A is incorrect because the policy includes s3:PutObject.

Option B is incorrect because the policy includes the bucket ARN. Option D is incorrect because the error is access denied, not a bucket policy issue. Actually, the correct answer is that the role may also need s3:PutObject on the bucket itself (not just objects) for certain operations like multipart uploads.

But the most likely reason is that the bucket policy denies the request. Given the exhibit, the IAM policy is correct, so the issue is likely the bucket policy. Option C is the best answer.

Full explanation →

483

Multi-Selectmedium

A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)

Select 2 answers

A.Amazon EMR with Spark

B.AWS Lake Formation

C.Amazon Athena CTAS queries

D.AWS Glue ETL jobs

E.Amazon Kinesis Data Firehose

AnswersA, D

EMR can run Spark for large-scale transformations.

Why this answer

Amazon EMR with Spark is correct because Spark natively supports reading JSON, CSV, and Avro formats and writing Parquet with built-in partitioning by date and source. You can achieve this with a concise PySpark or Scala script that loads the data, applies partitioning logic, and writes to S3, requiring minimal custom code beyond the transformation logic.

Exam trap

The trap here is that candidates often confuse AWS Lake Formation's data catalog and permission features with actual data transformation capabilities, or they assume Kinesis Data Firehose can transform existing S3 objects when it only processes streaming data in transit.

Full explanation →

484

MCQmedium

A social media company ingests user activity data from multiple sources using Amazon Kinesis Data Firehose. The data is delivered to Amazon S3 in near-real-time. The company wants to transform the data by adding a timestamp and masking email addresses before storing it in S3. The transformation should be applied to all records. What is the most cost-effective way to implement this transformation?

A.Use Amazon Athena to run a CTAS query that transforms the data and writes to a new location.

B.Use AWS Glue to schedule a batch job every 5 minutes to transform the data.

C.Use Amazon S3 Events to trigger a Lambda function whenever a new object is created.

D.Configure the Firehose delivery stream to invoke a Lambda function for data transformation.

AnswerD

Firehose supports built-in Lambda transformation for real-time processing.

Why this answer

Option A is correct. Kinesis Data Firehose can invoke a Lambda function to transform data on the fly. This is cost-effective because it runs only when data is flowing.

Option B is wrong because Glue jobs are batch-oriented and add latency. Option C is wrong because S3 Events with Lambda adds complexity and cost. Option D is wrong because Athena is for querying, not transforming.

Full explanation →

485

Multi-Selecthard

A data engineer is designing a disaster recovery plan for an Amazon Redshift data warehouse. The cluster is in us-east-1 and must be recoverable in us-west-2 with minimal data loss. Which THREE actions should the engineer take? (Choose THREE)

Select 3 answers

A.Create manual snapshots and copy them to us-west-2

B.Deploy Redshift in a multi-AZ configuration

C.Enable Redshift concurrent scaling

D.Schedule automated snapshots with a retention period

E.Configure automated snapshot copy to us-west-2

AnswersA, D, E

Manual snapshots can be copied across regions.

Why this answer

Options A, C, and E are correct. Cross-region snapshot copy allows recovery in another region. Automated snapshots enable point-in-time recovery.

Using a multi-AZ deployment provides high availability within a region but not cross-region. Concurrent scaling does not help disaster recovery. Manual backups are not automated.

Full explanation →

486

MCQeasy

A data engineering team is ingesting streaming data from IoT devices using AWS IoT Core and needs to process the data in near real-time with minimal code. Which AWS service should they use to transform the data before storing it in Amazon S3?

A.Amazon Kinesis Data Analytics

B.AWS Glue

C.Amazon Redshift

D.Amazon Athena

AnswerA

Kinesis Data Analytics can run SQL queries on streaming data from IoT Core in near real-time.

Why this answer

Option B is correct because AWS IoT Core can route messages to Kinesis Data Analytics for SQL-based streaming transformations. Option A is wrong because Glue is for batch ETL, not real-time. Option C is wrong because Redshift is for warehousing.

Option D is wrong because Athena is for ad-hoc querying, not streaming.

Full explanation →

487

MCQhard

A company uses Amazon DynamoDB to store session data. The security team requires that all data be encrypted at rest using a customer-managed KMS key. The data engineer has enabled encryption with a KMS key, but discovers that old data remains encrypted with the previous AWS-managed key. How can the engineer re-encrypt all existing data with the new key?

A.Disable and re-enable encryption with the new KMS key

B.Use AWS Backup to back up the table and restore it with the new encryption key

C.Use the DynamoDB console to change the encryption key and select 'Apply to existing data'

D.Export the table to S3 using DynamoDB Export to S3, then import using DynamoDB Import from S3 with the new encryption key specified

AnswerD

Export/Import re-encrypts data.

Why this answer

Option C is correct because Export to S3 and Import from S3 can re-encrypt the data with the new key. Option A is wrong because DynamoDB does not support in-place re-encryption. Option B is wrong because enabling encryption with a new key only applies to new writes.

Option D is wrong because copying the table does not re-encrypt with the new key; it uses the table's default key.

Full explanation →

488

MCQeasy

A company needs to ingest data from Amazon S3 into Amazon Redshift for analytics. The data arrives in CSV format with headers and may contain duplicate rows. Which Redshift command should be used to load the data while handling duplicates?

A.COPY command with the `REMOVEDUPLICATES` option

B.INSERT INTO ... SELECT DISTINCT from S3 via Spectrum

C.Use a staging table with COPY and then MERGE into the target table

D.CREATE TABLE AS SELECT DISTINCT from the S3 bucket

AnswerC

MERGE allows handling duplicates.

Why this answer

Option B is correct because the `COPY` command with the `REMOVEDUPLICATES` option is not valid; the correct approach is to use a staging table and then use `MERGE` (or `UPSERT`) to handle duplicates. Option A (COPY) alone does not remove duplicates. Option C (INSERT INTO SELECT) does not handle duplicates.

Option D (CREATE TABLE AS) does not handle duplicates.

Full explanation →

489

MCQhard

A data engineer at a financial services company manages an AWS Glue ETL pipeline that processes transaction data from Amazon S3 to Amazon Redshift for reporting. The pipeline runs every hour and uses a Glue job that reads Parquet files, performs transformations in Spark, and writes to Redshift using the JDBC connector. Recently, the job has been failing intermittently with the error: 'java.sql.BatchUpdateException: ERROR: null value in column "transaction_id" violates not-null constraint'. The data engineer has verified that the source Parquet files do contain non-null values for transaction_id. The job uses a DynamicFrame and applies a mapping to rename columns. The engineer also noticed that the failure occurs only during peak hours when there is high concurrency on Redshift. Which course of action should the engineer take to resolve this issue?

A.Add a filter in Glue to remove rows with null transaction_id.

B.Increase the Redshift WLM concurrency scaling to handle more queries.

C.Review the Glue job's mapping transformation to ensure transaction_id is correctly mapped and not dropped.

D.Increase the number of Glue workers to handle peak-hour load.

AnswerC

The mapping may have a bug that sets transaction_id to null.

Why this answer

Option C is correct. The error suggests that some rows are being written with null transaction_id. During high concurrency, Redshift might be rejecting the batch due to a transient issue, but the error is about null constraint.

The most likely cause is that the mapping is incorrectly dropping or nullifying the column. Option A is wrong because increasing Glue's worker count does not address the null value issue. Option B is wrong because increasing Redshift WLM concurrency could exacerbate the problem.

Option D is wrong because the source files are not the issue.

Full explanation →

490

MCQeasy

A company needs to ensure that data stored in Amazon RDS is encrypted at rest. Which action should the data engineer take?

A.Enable encryption at rest by modifying the existing RDS instance.

B.Encrypt the underlying EBS volumes using AWS KMS.

C.Create a new RDS instance with encryption enabled using AWS KMS.

D.Enable SSL/TLS for connections to the RDS instance.

AnswerC

Encryption at rest must be enabled at launch time for RDS.

Why this answer

Option D is correct because encryption at rest for RDS is enabled at instance creation time by enabling encryption with KMS. Option A is wrong because encryption is not automatic; it must be enabled. Option B is wrong because encrypting the EBS volumes is not sufficient for RDS; the database must be encrypted.

Option C is wrong because encryption in transit is different from at rest.

Full explanation →

491

MCQhard

A data engineer reviews the Glue job configuration. The job fails when processing large datasets. The error message indicates out-of-memory in the executors. Which change to the job configuration will most directly address this issue?

A.Change the worker type from Standard to G.2X.

B.Increase the timeout from 30 to 60 minutes.

C.Increase the number of workers from 5 to 10.

D.Set MaxRetries to 3.

AnswerA

G.2X workers have more memory (8 GB vs 4 GB), directly addressing OOM.

Why this answer

The job uses 5 workers of Standard type. To increase memory, the engineer should use G.1X or G.2X worker types which provide more memory per worker, or increase the number of workers. Among options, changing worker type to G.2X directly increases memory.

Full explanation →

492

MCQmedium

A company is using an Amazon RDS for MySQL database for its e-commerce platform. During a recent flash sale, the database experienced high read traffic, causing slow query performance. The company needs a solution that offloads read traffic with minimal application changes. Which action should be taken?

A.Enable DynamoDB Accelerator (DAX) on the RDS instance.

B.Migrate the database to Amazon Aurora and enable Aurora Global Database.

C.Implement Amazon ElastiCache for Redis to cache database queries.

D.Create an Amazon RDS read replica in the same region.

AnswerD

Read replicas offload read traffic from the primary instance with minimal application changes.

Why this answer

Creating an Amazon RDS read replica in the same region offloads read traffic from the primary DB instance by directing read queries to a read-only copy. This requires minimal application changes—only modifying the database connection string to point read queries to the replica endpoint. RDS read replicas use MySQL's native asynchronous replication, making them ideal for scaling read-heavy workloads like flash sales.

Exam trap

The trap here is that candidates may choose ElastiCache (Option C) because it is a caching solution, but they overlook the explicit requirement for minimal application changes, which caching typically does not satisfy without code modifications.

How to eliminate wrong answers

Option A is wrong because DynamoDB Accelerator (DAX) is an in-memory cache for Amazon DynamoDB, not for RDS for MySQL; it cannot be enabled on an RDS instance. Option B is wrong because migrating to Aurora and enabling Aurora Global Database is designed for cross-region disaster recovery and global reads, not for offloading read traffic within a single region, and it requires significant application and migration effort. Option C is wrong because while ElastiCache for Redis can cache query results, it requires application code changes to implement caching logic (e.g., cache-aside pattern), which contradicts the requirement for minimal application changes.

Full explanation →

493

Multi-Selectmedium

Which TWO actions can help optimize Amazon S3 storage costs for a data lake? (Choose two.)

Select 2 answers

A.Enable S3 Replication to another region

B.Use S3 Intelligent-Tiering for unpredictable access patterns

C.Use S3 Select to retrieve only needed data

D.Enable S3 Transfer Acceleration

E.Implement S3 Lifecycle policies to transition objects to Glacier

AnswersB, E

Intelligent-Tiering automatically optimizes costs based on access patterns.

Why this answer

S3 Intelligent-Tiering automatically moves objects between two access tiers (frequent and infrequent) when access patterns change, with no retrieval fees and a small monthly monitoring fee. This is ideal for a data lake where access patterns are unpredictable, as it optimizes costs without requiring manual lifecycle rule adjustments.

Exam trap

The trap here is that candidates confuse cost optimization for storage (reducing stored data cost) with cost optimization for data transfer or retrieval, leading them to select options like S3 Select or Transfer Acceleration that address different cost dimensions.

Full explanation →

494

MCQmedium

A company uses Amazon RDS for MySQL to store financial data. A compliance requirement mandates that all database connections must be encrypted. Which configuration step is necessary?

A.Set the RDS parameter enforce_ssl to 1.

B.Create the RDS DB instance in a private subnet.

C.Enable encryption for the RDS DB instance at creation time.

D.Configure the VPC security group to only allow traffic from certain IPs.

AnswerC

Encryption at rest and in transit can be enabled at creation; SSL/TLS can be enforced via parameter group.

Why this answer

Option D is correct because enabling encryption for RDS instances requires using a DB instance that supports encryption and enabling it at creation. Option A is wrong because VPC security groups control network access, not encryption. Option B is wrong because RDS does not support enforcing encryption at the subnet level.

Option C is wrong because RDS does not have a parameter enforce_ssl; it needs to be enabled in the DB parameter group.

Full explanation →

495

Multi-Selecteasy

A company wants to ingest streaming data from social media feeds into AWS for real-time analytics. Which TWO services can directly ingest streaming data without writing custom code? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon AppFlow

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Streams

E.Amazon S3 Transfer Acceleration

AnswersB, C

AppFlow can ingest data from SaaS applications (including social media) directly into AWS.

Why this answer

Amazon Kinesis Data Firehose can directly ingest streaming data and deliver to destinations. AWS Glue can stream from Kafka but not directly ingest from social media without custom connectors. Kinesis Data Streams requires producers to send data, not direct ingestion.

AppFlow can ingest from SaaS applications including social media.

Full explanation →

496

MCQeasy

A company uses Amazon S3 to store customer documents. The data engineer needs to ensure that all objects uploaded to a specific S3 bucket are automatically encrypted with a customer-managed AWS KMS key. What should the data engineer do?

A.Use pre-signed URLs for all uploads that include encryption parameters.

B.Create a bucket policy that denies uploads without encryption.

C.Enable S3 Versioning on the bucket.

D.Set default encryption on the bucket to use SSE-KMS with the customer-managed key.

AnswerD

Default encryption automatically encrypts all objects with the specified KMS key.

Why this answer

Setting a default encryption policy on the bucket with SSE-KMS ensures all objects are encrypted with the specified KMS key. Option A is wrong because bucket policies can enforce encryption but don't automatically encrypt. Option B is wrong because enabling S3 Versioning doesn't enforce encryption.

Option D is wrong because pre-signed URLs don't enforce encryption.

Full explanation →

497

Multi-Selecthard

A company is migrating its on-premises data warehouse to Amazon Redshift. The data includes tables with up to 100 columns and 500 million rows. The migration involves a full load followed by incremental updates. The company needs to minimize downtime during the final cutover. Which THREE strategies should the data engineer use to facilitate the migration? (Choose THREE.)

Select 3 answers

A.Increase the number of WLM queues to allow more concurrent loads.

B.Use the COPY command to load data from Amazon S3.

C.Use columnar format (e.g., Parquet) for the data files in S3.

D.Run VACUUM and ANALYZE commands after loading the data.

E.Disable distribution keys on the target tables to simplify loading.

AnswersB, C, D

COPY is optimized for bulk data loading into Redshift.

Why this answer

Option A is correct because using the COPY command with S3 is the most efficient way to load large datasets into Redshift. Option C is correct because using a columnar format like Parquet speeds up data transfer and reduces costs. Option E is correct because using VACUUM and ANALYZE after loading optimizes table storage and query performance.

Option B is wrong because increasing WLM concurrency does not speed up data loading. Option D is wrong because disabling distribution keys results in inefficient data distribution, leading to performance issues.

Full explanation →

498

MCQmedium

A data engineer is designing a data pipeline that processes sensitive personal data. The data is ingested via Amazon Kinesis Data Firehose and stored in Amazon S3. The pipeline must ensure that the data is encrypted at rest and in transit. The engineer also needs to audit access to the data. Which combination of services meets these requirements?

A.AWS KMS for encryption at rest, Kinesis Data Analytics for in-transit encryption, and AWS CloudTrail for auditing.

B.AWS KMS for encryption at rest, Amazon CloudWatch Logs for auditing, and TLS for in-transit encryption.

C.S3 server-side encryption (SSE-S3) for at-rest encryption, HTTPS for in-transit encryption, and AWS CloudTrail for auditing.

D.S3 client-side encryption, AWS Config for auditing, and TLS for in-transit encryption.

AnswerC

SSE-S3 encrypts objects at rest, HTTPS encrypts data in transit, and CloudTrail logs S3 API operations for auditing.

Why this answer

Option B is correct because SSE-S3 provides encryption at rest, HTTPS ensures encryption in transit, and CloudTrail logs S3 API calls for auditing. Option A is incorrect because CloudWatch Logs is for monitoring, not auditing data access. Option C is incorrect because AWS Config tracks configuration, not data access.

Option D is incorrect because Kinesis Data Analytics is for processing, not encryption.

Full explanation →

499

MCQeasy

A company uses AWS Lambda to process events from an S3 bucket. The Lambda function writes transformed data to another S3 bucket. Occasionally, the Lambda invocation fails with 'ResourceNotFoundException'. What is the MOST likely cause?

A.The Lambda function timed out.

B.The destination S3 bucket does not exist or the Lambda function's IAM role lacks permissions.

C.The S3 event notification is misconfigured.

D.The source S3 bucket has versioning disabled.

AnswerB

ResourceNotFoundException indicates missing resource or access denial.

Why this answer

Option B is correct. The destination bucket may not exist or the Lambda function's IAM role lacks permissions to write to it. Option A is wrong because Lambda timeouts would cause 'Timeout' error.

Option C is wrong because S3 event notifications are reliable. Option D is wrong because the source bucket exists since it triggered the event.

Full explanation →

500

MCQmedium

Refer to the exhibit. A data engineer runs the AWS CLI command and gets the output shown. The engineer wants to grant a data analyst read-only access to the 'sales_db' database in AWS Glue Data Catalog using IAM. Which IAM policy statement is required?

A.{"Effect": "Allow", "Action": "glue:GetTable", "Resource": "arn:aws:glue:us-east-1:123456789012:table/sales_db/*"}

B.{"Effect": "Allow", "Action": "glue:GetDatabase", "Resource": "arn:aws:glue:us-east-1:123456789012:database/sales_db"}

C.{"Effect": "Allow", "Action": "glue:GetDatabases", "Resource": "*"}

D.{"Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::data-lake-sales/*"}

AnswerB

Grants read access to the specific database.

Why this answer

Option A is correct because to read a database, the user needs glue:GetDatabase. Option B is wrong because glue:GetDatabases is for listing databases. Option C is wrong because s3:GetObject is for data access.

Option D is wrong because glue:GetTable is for tables, not databases.

Full explanation →

501

Multi-Selectmedium

A data engineer needs to encrypt data at rest in Amazon S3 using server-side encryption with a customer-managed KMS key. Which TWO steps are required to ensure that the KMS key can be used for S3 object encryption?

Select 2 answers

A.Configure a VPC endpoint for KMS to allow S3 to access the key.

B.Set the S3 bucket policy to require SSE-KMS for all PutObject requests.

C.Add a statement in the KMS key policy that allows the S3 service to use the key.

D.Grant the IAM role that writes objects the kms:GenerateDataKey and kms:Decrypt permissions.

E.Create a service-linked role for S3 to access KMS.

AnswersC, D

The key policy must allow S3 to call GenerateDataKey and Decrypt.

Why this answer

Options A and B are correct. The KMS key policy must grant the S3 service principal kms:GenerateDataKey and kms:Decrypt permissions. Option C is not required because S3 does not use KMS via VPC.

Option D is not required because S3 does not need to assume a role for SSE-KMS. Option E is not required because SSE-KMS does not use S3 bucket policies.

Full explanation →

502

MCQhard

A data engineer is troubleshooting an Amazon Redshift cluster that is running out of disk space. The engineer runs STV_PARTITIONS and notices that some slices have significantly more data than others. What is the most likely cause and solution?

A.Poorly chosen sort keys; redefine sort keys

B.Data distribution skew due to uneven distribution style; change distribution style to EVEN or correct KEY

C.Some nodes are underutilized; add more nodes

D.Concurrency scaling is disabled; enable concurrency scaling

AnswerB

Uneven distribution can cause some slices to fill up faster.

Why this answer

B is correct because STV_PARTITIONS shows per-slice disk usage, and significant variation indicates data distribution skew. Uneven distribution causes some slices to fill faster, leading to premature disk-full errors. Changing the distribution style to EVEN (for tables without join keys) or correcting the KEY distribution style (using a high-cardinality, evenly distributed column) rebalances data across slices.

Exam trap

The trap here is that candidates confuse sort keys (which improve query performance via zone maps) with distribution keys (which control data placement across slices), leading them to incorrectly select sort key redefinition as the fix for disk space skew.

How to eliminate wrong answers

Option A is wrong because sort keys affect query performance (min/max zone maps and block pruning), not how data is distributed across slices; disk space skew is a distribution issue, not a sort key issue. Option C is wrong because adding nodes increases total cluster capacity but does not fix existing data skew; the problem is uneven data placement, not insufficient total nodes. Option D is wrong because concurrency scaling handles workload bursts by adding transient compute capacity, not disk space; it does not affect how data is stored on existing slices.

Full explanation →

503

MCQmedium

A company uses AWS Glue to run ETL jobs that transform data from an Amazon S3 bucket (raw) to another S3 bucket (curated). The jobs run on a schedule and process data incrementally. The data engineer notices that the jobs are taking longer to complete each day, and the job metrics show that the number of DPUs (Data Processing Units) is underutilized. The engineer wants to improve job performance. What should the data engineer do?

A.Increase the number of DPUs allocated to the Glue job to enable more parallelism.

B.Switch from batch processing to streaming using AWS Glue Streaming.

C.Enable job bookmarks to skip already processed data more efficiently.

D.Decrease the number of DPUs to reduce resource contention.

AnswerA

More DPUs can speed up processing if the job is parallelizable.

Why this answer

Option A is correct because increasing the number of DPUs can reduce job duration if the workload is parallelizable. The underutilization suggests that more DPUs could be used. Option B is wrong because decreasing DPUs would increase time.

Option C is wrong because the issue is not about job bookmarks (which handle incremental processing). Option D is wrong because converting to streaming may not be appropriate for batch jobs.

Full explanation →

504

MCQeasy

A data engineering team needs to transform CSV files to Parquet format after they land in an S3 bucket. The transformation should be triggered automatically as soon as a new file arrives. Which AWS service is best suited for this task?

A.AWS Batch job submitted by S3 event

B.Amazon EMR cluster running continuously

C.AWS Lambda function triggered by S3 event

D.AWS Glue ETL job scheduled every 5 minutes

AnswerC

Lambda can be triggered immediately on S3 PUT events and perform the transformation.

Why this answer

Option D is correct because AWS Lambda can be triggered by S3 events and convert files to Parquet using libraries like PyArrow. Option A (AWS Glue) is better for scheduled jobs, not event-driven. Option B (Amazon EMR) is overkill.

Option C (AWS Batch) is for compute jobs but adds latency.

Full explanation →

505

MCQmedium

Refer to the exhibit. An S3 bucket policy is shown. A data engineer using the DataEngineerRole tries to upload an object to s3://example-bucket/data/report.csv with SSE-S3 encryption. The upload fails. What is the most likely cause?

A.The resource ARN does not match the object.

B.The role does not have s3:PutObject permission.

C.The condition requires SSE-S3 encryption header, but the upload did not include it.

D.The principal is not authorized.

AnswerC

The condition requires the encryption header to be present and set to AES256.

Why this answer

Option D is correct because the condition s3:x-amz-server-side-encryption: AES256 requires SSE-S3, but the policy also requires that encryption header be present. Option A is wrong because the role has GetObject and PutObject permissions. Option B is wrong because the resource is data/*, which should match.

Option C is wrong because the policy allows the role.

Full explanation →

506

Multi-Selectmedium

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data engineer needs to transform the data before delivery. Which THREE options can be used to perform the transformation?

Select 3 answers

A.Amazon Athena queries

B.AWS Glue ETL job

C.Amazon Kinesis Data Firehose data format conversion (e.g., JSON to Parquet)

D.AWS Lambda function

E.Amazon Kinesis Data Firehose dynamic partitioning with Lambda

AnswersC, D, E

Firehose can convert data formats natively.

Why this answer

Options A, C, and D are correct. Lambda can transform records, Firehose can convert to Parquet/ORC, and Firehose can invoke Lambda for dynamic partitioning. Option B (Glue) is not integrated with Firehose directly.

Option E (Athena) is for querying.

Full explanation →

507

MCQhard

A data engineer is designing a real-time analytics pipeline for clickstream data. The source is Amazon Kinesis Data Streams, and the data must be stored in Amazon S3 in partitioned Parquet format with near-real-time latency. The engineer must also handle late-arriving data (up to 1 hour). Which combination of services meets these requirements?

A.Use AWS Glue streaming ETL to write to S3 in real-time.

B.Use Kinesis Data Analytics with a tumbling window to write to S3.

C.Use Kinesis Data Firehose with a Lambda transformation to write to S3, and a separate Lambda consumer to reprocess late data from the stream.

D.Use Kinesis Data Firehose to deliver to S3, and use S3 Batch Operations to process late data.

AnswerC

Firehose handles real-time delivery with partitioning; Lambda can reprocess late records.

Why this answer

Option D is correct because Kinesis Data Firehose can buffer data and deliver to S3 with custom partitioning, and a separate Lambda function can reprocess late data from the stream. Option A is wrong because Kinesis Data Analytics does not partition data by time. Option B is wrong because S3 Batch Operations are for batch, not streaming.

Option C is wrong because Glue streaming ETL lacks built-in support for late data handling with Firehose partitioning.

Full explanation →

508

Multi-Selecthard

Which THREE factors should be considered when selecting a data ingestion service for a high-volume, real-time streaming pipeline that requires exactly-once processing semantics? (Choose 3.)

Select 3 answers

A.Ability to replay records from a checkpoint

B.Support for idempotent record processing

C.Integration with Amazon S3 for checkpoint storage

D.Support for schema evolution

E.Ability to transform data in-flight

AnswersA, B, C

Replay allows recovery without duplication.

Why this answer

Exactly-once processing requires services that support idempotent writes and checkpoints. Kinesis Data Streams supports record replay and checkpointing. AWS Lambda can be used with Kinesis for idempotent processing.

Amazon S3 provides idempotent PUT operations. Amazon DynamoDB can store checkpoints. Amazon Kinesis Data Firehose provides at-least-once delivery, not exactly-once.

AWS Glue is batch-oriented.

Full explanation →

509

MCQhard

A company stores data in Amazon S3 with server-side encryption using AWS KMS (SSE-KMS). The data engineer needs to give a third-party auditor read-only access to the encrypted objects. The auditor has an AWS account. Which strategy should be used?

A.Generate a presigned URL for each object the auditor needs to access.

B.Copy the objects to a new bucket encrypted with SSE-S3 and share that bucket.

C.Grant the auditor's IAM role permission to use the KMS key.

D.Update the S3 bucket policy to allow access from the auditor's account and update the KMS key policy to allow the auditor's account to decrypt.

AnswerD

Both policies are required for cross-account access with SSE-KMS.

Why this answer

Option D is correct because cross-account access to SSE-KMS encrypted objects requires both an S3 bucket policy allowing the auditor's account and a KMS key policy granting the auditor's account decrypt permissions. Option A is wrong because presigned URLs don't solve the cross-account KMS issue. Option B is wrong because simply granting access to the KMS key is insufficient without S3 permissions.

Option C is wrong because copying objects with SSE-S3 changes encryption and may not be allowed.

Full explanation →

510

Multi-Selectmedium

Which TWO actions can improve the performance of an AWS Glue ETL job that processes large datasets in Amazon S3? (Choose two.)

Select 2 answers

A.Increase the frequency of the Glue crawler.

B.Use a single Availability Zone for the S3 bucket.

C.Increase the number of DPUs allocated to the job.

D.Use columnar file formats like Parquet or ORC.

E.Use a single large file instead of many small files.

AnswersC, D

More DPUs increase parallelism and memory.

Why this answer

Option A (increase DPUs) and Option D (use columnar format) are correct. Option B (use a single file) may reduce parallelism. Option C (increase crawler frequency) does not affect ETL performance.

Option E (single availability zone) does not improve performance.

Full explanation →

511

MCQeasy

A data engineer needs to ensure that an Amazon Redshift cluster encrypts data at rest using a customer-managed AWS KMS key. Which configuration step is required?

A.Create a new cluster and select the default AWS managed key for encryption.

B.Create a new cluster and specify a customer-managed KMS key for encryption.

C.Use AWS CloudHSM to generate a key and attach it to the cluster.

D.Enable encryption on the existing cluster by modifying the cluster configuration.

AnswerB

Encryption must be set at cluster creation with a KMS key.

Why this answer

Option C is correct because Redshift supports encryption with KMS keys, but encryption cannot be enabled on an existing unencrypted cluster; a new encrypted cluster must be created. Option A is wrong because enabling encryption on an existing cluster is not supported. Option B is wrong because default KMS key is not customer-managed.

Option D is wrong because CloudHSM is for hardware-based key storage, not KMS integration.

Full explanation →

512

Multi-Selecthard

A company runs an Amazon Redshift cluster for data warehousing. The data engineering team notices that the 'Amazon Redshift Data API' is timing out when executing long-running queries. The queries typically take more than 10 minutes to complete. The team wants to ensure that the queries can complete without timeout and that the results are retrievable. Which TWO steps should the team take? (Choose TWO.)

Select 2 answers

A.Set the 'QueryExecutionTimeout' parameter in the Data API call to 30 minutes.

B.Increase the 'timeout' parameter in the Redshift cluster configuration.

C.Use the 'GetStatementResult' operation to retrieve results after the query completes.

D.Set the 'max_execution_time' parameter in the Redshift parameter group to 30 minutes.

E.Use the 'StatementName' parameter to run the query asynchronously and poll for completion.

AnswersC, E

This is the correct way to get results after the statement finishes.

Why this answer

Option B is correct because the Data API has a timeout of 10 minutes for a single call; using the 'StatementName' parameter allows you to check the status of the query asynchronously, and the query continues to run even if the API call times out. Option D is correct because you can retrieve the results using the 'GetStatementResult' operation after the statement has completed. Option A is wrong because increasing the query timeout in the cluster does not affect the Data API timeout.

Option C is wrong because the Data API does not have a 'QueryExecutionTimeout' parameter. Option E is wrong because Redshift does not have a 'max_execution_time' parameter; the relevant parameter is 'statement_timeout'.

Full explanation →

513

Multi-Selecteasy

A data engineer needs to audit data access in Amazon S3 for compliance. Which TWO services can be used to capture and analyze S3 access logs? (Choose TWO.)

Select 2 answers

A.Amazon CloudWatch Logs

B.AWS CloudTrail

C.S3 server access logs

D.Amazon Macie

E.AWS Config

AnswersB, C

Records S3 API calls.

Why this answer

Options B and D are correct. S3 server access logs capture detailed access logs, and AWS CloudTrail records API calls. Option A is wrong because Amazon Macie is for sensitive data discovery.

Option C is wrong because Amazon CloudWatch is for monitoring metrics. Option E is wrong because AWS Config is for resource configuration.

Full explanation →

514

MCQmedium

A data engineer is responsible for ingesting log files from a fleet of on-premises servers into Amazon S3 for central analysis. Each server generates log files that are rotated every hour, resulting in files of about 500 MB each. The total daily data volume is approximately 1 TB. The network connection between the on-premises data center and AWS is a 100 Mbps VPN. The engineer needs to ensure that all log files are transferred to S3 within 24 hours of generation without data loss. The engineer is considering using AWS DataSync. However, the initial setup shows that the transfer speed is insufficient to meet the 24-hour SLA. What should the engineer do to meet the requirement?

A.Use AWS CLI with multipart uploads and parallel threads to maximize throughput.

B.Contact the network team to upgrade the VPN bandwidth to at least 1 Gbps.

C.Order an AWS Snowball Edge device to transfer the initial data and then use DataSync for incremental changes.

D.Configure AWS DataSync to run on a schedule with incremental transfers and enable data compression.

AnswerD

Incremental transfers reduce the amount of data transferred each day; compression further reduces size, meeting the SLA.

Why this answer

Option C is correct because AWS DataSync supports scheduling tasks incrementally; after the initial full sync, only new and changed files are transferred, which will fit within the bandwidth. Option A is wrong because ordering a Snowball Edge would take days to arrive and is not suitable for ongoing daily transfers. Option B is wrong because increasing bandwidth may not be feasible or cost-effective.

Option D is wrong because compressing files may reduce size but adds complexity and does not solve the network bottleneck; incremental sync is more effective.

Full explanation →

515

Multi-Selecthard

A data engineer is designing a data store for a real-time analytics application that requires sub-millisecond read and write latency. The data is accessed via a REST API. Which AWS services should the engineer consider? (Choose THREE.)

Select 3 answers

A.Amazon S3

B.Amazon RDS for MySQL

C.Amazon ElastiCache for Redis

D.Amazon DynamoDB

E.DynamoDB Accelerator (DAX)

AnswersC, D, E

Redis provides sub-millisecond latency.

Why this answer

Amazon ElastiCache for Redis is correct because it provides an in-memory data store that delivers sub-millisecond read and write latency, ideal for real-time analytics. Redis supports data structures like strings, hashes, and sorted sets, and can be accessed via REST API through a caching layer or directly with Redis commands. This makes it suitable for low-latency, high-throughput workloads where disk-based storage would introduce unacceptable delays.

Exam trap

The trap here is that candidates may overlook DAX as a separate service and assume DynamoDB alone provides sub-millisecond latency, but DynamoDB's base latency is typically 1-10 milliseconds for strongly consistent reads, and DAX is required to achieve sub-millisecond performance for read-heavy workloads.

Full explanation →

516

Multi-Selectmedium

A company uses Amazon EMR to run Spark jobs on data stored in Amazon S3. The data engineer notices that the jobs are running slower than expected. The engineer suspects that the S3 storage class might be affecting performance. Which THREE factors can impact read performance from S3? (Choose three.)

Select 3 answers

A.Use of S3 Transfer Acceleration.

B.Use of S3 Select to retrieve only a subset of data.

C.Use of S3 Object Lock.

D.Average object size in S3 bucket.

E.Data stored in compressed format (e.g., GZIP, Snappy).

AnswersB, D, E

S3 Select reduces the amount of data transferred.

Why this answer

Options A, C, and D are correct. A: S3 Select can reduce data scanned. C: Data compression reduces network transfer.

D: Larger object sizes improve throughput due to parallel requests. Option B is wrong because S3 Transfer Acceleration improves upload speed, not read performance. Option E is wrong because S3 Object Lock does not affect read performance.

Full explanation →

517

MCQmedium

A financial services company uses Amazon Redshift for its data warehouse. The compliance team requires that all access to the database be logged, including the SQL queries executed, and that the logs be stored in a separate S3 bucket that is encrypted with a customer-managed KMS key. Additionally, the logs must be retained for 7 years. The data engineer has enabled audit logging on the Redshift cluster and configured it to deliver logs to an S3 bucket. However, the compliance team reports that the logs are not being delivered. The S3 bucket policy allows the Redshift service to write logs. What is the most likely reason for the failure?

A.The S3 bucket is in a different region than the Redshift cluster.

B.The S3 bucket has versioning enabled, which blocks log delivery.

C.The KMS key policy does not grant the Redshift service principal decrypt permissions.

D.The S3 bucket policy does not include a statement allowing the Redshift service principal to write objects.

AnswerD

Redshift requires explicit bucket policy for audit logging.

Why this answer

Option C is correct. Redshift audit logging requires that the S3 bucket have a specific bucket policy that allows the Redshift service principal to write logs. Without that policy, delivery fails.

Option A is wrong because KMS key permissions are not the issue if bucket policy is correct. Option B is wrong because versioning is not required. Option D is wrong because cross-region is supported.

Full explanation →

518

MCQmedium

A company stores sensitive data in an S3 bucket. To meet compliance requirements, they must ensure that all objects are encrypted at rest using server-side encryption with AWS KMS. Which bucket policy statement should be applied to deny uploads that do not use the required encryption?

A.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucketname/*","Condition":{"StringNotEquals":{"s3:x-amz-server-side-encryption":"aws:kms"}}}

B.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucketname/*","Condition":{"StringNotEquals":{"s3:x-amz-server-side-encryption":"AES256"}}}

C.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucketname/*","Condition":{"StringNotEquals":{"s3:x-amz-server-side-encryption-aws-kms-key-id":"arn:aws:kms:us-east-1:123456789012:key/abc123"}}}

D.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucketname/*","Condition":{"Null":{"s3:x-amz-server-side-encryption":"true"}}}

AnswerA

Correctly denies PutObject if the encryption is not SSE-KMS.

Why this answer

Option D is correct because the condition 's3:x-amz-server-side-encryption': 'aws:kms' ensures that only objects encrypted with SSE-KMS are allowed. Option A uses SSE-S3, which is not KMS. Option B uses the wrong header.

Option C uses SSE-C, which is not allowed. Option D correctly denies if the encryption is not 'aws:kms'.

Full explanation →

519

Multi-Selectmedium

A company is designing a data ingestion pipeline for real-time clickstream data. The data must be ingested with low latency (< 1 second) and then processed for real-time analytics. The processed data should be stored in Amazon S3 for batch analytics. Which THREE services should be used together?

Select 3 answers

A.Amazon Managed Streaming for Apache Kafka (MSK)

B.Amazon Kinesis Data Analytics

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Streams

E.AWS Glue ETL job

AnswersB, C, D

Performs real-time processing and analytics on streaming data.

Why this answer

Options A, B, and D are correct. Kinesis Data Streams ingests data with low latency. Kinesis Data Analytics processes streaming data in real-time.

Kinesis Data Firehose delivers processed data to S3. Option C (AWS Glue) is batch-oriented. Option E (Amazon MSK) is an alternative but not necessary if Kinesis is used.

Full explanation →

520

MCQmedium

A data engineer needs to encrypt data at rest in an Amazon Redshift cluster. The company requires that the encryption key be managed by the customer and rotated annually. Which solution meets these requirements?

A.Use S3 server-side encryption with customer-provided keys (SSE-C).

B.Use AWS Secrets Manager to store the encryption key and configure Redshift to reference it.

C.Use AWS KMS with automatic key rotation enabled.

D.Use AWS CloudHSM to create and manage the encryption key, and configure Redshift to use it.

AnswerD

CloudHSM provides customer-managed HSMs with key rotation capabilities.

Why this answer

Option C is correct because AWS CloudHSM provides dedicated hardware security modules for key management, allowing customer-managed keys with rotation. Option A is incorrect because Redshift cannot use Secrets Manager for encryption keys. Option B is incorrect because AWS KMS with automatic rotation meets the rotation requirement but the key is AWS-managed unless using customer-managed KMS keys; however, CloudHSM is more appropriate for dedicated control.

Option D is incorrect because S3 server-side encryption is not applicable to Redshift.

Full explanation →

521

Multi-Selectmedium

A company is using Amazon S3 to store sensitive data. They need to ensure that all objects are encrypted at rest. Which combination of actions should be taken? (Choose TWO.)

Select 2 answers

A.Enable S3 Versioning on the bucket.

B.Enable MFA Delete on the bucket.

C.Configure S3 Access Points with network policies.

D.Use a bucket policy to deny PutObject requests that do not include the x-amz-server-side-encryption header.

E.Enable default encryption on the S3 bucket.

AnswersD, E

Policy enforces encryption at upload time.

Why this answer

Option D is correct because a bucket policy that denies PutObject requests lacking the `x-amz-server-side-encryption` header enforces encryption at the time of upload, ensuring that any object written without explicit encryption headers is rejected. Option E is correct because enabling default encryption on the S3 bucket automatically applies server-side encryption (SSE-S3 or SSE-KMS) to any object uploaded without specifying encryption headers, providing a fallback that covers all objects. Together, these actions ensure that every object stored in the bucket is encrypted at rest, either by explicit client request or by default bucket settings.

Exam trap

The trap here is that candidates often confuse data protection features like Versioning or MFA Delete with encryption controls, or assume that network policies (Access Points) somehow enforce encryption, when in reality only explicit bucket policies and default encryption settings directly ensure objects are encrypted at rest.

Full explanation →

522

MCQhard

A data pipeline uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The team notices that one consumer falls behind and data accumulates. Which action will help this consumer catch up without affecting other consumers?

A.Increase the retention period of the stream.

B.Register an enhanced fan-out consumer for the slow consumer.

C.Increase the number of shards in the stream.

D.Use a Lambda consumer instead of an enhanced fan-out consumer.

AnswerB

Enhanced fan-out provides dedicated throughput per consumer, allowing the slow consumer to catch up without impacting others.

Why this answer

Registering a new enhanced fan-out consumer with its own dedicated read throughput allows it to catch up independently. Increasing shards affects all consumers, and increasing iterator age may help but doesn't increase throughput.

Full explanation →

523

MCQeasy

A data engineer needs to store semi-structured JSON logs from multiple sources in a centralized data store for querying using SQL. The logs are immutable and need to be retained for 90 days. Which AWS service should be used?

A.Amazon RDS for MySQL.

B.Amazon DynamoDB.

C.Amazon S3 with Amazon Athena.

D.Amazon ElastiCache for Redis.

AnswerC

S3 stores JSON logs, Athena enables SQL queries.

Why this answer

Amazon S3 with Amazon Athena is the correct choice because S3 provides durable, cost-effective storage for immutable semi-structured JSON logs, and Athena enables serverless SQL querying directly against the data in S3 without needing to load or transform it. This combination meets the 90-day retention requirement and supports querying semi-structured data using standard SQL via Athena's built-in JSON SerDe.

Exam trap

The trap here is that candidates may choose DynamoDB for its JSON support and querying flexibility, overlooking that it is not designed for cost-effective long-term retention of immutable logs and lacks native SQL querying, while S3 with Athena directly addresses both requirements.

How to eliminate wrong answers

Option A is wrong because Amazon RDS for MySQL is a relational database designed for structured data with predefined schemas, not optimized for storing large volumes of immutable semi-structured JSON logs, and it incurs higher costs for long-term retention. Option B is wrong because Amazon DynamoDB is a NoSQL key-value and document database that can store JSON, but it is not cost-effective for 90-day retention of immutable logs due to per-request pricing and storage costs, and it lacks native SQL querying capabilities without additional services like DynamoDB Accelerator or PartiQL. Option D is wrong because Amazon ElastiCache for Redis is an in-memory cache designed for low-latency access to transient data, not for durable, long-term storage of immutable logs, and it does not support SQL querying.

Full explanation →

524

Multi-Selectmedium

A company is designing a data lake on Amazon S3. The security team requires granular access control based on data classifications. Which TWO AWS services can be used together to implement attribute-based access control (ABAC) for objects in S3?

Select 2 answers

A.AWS Secrets Manager

B.AWS Lake Formation

C.Amazon S3 object tags

D.AWS Identity and Access Management (IAM)

E.AWS Key Management Service (KMS)

AnswersC, D

Object tags are used in IAM policy conditions to enable ABAC.

Why this answer

IAM policies support ABAC using tags. S3 object tags can be used as condition keys in IAM policies to control access based on object tags. Lake Formation also supports ABAC for data lake permissions.

KMS and Secrets Manager are not for access control.

Full explanation →

525

MCQmedium

A company has an Amazon S3 bucket with versioning enabled. They want to automatically delete noncurrent versions of objects after 30 days. Which lifecycle rule action should be used?

A.Expiration

B.NoncurrentVersionExpiration

C.NoncurrentVersionTransition

D.AbortIncompleteMultipartUpload

AnswerB

This action deletes noncurrent versions after a specified number of days.

Why this answer

The NoncurrentVersionExpiration lifecycle action is specifically designed to remove noncurrent object versions after a specified number of days. Since versioning is enabled and the requirement is to delete noncurrent versions after 30 days, this action directly meets the goal without affecting current versions or other lifecycle aspects.

Exam trap

The trap here is confusing NoncurrentVersionExpiration with Expiration, as candidates often mistakenly apply the standard Expiration action to delete old versions, not realizing it only affects the current version.

How to eliminate wrong answers

Option A is wrong because Expiration deletes the current version of an object (or marks it for deletion in non-versioned buckets), not noncurrent versions. Option C is wrong because NoncurrentVersionTransition moves noncurrent versions to a different storage class (e.g., S3 Glacier), but does not delete them. Option D is wrong because AbortIncompleteMultipartUpload only aborts incomplete multipart uploads that are older than a specified number of days, and has no effect on existing object versions.

Full explanation →

Page 7 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →