Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 976–1050

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 14 of 24

976

MCQhard

Refer to the exhibit. An IAM policy is attached to an IAM user. The user is trying to upload an object to 's3://data-lake-bucket/confidential/report.pdf' using the AWS CLI. The upload fails with an AccessDenied error. What is the reason for the failure?

A.The policy does not include 's3:PutObject' action.

B.The resource ARN in the Allow statement does not cover the specific object.

C.The user does not have permission to access the bucket at all.

D.An explicit Deny statement overrides the Allow statement for the 'confidential/' prefix.

AnswerD

Explicit Deny always takes precedence over Allow.

Why this answer

Option A is correct because an explicit Deny overrides any Allow. The Deny statement blocks all s3 actions on the confidential prefix, even though the Allow statement grants PutObject. Option B is wrong because the policy allows PutObject on the bucket.

Option C is wrong because the resource is specified correctly. Option D is wrong because the user has permissions on other parts of the bucket.

Full explanation →

977

MCQmedium

A data engineer is designing a pipeline to ingest change data capture (CDC) events from an Amazon RDS for PostgreSQL database into Amazon S3. The CDC events are captured using AWS DMS. The data must be available for querying within 5 minutes of the change. Which approach meets these requirements?

A.Export the database to S3 using pg_dump and then use AWS Glue to load into S3 in Parquet format.

B.Use AWS DMS to replicate data to Amazon Redshift, then unload to S3.

C.Use AWS DMS to replicate data directly to S3 in near real-time.

D.Use AWS DMS to replicate data to an SQS queue, then process with Lambda to write to S3.

AnswerC

DMS can write CDC to S3 with low latency.

Why this answer

Option A is correct because DMS can stream CDC data to S3 in near real-time, meeting the 5-minute SLA. Option B is wrong because a Lambda function triggered by S3 events is not needed for the initial ingestion. Option C is wrong because exporting to S3 and then using Glue adds latency.

Option D is wrong because Redshift is a separate service and adds complexity.

Full explanation →

978

Multi-Selectmedium

A company uses Amazon DynamoDB as a key-value store for a high-traffic application. The table has a provisioned read capacity of 10,000 RCUs and write capacity of 5,000 WCUs. The application experiences occasional throttling during peak hours. Which TWO actions can reduce throttling without changing the application code? (Choose TWO.)

Select 2 answers

A.Enable DynamoDB Auto Scaling for both read and write capacity.

B.Add a DynamoDB Accelerator (DAX) cluster for read-heavy workloads.

C.Switch to on-demand capacity mode for the table.

D.Change the read consistency model from eventually consistent to strongly consistent.

E.Use DynamoDB Global Tables to distribute the workload across regions.

AnswersA, B

Auto Scaling adjusts capacity based on traffic patterns.

Why this answer

Enabling Auto Scaling adjusts capacity automatically. Adding a read replica (DAX) reduces read load. Adding a Global Table does not increase capacity.

Changing to on-demand mode can help but may increase cost. Reducing read consistency is not recommended.

Full explanation →

979

MCQhard

A data engineer is optimizing an Amazon Redshift cluster that runs a nightly ETL workload. The engineer notices that the query performance degrades over the week and improves after a VACUUM operation. Which action should the engineer take to automate this maintenance and minimize impact on performance?

A.Run VACUUM manually only when performance degrades significantly.

B.Disable auto vacuum and run a manual VACUUM every night after the ETL.

C.Schedule a VACUUM command using a query scheduler like Amazon EventBridge.

D.Drop and recreate the tables weekly to avoid unsorted data.

AnswerC

Automates the maintenance task.

Why this answer

Option B is correct because automating VACUUM using a scheduled query (e.g., via Amazon EventBridge or Redshift's built-in scheduler) ensures regular maintenance without manual intervention. Option A is wrong because disabling auto vacuum does not automate it. Option C is wrong because dropping and recreating tables is disruptive.

Option D is wrong because manual VACUUM is not automated.

Full explanation →

980

MCQhard

A company uses AWS Lake Formation to manage access to data in a data lake. A new data engineer has been granted SELECT permission on a table but receives an 'AccessDeniedException' when querying via Amazon Athena. The table is registered in Lake Formation and the data is encrypted with SSE-KMS. Which of the following is the MOST likely cause?

A.The table's resource-based policy does not include the engineer's IAM role.

B.The S3 bucket policy denies access to the engineer's IAM role.

C.The AWS Glue Data Catalog has not been granted permission to the engineer's role.

D.The IAM role used by Athena does not have kms:Decrypt permission on the KMS key.

AnswerD

Athena must decrypt data with the KMS key, and the IAM role needs kms:Decrypt permission.

Why this answer

Option B is correct because Lake Formation integrates with AWS KMS for encrypted data; the IAM role used by Athena must have kms:Decrypt permission on the KMS key. Option A is wrong because Lake Formation permissions are not passed via the resource-based policy on the table. Option C is wrong because the AWS Glue Data Catalog does not enforce data access by default.

Option D is wrong because S3 bucket policies can block access, but the primary issue with encrypted data is KMS permissions.

Full explanation →

981

MCQeasy

A data engineer is troubleshooting a failed AWS Glue ETL job that reads from an S3 bucket and writes to an Amazon Redshift table. The job logs show a permission error. Which IAM policy change would resolve the issue?

A.Enable encryption on the S3 bucket using AWS KMS

B.Add s3:GetObject permission to the Glue job's IAM role

C.Add redshift:DataAPI access to the Glue job's IAM role

D.Attach an IAM role with redshift:GetClusterCredentials to the Redshift cluster

AnswerC

Glue needs permission to write to Redshift via the Data API or JDBC.

Why this answer

Option B is correct because the Glue job needs permission to write to Redshift. Option A is wrong because the job already reads from S3. Option C is wrong because Redshift needs its own IAM role.

Option D is wrong because KMS is not mentioned.

Full explanation →

982

MCQeasy

A data engineer attached this IAM policy to a Lambda function used to transform data in S3. The function is unable to write output to the bucket. What is the most likely reason?

A.The resource ARN is missing the bucket-level ARN.

B.The policy does not allow the s3:DeleteObject action.

C.The policy does not allow the s3:ListBucket action on the bucket.

D.The policy does not allow the s3:PutObjectAcl action.

AnswerC

To write objects, the function needs ListBucket permission on the bucket itself.

Why this answer

The policy allows GetObject and PutObject on objects, but not the s3:ListBucket action required to check existence or list objects. The function likely needs ListBucket to write or verify.

Full explanation →

983

MCQeasy

An IAM policy includes the above resource ARN for CloudWatch Logs. A data engineer needs to allow a Lambda function to create log streams and put logs to the log group 'my-log-group'. However, the Lambda function is failing with access denied. What is the issue?

A.The region in the ARN does not match the Lambda function's region.

B.The Lambda function does not have an execution role.

C.The ARN does not include the log-stream portion.

D.The ARN is incorrectly formatted because of the wildcard.

AnswerC

The correct ARN for log streams should be 'arn:aws:logs:us-east-1:123456789012:log-group:my-log-group:log-stream:*'.

Why this answer

Option B is correct because the ARN includes a wildcard for log streams, but the correct ARN for log streams should be 'arn:aws:logs:us-east-1:123456789012:log-group:my-log-group:log-stream:*'. However, the more common issue is that the ARN is missing the log-stream prefix. Option A is wrong because the ARN format is valid.

Option C is wrong because the Lambda execution role needs permissions, not the function itself. Option D is wrong because the region is correct.

Full explanation →

984

Multi-Selecthard

A data engineer is troubleshooting slow query performance on an Amazon Redshift cluster. The cluster has 10 nodes and is using automatic distribution style. The engineer suspects that data distribution is causing excessive data movement. Which steps should the engineer take to diagnose and resolve the issue? (Choose THREE.)

Select 3 answers

A.Choose appropriate distribution keys for large tables

B.Use the EXPLAIN command to analyze query plans

C.Run the VACUUM command to reclaim space

D.Query the STL_DIST and STL_BCAST system tables

E.Increase the number of nodes in the cluster

AnswersA, B, D

Proper distribution keys minimize data movement.

Why this answer

Option A is correct because choosing appropriate distribution keys for large tables ensures that data is evenly distributed across the cluster slices, minimizing the need for data redistribution during joins and aggregations. Automatic distribution style may not always select the optimal key, leading to excessive data movement and slow query performance.

Exam trap

The trap here is that candidates often confuse VACUUM (which only reorganizes data within slices) with distribution optimization, or assume scaling out nodes automatically resolves distribution-related performance issues without addressing the underlying key choice.

Full explanation →

985

MCQeasy

A company uses Amazon S3 to store log files from multiple applications. The logs are written in JSON format. A data engineer wants to use Amazon Athena to query these logs. The logs are stored in a bucket with the following structure: 's3://logs/app1/date=2021-01-01/'. The engineer creates an Athena table with partitions. However, when querying, Athena returns zero results for partitions that exist. The engineer has run MSCK REPAIR TABLE to add partitions. What is the most likely cause of the issue?

A.The MSCK REPAIR TABLE command failed silently.

B.The partition key name in the table definition does not match the S3 folder naming convention.

C.The log files are in JSON format and Athena does not support JSON.

D.The log files need to be copied to a different bucket in the same region.

AnswerB

The folder prefix must match the partition key name; otherwise, MSCK REPAIR cannot detect partitions.

Why this answer

Option C is correct because Athena relies on the partition metadata stored in the Glue Data Catalog. If the partition folder structure does not match the table's partition definition (e.g., the folder is named 'date=2021-01-01' but the table's partition key is named 'dt'), MSCK REPAIR will not register the partitions. The partition key name must match the folder prefix.

Option A is wrong because Athena supports JSON format. Option B is wrong because MSCK REPAIR does add partitions if the structure matches. Option D is wrong because the data is already in the S3 bucket; no need to copy.

Full explanation →

986

MCQhard

An application uses the 'orders' DynamoDB table with the schema and provisioned throughput shown in the exhibit. The application frequently queries by customer_id (range key) without specifying the order_id (partition key). What is the most likely impact on performance?

A.Queries will require a full table scan, consuming significant read capacity.

B.Queries will be throttled because the table does not have a global secondary index.

C.Queries will be fast because the sort key is indexed.

D.Queries will cause hot partitions on the table.

AnswerA

Without partition key, DynamoDB scans the entire table.

Why this answer

Option B is correct because queries without a partition key result in a full table scan, which is inefficient and consumes read capacity. Option A is incorrect because the query will not use the sort key efficiently. Option C is incorrect because the issue is not hot partitions but full scans.

Option D is incorrect because it will not cause throttling immediately but will consume capacity.

Full explanation →

987

MCQmedium

A company is using Amazon RDS for MySQL with Multi-AZ deployment. The database size is 2 TB and the workload is read-heavy. To improve read performance, which option should be used?

A.Use Amazon ElastiCache to cache database queries

B.Increase the instance size to 16xlarge

C.Create Read Replicas in the same or different regions

D.Enable Multi-AZ on additional instances

AnswerC

Read Replicas allow offloading read traffic.

Why this answer

Option C is correct because Read Replicas offload read traffic from the primary instance. Option A is wrong because Multi-AZ is for high availability, not read scaling. Option B is wrong because increasing instance size helps but is less efficient than adding replicas.

Option D is wrong because ElastiCache is for caching, not directly for MySQL read scaling.

Full explanation →

988

MCQhard

A data engineer attaches the above IAM policy to an IAM user. The user tries to download an object from my-bucket using the AWS CLI without specifying SSE headers. The object is stored with SSE-S3. Will the download succeed?

A.No, because the object is encrypted and the user does not have decrypt permission.

B.No, because the request does not include the required encryption header.

C.Yes, because the object is encrypted with SSE-S3, which uses AES256.

D.Yes, because the policy allows s3:GetObject on the bucket.

AnswerB

The condition requires the request to have x-amz-server-side-encryption: AES256.

Why this answer

Option B is correct because when an object is stored with SSE-S3, AWS S3 requires that any request to download it without specifying the `x-amz-server-side-encryption` header (or the equivalent CLI parameter) will fail. The IAM policy grants `s3:GetObject` but does not override the S3 API's requirement for the encryption header to be present in the request. Without the header, S3 rejects the request with a `400 Bad Request` error, even though the user has the necessary IAM permissions.

Exam trap

The trap here is that candidates assume SSE-S3 decryption is fully transparent and that any `s3:GetObject` permission suffices, overlooking the S3 API's requirement for the encryption header on GET requests for SSE-S3 objects.

How to eliminate wrong answers

Option A is wrong because SSE-S3 uses server-side encryption managed by AWS, and the user does not need a separate 'decrypt permission' — S3 handles decryption transparently when the request includes the required encryption header. Option C is wrong because while the object is encrypted with AES256 under SSE-S3, the download will still fail if the request does not include the required `x-amz-server-side-encryption` header; the encryption algorithm alone does not bypass the header requirement. Option D is wrong because the IAM policy allows `s3:GetObject`, but the S3 API enforces an additional condition: for SSE-S3 objects, the request must include the encryption header; the policy alone is insufficient to guarantee success.

Full explanation →

989

MCQeasy

A data engineer needs to grant an IAM user read-only access to an S3 bucket named 'data-lake'. Which IAM policy statement should be used?

A.{"Effect":"Allow","Action":["s3:PutObject","s3:DeleteObject"],"Resource":"arn:aws:s3:::data-lake/*"}

B.{"Effect":"Allow","Action":"s3:*","Resource":"*"}

C.{"Effect":"Allow","Action":["s3:ListBucket","s3:GetObject"],"Resource":["arn:aws:s3:::data-lake","arn:aws:s3:::data-lake/*"]}

D.{"Effect":"Allow","Action":"s3:ListBucket","Resource":"arn:aws:s3:::data-lake"}

AnswerC

This grants read-only access to the specific bucket and its objects.

Why this answer

Option B is correct because it allows ListBucket on the bucket and GetObject on objects. Option A is wrong because it allows all actions. Option C is wrong because it allows write actions.

Option D is wrong because it only allows ListBucket, not GetObject.

Full explanation →

990

MCQmedium

A data engineer is configuring S3 bucket policies to restrict access to a specific VPC. Which condition key should be used in the bucket policy to enforce that requests originate only from the desired VPC?

A.aws:VpcSourceIp

B.aws:SourceVpc

C.aws:RequestedRegion

D.aws:SourceIp

AnswerB

aws:SourceVpc restricts requests to a specific VPC.

Why this answer

Option C is correct because aws:SourceVpc is the condition key used to restrict access to a specific VPC. Option A is wrong because aws:SourceIp is for IP addresses. Option B is wrong because aws:VpcSourceIp is not a valid key.

Option D is wrong because aws:RequestedRegion is for region restriction.

Full explanation →

991

MCQmedium

A data pipeline ingests CSV files from an S3 bucket into a Redshift table using the COPY command. Recently, files with inconsistent column delimiters (some use pipes, others use commas) have been arriving. The pipeline must handle both delimiters without manual intervention. What is the MOST efficient solution?

A.Configure the COPY command with a fixed delimiter (e.g., comma) and manually convert files with pipes before ingestion.

B.Create an AWS Lambda function triggered by S3 events that reads the first line of each file, detects the delimiter, and runs the COPY command with the appropriate DELIMITER option.

C.Use AWS Glue to crawl the S3 bucket and automatically detect the schema and delimiter before writing to Redshift.

D.Use Amazon Athena to query the files with the OpenCSVSerDe, which automatically detects delimiters, and then write the results to Redshift.

AnswerB

Lambda provides a lightweight, event-driven solution to dynamically detect and handle delimiters.

Why this answer

Option C is correct because using a Lambda function to inspect the first line and set the DELIMITER parameter dynamically handles multiple delimiters without manual intervention, and it's efficient as it runs per object. Option A is wrong because it requires manual reconfiguration per file. Option B is wrong because AWS Glue is overkill for simple delimiter detection.

Option D is wrong because it still requires manual schema redefinition.

Full explanation →

992

MCQmedium

A data pipeline uses AWS Glue to process data from Amazon S3 and write results to Amazon Redshift. The pipeline fails intermittently with the error 'S3ServiceException: Access Denied'. The IAM role used by Glue has permissions to read from the S3 bucket. What is the most likely cause of this error?

A.The S3 bucket is in a different AWS Region than the Glue job

B.S3 Server Access Logging is enabled and blocking requests

C.The S3 bucket policy denies access to the Glue job's IAM role

D.The S3 bucket has S3 Transfer Acceleration enabled

AnswerC

A bucket policy can explicitly deny access, overriding IAM allow.

Why this answer

Option C is correct because S3 bucket policies can explicitly deny access even if the IAM role allows it. Option A is wrong because S3 Transfer Acceleration is not related to access denied errors. Option B is wrong because S3 is not region-specific for the error.

Option D is wrong because S3 Server Access Logging does not affect access permissions.

Full explanation →

993

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue ETL job that fails with an access denied error when writing to S3. The IAM role attached to the Glue job has the policy shown. What is the most likely cause of the error?

A.The Glue job is missing the s3:ListBucket permission on the bucket

B.The Glue job is writing to an S3 bucket that is not included in the Resource ARN

C.The S3 bucket is encrypted with AWS KMS and the policy does not include kms:Decrypt permissions

D.The Glue job does not have permission to call glue:StartJobRun

AnswerB

The policy only grants access to my-data-bucket, not other buckets.

Why this answer

Option B is correct because the policy only allows s3:PutObject on s3://my-data-bucket/*, but if the job writes to a different bucket or a path that doesn't exist, it may fail. Option A (KMS) is plausible but not indicated. Option C (Glue StartJobRun) is allowed.

Option D (S3 list) is not needed for writing.

Full explanation →

994

MCQmedium

A data engineer is troubleshooting a slow-running query on Amazon Redshift. The query scans a large table but returns few rows. Which diagnostic step should be taken first?

A.Use EXPLAIN to review the query plan.

B.Run ANALYZE on the table.

C.Check the concurrency scaling status.

D.Run VACUUM on the table.

AnswerA

Reveals how the query is executed, helps identify bottlenecks.

Why this answer

When a query scans a large table but returns few rows, the most likely cause is an inefficient query plan—such as a full table scan instead of using indexes or zone maps. Using EXPLAIN first reveals the execution plan, allowing the engineer to identify whether the query is performing unnecessary sequential scans, missing filter pushdown, or using suboptimal join strategies. This diagnostic step should always precede tuning actions like ANALYZE or VACUUM, which address data distribution or storage bloat rather than query planning.

Exam trap

The trap here is that candidates often jump to performance-tuning commands like ANALYZE or VACUUM without first diagnosing the query plan, but the DEA-C01 exam emphasizes that EXPLAIN is the foundational step for identifying inefficient scan patterns before applying any corrective actions.

How to eliminate wrong answers

Option B is wrong because ANALYZE updates table statistics for the query optimizer, but if the query plan is already suboptimal (e.g., missing a WHERE clause filter), fresh statistics won't fix the root cause—EXPLAIN must be checked first. Option C is wrong because concurrency scaling handles increased query load by adding cluster capacity, but it does not improve the efficiency of a single slow query that scans many rows unnecessarily. Option D is wrong because VACUUM reclaims disk space and sorts rows for better compression, but it does not change the query execution path—a full table scan will remain a full table scan even after a vacuum.

Full explanation →

995

Multi-Selecthard

A company is migrating a large Oracle database to Amazon Aurora PostgreSQL. The migration must have minimal downtime and preserve data consistency. Which THREE AWS services or features should be used?

Select 3 answers

A.Amazon RDS for Oracle as the target

B.AWS DataSync for initial load

C.AWS Schema Conversion Tool (SCT) for schema conversion

D.Amazon Aurora PostgreSQL as the target database

E.AWS Database Migration Service (DMS) for continuous replication

AnswersC, D, E

SCT converts Oracle schema to Aurora PostgreSQL compatible schema.

Why this answer

The AWS Schema Conversion Tool (SCT) is required to convert the source Oracle database schema (including stored procedures, functions, and data types) to a format compatible with Amazon Aurora PostgreSQL. Without SCT, the heterogeneous migration would fail due to incompatible SQL dialects and database objects.

Exam trap

The trap here is that candidates often confuse AWS DataSync (a file-transfer service) with database migration tools, or mistakenly think RDS for Oracle can serve as a migration target when the question explicitly specifies Aurora PostgreSQL.

Full explanation →

996

MCQhard

A company is ingesting CSV files into Amazon S3. Each file contains a header row. The pipeline uses AWS Glue to crawl the S3 bucket and create a table in the AWS Glue Data Catalog. However, the crawler is including the header as data. What is the most likely cause?

A.The S3 bucket has versioning enabled

B.The crawler is configured to use a custom classifier that does not skip headers

C.The crawler is running in 'crawl all folders' mode

D.The CSV files are not compressed

AnswerB

Without a classifier that sets skip.header.line.count=1, headers are included.

Why this answer

By default, Glue crawlers do not skip headers unless configured to do so. Adding an explicit 'skip.header.line.count=1' in the table properties fixes it. The crawler classifies correctly but needs the property.

Full explanation →

997

MCQeasy

A company runs a data pipeline on AWS Glue that processes streaming data from Amazon Kinesis Data Streams and writes results to an Amazon Redshift cluster. The pipeline has been running smoothly, but recently the Glue job started failing with 'ResourceNotFoundException' for the Redshift table. What should the data engineer check first?

A.Verify that the target Redshift table exists and was not dropped or renamed.

B.Ensure the Redshift table schema matches the Glue job output.

C.Check the IAM role permissions for the Glue job to access Redshift.

D.Review security group rules for the Redshift cluster.

AnswerA

ResourceNotFoundException indicates the table is missing.

Why this answer

Option A is correct because the error indicates the table does not exist or was deleted. Option B is wrong because IAM role issues would cause Access Denied, not ResourceNotFoundException. Option C is wrong because network issues would cause timeout or connection refused.

Option D is wrong because schema changes could cause type mismatch but not ResourceNotFoundException.

Full explanation →

998

MCQmedium

An IAM policy is attached to an AWS Glue job. The job needs to read from and write to S3 buckets, and also trigger other Glue jobs. The job is failing with an AccessDenied error when trying to write to a bucket named 'example-bucket'. What is the MOST likely cause?

A.The policy does not include s3:PutObject action.

B.The bucket name in the resource ARN does not match the actual bucket name.

C.The policy uses a resource ARN with a wildcard, which is not allowed.

D.The policy does not allow Glue actions.

AnswerB

The ARN uses 'example-bucket' but the actual bucket might have a different name.

Why this answer

Option B is correct because the policy allows both GetObject and PutObject on the specified bucket, so the issue is likely that the bucket name is misspelled or the ARN is incorrect. Option A is wrong because the policy does allow Glue actions. Option C is wrong because the policy uses the correct ARN format.

Option D is wrong because the policy does allow PutObject.

Full explanation →

999

MCQhard

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

A.Enable job bookmark and schedule the job to run more frequently.

B.Use multiple JDBC connections in parallel by setting 'hashexpression' and 'hashfield'.

C.Partition the source table by year and use pushdown predicates in the Glue job.

D.Increase the number of DPUs for the Glue job to 100.

AnswerC

This reduces the data scanned by filtering on partition columns.

Why this answer

Option C is correct because partitioning the source table by year and using pushdown predicates allows AWS Glue to read only the relevant partitions from PostgreSQL, drastically reducing the data scanned and transferred. This directly addresses the 500 million row volume and frequent updates by minimizing the JDBC read workload, which is the primary bottleneck in the 6-hour runtime.

Exam trap

The trap here is that candidates often assume increasing DPUs (Option D) or adding parallelism (Option B) will linearly speed up JDBC reads, but they fail to recognize that the bottleneck is the source database's I/O and network throughput, not Glue's compute capacity, and that predicate pushdown is the only option that reduces the data volume at the source.

How to eliminate wrong answers

Option A is wrong because job bookmarks track previously processed data to avoid reprocessing, but they do not reduce the initial full load or the per-run data volume; scheduling more frequently would only compound the problem by running incomplete jobs. Option B is wrong because 'hashexpression' and 'hashfield' are not valid JDBC parallelism parameters in AWS Glue; the correct approach for parallel JDBC reads is to set 'hashfield' and 'hashpartitions' (not 'hashexpression'), and even then, parallelism alone cannot overcome the I/O bottleneck of scanning 500 million rows without filtering. Option D is wrong because increasing DPUs to 100 may improve compute parallelism for transformations, but the bottleneck is the JDBC read from PostgreSQL, which is constrained by the source database's network and query capacity, not by Glue's compute resources; excessive DPUs can also cause throttling or connection limits.

Full explanation →

1000

MCQeasy

A company stores sensitive data in Amazon S3 and needs to ensure that data is encrypted at rest. Which AWS service can be used to manage the encryption keys?

A.AWS Key Management Service (KMS)

B.AWS Secrets Manager

C.AWS Identity and Access Management (IAM)

D.AWS Certificate Manager (ACM)

AnswerA

KMS is the service for managing encryption keys.

Why this answer

AWS KMS is the managed service for creating and controlling encryption keys used to encrypt data. Option A is wrong because IAM manages access, not keys. Option B is wrong because CloudHSM is a hardware security module but is not the only option.

Option C is wrong because Secrets Manager is for secrets like passwords.

Full explanation →

1001

MCQmedium

A company is using Amazon RDS for PostgreSQL with Multi-AZ deployment. The primary instance fails and a failover occurs. After the failover, the application cannot connect to the database. What is the MOST likely cause?

A.The database instance is in a 'stopped' state after failover.

B.The Multi-AZ failover requires manual intervention to complete.

C.The security group for the RDS instance was not updated during failover.

D.The application is using the old primary instance endpoint instead of the RDS CNAME.

AnswerD

The application should use the CNAME, which updates automatically after failover.

Why this answer

After a Multi-AZ failover in Amazon RDS for PostgreSQL, the DNS CNAME record is automatically updated to point to the new primary instance in the standby Availability Zone. If the application is configured with the old primary instance's endpoint (the specific IP or DNS name of the original instance) instead of the RDS CNAME (which remains constant), it will attempt to connect to the failed instance, which is no longer available. This is the most likely cause of connectivity loss because the CNAME is the stable connection point that follows the active primary.

Exam trap

The trap here is that candidates may assume security groups or instance state are the issue, but AWS explicitly tests the concept that the RDS CNAME is the correct connection target and that hardcoding endpoints leads to failover failures.

How to eliminate wrong answers

Option A is wrong because RDS Multi-AZ failover does not stop the database instance; the new primary is promoted and remains in an 'available' state. Option B is wrong because Multi-AZ failover is fully automated and requires no manual intervention to complete. Option C is wrong because security groups are associated with the RDS instance itself, not with a specific AZ or IP, and they remain unchanged during failover; the new primary inherits the same security group configuration.

Full explanation →

1002

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline to load clickstream data from an Amazon S3 bucket into an Amazon Redshift cluster. The data arrives in 5-minute batches. Which TWO actions should the engineer take to ensure data consistency and avoid duplicates? (Select TWO.)

Select 2 answers

A.Define a SORTKEY on the target table to improve deduplication.

B.Disable workload management (WLM) to maximize resources.

C.Use the STL_LOAD_ERRORS system table to monitor and resolve load errors.

D.Load data into a single slice to maintain order.

E.Use a staging table and perform a MERGE operation to avoid duplicates.

AnswersC, E

Monitoring load errors helps catch and fix issues that could cause duplicates or missing data.

Why this answer

Options B and D are correct. Using the STL_LOAD_ERRORS table helps identify errors, and using a staging table with a MERGE/upsert pattern ensures idempotent loads. Option A (disabling WLM) is unrelated.

Option C (loading into a single slice) reduces performance. Option E (using SORTKEY) improves query performance but not consistency.

Full explanation →

1003

MCQmedium

A company uses AWS Glue to process sensitive customer data stored in S3. The security team requires that all data be encrypted at rest using a customer-managed KMS key and that access to the key be auditable. Which solution meets these requirements?

A.Encrypt the data client-side before uploading to S3.

B.Configure the S3 bucket to use SSE-KMS with a customer-managed KMS key and enable CloudTrail for KMS events.

C.Enable default SSE-S3 encryption on the S3 bucket.

D.Use SSE-C with a customer-provided key.

AnswerB

SSE-KMS with customer-managed key provides encryption and auditability via CloudTrail.

Why this answer

Option D is correct because SSE-KMS with a customer-managed key provides encryption with an auditable CMK. Option A (SSE-S3) uses AWS-managed keys with no audit capability. Option B (client-side encryption) is not at rest encryption within S3.

Option C (SSE-C) does not use KMS keys.

Full explanation →

1004

MCQhard

The exhibit shows the output of describe-table for a DynamoDB table. The table is used for a reporting job that queries by 'pk' and filters on 'sk' using a range condition. The job is running slowly. What is the most likely cause?

A.The table lacks a global secondary index (GSI).

B.The provisioned read capacity is too low.

C.The table uses provisioned throughput instead of on-demand.

D.The table needs a local secondary index (LSI) on 'sk'.

AnswerB

5 RCU is very low for reporting queries.

Why this answer

The table has only 5 read capacity units, which is likely too low for the reporting job. Auto scaling or increasing RCU would help. Indexes and LSI are not shown to be causing slowness.

Full explanation →

1005

MCQhard

A company has a data lake in Amazon S3 with millions of objects. The security team wants to enforce that all objects are encrypted with a specific customer-managed KMS key. The data engineer configures an S3 bucket policy to deny PutObject if the encryption is not set to that key. However, some existing objects are not encrypted with that key. What is the most efficient way to remediate the existing objects?

A.Use S3 Cross-Region Replication to replicate objects to a new bucket with the correct encryption.

B.Write a script using the AWS SDK to iterate over all objects and re-upload them with the correct encryption.

C.Use S3 Batch Operations to copy objects in the same bucket with the new encryption settings.

D.Use S3 Object Lambda to dynamically encrypt objects on read.

AnswerC

Batch Operations can efficiently update encryption for large numbers of objects.

Why this answer

Option D is correct because S3 Batch Operations can copy objects in place with new encryption settings, efficiently updating millions of objects. Option A is wrong because it is inefficient for millions of objects. Option B is wrong because S3 Replication is for cross-region or cross-bucket copying.

Option C is wrong because S3 Object Lambda modifies data on read, not at rest.

Full explanation →

1006

MCQmedium

A company is storing sensitive user data in an Amazon S3 bucket. The security team requires that all data be encrypted at rest using a customer-managed key stored in AWS KMS. The bucket policy must deny any PUT request that does not include the appropriate encryption header. Which bucket policy condition key should be used?

A.s3:x-amz-server-side-encryption-aws-kms-key-id

B.s3:x-amz-server-side-encryption

C.s3:x-amz-acl

D.aws:SourceArn

AnswerA

This condition key allows requiring a specific KMS key ID for encryption.

Why this answer

Option A is correct because the `s3:x-amz-server-side-encryption-aws-kms-key-id` condition key allows the bucket policy to enforce that PUT requests include a specific customer-managed KMS key ID in the `x-amz-server-side-encryption-aws-kms-key-id` header, ensuring encryption at rest with the required key. This directly meets the security team's requirement to deny PUT requests that lack the appropriate encryption header tied to a customer-managed KMS key.

Exam trap

The trap here is that candidates often confuse `s3:x-amz-server-side-encryption` (which only checks the encryption algorithm, not the key) with `s3:x-amz-server-side-encryption-aws-kms-key-id` (which checks the specific KMS key ID), leading them to pick option B when the requirement explicitly demands a customer-managed key.

How to eliminate wrong answers

Option B is wrong because `s3:x-amz-server-side-encryption` only checks whether the `x-amz-server-side-encryption` header is present (e.g., `AES256` or `aws:kms`), but it cannot enforce that a specific customer-managed KMS key ID is used; it would allow any KMS key, including AWS-managed keys. Option C is wrong because `s3:x-amz-acl` is used to control access control list (ACL) headers in requests, not encryption headers, so it is irrelevant to encryption enforcement. Option D is wrong because `aws:SourceArn` is a global condition key used to restrict requests based on the ARN of the source resource (e.g., an SNS topic or Lambda function), not to enforce encryption headers in S3 PUT requests.

Full explanation →

1007

MCQhard

A company uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The stream has 5 shards. Each consumer reads from all shards. The total incoming data rate is 25 MB/s. What is the maximum read throughput per consumer if enhanced fan-out is enabled?

A.2 MB/s

B.10 MB/s

C.1 MB/s

D.5 MB/s

AnswerB

Each shard provides 2 MB/s read capacity with enhanced fan-out; 5 shards = 10 MB/s.

Why this answer

Option A is correct because enhanced fan-out provides each consumer with dedicated 2 MB/s per shard read throughput. With 5 shards, each consumer can read up to 10 MB/s. Option B is wrong because that is the total write capacity.

Option C is wrong because that is the read capacity without enhanced fan-out. Option D is wrong because that is the per-shard write capacity.

Full explanation →

1008

MCQmedium

A data engineer is troubleshooting a data pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The engineer notices that the S3 bucket contains many small files (less than 1 MB). This is causing performance issues in downstream processing. What is the BEST way to reduce the number of small files?

A.Increase the buffer size to at least 128 MB in the Firehose delivery stream configuration.

B.Use an AWS Lambda function to transform the data before delivery.

C.Change the compression format from GZIP to Snappy.

D.Decrease the buffer interval in the Firehose delivery stream configuration.

AnswerA

Larger buffer size leads to fewer, larger files.

Why this answer

Option C is correct because increasing the buffer size (e.g., to 128 MB) causes Firehose to deliver fewer, larger files. Option A is incorrect because reducing the buffer interval would create more small files. Option B is incorrect because changing the compression algorithm does not affect file size directly; it reduces storage size but not the number of files.

Option D is incorrect because using a Lambda transformation does not inherently change buffering behavior.

Full explanation →

1009

Multi-Selectmedium

A company is designing a data lake on Amazon S3 for analytics. The data includes sensitive personally identifiable information (PII). Which TWO actions should the company take to protect the data? (Choose TWO.)

Select 2 answers

A.Enable S3 Block Public Access.

B.Enable Requester Pays.

C.Enable S3 Transfer Acceleration.

D.Enable cross-region replication.

E.Enable default encryption with SSE-KMS.

AnswersA, E

Why this answer

Options A and D are correct. Option A is correct because encrypting at rest with SSE-KMS provides encryption and key management. Option D is correct because S3 Block Public Access prevents accidental public exposure.

Option B is wrong because S3 Transfer Acceleration is for speed, not security. Option C is wrong because cross-region replication does not protect data in the source bucket. Option E is wrong because Requester Pays does not control access.

Full explanation →

1010

MCQmedium

A data engineer runs the above CLI command to describe the DynamoDB table 'Orders'. The table has a partition key 'OrderID' and sort key 'CustomerID'. Which query operation is most efficient for retrieving all orders for a specific customer?

A.Query the table using CustomerID as the partition key

B.Scan the table and filter by CustomerID

C.Use GetItem with CustomerID as the key

D.Create a Global Secondary Index on CustomerID and query the index

AnswerD

A GSI allows efficient query by CustomerID alone.

Why this answer

Option D is correct because a Global Secondary Index (GSI) on CustomerID allows you to query efficiently using CustomerID as the partition key, avoiding a full table scan. Since the base table's primary key is (OrderID, CustomerID), you cannot directly query by CustomerID alone; a GSI provides an alternative access pattern optimized for this query.

Exam trap

The trap here is that candidates assume the sort key can be used as a query filter without an index, but DynamoDB requires the partition key for Query operations, and a Scan is often mistakenly chosen as a simpler alternative despite its performance cost.

How to eliminate wrong answers

Option A is wrong because CustomerID is the sort key, not the partition key, so a Query operation requires the partition key (OrderID) to be specified; you cannot query using only the sort key. Option B is wrong because a Scan reads every item in the table, which is inefficient and costly for large datasets, especially when a targeted query is possible. Option C is wrong because GetItem requires both the partition key and sort key to retrieve a single item; it cannot return multiple orders for a customer.

Full explanation →

1011

MCQhard

A company has an AWS Glue ETL job that reads data from an S3 bucket encrypted with SSE-S3. The job runs successfully, but the output written to another S3 bucket with SSE-KMS fails. The IAM role for the Glue job has s3:PutObject and kms:GenerateDataKey permissions. What is the most likely cause?

A.The IAM role is missing kms:Encrypt permission

B.The target S3 bucket policy denies s3:PutObject

C.The KMS key policy does not grant the Glue role kms:GenerateDataKey

D.The source bucket's encryption type is incompatible with the target

AnswerA

Writing with SSE-KMS requires kms:Encrypt.

Why this answer

Option D is correct. Glue needs kms:Encrypt permission to write with SSE-KMS. Option A is wrong because SSE-S3 doesn't need KMS.

Option B is wrong because S3 bucket policy is not the issue. Option C is wrong because KMS key policy needs to allow the role, but the role already has GenerateDataKey; missing Encrypt is more likely.

Full explanation →

1012

MCQhard

Your company runs a data pipeline that ingests data from AWS Database Migration Service (DMS) into Amazon S3 in Parquet format. An AWS Glue ETL job then transforms the data and loads it into an Amazon Redshift cluster. The Glue job uses a JDBC connection to Redshift. Recently, the Glue job started failing with a 'communication failure' error when writing to Redshift. The Redshift cluster is in a VPC with public accessibility disabled. The Glue job runs in a VPC with a subnet that has a route to a NAT gateway. The security group for Redshift allows inbound traffic from the Glue job's security group. The Glue job's IAM role has the necessary permissions. What is the most likely cause?

A.The Glue job's IAM role does not have the redshift:DescribeClusters permission.

B.The Redshift cluster's public accessibility is disabled, but the Glue job is trying to connect over the internet.

C.The Glue job and Redshift cluster are in different VPCs that are not peered or connected via VPC Transit Gateway.

D.The NAT gateway is not configured to allow traffic to the Redshift cluster's subnet.

AnswerC

Without VPC peering or transit gateway, the Glue job cannot reach the Redshift cluster.

Why this answer

Option C is correct because even though the security group allows inbound traffic, the Glue job's VPC may not have a route to the Redshift cluster's VPC if they are in different VPCs. Option A is wrong because IAM permissions are not the issue. Option B is wrong because the Redshift cluster is in a VPC and not publicly accessible.

Option D is wrong because the NAT gateway is for outbound internet, not for connecting to Redshift within the same VPC.

Full explanation →

1013

MCQhard

A company uses AWS Lake Formation to manage data lake permissions. The data engineer notices that a user with SELECT permission on a table can also query the underlying data in Amazon S3 directly. How can the engineer enforce that access to the S3 data is only through Lake Formation?

A.Use S3 Access Points with a policy that restricts access to only Lake Formation

B.Grant the user permissions only through Lake Formation and remove any IAM policies that allow direct S3 access to the data location

C.Enable S3 Block Public Access on the bucket

D.Change the S3 bucket policy to deny all access except from Lake Formation

AnswerB

This ensures that the user can only access data through Lake Formation, and direct S3 access is blocked.

Why this answer

Registering the S3 location with Lake Formation and using the 'Lake Formation managed' option ensures that IAM policies do not grant direct S3 access. Lake Formation provides fine-grained access control, and by granting only Lake Formation permissions and revoking direct S3 bucket permissions, users cannot bypass Lake Formation. Option B and D do not prevent direct S3 access.

Option C suggests Lake Formation does not control S3 access, which is incorrect.

Full explanation →

1014

MCQmedium

A company is ingesting log files from multiple EC2 instances into Amazon S3 using the CloudWatch agent. The logs are delivered to a CloudWatch Logs group, and a subscription filter sends them to a Lambda function for transformation, then to Firehose. The Firehose stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The logs are critical and must be available in S3 within 5 minutes. What is the most cost-effective way to reduce the delivery latency?

A.Replace Firehose with Amazon Kinesis Data Streams

B.Increase the buffer size to 10 MB

C.Increase the buffer interval to 120 seconds

D.Decrease the buffer interval to 10 seconds

AnswerD

Lower buffer interval reduces delivery latency.

Why this answer

Option D is correct because reducing the buffer interval from 60 to 10 seconds directly reduces latency without adding significant cost. Option A (increase buffer size) would increase latency. Option B (increase buffer interval) worsens latency.

Option C (change to Kinesis Data Streams) adds cost and complexity.

Full explanation →

1015

MCQmedium

A data engineering team notices that an AWS Glue ETL job, which processes hourly data from an S3 bucket, is taking progressively longer to run. The job reads Parquet files partitioned by date and hour. Which action is MOST likely to improve the job's performance?

A.Enable pushdown predicate filtering on the job's data source.

B.Convert Parquet files to CSV to improve read performance.

C.Increase the number of DPUs for the job.

D.Switch from Spark to Python shell for simpler processing.

AnswerA

Pushdown predicates filter data at the source, reducing data scanned.

Why this answer

Option C is correct because enabling pushdown predicate filtering in Spark reduces the amount of data read by pruning partitions early. Option A is wrong because Glue jobs already use distributed processing. Option B is wrong because increasing the number of DPUs can help but is not the most direct fix for the described symptom.

Option D is wrong because converting to CSV would increase I/O and processing time.

Full explanation →

1016

MCQhard

A company runs a nightly batch ETL job using AWS Glue to transform data from Amazon RDS for MySQL to Amazon S3. The job reads 100 tables and writes Parquet files partitioned by date. Recently, the job started failing with 'ThrottlingException' from the RDS database. The data volume has increased, and the Glue job is reading large tables without any filtering. The job uses a single Glue job with multiple Spark executors. The engineer needs to reduce the load on the RDS database while maintaining the same processing time. What should the engineer do?

A.Use AWS DMS to continuously replicate data to S3 and then run Glue on the S3 data.

B.Change the job to read all data from RDS into a staging table in S3 first, then transform.

C.Increase the number of DPUs to process data faster.

D.Modify the Glue job to use a JDBC connection with a WHERE clause to read only the latest partition.

AnswerD

Filtering reduces data read from RDS, reducing throttling.

Why this answer

Option D is correct because using a JDBC connection with a WHERE clause to filter data by date reduces the amount of data read from RDS, reducing throttling. Option A is wrong because increasing DPUs would increase parallelism and potentially increase load on RDS. Option B is wrong because DMS is a migration tool, not for regular ETL.

Option C is wrong because reading all data into S3 first does not reduce load on RDS; the initial read still causes throttling.

Full explanation →

1017

MCQhard

Refer to the exhibit. A data engineer applies this bucket policy to an S3 bucket. A user within the 10.0.0.0/24 IP range attempts to upload an object to the bucket using an HTTP (non-HTTPS) request. What is the outcome?

A.The upload succeeds because the Allow statement grants permission.

B.The upload succeeds because the user's IP is allowed.

C.The upload fails because the user's IP is not in the allowed range for PutObject.

D.The upload fails because the request is not using HTTPS.

AnswerD

Explicit Deny for non-HTTPS requests.

Why this answer

Option B is correct because the Deny statement explicitly denies all s3 actions when SecureTransport is false, regardless of the Allow statement. The Explicit Deny overrides Allow. Option A is wrong because the Deny applies.

Option C is wrong because the IP condition is only on Allow. Option D is wrong because the policy is valid.

Full explanation →

1018

Multi-Selectmedium

A company uses AWS Glue to perform ETL on data stored in Amazon S3. The Glue job reads CSV files, converts them to Parquet, and partitions by date. The job runs daily and processes about 500 GB of data. The team wants to optimize costs and performance. Which three actions should the team take? (Select THREE.)

Select 3 answers

A.Increase the Spark shuffle partitions to 500.

B.Use column pruning to read only necessary columns in the Glue script.

C.Use G.1X or G.2X worker types for better performance.

D.Increase the number of DPUs for the job.

E.Write the output as JSON instead of Parquet to avoid compression overhead.

AnswersB, C, D

Reduces data scanned and improves performance.

Why this answer

Option B is correct because column pruning in AWS Glue scripts reduces the amount of data read from Amazon S3 by specifying only the columns needed for the ETL transformation. This minimizes I/O and network overhead, directly lowering costs and improving job performance, especially when processing large CSV files.

Exam trap

The trap here is that candidates often confuse increasing DPUs or shuffle partitions as a universal performance fix, but AWS Glue's cost optimization relies on reducing data processed (column pruning) and choosing appropriate worker types for the workload, not simply scaling resources.

Full explanation →

1019

MCQeasy

A company needs to transfer 10 TB of historical data from an on-premises HDFS cluster to Amazon S3. The data is stored on a single 20 TB disk. The network link to AWS has a bandwidth of 1 Gbps. The transfer must be completed within 2 days. Which solution meets these requirements?

A.Use AWS Snowball Edge to transfer the data physically.

B.Use Amazon Kinesis Data Streams to stream data to S3.

C.Use AWS DMS to migrate data from HDFS to S3.

D.Use AWS CLI to copy data directly to S3 over the network.

AnswerA

Snowball Edge provides fast, reliable transfer for large datasets.

Why this answer

Option B is correct because AWS Snowball Edge can physically ship the data, bypassing network bandwidth limits. Option A is wrong because 1 Gbps for 2 days can only transfer about 21.6 TB, which is enough, but network stability and other traffic may cause delays. Option C is wrong because AWS DMS is for database migration, not HDFS.

Option D is wrong because Kinesis is for streaming, not bulk transfer.

Full explanation →

1020

MCQhard

A company runs a data ingestion pipeline that uses AWS Glue to read 500 GB of JSON files from an S3 bucket (s3://raw-data/) every hour. The Glue ETL job transforms the data and writes Parquet files to another S3 bucket (s3://processed-data/). The job is triggered by a time-based CloudWatch Events rule. Recently, the job has started taking over 2 hours to complete, causing delays in downstream processes. The data volume has been consistent, and no changes have been made to the job code or infrastructure. The S3 bucket 's3://raw-data/' receives new files continuously, but the Glue job reads all files in the bucket each run (no incremental processing). The engineer suspects that the job is reprocessing old data. Which action should the engineer take FIRST to reduce the job duration?

A.Enable Glue job bookmarking and configure the job to process only new data.

B.Increase the parallelism of the Spark job by repartitioning the data.

C.Add partition pruning by modifying the S3 path to include date-based partitions.

D.Increase the number of DPUs for the Glue job to 100.

AnswerA

Bookmarking tracks processed files, so subsequent runs only process new files, drastically reducing runtime.

Why this answer

Option C is correct because enabling job bookmarking in Glue allows the job to process only new files since the last run, dramatically reducing processing time. Option A (increasing DPUs) would help with resource constraints but does not address the root cause of reprocessing old data. Option B (increasing parallelism) may help but not as much as eliminating reprocessing.

Option D (using partition pruning) assumes the data is partitioned, but the stem says all files are read; partitioning might not help if files are not organized by time. The most impactful first step is to enable bookmarking.

Full explanation →

1021

MCQmedium

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database to Amazon S3. The pipeline should capture changes in near real-time (within minutes) and minimize impact on the source database. The source table has a 'last_modified' timestamp column. Which service combination would meet these requirements?

A.AWS DMS with a replication task in CDC mode, writing to S3 in Parquet format.

B.Amazon Kinesis Data Firehose with a Lambda function that queries Oracle.

C.AWS Data Pipeline with a periodic SQL query activity to copy full table snapshots.

D.AWS Glue with a JDBC connection to Oracle, running a crawler every 5 minutes.

AnswerA

DMS CDC captures minimal changes and writes to S3 with low latency.

Why this answer

AWS Database Migration Service (DMS) with change data capture (CDC) can capture changes from Oracle in near real-time and write to S3. Option A is wrong because Glue does not support CDC. Option B is wrong because Data Pipeline is batch-oriented.

Option D is wrong because Firehose does not connect to databases directly.

Full explanation →

1022

MCQhard

A data engineer is designing a data ingestion pipeline for clickstream data from a mobile app. The data volume varies, with occasional spikes up to 10 MB/s. The pipeline must persist the raw data in Amazon S3 and make it available for near-real-time analytics via Amazon Athena. Which combination of services minimizes cost and operational overhead?

A.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics, then Amazon S3

B.Amazon SQS with an Auto Scaling group of EC2 instances writing to Amazon S3

C.Amazon Kinesis Data Streams with AWS Lambda for transformation, then Amazon S3

D.Amazon Kinesis Data Firehose with direct delivery to Amazon S3, then Amazon Athena

AnswerD

Firehose is fully managed, scales automatically, and delivers to S3.

Why this answer

Option C is correct because Firehose can buffer and deliver data to S3 with minimal configuration, and Athena can query directly from S3. Option A (Kinesis Data Streams + Lambda) incurs higher costs and management overhead. Option B (SQS + EC2) requires managing EC2 instances and auto-scaling.

Option D (Kinesis Data Streams + Kinesis Data Analytics) adds complexity and cost.

Full explanation →

1023

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a streaming ingestion architecture? (Choose 3.)

Select 3 answers

A.Ability to deliver data directly to Amazon Redshift

B.Data retention period longer than 7 days

C.Automatic scaling of ingestion throughput

D.Lower cost per GB ingested

E.Need for custom data processing using AWS Lambda or KCL

AnswersA, C, E

Firehose can deliver to Redshift; KDS requires additional services.

Why this answer

KDS offers custom processing with shard-level control, while Firehose auto-scales and can deliver to S3, Redshift, etc. KDS requires consumer management; Firehose is managed.

Full explanation →

1024

MCQmedium

A data engineer needs to set up a cross-account access for an S3 bucket so that users in Account B can read objects. The bucket in Account A has a bucket policy that grants access. What additional step is required?

A.Enable S3 object ACLs on the bucket.

B.Create an IAM role in Account B and attach a policy that allows s3:GetObject for the bucket.

C.Disable S3 Block Public Access settings on the bucket.

D.Set up an S3 Lifecycle policy to replicate objects to Account B.

AnswerB

Users in Account B need an IAM role or user with explicit permissions to access the bucket.

Why this answer

Option C is correct because cross-account access requires both a bucket policy in the source account and an IAM user/role in the target account with permissions. Option A (disable block public access) is not needed if the bucket policy is not public. Option B (ACLs) are legacy and not recommended.

Option D (lifecycle policy) is unrelated.

Full explanation →

1025

MCQhard

A company is using Amazon EMR to process data stored in Amazon S3. The S3 bucket is configured with a bucket policy that denies access unless the request includes a specific tag. The EMR cluster's IAM role has s3:GetObject permission. However, the EMR job fails to read data from S3. What is the most likely cause?

A.The bucket policy is not attached to the EMR role.

B.The EMR cluster is not in the same account as the S3 bucket.

C.The IAM role does not have a condition that matches the required tag.

D.The EMR role does not have s3:GetObject permission.

AnswerC

The bucket policy requires a tag, and the role must have a matching condition.

Why this answer

The bucket policy denies access unless the request includes a specific tag. Even though the EMR cluster's IAM role has s3:GetObject permission, the IAM role does not have a condition key (e.g., aws:RequestTag) that matches the required tag. Therefore, the request is denied by the bucket policy, causing the EMR job to fail.

Exam trap

AWS often tests the interaction between IAM policies and S3 bucket policies, specifically that a bucket policy with a deny condition can override IAM permissions, and candidates mistakenly think the issue is missing IAM permissions rather than a missing condition in the request.

How to eliminate wrong answers

Option A is wrong because bucket policies are attached to the S3 bucket, not to IAM roles; the policy is already configured on the bucket. Option B is wrong because cross-account access is possible with proper permissions, and the question does not indicate a different account; the failure is due to the tag condition, not account mismatch. Option D is wrong because the question explicitly states the IAM role has s3:GetObject permission, so the failure is not due to missing permission.

Full explanation →

1026

MCQeasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a nightly basis. The data volume is approximately 10 GB per night. The database is accessible over the internet. Which AWS service is MOST appropriate for this task?

A.AWS Glue ETL job with a JDBC connection

B.AWS DataSync

C.AWS Transfer Family

D.Amazon Kinesis Data Streams

Why this answer

AWS Database Migration Service (DMS) is designed for migrating databases to AWS and can continuously replicate changes. For nightly batch loads, DMS with a full load or CDC is ideal. Option A (AWS DataSync) is for file transfers, not databases.

Option B (AWS Glue) can connect to JDBC sources but is more suitable for ETL; DMS is simpler for database migration. Option C (Amazon Kinesis) is for real-time streaming. Option D (AWS Transfer Family) is for file transfers over SFTP/FTPS.

Full explanation →

1027

MCQhard

A company is using Amazon EMR with Kerberos authentication. They want to ensure that data in transit between EMR cluster nodes is encrypted. Which configuration should be applied?

A.Use VPC peering to connect the cluster nodes.

B.Configure the EMR cluster to use in-transit encryption.

C.Enable S3 server-side encryption for the cluster's output data.

D.Enable EBS encryption on the cluster instances.

AnswerB

In-transit encryption uses TLS to protect data between nodes.

Why this answer

Option D is correct because enabling in-transit encryption in Amazon EMR uses TLS to encrypt data between nodes. Option A is incorrect because S3 SSE encrypts data at rest. Option B is incorrect because EBS encryption encrypts data at rest.

Option C is incorrect because VPC peering does not provide encryption; it is a network connectivity feature.

Full explanation →

1028

MCQmedium

A company uses Amazon S3 to store raw data files. An AWS Glue crawler creates metadata in the Data Catalog. The data engineer discovers that the crawler is not detecting new partitions after new data is added to the S3 bucket. What is the MOST likely cause?

A.The IAM role used by the crawler does not have kms:Decrypt permission for the KMS key that encrypts the new partitions.

B.The crawler configuration has 'Crawl all folders' disabled.

C.The S3 bucket has too many objects, exceeding the crawler's limit.

D.The crawler does not have S3 event notifications enabled.

AnswerA

Without decrypt permission, the crawler cannot read the data.

Why this answer

Option B is correct because if the S3 bucket uses an SSE-KMS encrypted prefix, the crawler may not have permission to decrypt; the KMS key policy or IAM role must allow kms:Decrypt. Option A is wrong because S3 events are not required for crawling. Option C is wrong because the crawler can handle many objects.

Option D is wrong because the crawler configuration may need to be set to detect partitions, but the most likely cause given encryption is permission.

Full explanation →

1029

MCQmedium

A data engineer is migrating an on-premises MongoDB database to Amazon DocumentDB. Which migration strategy minimizes downtime?

A.Take a snapshot of the MongoDB database and restore it to DocumentDB.

B.Use AWS Database Migration Service (AWS DMS) with full load only.

C.Export data using mongodump and import using mongorestore.

D.Use AWS DMS with full load and ongoing replication from MongoDB to DocumentDB.

AnswerD

Ongoing replication allows near-zero downtime cutover.

Why this answer

AWS DMS with full load and ongoing replication (change data capture) minimizes downtime by continuously synchronizing changes from the source MongoDB to the target DocumentDB after the initial full load, allowing a cutover with only a brief pause. This is the only option that supports near-zero downtime migration for live databases.

Exam trap

The trap here is that candidates assume any AWS DMS migration automatically minimizes downtime, but only the full load plus ongoing replication (CDC) option achieves near-zero downtime, while full load only still requires a write stop.

How to eliminate wrong answers

Option A is wrong because taking a snapshot and restoring it captures only a point-in-time copy, requiring the source database to be offline or read-only during the snapshot, causing downtime. Option B is wrong because AWS DMS full load only transfers the current data once, without capturing ongoing changes, so any writes during the migration are lost and downtime is needed to stop writes before cutover. Option C is wrong because mongodump and mongorestore are offline tools that require the source MongoDB to stop accepting writes during the export, resulting in significant downtime.

Full explanation →

1030

MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices and immediately process each record with minimal latency. Which AWS service should be used as the ingestion point?

A.AWS Lambda

B.Amazon S3

C.Amazon Kinesis Data Streams

D.AWS Glue

AnswerC

Kinesis Data Streams ingests streaming data with low latency and can be consumed by multiple applications.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion with low latency. AWS Glue is for batch ETL, S3 is object storage, and Lambda is compute but not an ingestion endpoint itself.

Full explanation →

1031

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Recently, the transformation errors have increased due to Lambda timeouts. The data engineer needs to diagnose and resolve the issue without losing data. What should the engineer do?

A.Increase the Lambda function timeout and ensure that failed records are sent to a backup S3 bucket

B.Enable Amazon CloudWatch Logs for the Lambda function to capture errors and store failed records in CloudWatch

C.Configure the Lambda function to write failed records to an Amazon SQS queue for later reprocessing

D.Modify the Lambda function to store failed records in Amazon S3 before processing

AnswerA

Increasing timeout reduces failures, and configuring a backup bucket prevents data loss.

Why this answer

Option A is correct because increasing the Lambda timeout gives the function more time to complete, and Kinesis Firehose retries failures; data is not lost if the Lambda fails because Firehose can retry or send failed records to a backup S3 bucket. Option B is wrong because Lambda never stores data. Option C is wrong because CloudWatch Logs captures logs but does not store failed records.

Option D is wrong because Lambda cannot write to SQS directly from Firehose transformation; Firehose handles failed records internally.

Full explanation →

1032

MCQeasy

The exhibit shows the output of describing an Amazon Kinesis Data Stream. A producer is sending records but the consumer is not receiving all records. What is the most likely cause?

A.The stream has only one shard, causing write throttling

B.The stream is in ACTIVE status, which prevents reading

C.The retention period is too short

D.The hash key range is too wide

AnswerA

With one shard, write throughput is limited; exceeding it causes throttling and missed records.

Why this answer

The stream has only one shard, which provides a maximum throughput of 1 MB/s or 1000 records/s for writes. If the producer exceeds this, records will be throttled. The retention period is 24 hours, which is fine.

The stream status is ACTIVE. There is no indication of a faulty shard. The consumer might be slow, but the question asks for cause of not receiving all records; throttling due to insufficient shards is a common issue.

Full explanation →

1033

Multi-Selecthard

A data engineer is troubleshooting an AWS Glue ETL job that reads from an S3 bucket and writes to a DynamoDB table. The job fails with an AccessDeniedException. The IAM role attached to the Glue job has the policy shown in the exhibit. Which TWO additional permissions are required to resolve the issue?

Select 2 answers

A.kms:Decrypt

B.dynamodb:DescribeTable

C.iam:PassRole

D.s3:ListBucket

E.glue:GetJobRun

AnswersC, E

Glue needs to pass the role to itself.

Why this answer

Option C is correct because the AWS Glue job must pass the IAM role to itself when it runs, which requires the `iam:PassRole` permission. Without this, the job cannot assume the role and will fail with an AccessDeniedException, even if the role has the necessary S3 and DynamoDB permissions.

Exam trap

The trap here is that candidates often focus on the S3 or DynamoDB permissions listed in the exhibit and overlook the fundamental requirement for Glue to pass the role to itself, which is a prerequisite for any Glue job execution.

Full explanation →

1034

MCQeasy

A company needs to ingest data from a MySQL database into Amazon S3 in near real-time. The database is running on EC2. The data engineer wants to minimize the impact on the source database. Which service should be used?

A.AWS Database Migration Service (DMS) with ongoing replication

B.AWS Glue ETL job with a JDBC connection

C.Amazon RDS for MySQL with read replica

D.AWS Schema Conversion Tool (SCT)

AnswerA

DMS CDC uses binary logs to capture changes with minimal overhead.

Why this answer

AWS DMS with ongoing replication (change data capture) is the correct choice because it can continuously replicate changes from a MySQL source database to Amazon S3 with minimal performance impact. DMS uses a transactional log-based approach (MySQL binlog) to capture changes as they occur, avoiding heavy SELECT queries on the source. This enables near real-time ingestion without adding significant load to the production database.

Exam trap

The trap here is that candidates often confuse AWS Glue's batch JDBC capabilities with streaming ingestion, or assume that a read replica can directly feed data into S3 without an intermediary service like DMS or Kinesis.

How to eliminate wrong answers

Option B is wrong because AWS Glue ETL jobs with JDBC connections run batch queries that pull full table snapshots or large result sets, which can cause significant performance degradation on the source MySQL database and cannot achieve near real-time latency. Option C is wrong because Amazon RDS for MySQL with a read replica is a database migration or read scaling solution, not a data ingestion service to S3; it does not natively stream data to S3 without additional tooling. Option D is wrong because AWS Schema Conversion Tool (SCT) is designed for converting database schemas between different database engines (e.g., Oracle to Aurora), not for ingesting data into S3.

Full explanation →

1035

MCQhard

A financial services company stores transaction data in Amazon RDS for PostgreSQL. The company requires that all changes to the database be logged for audit purposes, including before and after images of updated rows. Which feature should the data engineer enable?

A.Enable automated backups and export logs to Amazon S3

B.Enable Enhanced Monitoring and publish logs to CloudWatch Logs

C.Set up logical replication using pglogical or native publication/subscription

D.Enable Multi-AZ deployment and read replicas

AnswerC

Logical replication provides row-level changes with before and after images.

Why this answer

Option C is correct because logical replication, using either pglogical or native PostgreSQL publication/subscription, captures row-level changes (INSERT, UPDATE, DELETE) and can include both the old and new values of updated rows. This meets the audit requirement for before-and-after images, as logical replication decodes the write-ahead log (WAL) to produce a change stream that includes full row snapshots.

Exam trap

The trap here is that candidates confuse database-level logging features (like Enhanced Monitoring or automated backups) with row-level change data capture, assuming any logging mechanism will capture before-and-after images, when only logical replication (or triggers with audit tables) provides that granularity.

How to eliminate wrong answers

Option A is wrong because automated backups capture point-in-time snapshots of the entire database, not a continuous, row-level change stream with before-and-after images; exporting logs to S3 provides error logs or slow query logs, not row-level audit trails. Option B is wrong because Enhanced Monitoring collects OS-level metrics (CPU, memory, disk I/O) and publishes them to CloudWatch Logs, not database row changes. Option D is wrong because Multi-AZ deployment provides high availability via synchronous standby replication, and read replicas serve read traffic; neither logs individual row modifications or provides before-and-after images.

Full explanation →

1036

MCQhard

A company uses Amazon EMR to process large datasets stored in Amazon S3. The cluster uses a transient configuration and stores intermediate data on HDFS. After a job fails due to a spot instance termination, the data engineer needs to rerun the job. What should the engineer do to minimize data loss and cost?

A.Use a long-running cluster with all on-demand instances to avoid interruptions.

B.Configure the cluster with a mix of spot and on-demand instances and set HDFS replication to 3.

C.Configure the cluster to use EMRFS and store intermediate data in S3.

D.Increase the HDFS replication factor to 5 and use only spot instances.

AnswerB

This balances cost and reliability; on-demand instances provide stability for HDFS NameNode and critical nodes.

Why this answer

Option B is correct because using a mix of spot and on-demand instances balances cost savings with fault tolerance, while setting HDFS replication to 3 ensures that intermediate data on HDFS survives the loss of a single node (e.g., a terminated spot instance). This allows the job to resume from the last checkpoint or reduce recomputation, minimizing both data loss and cost.

Exam trap

The trap here is that candidates assume storing intermediate data in S3 (EMRFS) is always better for durability, but they overlook that HDFS replication with spot/on-demand mix provides a cheaper, faster recovery for transient cluster workloads without incurring S3 write costs.

How to eliminate wrong answers

Option A is wrong because using all on-demand instances eliminates cost savings from spot instances and does not address the transient cluster design; a long-running cluster increases costs unnecessarily. Option C is wrong because storing intermediate data in S3 via EMRFS introduces higher latency and cost for transient data, and EMRFS does not natively support checkpointing for HDFS-dependent intermediate data. Option D is wrong because increasing HDFS replication to 5 consumes more storage and network bandwidth, and using only spot instances increases the risk of frequent failures without on-demand fallback, leading to higher recomputation costs.

Full explanation →

1037

MCQeasy

A data engineer needs to audit all AWS KMS key usage in the account. Which AWS service should be used to record KMS API calls?

A.AWS CloudTrail

B.AWS Config

C.Amazon CloudWatch Logs

D.Amazon GuardDuty

AnswerA

CloudTrail records KMS API calls.

Why this answer

Option A is correct because AWS CloudTrail records API calls for KMS. Option B is wrong because CloudWatch Logs stores logs but does not record API calls. Option C is wrong because AWS Config records resource changes, not API calls.

Option D is wrong because Amazon GuardDuty is for threat detection.

Full explanation →

1038

MCQhard

A data pipeline uses Amazon Kinesis Data Firehose to ingest log data from web servers and deliver it to Amazon S3. The data is then transformed by an AWS Glue job before being loaded into Amazon Redshift. The pipeline must handle a sudden spike in log volume without data loss. Which configuration change is MOST appropriate?

A.Increase the AWS Glue job timeout and allocate more DPUs.

B.Configure Kinesis Data Firehose to back up all data to S3 in case of delivery failures.

C.Increase the number of nodes in the Redshift cluster to handle higher load.

D.Increase the S3 bucket size limit and enable versioning.

AnswerB

S3 backup for failed records ensures no data loss.

Why this answer

Option C is correct because enabling S3 backup for failed records in Firehose ensures that if the delivery fails after retries, the data is stored in S3 and can be reprocessed later, preventing data loss. Option A is wrong because increasing S3 bucket size doesn't help with data loss. Option B is wrong because increasing Glue job timeout does not prevent data loss during ingestion.

Option D is wrong because increasing Redshift concurrency does not affect the ingestion pipeline.

Full explanation →

1039

MCQeasy

A data engineer needs to store archival data that is rarely accessed but must be retained for 7 years. The data should be retrievable within 12 hours. Which Amazon S3 storage class is MOST cost-effective?

A.S3 Intelligent-Tiering

B.S3 Glacier Flexible Retrieval

C.S3 Standard

D.S3 Glacier Deep Archive

AnswerD

Lowest cost with 12-hour retrieval.

Why this answer

Option C is correct because Glacier Deep Archive is the cheapest storage class with retrieval within 12 hours. Option A (Standard) is expensive. Option B (Glacier Flexible Retrieval) is more expensive.

Option D (Intelligent-Tiering) may not archive automatically.

Full explanation →

1040

MCQeasy

A data engineer needs to store semi-structured JSON files that are accessed infrequently but must be retrievable within minutes. The data should be stored cost-effectively. Which storage solution meets these requirements?

A.Amazon S3 Glacier Flexible Retrieval storage class.

B.Amazon S3 Glacier Deep Archive storage class.

C.Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class.

D.Amazon S3 Standard storage class.

AnswerC

S3 Standard-IA is cost-effective for infrequent access with millisecond retrieval.

Why this answer

Option B is correct because S3 Glacier Deep Archive is cost-effective for infrequently accessed data with retrieval within minutes. Wait, Glacier retrieval is minutes to hours, but Deep Archive is 12 hours. Actually S3 Glacier Instant Retrieval is for infrequent access with millisecond retrieval.

But the question says 'within minutes', so S3 Glacier Flexible Retrieval (minutes to hours) or S3 Standard-IA (milliseconds). However, Standard-IA is for infrequent but immediate access. The best cost-effective option for infrequent access with retrieval within minutes is S3 Standard-IA.

Option A is for frequent access. Option C is for archives with longer retrieval. Option D is for archival with very long retrieval.

So Option A is too expensive, C and D have longer retrieval times. Thus B is correct.

Full explanation →

1041

Multi-Selectmedium

A data engineer is monitoring an Amazon Kinesis Data Analytics for Apache Flink application that processes streaming data. The application is falling behind (increasing 'MillisBehindLatest') and the CPU utilization of the Flink task managers is consistently above 80%. Which THREE actions should the engineer take to improve performance? (Choose THREE.)

Select 3 answers

A.Increase the number of shards in the Kinesis data stream.

B.Decrease the checkpoint interval to reduce state size.

C.Enable auto-scaling for the Flink application.

D.Decrease the number of task managers to reduce CPU contention.

E.Increase the Flink application's parallelism.

AnswersA, C, E

More shards allow higher ingestion rate.

Why this answer

Increasing the number of shards in the Kinesis data stream (Option A) directly increases the ingestion capacity and parallelism source for the Flink application. With more shards, the application can read data from more partitions concurrently, reducing the backlog indicated by 'MillisBehindLatest'. This is a fundamental scaling action for Kinesis-based Flink applications.

Exam trap

The trap here is that candidates often confuse decreasing checkpoint intervals with improving performance, not realizing that more frequent checkpoints increase CPU and I/O overhead, making the lag worse.

Full explanation →

1042

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data is accessed frequently for the first 30 days, then rarely accessed after 90 days, and must be archived after 1 year. Which S3 lifecycle policy configuration meets these requirements with the lowest cost?

A.Transition to S3 One Zone-IA after 30 days, then to S3 Glacier Deep Archive after 365 days.

B.Transition to S3 Glacier Flexible Retrieval after 90 days.

C.Transition to S3 Glacier Instant Retrieval after 30 days, then expire after 365 days.

D.Transition to S3 Standard-IA after 30 days, then to S3 Glacier Deep Archive after 365 days.

AnswerD

Standard-IA is cost-effective for infrequent access, and Deep Archive is the cheapest archival tier.

Why this answer

Option B is correct because it transitions to S3 Standard-IA after 30 days (cost-effective for infrequent access) and to S3 Glacier Deep Archive after 365 days (lowest cost archival). Option A transitions to S3 One Zone-IA, which is not recommended for durability. Option C transitions to S3 Glacier Flexible Retrieval, which costs more than Deep Archive.

Option D transitions to S3 Glacier Instant Retrieval, which is expensive for archival.

Full explanation →

1043

Multi-Selecteasy

A data engineer is setting up a new Amazon Redshift cluster for a data warehouse. The engineer wants to ensure data durability and high availability. Which THREE features should the engineer consider? (Choose three.)

Select 3 answers

A.S3 Cross-Region Replication for Redshift data.

B.Cross-Region snapshot copy.

C.Multi-node cluster with data replication.

D.Multi-AZ deployment for automatic failover.

E.Automated snapshots to Amazon S3.

AnswersB, C, E

Cross-Region copies protect against region failures.

Why this answer

Options A, C, and E are correct. A: Automated snapshots provide point-in-time recovery. C: Cross-Region snapshot copy protects against regional failures.

E: Multi-node clusters provide data replication within the cluster. Option B is wrong because Multi-AZ is not a feature of Redshift; it uses automatic failover within a single AZ. Option D is wrong because S3 Cross-Region Replication is for S3, not Redshift.

Full explanation →

1044

MCQeasy

A retail company stores customer transaction data in an Amazon S3 bucket. The data is encrypted using server-side encryption with AWS KMS (SSE-KMS). The company uses an IAM role to allow an Amazon Athena query service to read the data. The data engineer creates a new Athena workgroup and attempts to run a query on the S3 bucket. The query fails with an access denied error. The IAM role has permissions to decrypt the KMS key and read from the bucket. The engineer checks the S3 bucket policy and finds that it does not explicitly allow access. What is the most likely cause of the failure?

A.The S3 bucket is in a different AWS account than the Athena workgroup.

B.The S3 bucket policy does not grant the required permissions to the Athena service principal.

C.The IAM role does not have permission to use the KMS key for encryption operations.

D.Athena does not support querying data encrypted with SSE-KMS.

AnswerB

The S3 bucket policy must explicitly allow the Athena service or the IAM role to access the bucket.

Why this answer

Option D is correct because although the IAM role has permissions, the S3 bucket policy might explicitly deny access or not grant access to the Athena service. Option A is wrong because the IAM role has encryption permissions. Option B is wrong because cross-account access is not mentioned; the role is in the same account.

Option C is wrong because Athena can query encrypted data with proper permissions.

Full explanation →

1045

MCQmedium

A company stores sensitive data in Amazon S3. They need to ensure that all objects are encrypted at rest. Which approach meets this requirement with minimal effort?

A.Use client-side encryption before uploading

B.Enable default encryption on the S3 bucket with SSE-S3

C.Enable S3 Versioning and MFA Delete

D.Use a bucket policy to deny PutObject without encryption

AnswerB

Automatically encrypts all new objects with minimal effort.

Why this answer

Enabling S3 default encryption (SSE-S3) ensures all new objects are encrypted automatically. Bucket policies can enforce encryption but require more effort.

Full explanation →

1046

MCQmedium

Refer to the exhibit. An IAM policy is attached to a user. The user cannot upload objects to the S3 bucket 'example-bucket' using the AWS CLI. What is the most likely cause?

A.The user is not using HTTPS for API calls

B.The policy does not allow s3:PutObject

C.The user is not in the same AWS region

D.The resource ARN does not include the bucket itself

AnswerA

The condition aws:SecureTransport requires HTTPS.

Why this answer

The IAM policy explicitly denies all actions unless the request uses HTTPS (via the `aws:SecureTransport` condition key). Since the AWS CLI by default can use HTTP if not explicitly configured to use HTTPS, the user's upload attempt fails. The `s3:PutObject` action is allowed in the policy, but the condition block overrides that permission when the request is not made over HTTPS.

Exam trap

AWS often tests the `aws:SecureTransport` condition key as a hidden denial, leading candidates to incorrectly assume the action is missing or the ARN is malformed.

How to eliminate wrong answers

Option B is wrong because the policy does allow `s3:PutObject` on the bucket; the issue is the condition key, not the action. Option C is wrong because S3 is a global service and bucket operations are not restricted by the user's AWS region; the region is specified in the bucket ARN, not the user's location. Option D is wrong because the resource ARN `arn:aws:s3:::example-bucket/*` correctly includes all objects within the bucket, which is the standard way to grant object-level permissions; the bucket itself is not needed for object uploads.

Full explanation →

1047

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data includes customer PII that must be encrypted at rest. The company also requires that the encryption keys be rotated automatically every year. Which encryption solution should the engineer use?

A.SSE-KMS with automatic key rotation enabled

B.SSE-S3

C.SSE-C

D.Client-side encryption with AWS KMS

AnswerA

SSE-KMS allows you to use a customer managed key with automatic annual rotation, giving you control and auditability.

Why this answer

SSE-KMS with automatic key rotation enabled meets both requirements: it encrypts data at rest in S3 and allows the company to automatically rotate the customer master key (CMK) every year. AWS KMS supports automatic annual rotation for symmetric CMKs, which satisfies the compliance need without manual intervention.

Exam trap

The trap here is that candidates often assume SSE-S3 provides automatic key rotation, but SSE-S3 rotates keys on a schedule managed entirely by AWS (approximately every 90 days) and does not allow customer control over the rotation frequency, whereas SSE-KMS with automatic rotation meets the explicit annual requirement.

How to eliminate wrong answers

Option B (SSE-S3) is wrong because while it encrypts data at rest, it does not support automatic key rotation; the encryption keys are managed and rotated by S3 but not on a customer-defined schedule. Option C (SSE-C) is wrong because it requires the customer to provide and manage their own encryption keys, and AWS does not handle key rotation, making it unsuitable for automated annual rotation. Option D (Client-side encryption with AWS KMS) is wrong because it encrypts data before sending it to S3, but the key rotation applies only to the KMS key used for client-side encryption, not to the S3-side encryption; moreover, client-side encryption adds complexity and does not directly address the requirement for encryption at rest within S3.

Full explanation →

1048

Multi-Selecteasy

A data engineer is designing a data pipeline that processes streaming data. The pipeline must be able to handle duplicate records and ensure exactly-once processing semantics. Which THREE AWS services or features should the engineer consider? (Choose three.)

Select 3 answers

A.Amazon EMR with Apache Flink for exactly-once semantics.

B.Amazon Kinesis Data Firehose with automatic retries.

C.Amazon Kinesis Data Streams with sequence numbers for deduplication.

D.Amazon DynamoDB Streams for change data capture.

E.Amazon Kinesis Data Analytics for Apache Flink with idempotent sinks.

AnswersA, C, E

Flink on EMR provides exactly-once processing via checkpointing.

Why this answer

Option A is correct because Kinesis Data Streams provides sequence numbers for records, enabling deduplication. Option C is correct because Kinesis Data Analytics uses the concept of 'idempotent' output to achieve exactly-once processing. Option D is correct because Apache Flink on Amazon EMR provides exactly-once processing through checkpointing.

Option B is incorrect because Kinesis Firehose delivers at-least-once, not exactly-once. Option E is incorrect because DynamoDB Streams is at-least-once.

Full explanation →

1049

MCQmedium

A data engineer needs to audit all access to an S3 bucket containing sensitive customer data. The engineer must record the requester, timestamp, action, and whether the access was denied. Which AWS solution meets these requirements?

A.Use AWS Config to record S3 bucket-level configuration changes.

B.Enable VPC Flow Logs for the VPC where the bucket resides.

C.Enable AWS CloudTrail Data Events for the S3 bucket.

D.Enable S3 server access logs for the bucket, storing them in a different bucket.

AnswerD

S3 server access logs provide detailed records of all requests, including requester and access status.

Why this answer

Option C is correct because S3 server access logs capture detailed records of requests made to a bucket, including requester, timestamp, action, and response status. Option A is wrong because CloudTrail logs object-level events only if Data Events are enabled, but server access logs are specifically designed for this purpose and are more granular. Option B is wrong because VPC Flow Logs capture network traffic metadata but not application-level S3 operations.

Option D is wrong because AWS Config tracks resource configuration changes, not access requests.

Full explanation →

1050

MCQhard

A company runs an e-commerce platform on AWS. The product catalog is stored in Amazon DynamoDB with a table that has a partition key of 'product_id' and a sort key of 'category'. The application frequently queries products by category and by product_id. Recently, the operations team noticed that read latency has increased significantly for queries that filter by category. The DynamoDB table has auto scaling enabled. The data engineer examines the CloudWatch metrics and sees that the ReadThrottleEvents metric is non-zero for the table, but the consumed read capacity is well below the provisioned limit. The table has a global secondary index (GSI) on the 'category' attribute. Which action is most likely to resolve the latency issue?

A.Switch the table to DynamoDB on-demand capacity mode.

B.Enable DynamoDB Accelerator (DAX) to cache read queries.

C.Increase the provisioned read capacity on the main table.

D.Redesign the GSI partition key to include a random suffix to distribute load across multiple partitions.

AnswerD

This prevents a hot partition on the GSI.

Why this answer

The issue is that the GSI on 'category' is experiencing hot partitions because 'category' has low cardinality, causing uneven data distribution. The non-zero ReadThrottleEvents on the GSI (not the main table) indicate throttling on the GSI's provisioned capacity, even though the main table's consumed read capacity is below its limit. Adding a random suffix to the GSI partition key distributes reads across multiple physical partitions, reducing hot spots and latency.

Exam trap

The trap here is that candidates assume throttling always relates to the base table's provisioned capacity, overlooking that GSIs have independent capacity and can throttle even when the base table is underutilized, especially with low-cardinality sort keys like 'category'.

How to eliminate wrong answers

Option A is wrong because switching to on-demand mode would not resolve the hot partition issue; it only eliminates the need to manage provisioned capacity but does not fix the underlying data skew that causes throttling on the GSI. Option B is wrong because DynamoDB Accelerator (DAX) caches read results but does not address the root cause of throttling on the GSI due to uneven partition access; it would only mask the symptom for cached queries. Option C is wrong because increasing provisioned read capacity on the main table does not affect the GSI's separate capacity; the throttling is occurring on the GSI, not the base table, and the consumed read capacity on the main table is already well below its limit.

Full explanation →

Page 14 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →