Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 676–750

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 10 of 24

676

MCQhard

A company uses Kinesis Data Streams to ingest clickstream data. They notice that the data processing latency increases as the number of shards grows. What is the most likely cause and solution?

A.Reduce the number of shards or increase the number of consumers.

B.Increase the Kinesis Producer Library (KPL) batch size.

C.Use enhanced fan-out to allow multiple consumers per shard.

D.Increase the number of shards to handle more data.

AnswerA

Balancing shards and consumers ensures each shard is processed, reducing latency.

Why this answer

Option D is correct because when there are more shards than consumers, some shards are idle, leading to underutilization and increased latency. Option A is wrong because increasing shards would worsen the imbalance. Option B is wrong because enhanced fan-out is for multiple consumers, not single consumer.

Option C is wrong because increasing batch size might help throughput but not the fundamental shard-consumer mismatch.

Full explanation →

677

Drag & Dropmedium

Order the steps to migrate an on-premises database to Amazon RDS using AWS DMS.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, create the replication instance. Then configure endpoints, create the migration task, start it, and finally validate the migrated data.

Full explanation →

678

MCQmedium

A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest using a customer-managed AWS KMS key. However, when a data engineer attempts to upload an object using the AWS CLI, the upload fails with an access denied error. The engineer has s3:PutObject permission on the bucket. Which additional permission is most likely missing?

A.kms:CreateKey

B.kms:Decrypt

C.s3:PutObjectAcl

D.kms:GenerateDataKey

AnswerD

Required to generate a data key for server-side encryption.

Why this answer

To upload an object with SSE-KMS, the IAM user or role must have kms:GenerateDataKey permission to generate a data key for encryption. Option A is correct because without it, the upload fails. Option B is wrong because kms:Decrypt is for decryption, not upload.

Option C is wrong because kms:CreateKey is for creating keys, not using them. Option D is wrong because s3:PutObjectAcl is for ACLs, not encryption.

Full explanation →

679

Multi-Selectmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The delivery stream is failing with 'Insufficient capacity' errors. Which THREE actions should the data engineer take to resolve this issue? (Choose THREE.)

Select 3 answers

A.Enable S3 bucket versioning to handle concurrent writes.

B.Increase the buffer size and buffer interval in the Firehose delivery stream configuration.

C.Configure a CloudWatch alarm to monitor the error rate.

D.Request a service quota increase for Kinesis Data Firehose.

E.Increase the number of shards in the source Kinesis data stream.

AnswersB, D, E

Larger buffers reduce the frequency of writes, lowering capacity needs.

Why this answer

Options A, B, and C are correct. A: Increasing buffer size and interval allows Firehose to batch more records, reducing the number of PUT requests. B: Increasing the number of shards in the source Kinesis stream provides more write capacity.

C: Requesting a service quota increase for Firehose can raise the default limits. Option D is wrong because S3 bucket versioning does not affect Firehose capacity. Option E is wrong because CloudWatch alarms only alert, they do not resolve capacity issues.

Full explanation →

680

MCQeasy

An e-commerce application uses Amazon ElastiCache for Redis to cache product catalog data. The cache currently uses lazy loading. The team wants to ensure that frequently accessed product data is always fresh. Which caching strategy should they implement?

A.Write-through caching

B.Set a TTL of 5 minutes for all cached items

C.Use database read replicas to serve data

D.Lazy loading with TTL

AnswerA

Write-through updates cache directly on writes, ensuring data is always fresh.

Why this answer

Write-through caching ensures that data is written to the cache simultaneously with the database, guaranteeing that frequently accessed product data is always fresh. This strategy eliminates stale reads by synchronously updating the cache on every write, which directly addresses the requirement for freshness without relying on expiration or lazy population.

Exam trap

The trap here is that candidates often assume lazy loading with a short TTL is sufficient for freshness, but the exam tests the understanding that only write-through (or write-behind) strategies guarantee synchronous cache updates without relying on expiration windows.

How to eliminate wrong answers

Option B is wrong because setting a TTL of 5 minutes does not guarantee freshness; data can still become stale within the TTL window, and frequently accessed items may be served from the cache even after they have been updated in the database. Option C is wrong because database read replicas serve stale data asynchronously and do not cache product data in ElastiCache, failing to meet the caching freshness requirement. Option D is wrong because lazy loading with TTL still allows stale data to be served until the TTL expires or a cache miss triggers a refresh, which does not ensure that frequently accessed data is always fresh.

Full explanation →

681

MCQhard

A company uses AWS Glue to process JSON logs from S3. The logs have a nested structure and the schema evolves over time. The data engineer needs to ensure the Glue job can handle schema changes without failing. Which configuration should be used?

A.Manually update the table schema in the Glue Data Catalog before each run

B.Use Spark SQL with a static schema definition in the script

C.Set the job parameter '--enable-glue-datacatalog' and '--mergeDynamicColumns' to true

D.Enable AWS Glue Schema Registry and define a schema version

AnswerC

This allows Glue DynamicFrame to merge schema variations automatically.

Why this answer

Option B is correct because setting 'mergeDynamicColumns' to true in Glue job parameters allows new columns to be added dynamically. Option A (schema registry) is for schema validation, not evolution. Option C (manual schema updates) is not automated.

Option D (Spark SQL) does not solve evolution.

Full explanation →

682

MCQmedium

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. The source database is behind a firewall that does not allow direct internet access. Which service should the engineer use to transfer the data securely?

A.AWS DataSync with a network path through AWS Direct Connect or VPN.

B.AWS Database Migration Service (AWS DMS) with ongoing replication from Oracle to S3.

C.Amazon S3 Transfer Acceleration with a public endpoint.

D.AWS Snowball Edge device for daily transfers.

AnswerA

DataSync is designed for scheduled transfers to S3.

Why this answer

Option B is correct because AWS DataSync can transfer data from on-premises to S3, and it supports private network connectivity via AWS Direct Connect or VPN. Option A is wrong because AWS DMS is for ongoing replication, not bulk daily transfers. Option C is wrong because S3 Transfer Acceleration requires internet.

Option D is wrong because Snowball Edge is for offline transfer, not daily.

Full explanation →

683

MCQmedium

A company runs a SQL Server transactional database on Amazon RDS. They need to capture change data (inserts, updates, deletes) in near real-time and replicate them to an Amazon S3 data lake. Which AWS service is most suitable?

A.AWS Database Migration Service (DMS) with change data capture

B.AWS Glue DataBrew

C.Amazon Kinesis Data Streams with Kinesis Client Library

D.Amazon Redshift Spectrum

AnswerA

DMS supports ongoing replication with CDC and can write to S3.

Why this answer

AWS DMS with CDC continuously captures changes from the source and writes to S3 in Parquet/JSON. Kinesis, Glue, and Redshift are not designed for CDC from RDS to S3 directly.

Full explanation →

684

MCQhard

A data engineer at a media company is managing an Amazon RDS for MySQL database that stores user profiles and preferences. The database has been running on a db.r5.large instance with 500 GB of General Purpose SSD (gp2) storage. Recently, the application team has noticed increased query latency during peak hours. Amazon CloudWatch metrics show that the ReadIOPS metric is consistently peaking at 5,000 IOPS, which is near the baseline performance of the gp2 volume (1,500 IOPS baseline for 500 GB, but with bursts up to 3,000 IOPS for short periods). The database is not CPU-bound, and memory utilization is moderate. The data engineer needs to resolve the I/O bottleneck with minimal cost increase. The company is open to changing the storage type or instance class, but wants to avoid over-provisioning. What should the data engineer do?

A.Change the storage type to General Purpose SSD (gp3) and set the provisioned IOPS to 5,000.

B.Enable Multi-AZ deployment to offload reads to the standby instance.

C.Change the storage type to Provisioned IOPS SSD (io1) and provision 5,000 IOPS.

D.Upgrade the instance to a db.r5.xlarge to get more memory and reduce I/O.

AnswerA

gp3 provides a baseline of 3,000 IOPS and can be scaled up to 5,000 at lower cost than io1.

Why this answer

Option D is correct because gp3 provides a baseline of 3,000 IOPS and 125 MB/s throughput at no additional cost, and can be increased independently. This would give 5,000 IOPS without the burst limitations of gp2. Option A is incorrect because moving to Provisioned IOPS (io1) would be more expensive and requires provisioning IOPS.

Option B is incorrect because increasing instance size to memory-optimized classes does not directly improve IOPS; it adds unnecessary memory cost. Option C is incorrect because Multi-AZ does not improve read IOPS performance; the standby is not used for reads.

Full explanation →

685

Multi-Selectmedium

Which TWO practices improve the performance of AWS Glue ETL jobs? (Choose two.)

Select 2 answers

A.Use pushdown predicates to filter data at the source

B.Increase the number of DPUs to the maximum allowed

C.Use the smallest possible file size for input data

D.Enable AWS Glue job metrics and debug logging

E.Use column pruning to select only required columns

AnswersA, E

Filters data early, reducing data scanned.

Why this answer

Option B and Option D are correct. Using column pruning reduces the data shuffled, and using pushdown predicates filters data early. Option A is wrong because increasing DPUs beyond the recommended ratio can cause resource contention.

Option C is wrong because the smallest file size increases overhead. Option E is wrong because standard logs are sufficient and debug logs generate overhead.

Full explanation →

686

MCQhard

A company is using Amazon S3 to store sensitive customer data. The security team requires that all data be encrypted in transit and at rest. Additionally, they want to prevent any accidental public access. Which combination of actions should the data engineer take?

A.Enable default encryption with SSE-S3, enforce HTTPS only via bucket policy, and enable S3 Block Public Access.

B.Enable default encryption with SSE-KMS, allow both HTTP and HTTPS, and set bucket ACLs to private.

C.Use client-side encryption, enforce HTTPS via bucket policy, and enable S3 Block Public Access.

D.Enable default encryption with SSE-S3, allow HTTP and HTTPS, and use bucket ACLs to block public access.

AnswerA

SSE-S3 encrypts at rest, bucket policy enforces HTTPS, Block Public Access prevents public access.

Why this answer

Option A is correct because it covers all requirements. Option B allows public access via bucket policy. Option C uses HTTPS but doesn't enforce it.

Option D doesn't enforce HTTPS or block public access.

Full explanation →

687

MCQmedium

A company uses Amazon RDS for MySQL with Multi-AZ deployment. The database experiences high write latency during peak hours. The application uses InnoDB tables. Which action would reduce write latency without changing the application code?

A.Enable storage autoscaling on the DB instance

B.Add a read replica to offload writes

C.Enable Multi-AZ on the DB instance

D.Increase the DB instance class size

AnswerD

A larger instance class provides more resources, improving write throughput.

Why this answer

Increasing the DB instance class size (Option D) provides more CPU and memory resources, which directly improves the database's ability to handle high write loads by reducing contention and speeding up InnoDB transaction processing. This action requires no application code changes and is the most direct way to address write latency caused by resource constraints.

Exam trap

The trap here is that candidates often confuse read replicas with write scaling, assuming they can offload writes, when in fact they only handle SELECT queries and do not reduce write latency on the primary.

How to eliminate wrong answers

Option A is wrong because storage autoscaling only increases storage capacity when space is low, which does not address write latency caused by CPU or memory bottlenecks. Option B is wrong because read replicas are designed to offload read traffic, not write operations; writes still go to the primary instance, so write latency remains unchanged. Option C is wrong because Multi-AZ deployment provides high availability and automatic failover, but it does not improve write performance; in fact, synchronous replication to the standby can slightly increase write latency.

Full explanation →

688

MCQeasy

A company uses AWS KMS to encrypt data in Amazon S3. The security team wants to ensure that the KMS key can only be used from within the company's VPC. Which policy element should be added to the KMS key policy?

A.Set the Principal element to restrict access to the VPC.

B.Add a condition using aws:SourceIp to allow only IP addresses from the VPC.

C.Add a condition using aws:SourceVpc to allow only requests from the VPC.

D.Add a condition using kms:ViaService to allow only via VPC endpoints.

AnswerC

This condition restricts key usage to the specified VPC.

Why this answer

Option C is correct because using a condition with aws:SourceVpc restricts key usage to requests originating from a specific VPC. Option A is wrong because the Principal element specifies who can use the key, not where. Option B is wrong because aws:SourceIp is for IP addresses, not VPC.

Option D is wrong because kms:ViaService restricts usage to specific AWS services, not network location.

Full explanation →

689

Multi-Selecteasy

A data engineer is designing a data lake on Amazon S3 that must comply with GDPR. The engineer needs to ensure that individuals can request deletion of their personal data. Which THREE AWS services can be used together to automate the deletion of specific records?

Select 3 answers

A.AWS Lambda

B.AWS Glue

C.Amazon S3 Batch Operations

D.Amazon S3 Select

E.Amazon DynamoDB

AnswersA, C, D

Can process deletion logic.

Why this answer

Option B (S3 Select) can query specific records, Option C (Lambda) can process deletion, Option E (S3 Batch Operations) can delete objects. Option A (Glue) is for ETL, not for selective deletion. Option D (DynamoDB) is not for S3.

Full explanation →

690

MCQhard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream, performs a 1-minute tumbling window aggregation, and writes results to an S3 bucket. Recently, the application started experiencing checkpoint failures and increasing processing delay. Which action should the engineer take FIRST to diagnose the issue?

A.Increase the parallelism of the Flink application.

B.Monitor CPU and memory utilization of the Flink application using Amazon CloudWatch metrics.

C.Switch to the Kinesis Client Library (KCL) for checkpointing.

D.Increase the checkpoint interval to reduce checkpoint frequency.

AnswerB

Resource exhaustion is a common cause of checkpoint failures; monitoring helps identify if scaling is needed.

Why this answer

Option B is correct because checkpoint failures are often due to insufficient resources (CPU/memory) for the Flink job. Monitoring CPU and memory utilization via CloudWatch metrics directly helps identify resource bottlenecks. Option A (checkpoint interval) might help but is not diagnostic.

Option C (parallelism) is a tuning step. Option D (KCL) is not relevant for Flink. The first step is to check resource utilization.

Full explanation →

691

MCQhard

A company has a multi-account strategy using AWS Organizations. The data engineering team needs to share a central S3 bucket across multiple accounts while maintaining fine-grained access control. Which solution should be used?

A.Use IAM roles in each account with cross-account access

B.Use Amazon CloudFront to serve the data

C.Use S3 access points with a policy per account

D.Create a bucket policy with principal ARNs for each account

AnswerC

Access points allow separate policies for each account.

Why this answer

Option D is correct. S3 access points support per-account policies and can be used with AWS Organizations to enforce policies. Option A is wrong because bucket policies become complex.

Option B is wrong because cross-account roles are not fine-grained at the object level. Option C is wrong because CloudFront is for content delivery.

Full explanation →

692

MCQmedium

A company uses Amazon EMR to process large datasets stored in Amazon S3. The data engineer notices that EMR tasks are failing with 'DiskOutOfSpace' errors. The cluster uses m5.xlarge instances with 1 EBS volume of 64 GB. What is the MOST cost-effective solution to resolve this issue?

A.Use a mix of on-demand and spot instances for core nodes.

B.Increase the EBS storage volume size for each instance and use spot instances for task nodes.

C.Switch to D2 instances which have more instance store volume.

D.Increase the number of task instances to distribute the workload.

AnswerB

More disk space solves the issue; spot instances reduce cost.

Why this answer

Option D is correct because increasing EBS storage per instance provides more disk space, and using spot instances reduces cost. Option A is wrong because adding more instances may not address the per-instance disk space issue and increases cost. Option B is wrong because increasing core nodes only helps if the shuffle data is distributed; spot instances reduce cost but may cause interruptions.

Option C is wrong because switching to D2 instances is more expensive and may not be needed.

Full explanation →

693

Multi-Selecthard

A company uses Amazon EMR to process sensitive data. The data engineer needs to ensure that data in transit between EMR and S3 is encrypted. Which THREE configurations achieve this? (Choose THREE.)

Select 3 answers

A.Enable S3 Block Public Access on the bucket

B.Configure EMRFS to use server-side encryption with S3 (SSE-S3) and require HTTPS

C.Enable SSE-KMS on the S3 bucket

D.Use SSE-C with HTTPS for S3 communication

E.Configure EMR to use VPC endpoints for S3 with a policy that enforces HTTPS

AnswersB, D, E

EMRFS can enforce HTTPS for data transfer.

Why this answer

EMR can use S3 SSE-C with HTTPS, VPC endpoints with policies, and EMRFS with SSE-S3 all support encryption in transit. Option A is wrong because SSE-KMS encrypts at rest. Option D is wrong because S3 Block Public Access is unrelated to transit.

Options B, C, and E are correct.

Full explanation →

694

Multi-Selecthard

A company is ingesting real-time financial transactions into Amazon Kinesis Data Streams. The data is then consumed by a Kinesis Data Analytics for Apache Flink application that calculates running totals. The application is experiencing high latency and checkpoint failures. Which TWO steps should the engineer take to improve performance and reliability? (Select TWO.)

Select 2 answers

A.Enable enhanced fan-out for the Flink application.

B.Reduce the batch size of records processed per checkpoint.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the number of KPUs (Kinesis Processing Units) for the Flink application.

E.Decrease the checkpoint interval to reduce state size.

AnswersC, D

More shards increase parallelism, reducing latency and improving throughput.

Why this answer

Options B and D are correct. Increasing the number of shards increases throughput and parallelism, helping reduce latency. Increasing the number of KPUs (Kinesis Processing Units) for the Flink application provides more compute resources, addressing checkpoint failures.

Option A (decreasing checkpoint interval) may increase checkpoint overhead. Option C (using Fan-Out) is for multiple consumers, not for a single Flink job. Option E (reducing batch size) may not help with overall throughput.

Full explanation →

695

Drag & Dropmedium

Arrange the steps to set up cross-region replication for an S3 bucket.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, enable versioning on source and destination. Then create the destination bucket, add a replication rule, and assign an IAM role for replication.

Full explanation →

696

MCQmedium

A company uses Amazon S3 to store historical financial records. A compliance policy requires that all objects be encrypted with a customer-managed key stored in AWS KMS. The bucket is already configured with SSE-S3. What is the LEAST disruptive way to change the encryption to SSE-KMS?

A.Add a bucket policy to enforce SSE-KMS.

B.Update the bucket's default encryption settings to SSE-KMS.

C.Copy all objects to a new bucket that has default encryption set to SSE-KMS.

D.Use S3 Batch Operations to apply SSE-KMS to all existing objects.

AnswerC

Copying objects to a new bucket with SSE-KMS default encryption will re-encrypt them with the new key and is straightforward.

Why this answer

Option C is correct because changing the default encryption settings of an existing bucket (SSE-S3 to SSE-KMS) does not retroactively encrypt objects that were already stored with SSE-S3. Copying all objects to a new bucket that has default encryption set to SSE-KMS ensures every object is encrypted with a customer-managed key, as the copy operation re-encrypts each object using the new bucket's default settings. This approach is the least disruptive because it avoids modifying the original bucket's configuration or policies, which could break existing applications or access patterns.

Exam trap

The trap here is that candidates assume updating default encryption settings (Option B) will retroactively encrypt existing objects, but S3 default encryption only applies to new uploads, not to objects already stored with a different encryption method.

How to eliminate wrong answers

Option A is wrong because adding a bucket policy to enforce SSE-KMS only affects future uploads and does not change the encryption of existing objects, leaving them non-compliant. Option B is wrong because updating the bucket's default encryption settings to SSE-KMS only applies to new objects; existing objects remain encrypted with SSE-S3 and are not retroactively re-encrypted. Option D is wrong because S3 Batch Operations can apply SSE-KMS to existing objects, but this process is more disruptive than copying to a new bucket, as it requires careful management of permissions, potential downtime, and does not guarantee a clean separation of old and new encryption configurations.

Full explanation →

697

MCQhard

A media company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is then processed by AWS Glue jobs. The company wants to ensure that data is delivered to S3 within 5 minutes of ingestion. Currently, the Firehose delivery stream is configured with a buffer interval of 300 seconds and a buffer size of 5 MB. The log data arrives at a rate of 2 MB per second. The data engineer notices that some log files are delayed by up to 10 minutes. The company cannot change the buffer size due to downstream requirements. What should the data engineer do to meet the 5-minute delivery requirement?

A.Increase the buffer interval to 600 seconds to reduce the number of delivery attempts.

B.Increase the buffer size to 10 MB to ensure data is delivered in larger chunks.

C.Enable GZIP compression on the Firehose stream to reduce data size.

D.Decrease the buffer interval to 120 seconds.

AnswerD

Lower interval triggers delivery more often, reducing latency.

Why this answer

Option A is correct. Decreasing the buffer interval to 120 seconds will cause Firehose to deliver data more frequently, meeting the 5-minute SLA. Option B is wrong because increasing buffer interval would increase delay.

Option C is wrong because increasing buffer size would also increase delay. Option D is wrong because compressing data does not reduce buffer interval.

Full explanation →

698

MCQmedium

A company wants to ingest streaming data from thousands of IoT devices into Amazon S3 with minimal latency and then transform the data using Spark SQL. Which AWS service should be used for data ingestion?

A.Amazon EMR

B.AWS Glue

C.Amazon Athena

D.Amazon Kinesis Data Firehose

AnswerD

Kinesis Data Firehose can ingest streaming data and deliver it to S3 with near-real-time latency.

Why this answer

Amazon Kinesis Data Firehose is the best choice because it can ingest streaming data, buffer it, and deliver it to S3 with minimal latency. AWS Glue is for ETL jobs, not real-time ingestion. Amazon Athena is a query service.

Amazon EMR can process data but is not optimized for ingestion.

Full explanation →

699

MCQhard

A large e-commerce company uses Amazon DynamoDB to store shopping cart data. The table has a partition key of 'user_id' and a sort key of 'item_id'. The application performs frequent updates to the 'quantity' attribute for items in a user's cart. Recently, the operations team noticed that write requests are being throttled during peak shopping hours. The table is provisioned with 10,000 write capacity units (WCUs) and uses DynamoDB Accelerator (DAX) for read caching. The data engineer suspects that the throttling is due to hot partitions. The application uses a single AWS SDK client configured with retries. After reviewing the Amazon CloudWatch metrics, the engineer sees that the WriteThrottleEvents metric spikes for a few partition keys. The table has a high number of partitions. What should the data engineer do to resolve the throttling issue with minimal application changes?

A.Increase the provisioned write capacity to 20,000 WCUs permanently.

B.Enable DynamoDB Global Tables to distribute writes across regions.

C.Add more nodes to the DAX cluster to offload write traffic.

D.Configure DynamoDB Auto Scaling with a maximum WCU setting of 20,000 and a target utilization of 70%.

AnswerD

Auto Scaling dynamically adjusts capacity based on traffic, reducing throttling without permanent overprovisioning.

Why this answer

Option D is correct because using DynamoDB Auto Scaling with a higher maximum WCUs allows the table to scale up during peak demand, addressing hot partitions without code changes. Option A is wrong because increasing WCUs manually does not adapt to variable traffic. Option B is wrong because DAX is for reads, not writes.

Option C is wrong because Global Tables replicate data but do not increase write capacity.

Full explanation →

700

MCQhard

A company stores sensitive data in Amazon S3. The security team requires that all data be encrypted at rest and that the encryption keys be stored in AWS CloudHSM. Which S3 encryption option should be used?

A.SSE-S3

B.SSE-KMS with an AWS managed key

C.SSE-KMS with a customer managed key

D.SSE-C

AnswerD

SSE-C allows the customer to provide their own encryption keys, which can be stored and managed in CloudHSM.

Why this answer

SSE-C allows customers to provide their own encryption keys, which can be stored in CloudHSM. SSE-S3 and SSE-KMS use AWS-managed keys or KMS keys, not CloudHSM. Option D is correct.

Full explanation →

701

MCQeasy

Refer to the exhibit. An IAM policy includes this statement. What access does it grant?

A.It denies GetObject access to the bucket from IP addresses in 10.0.0.0/8

B.It allows GetObject access to the bucket only from a specific VPC

C.It allows PutObject access to the bucket from any IP address

D.It allows GetObject access to the bucket only from IP addresses in 10.0.0.0/8

AnswerD

The policy allows access from the specified IP range.

Why this answer

Option B is correct because the policy allows GetObject from the specified IP range. Option A is wrong because it allows, not denies. Option C is wrong because it does not restrict to VPC.

Option D is wrong because it allows GetObject, not PutObject.

Full explanation →

702

Multi-Selectmedium

Which THREE are best practices for managing data in Amazon S3 for a data lake? (Choose three.)

Select 3 answers

A.Enable S3 Versioning to protect against accidental deletions.

B.Configure lifecycle policies to transition data to colder storage tiers.

C.Enable S3 Snapshot for point-in-time recovery.

D.Disable S3 server access logging to reduce costs.

E.Use bucket policies to restrict access based on IAM roles.

AnswersA, B, E

Versioning provides data protection.

Why this answer

Enabling S3 Versioning is a best practice for data lakes because it protects against accidental deletions or overwrites by preserving all versions of an object, including deletions (which are recorded as delete markers). This allows you to recover previous object states and is essential for data governance and auditability in a data lake environment.

Exam trap

The trap here is that candidates may confuse S3 Versioning with a non-existent 'S3 Snapshot' feature, or mistakenly think disabling server access logging is a cost-saving best practice, when in fact it undermines security auditing.

Full explanation →

703

MCQhard

Refer to the exhibit. An S3 bucket policy allows the DataEngineerRole to get objects only if the request uses HTTPS. However, requests from this role are being denied even when using HTTPS. What is the MOST likely reason?

A.The IAM role does not have permission to use SSE-S3.

B.The condition key aws:SecureTransport is misspelled.

C.The bucket policy does not include a Deny statement for HTTP requests.

D.The IAM role's attached policy does not allow s3:GetObject on the bucket.

AnswerD

The bucket policy allows the role, but the role itself must also have an IAM policy that allows s3:GetObject.

Why this answer

Option D is correct because the condition key aws:SecureTransport evaluates to true when the request is made over HTTPS, but the IAM role's policy might also need to allow the action. Option A is wrong because the bucket policy already allows the role. Option B is wrong because SSE-S3 does not require additional permissions.

Option C is wrong because the condition is correctly written.

Full explanation →

704

Drag & Dropmedium

Arrange the steps to set up a streaming ETL pipeline using Amazon Kinesis Data Firehose to Amazon S3.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, create the Firehose stream, configure source, set S3 destination, enable optional Lambda transformation, and test.

Full explanation →

705

MCQmedium

A company is using Amazon RDS for MySQL with Multi-AZ deployment. They notice that during a recent failover test, the application experienced a brief write outage. The application uses a connection string that points to the RDS instance endpoint. What is the MOST likely cause of the write outage?

A.The application is using a read replica endpoint, which does not support write operations.

B.The application is using the RDS instance endpoint instead of the cluster endpoint, so it does not automatically route to the standby after failover.

C.The application is connecting through a Network Load Balancer, which is not configured for cross-zone failover.

D.The application connection pool is exhausted because the failover caused all existing connections to drop simultaneously.

AnswerB

The instance endpoint is static and remains pointed to the original primary; after failover, the application must reconnect to the new primary using the CNAME which takes time to update.

Why this answer

Option B is correct because in a Multi-AZ RDS deployment, the instance endpoint always points to the current primary instance. During a failover, the DNS record for the instance endpoint is updated to point to the new primary, but existing connections to the old primary are dropped, and the DNS change can take time to propagate. The application's connection string using the instance endpoint means it does not automatically route to the standby during the failover transition, causing a brief write outage until the DNS update completes and the application reconnects.

In contrast, using a cluster endpoint (available for Aurora, not standard RDS) or implementing retry logic in the application would mitigate this.

Exam trap

The trap here is that candidates often confuse the RDS instance endpoint with the cluster endpoint used in Amazon Aurora, assuming that Multi-AZ automatically provides a seamless, zero-downtime failover for writes, when in fact the instance endpoint requires DNS propagation and connection re-establishment.

How to eliminate wrong answers

Option A is wrong because a read replica endpoint is used for read-only traffic; while it does not support writes, the scenario describes a write outage during failover, not a persistent inability to write, and the application is using the instance endpoint, not a read replica endpoint. Option C is wrong because a Network Load Balancer is not a standard component in an RDS Multi-AZ architecture; RDS handles failover internally via DNS, and NLB is not involved in routing to RDS instances. Option D is wrong because while failover does cause existing connections to drop, connection pool exhaustion is a symptom of poor application retry logic, not the root cause of the write outage; the primary issue is the DNS propagation delay and the application's use of the instance endpoint.

Full explanation →

706

MCQhard

A data streaming application uses Kinesis Data Streams with 10 shards. The data producer is throttled frequently. Which action should be taken to resolve this issue?

A.Decrease the data retention period

B.Use enhanced fan-out for consumers

C.Enable server-side encryption

D.Increase the number of shards

AnswerD

Each shard provides 1 MB/s write capacity, so more shards increase capacity.

Why this answer

Option B is correct because increasing the number of shards increases the write capacity. Option A is wrong because decreasing retention period does not affect write throttling. Option C is wrong because enabling encryption does not affect throttling.

Option D is wrong because using Enhanced Fan-Out is for consumers, not producers.

Full explanation →

707

MCQhard

Refer to the exhibit. A data engineer has attached this bucket policy to an S3 bucket. What is the effect of this policy?

A.It enforces server-side encryption for all objects written to the bucket.

B.It allows the DataLakeRole to read and write objects, but only over HTTPS.

C.It allows anonymous access to the bucket for HTTPS requests.

D.It denies all access to the bucket except for requests from the DataLakeRole.

AnswerB

The allow statement grants GetObject and PutObject to the role; the deny statement blocks non-HTTPS requests for everyone.

Why this answer

Option B is correct because the bucket policy uses a condition key `aws:SecureTransport` set to `true`, which restricts access to HTTPS (TLS) connections only. The `Principal` is `DataLakeRole`, and the `Action` includes `s3:GetObject` and `s3:PutObject`, so the policy allows that role to read and write objects exclusively over HTTPS, enforcing encrypted data in transit.

Exam trap

AWS often tests the distinction between encryption in transit (HTTPS/TLS) and encryption at rest (SSE), leading candidates to confuse the `aws:SecureTransport` condition with server-side encryption requirements.

How to eliminate wrong answers

Option A is wrong because the policy does not reference `s3:x-amz-server-side-encryption` or any condition enforcing server-side encryption (SSE) at rest; it only enforces encryption in transit via `aws:SecureTransport`. Option C is wrong because the `Principal` is explicitly set to `DataLakeRole` (an IAM role ARN), not `"*"` or `{"AWS": "*"}`, so anonymous access is not granted. Option D is wrong because the policy includes an `Allow` effect for `DataLakeRole` under the HTTPS condition, but it does not contain a `Deny` statement for other principals or conditions; without an explicit `Deny`, other access may still be allowed by other policies (e.g., bucket ACLs or IAM policies), so it does not deny all other access.

Full explanation →

708

Multi-Selecthard

A company wants to implement least privilege access for its data lake on S3. Which THREE practices should be followed? (Choose THREE.)

Select 3 answers

A.Grant s3:* to all users for simplicity

B.Use S3 bucket policies for cross-account access

C.Use S3 access points to enforce network policies

D.Disable S3 Block Public Access to allow flexibility

E.Use IAM policies to grant specific permissions to users and roles

AnswersB, C, E

Bucket policies are appropriate for cross-account.

Why this answer

Options A, C, and D are correct. Using IAM policies to grant least privilege, applying bucket policies for cross-account access, and using S3 access points are best practices. Option B is wrong because S3 Block Public Access should be enabled, not disabled.

Option E is wrong because granting s3:* is not least privilege.

Full explanation →

709

MCQeasy

A company uses Amazon S3 to store sensitive data. The security team requires that all data be encrypted at rest using a customer-managed key that is rotated annually. Which encryption option should be used?

A.SSE-KMS (Server-Side Encryption with AWS KMS).

B.SSE-S3 (Server-Side Encryption with S3-managed keys).

C.Client-side encryption.

D.SSE-C (Server-Side Encryption with Customer-Provided keys).

AnswerA

Allows customer-managed KMS key with annual rotation.

Why this answer

SSE-KMS is the correct choice because it allows you to use a customer-managed key (CMK) in AWS KMS, which you can configure to rotate automatically on an annual schedule. This satisfies the security team's requirement for encryption at rest with a key you control and rotate yearly, while still leveraging server-side encryption that integrates with S3's existing infrastructure.

Exam trap

The trap here is that candidates often confuse SSE-C with customer-managed keys, but SSE-C requires you to supply the key on every operation and does not support AWS-managed rotation, making it unsuitable for the 'rotated annually' requirement.

How to eliminate wrong answers

Option B (SSE-S3) is wrong because it uses S3-managed keys that are automatically rotated by AWS, not customer-managed keys, so you cannot control the rotation schedule or manage the key yourself. Option C (Client-side encryption) is wrong because it encrypts data before it reaches S3, which does not meet the requirement for server-side encryption at rest managed by AWS; it also places the key management burden entirely on the client, not the customer-managed key service. Option D (SSE-C) is wrong because it requires you to provide your own encryption key with each request, and AWS does not manage or rotate the key—you must handle key storage and rotation entirely outside of AWS, which contradicts the requirement for a customer-managed key that is rotated annually within AWS.

Full explanation →

710

MCQhard

Refer to the exhibit. A data engineer has attached this bucket policy to an S3 bucket named data-lake-bucket. The engineer wants to allow only GET requests from the corporate network (10.0.0.0/16) over HTTPS. However, users report that they cannot access objects even when connected to the corporate network. What is the issue?

A.The Deny statement should include a condition on the source IP.

B.The Allow statement should include a condition for SecureTransport.

C.The Allow statement should specify s3:GetObject instead of s3:GetObject.

D.The Deny statement blocks all requests that are not using HTTPS, including those from the corporate network.

AnswerD

Deny overrides Allow when condition is met.

Why this answer

Option D is correct because the Deny statement with `aws:SecureTransport` set to `false` blocks all HTTP requests. Since the Allow statement only permits GET requests from the corporate network (10.0.0.0/16) but does not require HTTPS, any request from that network that uses HTTP is denied by the explicit Deny. The Deny statement overrides the Allow, so even legitimate corporate users are blocked if they use HTTP.

Exam trap

AWS often tests the principle that an explicit Deny overrides any Allow, leading candidates to focus on fixing the Allow statement rather than recognizing that the Deny unconditionally blocks HTTP traffic from all sources, including the corporate network.

How to eliminate wrong answers

Option A is wrong because the Deny statement already includes a condition on `aws:SecureTransport`, not on source IP; adding a source IP condition would not fix the HTTPS enforcement issue. Option B is wrong because the Allow statement already includes a condition for `aws:SecureTransport` equal to `true` in the Deny, but the Allow itself lacks a SecureTransport condition, so it permits both HTTP and HTTPS; adding SecureTransport to the Allow would not resolve the Deny blocking HTTP. Option C is wrong because `s3:GetObject` is the correct action for GET requests; the typo 's3:GetObject' in the question is a red herring, and the actual policy uses the correct action.

Full explanation →

711

MCQhard

A data engineering team is responsible for an Amazon RDS for PostgreSQL instance that stores financial data. The database is 500 GB in size. The team needs to create a read replica in a different AWS Region for disaster recovery. The source database has automated backups enabled with a retention period of 7 days. The team initiates the cross-region read replica creation. After several hours, the replica status shows 'Replication Lag' of 30 minutes and is increasing. What should the team do to reduce the replication lag?

A.Modify the source DB instance to use a larger instance class.

B.Delete the replica and create a new one from a snapshot.

C.Increase the backup retention period to 35 days.

D.Enable Multi-AZ on the source database instance.

AnswerD

Multi-AZ provides a synchronous standby that reduces replication lag.

Why this answer

Option C is correct because enabling Multi-AZ on the source provides synchronous standby and reduces replication lag by offloading backups and improving stability. Option A is incorrect because increasing backup retention does not affect replication. Option B is incorrect because modifying the DB instance class may help but is not the primary solution; Multi-AZ is more effective.

Option D is incorrect because deleting and recreating may not solve the underlying issue.

Full explanation →

712

MCQmedium

A company is using Amazon RDS for MySQL and needs to automate backups with a retention period of 35 days. They also want to be able to restore to any point within the retention period. Which configuration should be used?

A.Enable manual snapshots daily and retain for 35 days.

B.Set the backup retention period to 35 days and enable automatic backups.

C.Set the backup retention period to 7 days and create daily manual snapshots.

D.Disable automated backups and rely on Multi-AZ for recovery.

AnswerB

Automated backups allow point-in-time recovery within the retention period.

Why this answer

Amazon RDS for MySQL supports automated backups with a configurable retention period of up to 35 days. By setting the backup retention period to 35 days and enabling automatic backups, RDS automatically performs daily snapshots and transaction log backups, enabling point-in-time recovery (PITR) to any second within the retention window. This meets the requirement for both a 35-day retention and full PITR capability without manual intervention.

Exam trap

The trap here is that candidates often confuse manual snapshots (which are retained indefinitely but do not support PITR) with automated backups (which support PITR but have a maximum retention of 35 days), leading them to choose Option A or C, thinking manual snapshots can extend the PITR window.

How to eliminate wrong answers

Option A is wrong because manual snapshots are not automatically taken daily and do not support point-in-time recovery; they only provide a single point-in-time restore, not continuous PITR. Option C is wrong because setting the backup retention period to 7 days limits automated backups and PITR to only 7 days, and adding daily manual snapshots does not extend the PITR window beyond 7 days. Option D is wrong because disabling automated backups eliminates both automated snapshots and transaction log backups, making PITR impossible; Multi-AZ provides high availability but does not create backups or enable recovery to any point in time.

Full explanation →

713

MCQmedium

A data engineer needs to grant an IAM user read-only access to a specific prefix (folder) in an S3 bucket. The bucket contains sensitive data. Which S3 bucket policy statement achieves this?

A.{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::123456789012:user/DataEng"},"Action":"s3:GetObject","Resource":"arn:aws:s3:::mybucket"}

B.{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::123456789012:user/DataEng"},"Action":"s3:GetObject","Resource":"arn:aws:s3:::mybucket/sensitive/*"}

C.{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::123456789012:user/DataEng"},"Action":"s3:GetObject","Resource":"arn:aws:s3:::mybucket/*","Condition":{"StringLike":{"s3:prefix":"sensitive/"}}}

D.{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::123456789012:user/DataEng"},"Action":"s3:GetObject","Resource":"arn:aws:s3:::mybucket/*"}

AnswerB

Grants access only to objects under sensitive/ prefix.

Why this answer

Option B is correct because it grants s3:GetObject for the specific prefix and denies access to other prefixes implicitly. Option A is wrong because it grants access to all objects. Option C is wrong because it uses a condition that does not restrict prefix.

Option D is wrong because it grants access to all objects in the bucket.

Full explanation →

714

MCQeasy

A startup is building a real-time analytics application using Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics. The application processes clickstream data from a website. The data is also stored in Amazon S3 for historical analysis. The company uses an S3 bucket with a lifecycle policy that transitions objects to Amazon S3 Glacier Deep Archive after 30 days. The data engineering team has configured a Kinesis Data Firehose delivery stream to write data to the S3 bucket. The team notices that the data in S3 is not being transitioned to Glacier Deep Archive after 30 days. The lifecycle policy is correctly configured and has been verified. What is the most likely cause of this issue?

A.The S3 lifecycle rule is configured with a filter that does not match the prefix used by Kinesis Data Firehose.

B.The Glacier Deep Archive storage class requires a minimum 90-day storage period, so the lifecycle policy cannot transition objects after 30 days.

C.The S3 bucket is not enabled for S3 Intelligent-Tiering, which is required for lifecycle transitions to Glacier Deep Archive.

D.The S3 bucket does not have S3 Batch Operations enabled to invoke the lifecycle policy.

AnswerA

If the prefix filter doesn't match the objects' prefixes, the rule won't apply.

Why this answer

Option C is correct because Kinesis Data Firehose writes data with a prefix that includes the date, but it often uses the delivery stream's timestamp rather than the object creation date. The lifecycle rule is based on the object creation date. If the prefix does not match the rule filter, the rule may not apply.

Option A is wrong because S3 Batch Operations are not related to lifecycle transitions. Option B is wrong because Glacier Deep Archive does have a minimum 90-day storage charge, but the lifecycle rule can still transition after 30 days (though you'll pay a penalty). Option D is wrong because S3 Intelligent-Tiering is an alternative storage class, not a requirement.

Full explanation →

715

MCQeasy

A data engineer needs to monitor the number of records processed by an AWS Glue ETL job. Which CloudWatch metric should the engineer use?

A.glue.driver.aggregate.elapsedTime

B.glue.driver.aggregate.numRecords

C.glue.driver.aggregate.bytesRead

D.glue.driver.aggregate.recordsRead

AnswerB

This metric tracks the number of records processed.

Why this answer

Option B is correct because Glue emits a 'glue.driver.aggregate.numRecords' metric for the number of records processed. Option A is wrong because 'glue.driver.aggregate.elapsedTime' is for time. Option C is wrong because 'glue.driver.aggregate.bytesRead' is for bytes.

Option D is wrong because 'glue.driver.aggregate.recordsRead' is not a standard metric.

Full explanation →

716

Multi-Selecthard

A data engineer is troubleshooting an Amazon Redshift cluster that is unable to access an S3 bucket for COPY operations. The cluster has an IAM role attached. Which of the following could be causing the failure? (Choose TWO.)

Select 2 answers

A.The VPC security group does not allow outbound HTTPS traffic

B.The S3 bucket policy denies access to the IAM role

C.The S3 bucket has default encryption enabled

D.The IAM role does not have the s3:GetObject permission

E.The KMS key used for encryption is not shared with Redshift

AnswersB, D

The bucket policy can override the IAM role permissions.

Why this answer

Options B and D are correct. The IAM role must have permission to the S3 bucket, and the bucket policy must allow the role. Option A is wrong because VPC security groups control network traffic, not S3 access.

Option C is wrong because encryption is not required for COPY. Option E is wrong because Redshift does not need KMS permissions unless using SSE-KMS.

Full explanation →

717

Matchingmedium

Match each AWS security service to its purpose in data protection.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Managed encryption keys

User and role access control

Audit API activity

Discover and protect sensitive data

Web application firewall

Why these pairings

Security services protect data in AWS.

Full explanation →

718

Multi-Selectmedium

A company is designing a data store for IoT sensor data that is written once and never updated. The data must be stored with high durability and low cost. Which TWO AWS storage services are most suitable? (Choose TWO.)

Select 2 answers

A.Amazon ElastiCache

B.Amazon EBS

C.Amazon S3

D.Amazon DynamoDB

E.Amazon S3 Glacier Deep Archive

AnswersC, E

S3 provides 99.999999999% durability and low cost for infrequently accessed data.

Why this answer

Amazon S3 is correct because it provides 99.999999999% (11 9's) durability, is designed for write-once-read-many (WORM) workloads, and offers low-cost storage tiers suitable for IoT sensor data that is never updated. S3's object storage model and lifecycle policies allow automatic transition to colder storage, making it ideal for immutable data at scale.

Exam trap

The trap here is that candidates often choose DynamoDB (D) for its scalability and low latency, overlooking that the question emphasizes low cost and write-once immutability, where S3 and Glacier Deep Archive are orders of magnitude cheaper per GB stored.

Full explanation →

719

Multi-Selectmedium

A company runs a data lake on Amazon S3 with AWS Glue for ETL. The data is stored in Parquet format and partitioned by date. The data engineer notices that queries using Amazon Athena are scanning large amounts of data even when filtering on the partition column. Which TWO actions would improve query performance? (Choose TWO)

Select 2 answers

A.Use a different file format like Avro

B.Ensure that the WHERE clause uses the partition column correctly

C.Convert the data from Parquet to CSV for better compression

D.Increase the number of partitions by adding a second partition column

E.Enable predicate pushdown in Athena

AnswersB, E

Enables partition pruning.

Why this answer

Option B and D are correct. Partition pruning is most effective when the partition column is used correctly in the WHERE clause, and using columnar formats like Parquet with predicate pushdown reduces data scanned. Option A is wrong because adding more partitions may not help if the filter is not applied.

Option C is wrong because converting to CSV would increase data scanned. Option E is wrong because using a different file format may not help if partitioning is not leveraged.

Full explanation →

720

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is in JSON format and contains a 'timestamp' field with a Unix epoch value. The company wants to partition the S3 objects by year, month, day, and hour based on the timestamp. What is the MOST efficient method to achieve this?

A.Use the dynamic partitioning feature of Kinesis Data Firehose with inline parsing to extract the timestamp and create the S3 prefix.

B.Configure a custom S3 prefix in Firehose using the 'YYYY/MM/dd/HH' format based on the current time.

C.Use an AWS Glue ETL job to read from Firehose, partition, and write to S3.

D.Use Amazon Athena to run a CTAS query that partitions the data by timestamp.

AnswerA

Dynamic partitioning allows extracting keys from the data and defining S3 prefixes.

Why this answer

Option A is correct because Firehose supports dynamic partitioning using inline parsing or Lambda to extract the timestamp. Option B is incorrect because Glue ETL would add latency. Option C is incorrect because custom partitioning is not a Firehose feature; dynamic partitioning is.

Option D is incorrect because Athena would need a Lambda function to partition, not efficient.

Full explanation →

721

MCQeasy

A company wants to ingest real-time data from a social media API into Amazon S3 for analysis. The API provides data as JSON records. Which AWS service is best suited for this ingestion?

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Simple Queue Service (SQS)

D.Amazon DataZone

AnswerB

Firehose is designed for streaming data ingestion into S3.

Why this answer

Option D is correct because Amazon Kinesis Data Firehose can capture and load streaming data into S3 with minimal latency. Option A is wrong because AWS Glue is ETL, not real-time ingestion. Option B is wrong because Amazon SQS is a message queue, not designed for direct S3 loading.

Option C is wrong because Amazon DataZone is for data cataloging, not ingestion.

Full explanation →

722

Multi-Selectmedium

Which TWO actions should a data engineer take to encrypt data at rest in an Amazon S3 bucket? (Select TWO.)

Select 2 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Use client-side encryption before uploading objects to S3.

C.Configure the bucket to use SSE-KMS.

D.Enable default encryption on the bucket using SSE-S3.

E.Attach a bucket policy that denies unencrypted PUT requests.

AnswersC, D

SSE-KMS encrypts objects at rest using AWS KMS keys.

Why this answer

Option C is correct because SSE-KMS (Server-Side Encryption with AWS Key Management Service) encrypts data at rest in S3 by using a KMS key to manage encryption keys. This provides envelope encryption, where a CMK generates a data key that encrypts the object, and the data key is then encrypted by the CMK. Option D is correct because enabling default encryption on an S3 bucket using SSE-S3 (AES-256) ensures that all objects uploaded without explicit encryption headers are automatically encrypted at rest by S3's managed key.

Exam trap

The trap here is that candidates confuse enforcing encryption (via bucket policies) with actually performing encryption, or they mistakenly think client-side encryption is a bucket-level action rather than a client-side responsibility.

Full explanation →

723

Multi-Selecthard

A company uses Amazon Redshift for its data warehouse. The cluster has multiple node types and is configured with automated snapshots. The company needs to ensure high availability and disaster recovery across AWS Regions. Which THREE actions should the company take to meet these requirements? (Choose THREE.)

Select 3 answers

A.Enable automated snapshots with a retention period of at least 1 day.

B.Restore a snapshot from the secondary Region in the event of a disaster.

C.Create manual snapshots on a daily basis and copy them to another Region.

D.Configure the cluster to use multiple Availability Zones (multi-AZ) for high availability.

E.Configure cross-Region snapshot copy to replicate snapshots to another Region.

AnswersB, D, E

Restoring from a cross-Region snapshot provides DR capability.

Why this answer

Enabling cross-Region snapshot copy, restoring a snapshot to a different Region, and configuring a multi-AZ cluster (if supported) provide high availability and DR. Automated snapshots are already present. Manual snapshots are not needed for DR if automated snapshots are configured.

Resizing the cluster does not provide DR.

Full explanation →

724

MCQmedium

A company is using AWS Lake Formation to manage access to data in a data lake stored in Amazon S3. A data engineer notices that users with SELECT permissions on a table can still query the underlying S3 data directly using Athena. What is the most likely cause?

A.The S3 bucket policy allows full access to all principals

B.The users are using a version of Athena that does not support Lake Formation

C.The S3 bucket does not have server-side encryption enabled

D.Lake Formation does not support integration with Athena

AnswerB

Older Athena versions do not enforce Lake Formation permissions.

Why this answer

Option C is correct because Lake Formation integrates with Athena by default when using the Athena engine version 2 or later. Option A is wrong because Lake Formation can be used with Athena. Option B is wrong because the issue is about direct S3 access, not encryption.

Option D is wrong because Lake Formation can work with S3 bucket policies.

Full explanation →

725

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data includes sensitive customer information that must be encrypted at rest. Which combination of actions meets this requirement with minimal operational overhead?

A.Enable default encryption on the S3 bucket using SSE-S3

B.Encrypt objects client-side before uploading

C.Use an S3 bucket policy to deny writes without encryption

D.Use an S3 Lifecycle policy to transition to Glacier

AnswerA

Default encryption ensures all new objects are encrypted automatically.

Why this answer

SSE-S3 provides server-side encryption with Amazon S3-managed keys, which encrypts data at rest with minimal operational overhead because AWS handles key management, rotation, and encryption/decryption transparently. Enabling default encryption on the S3 bucket ensures that all objects written to the bucket are automatically encrypted without requiring any client-side changes or additional code, meeting the requirement with the least administrative effort.

Exam trap

The trap here is that candidates often confuse 'enforcing encryption via bucket policy' (Option C) with 'automatically encrypting data' — the policy only denies unencrypted writes but does not reduce operational overhead because the client must still implement encryption logic.

How to eliminate wrong answers

Option B is wrong because client-side encryption requires the data engineer to manage encryption keys and perform encryption/decryption in the application code, adding significant operational overhead compared to server-side encryption. Option C is wrong because an S3 bucket policy that denies writes without encryption only enforces encryption at upload time but does not itself encrypt the data; it relies on the client to provide encryption headers, which still requires client-side logic and does not reduce overhead. Option D is wrong because an S3 Lifecycle policy to transition to Glacier only moves data to a different storage class for cost optimization, it does not provide encryption at rest; Glacier itself supports encryption but the lifecycle policy does not enable or enforce it.

Full explanation →

726

MCQeasy

A company needs to migrate an on-premises 10 TB PostgreSQL database to Amazon RDS for PostgreSQL with minimal downtime. Which AWS service should be used for the migration?

A.AWS Storage Gateway

B.AWS Snowball Edge

C.AWS DataSync

D.AWS Database Migration Service (DMS)

AnswerD

DMS supports continuous replication.

Why this answer

Option A is correct because AWS DMS supports ongoing replication to minimize downtime. Option B is wrong because S3 is for object storage. Option C is wrong because Snowball is for large data transfers but does not support ongoing replication.

Option D is wrong because Database Migration Service (DMS) is the correct service.

Full explanation →

727

Multi-Selecthard

A company runs a data processing pipeline using Amazon EMR with Spark. The pipeline reads from S3, processes data, and writes to S3. Recently, the job started failing with 'S3AccessDeniedException' even though the EMR role has appropriate S3 permissions. Which TWO actions should the data engineer take to resolve this issue? (Choose TWO.)

Select 2 answers

A.Enable S3 versioning on the bucket to allow multiple access methods.

B.Verify that the EMR service role has the necessary S3 permissions in IAM.

C.Disable S3 Block Public Access settings on the bucket.

D.Check the S3 bucket policy for explicit deny statements that may override the IAM role.

E.Ensure the EMR cluster is launched in a VPC with an S3 VPC endpoint.

AnswersB, D

The EMR service role (EMR_EC2_DefaultRole) must have permissions.

Why this answer

Options A and D are correct. A: S3 bucket policies can deny access to specific IP addresses or VPC endpoints; checking them can reveal explicit denies. D: EMR roles may need an IAM policy for S3 access; verifying the policy ensures correct permissions.

Option B is wrong because S3 Block Public Access does not affect IAM-based access. Option C is wrong because S3 endpoints in the VPC are not required if using public internet. Option E is wrong because S3 versioning does not affect access permissions.

Full explanation →

728

MCQmedium

A data engineer is troubleshooting an Amazon RDS for MySQL instance that is experiencing high read latency. The instance is a Single-AZ db.r5.large with 100 GB of General Purpose (gp2) storage. Which action is most likely to reduce read latency?

A.Create a read replica and direct read queries to it.

B.Enable automatic backups with a 7-day retention.

C.Increase the allocated storage to 200 GB.

D.Convert the instance to a Multi-AZ deployment.

AnswerA

Offloads read traffic, reducing load on the primary.

Why this answer

Creating a read replica offloads SELECT queries from the primary instance, directly reducing the read load and thus read latency. Since the instance is Single-AZ and experiencing high read latency, a read replica distributes the read traffic without altering the existing storage or availability configuration.

Exam trap

The trap here is that candidates often confuse Multi-AZ with read replicas, assuming Multi-AZ provides read scaling, but Multi-AZ only provides a standby for failover, not a read endpoint.

How to eliminate wrong answers

Option B is wrong because enabling automatic backups with a 7-day retention does not reduce read latency; backups consume I/O and CPU resources, potentially increasing latency. Option C is wrong because increasing allocated storage to 200 GB on gp2 improves baseline IOPS (from 300 to 600) but does not directly address high read latency caused by read workload saturation; the issue is read demand, not storage throughput. Option D is wrong because converting to Multi-AZ provides high availability and failover support but does not reduce read latency; the standby replica is not used for read traffic unless you explicitly configure a read replica.

Full explanation →

729

MCQhard

A social media company ingests user activity data from multiple sources into Amazon S3. The data is in JSON format and includes fields: user_id, activity_type, timestamp, and metadata. The company wants to transform this data into a columnar format (Parquet) partitioned by date and activity_type for efficient querying with Amazon Athena. The pipeline must handle data that arrives up to 3 days late. Currently, a daily AWS Glue ETL job scans the entire S3 bucket for new files, transforms them, and writes to a separate output bucket. The job is taking longer as data volume grows, and the team wants to reduce processing time and cost. What should the engineer do?

A.Increase the number of DPUs for the Glue job to process data faster.

B.Use AWS Glue partition projection and schema inference to reduce scan time.

C.Replace AWS Glue with Amazon EMR and use Spark to process data in parallel.

D.Set up S3 event notifications to invoke an AWS Lambda function that triggers a Glue job for each new object, passing the object key so the job processes only that file.

AnswerD

This enables incremental processing, reduces scan time, and is cost-effective.

Why this answer

Option B is correct: Using S3 event notifications to trigger a Lambda function that starts a Glue job for each new file allows incremental processing, reducing scan time. Option A (increase capacity) does not address the root cause. Option C (EMR) adds complexity.

Option D (partition projection) does not help with transformation.

Full explanation →

730

MCQeasy

A company needs to ingest data from an on-premises SQL Server database into Amazon Redshift. The data volume is less than 1 TB and the network bandwidth is limited. Which AWS service should be used for the initial full load?

A.AWS Snowball Edge

B.AWS Database Migration Service (DMS)

C.Amazon S3 Transfer Acceleration

D.AWS Direct Connect

AnswerB

Designed for database migration with limited bandwidth.

Why this answer

Option B is correct because AWS DMS is designed for migrating databases to AWS, including to Redshift. Option A (AWS Snowball) is for large data volumes (petabytes) and not efficient for <1 TB. Option C (Amazon S3 Transfer Acceleration) speeds up uploads to S3 but not directly to Redshift.

Option D (AWS Direct Connect) is a network connection, not a migration service.

Full explanation →

731

MCQeasy

A company is using Amazon S3 to store log files. The security team requires that all data be encrypted in transit. Which of the following ensures encryption in transit for S3?

A.Use HTTPS (SSL/TLS) when accessing S3 endpoints.

B.Use Amazon S3 Transfer Acceleration.

C.Enable client-side encryption before uploading to S3.

D.Use server-side encryption with S3 managed keys (SSE-S3).

AnswerA

HTTPS encrypts data in transit between client and S3.

Why this answer

Encryption in transit for S3 is achieved by using HTTPS (SSL/TLS) when accessing S3 endpoints. Option B is correct. SSE-S3 is at-rest encryption.

Client-side encryption is also at-rest. VPC endpoints provide private connectivity but not encryption by default; you need to use HTTPS.

Full explanation →

732

MCQeasy

A company stores its application logs in Amazon S3. The logs are generated daily and need to be retained for 3 years for compliance. The logs are accessed frequently for the first 30 days, occasionally for the next 6 months, and rarely after that. The data engineering team wants to minimize storage costs while ensuring that logs are available for retrieval within 12 hours for the first 6 months and within 48 hours after that. The team also wants to automatically delete logs after 3 years. Which lifecycle policy should the team implement?

A.Transition to S3 One Zone-IA after 30 days, delete after 6 months.

B.Transition to S3 Standard-IA after 30 days, delete after 3 years.

C.Transition to S3 Standard-IA after 30 days, to S3 Glacier after 6 months, delete after 3 years.

D.Transition to S3 Standard-IA after 30 days, to S3 Glacier Deep Archive after 6 months, delete after 3 years.

AnswerD

Meets cost and retrieval time requirements.

Why this answer

Option D is correct because it aligns with the access patterns and retrieval requirements: S3 Standard-IA after 30 days for occasional access with immediate retrieval, then S3 Glacier Deep Archive after 6 months for rare access with a 12-hour retrieval time (via expedited or standard retrieval), and deletion after 3 years. This minimizes storage costs while meeting the 48-hour retrieval window for older logs.

Exam trap

The trap here is that candidates may choose Option C (S3 Glacier) thinking it is the cheapest cold storage, but S3 Glacier Deep Archive is actually the lowest-cost option for data that is rarely accessed and can tolerate a 12-hour retrieval time, which still satisfies the 48-hour requirement.

How to eliminate wrong answers

Option A is wrong because it transitions to S3 One Zone-IA after 30 days, which does not provide the durability or availability needed for compliance logs, and it deletes after 6 months instead of 3 years. Option B is wrong because it keeps logs in S3 Standard-IA for the entire 3 years, which is more expensive than transitioning to a colder storage class after 6 months, and it does not meet the cost-minimization goal. Option C is wrong because it transitions to S3 Glacier after 6 months, which has a retrieval time of 1-5 minutes for expedited or 3-5 hours for standard, exceeding the 48-hour requirement but not being the most cost-effective option; S3 Glacier Deep Archive is cheaper and still meets the 48-hour retrieval window.

Full explanation →

733

MCQhard

A data engineer is tasked with implementing data masking for a non-production environment. The source data contains credit card numbers stored in an Amazon RDS for PostgreSQL database. The engineer wants to automatically mask the credit card numbers when copying data to the non-production database. Which AWS service can be used to achieve this?

A.AWS Database Migration Service (DMS)

B.AWS Glue

C.AWS Lake Formation

D.Amazon Athena

AnswerA

DMS supports transformation rules that can mask columns during migration.

Why this answer

AWS DMS can transform data during migration using transformation rules. It can mask data by replacing columns with predefined values. Glue is for ETL, but DMS is purpose-built for database migrations with transformations.

Lake Formation is for data lake permissions. Athena is for querying S3 data.

Full explanation →

734

MCQeasy

A data engineer is troubleshooting a failed AWS Glue ETL job. The job reads from an S3 bucket and writes to an RDS MySQL database. The job fails with an 'Access Denied' error when trying to write to RDS. What is the most likely cause?

A.The IAM role associated with the Glue job does not have the necessary permissions to write to the RDS instance.

B.The Glue job is running in a VPC without a route to the internet.

C.The S3 bucket policy does not allow the Glue job to read the data.

D.The RDS instance is encrypted with a KMS key that the Glue job cannot access.

AnswerA

IAM role needs RDS write permissions.

Why this answer

The error 'Access Denied' when writing to RDS indicates that the AWS Glue job's IAM role lacks the necessary permissions (e.g., rds-db:connect, or specific database-level GRANTs) to perform write operations on the RDS MySQL instance. AWS Glue uses the attached IAM role to authenticate and authorize actions against AWS services, and without proper IAM policies allowing access to the RDS resource, the write attempt is denied.

Exam trap

The trap here is that candidates often confuse network connectivity issues (Option B) with authorization errors, but 'Access Denied' is a specific HTTP 403 error indicating lack of permissions, not a network problem.

How to eliminate wrong answers

Option B is wrong because a missing route to the internet would cause a network connectivity timeout or 'connection refused' error, not an 'Access Denied' error, which is an authorization failure. Option C is wrong because the error occurs when writing to RDS, not when reading from S3; an S3 bucket policy issue would produce an S3-specific 'Access Denied' error during the read phase. Option D is wrong because if the RDS instance is encrypted with a KMS key that the Glue job cannot access, the error would typically be a 'KMS access denied' or 'encryption key unavailable' error, not a generic 'Access Denied' for writing to RDS.

Full explanation →

735

MCQhard

A data engineer is designing a data lake on Amazon S3. The compliance team requires that objects be automatically deleted after 7 years. Additionally, objects must be transitioned to Amazon S3 Glacier Instant Retrieval after 30 days to reduce costs. Which S3 lifecycle policy configuration meets these requirements?

A.Transition to Glacier Instant Retrieval after 30 days, then expire after 90 days.

B.Transition to Glacier Instant Retrieval after 30 days, then expire after 2555 days.

C.Transition to Glacier Deep Archive after 30 days, then expire after 7 years.

D.Transition to S3 Standard-IA after 30 days, then expire after 7 years.

AnswerB

2555 days is approximately 7 years.

Why this answer

Option D is correct because it transitions objects to Glacier Instant Retrieval after 30 days and then permanently deletes them after 7 years (2555 days). Option A is wrong because Glacier Deep Archive is not Glacier Instant Retrieval. Option B is wrong because it transitions to S3 Standard-IA, not Glacier Instant Retrieval.

Option C is wrong because it deletes after 90 days, not 7 years.

Full explanation →

736

MCQeasy

A company wants to import data from an external FTP server into Amazon S3 on a daily basis. The data volumes are moderate. Which AWS service is MOST suitable for this task?

A.Amazon S3 Transfer Acceleration

B.AWS Transfer Family

C.AWS DataSync

D.AWS Glue with a JDBC connection

AnswerB

Transfer Family supports FTP/SFTP/FTPS and directly writes to S3.

Why this answer

Option D is correct because AWS Transfer Family provides fully managed support for SFTP, FTPS, and FTP, enabling direct transfer to S3. Option A is wrong because Glue cannot connect to FTP directly without a custom connector. Option B is wrong because DataSync is for moving data between on-premises and AWS, but it does not support FTP servers.

Option C is wrong because S3 Transfer Acceleration is for speeding up uploads to S3 from clients, not for fetching from FTP.

Full explanation →

737

MCQmedium

A data engineer is building a real-time data pipeline to ingest sensor data from IoT devices. The data is sent to AWS IoT Core, which publishes messages to a Kinesis Data Stream. Each message is about 1 KB in size. The data must be transformed (add a device location field) and then stored in Amazon S3 for long-term analytics. The engineer has set up a Lambda function to transform the records and write to S3. However, the engineer notices that the Lambda function is invoked thousands of times per second, causing high costs and occasional throttling. The Lambda function processes only one record at a time. The engineer wants to reduce the number of Lambda invocations and improve throughput. What should the engineer do?

A.Reduce the number of shards in the Kinesis stream to limit concurrency.

B.Increase the Lambda function's memory allocation to improve performance.

C.Replace the Lambda function with Amazon Kinesis Data Firehose and use its built-in transformation.

D.Configure the event source mapping to use a larger batch size and set a batch window.

AnswerD

Larger batch size and batch window reduce number of invocations.

Why this answer

Option C is correct because increasing the batch size in the event source mapping allows Lambda to process multiple records per invocation, reducing invocations. Option A is wrong because increasing Lambda memory does not reduce invocations. Option B is wrong because Firehose can batch records, but the transformation would still be per record unless using Firehose's built-in Lambda integration.

Option D is wrong because reducing concurrent executions would cause throttling and backlogs.

Full explanation →

738

Multi-Selectmedium

Which TWO actions should a data engineer take to optimize Amazon S3 query performance for Amazon Athena when dealing with large Parquet files? (Choose 2.)

Select 2 answers

A.Store data in a single large file without partitioning

B.Use GZIP compression on the Parquet files

C.Split large files into many small files

D.Optimize file sizes to be around 64 MB to 256 MB

E.Partition the data by frequently filtered columns

AnswersD, E

Optimal file size improves parallelism and performance.

Why this answer

A and D are correct: Partitioning reduces scanned data, and converting to Parquet is already done but optimizing file size (e.g., 64 MB) further improves performance. B (compression) is already assumed in Parquet. C (small files) actually hurts performance.

E (no partitioning) is not an optimization.

Full explanation →

739

MCQmedium

A data engineer notices that an AWS Glue ETL job is running slower than expected. The job reads from Amazon S3, joins two datasets, and writes the result back to S3. The job uses the default worker type (G.1X) and 10 DPUs. Which action is most likely to improve performance?

A.Increase the number of DPUs to 20

B.Repartition the data before the join operation

C.Use coalesce to reduce the number of output files

D.Change the worker type to G.2X

AnswerB

Optimizes parallelism and reduces shuffling.

Why this answer

Option B is correct because increasing the number of partitions (via repartition) can improve parallelism and performance when the data is skewed. Option A (increasing DPUs) may help but is not the most targeted fix. Option C (using coalesce) reduces partitions, which can hurt performance.

Option D (changing to G.2X) provides more memory per worker but may not address the core issue of partition skew.

Full explanation →

740

MCQhard

A data engineer needs to share a dataset stored in an S3 bucket with a partner AWS account. The partner should be able to read the data without needing to authenticate with the engineer's account. The engineer must not share any secret keys. Which approach should be used?

A.Write a bucket policy that grants access to the partner account's IAM role.

B.Generate presigned URLs and share them with the partner.

C.Make the bucket publicly readable.

D.Create an IAM user with access keys and share them with the partner.

AnswerA

Bucket policy can grant cross-account access securely.

Why this answer

Option B is correct because S3 bucket policies can grant cross-account access to a specific IAM role in the partner account. Option A is wrong because presigned URLs are temporary and need to be generated. Option C is wrong because making the bucket public violates security.

Option D is wrong because sharing access keys is insecure and against best practices.

Full explanation →

741

MCQhard

A healthcare organization uses AWS Lake Formation to manage a data lake in Amazon S3. The data lake contains sensitive patient information that must be encrypted at rest. The organization uses AWS KMS with a customer-managed key (CMK) for encryption. Recently, the security team noticed that a new IAM user was able to query the data lake using Amazon Athena without explicit permissions in Lake Formation. The data lake administrator suspects that the IAM user might have been granted access through an IAM policy that allows 'lakeformation:GetDataAccess' without proper resource restrictions. The organization wants to enforce that only Lake Formation permissions control access to the data lake, and IAM policies should not grant access directly. What should they do?

A.Change the KMS key policy to require that any request to decrypt data must come from the Lake Formation service role.

B.Revoke the 'lakeformation:GetDataAccess' permission from all IAM users and groups, and require that access be granted only through Lake Formation permissions.

C.Remove the IAM policy that grants 'lakeformation:GetDataAccess' from the specific user and ensure Lake Formation permissions are correctly set.

D.Add an S3 bucket policy that denies all principals except the Lake Formation service role.

AnswerB

This ensures that only Lake Formation permissions control data access.

Why this answer

Option C is correct because the 'Lakeformation:GetDataAccess' permission is required for principals to access data through Lake Formation, and revoking it for all IAM users forces them to rely solely on Lake Formation permissions. Option A is wrong because IAM policies for Lake Formation actions (like GetDataAccess) can grant access to data lake resources; removing them from the specific user is not enough. Option B is wrong because S3 bucket policies would bypass Lake Formation's fine-grained access control.

Option D is wrong because KMS keys do not control data access permissions; they control encryption.

Full explanation →

742

MCQmedium

A data engineer needs to store semi-structured JSON data from IoT devices. The data is written once, read rarely, but must be queryable using SQL. The storage cost must be minimized. Which storage solution should the engineer choose?

A.Store JSON in Amazon Redshift as SUPER data type

B.Store JSON in an Amazon RDS for MySQL table

C.Store JSON documents in Amazon DynamoDB and use PartiQL for queries

D.Store JSON files in Amazon S3 and use Amazon Athena for queries

AnswerD

S3 provides low-cost storage; Athena enables SQL querying over JSON.

Why this answer

Amazon S3 provides the lowest-cost storage for data that is written once and rarely read, while Amazon Athena enables serverless SQL querying directly on JSON files stored in S3. This combination minimizes storage costs because S3 charges only for the data stored and retrieval, with no minimum fees or provisioning required, and Athena charges only for the data scanned per query. The workload's write-once, read-rarely pattern aligns perfectly with S3's durability and lifecycle policies, making it the most cost-effective choice.

Exam trap

The trap here is that candidates often choose DynamoDB (Option C) because it supports JSON natively and PartiQL provides SQL-like queries, but they overlook that DynamoDB's provisioned throughput and storage costs are significantly higher than S3 for write-once, read-rarely workloads, and that Athena on S3 is the serverless, cost-optimized solution for ad-hoc SQL queries on infrequently accessed data.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift is a petabyte-scale data warehouse designed for high-performance analytics on structured and semi-structured data, but it incurs significant costs for provisioned clusters even when data is rarely queried, making it unsuitable for minimizing storage costs. Option B is wrong because Amazon RDS for MySQL is a relational database that requires provisioning and paying for a database instance 24/7, and storing JSON in a MySQL table incurs overhead for indexing and transactions that are unnecessary for write-once, read-rarely data. Option C is wrong because Amazon DynamoDB is a NoSQL key-value and document database optimized for low-latency, high-throughput workloads, but its storage costs are higher than S3 for rarely accessed data, and PartiQL queries on DynamoDB still consume read capacity units, leading to ongoing costs that exceed S3+Athena for infrequent queries.

Full explanation →

743

MCQmedium

Refer to the exhibit. A data engineer is configuring an IAM policy for an AWS Glue ETL job that reads data from the 'my-data-bucket' S3 bucket, transforms it, and writes the output back to the same bucket. The engineer wants to prevent accidental deletion of objects. Based on the policy, which statement is true about the Glue job's permissions?

A.The job can write objects but cannot read objects.

B.The job can read objects but cannot write objects.

C.The job can read and write, but may also delete objects.

D.The job can read and write objects, but cannot delete objects.

AnswerD

Get and Put allowed; Delete denied.

Why this answer

Option D is correct. The policy allows s3:GetObject and s3:PutObject, and explicitly denies s3:DeleteObject. Option A is wrong because the job can read and write.

Option B is wrong because the Deny effect overrides any Allow. Option C is wrong because the job can read.

Full explanation →

744

MCQeasy

A data engineer is ingesting streaming data from an IoT fleet into Amazon S3 using Amazon Kinesis Data Firehose. The data arrives as JSON, but the downstream analytics require Parquet format. Which Firehose transformation should the engineer configure?

A.Use an S3 lifecycle policy to convert JSON to Parquet.

B.Configure a Lambda function as a data transformation in Firehose to convert JSON to Parquet.

C.Use S3 Batch Operations to convert existing JSON objects to Parquet.

D.Use Kinesis Data Analytics to convert the stream to Parquet before writing to S3.

AnswerB

Lambda can transform data format during delivery.

Why this answer

Option B is correct because Kinesis Data Firehose can convert JSON to Parquet using an AWS Lambda transformation. Option A is wrong because S3 lifecycle policies do not transform data format. Option C is wrong because Kinesis Data Analytics performs real-time analytics, not format conversion.

Option D is wrong because S3 batch operations process existing objects, not streaming ingestion.

Full explanation →

745

MCQeasy

A company wants to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is 500 GB per transfer. Which AWS service is most appropriate for this batch ingestion?

A.AWS Database Migration Service (DMS)

B.AWS Data Pipeline

C.Amazon Kinesis Data Firehose

D.AWS Glue

AnswerD

Glue can run scheduled crawlers and ETL jobs for batch ingestion.

Why this answer

Option B is correct because AWS Glue can run scheduled ETL jobs to extract data from JDBC sources like Oracle and write to S3. Option A (DMS) is for ongoing replication, not scheduled batch. Option C (Firehose) is for streaming, not batch.

Option D (Data Pipeline) is a legacy service.

Full explanation →

746

MCQhard

Refer to the exhibit. A data engineer is reviewing the configuration of an Amazon Redshift cluster. The engineer wants to ensure that the cluster can be restored to a point in time up to 35 days in the past. Based on the exhibit, what change is needed?

A.Increase the automated snapshot retention period to 35 days.

B.Change the cluster subnet group to a custom one.

C.Enable encryption on the cluster.

D.Increase the number of nodes to 6.

AnswerA

Current retention is 1 day.

Why this answer

Option C is correct. The AutomatedSnapshotRetentionPeriod is set to 1 day, which allows only 1 day of point-in-time recovery. To support 35 days, this value must be increased to 35.

Option A is incorrect because the cluster is already encrypted. Option B is incorrect because the number of nodes does not affect snapshot retention. Option D is incorrect because the subnet group does not affect retention.

Full explanation →

747

MCQmedium

A data engineer is building a data ingestion pipeline that reads JSON files from Amazon S3 and loads them into an Amazon Redshift table using COPY commands. The files are gzip compressed and contain nested JSON. The engineer wants to minimize transformation steps. Which approach should the engineer use?

A.Use Amazon Athena to query the JSON and INSERT INTO Redshift.

B.Use Kinesis Data Firehose to transform and load into Redshift.

C.Use the COPY command with the 'auto' option to ingest JSON directly.

D.Use AWS Glue ETL to flatten the JSON and write to S3 as CSV, then COPY from CSV.

AnswerC

COPY with 'auto' automatically parses JSON.

Why this answer

Option B is correct because Redshift COPY with the 'auto' option can automatically flatten nested JSON. Option A is wrong because Glue ETL is an unnecessary intermediate step. Option C is wrong because Kinesis Data Firehose is for streaming, not batch.

Option D is wrong because Athena requires schema definition and is not a direct load to Redshift.

Full explanation →

748

MCQeasy

A data engineer is troubleshooting an AWS Glue ETL job that uses a Python shell script to extract data from an Amazon RDS for PostgreSQL database and load it into an Amazon Redshift table. The job runs successfully, but the data engineer notices that the row count in Redshift is consistently lower than the row count in PostgreSQL. The job uses a SELECT * query without any filtering. The data engineer suspects that some rows are being dropped during the transfer. The job uses the psycopg2 library to connect to PostgreSQL and the psycopg2 connection is configured with autocommit=True. The Redshift table has no constraints that would reject rows. What is the most likely cause of the missing rows?

A.The SELECT * query includes columns with data types that are not supported by psycopg2.

B.The SSL/TLS connection to PostgreSQL is dropping packets.

C.The autocommit=True setting is causing incomplete transactions.

D.The Redshift table has a distribution key that causes some rows to be silently discarded.

AnswerA

Unsupported data types may cause rows to be skipped or nullified.

Why this answer

Option C is correct. If the Redshift table is set to distribute data, and if the distribution key is not unique, some rows may be lost if there are duplicates? Actually, the most common cause is that the Glue job is using multiple executors, and the data is being split across them, but Python shell uses a single executor. However, the most plausible answer is that the SELECT * query may include columns with special characters that are not handled correctly.

Option A is wrong because autocommit=True should not cause data loss. Option B is wrong because SSL/TLS is about encryption, not row count. Option D is wrong because the Redshift table has no constraints.

Full explanation →

749

Multi-Selecthard

Which THREE are valid considerations when troubleshooting data loss in an AWS Glue ETL job? (Choose three.)

Select 3 answers

A.Job bookmarks may be skipping new data if not configured properly.

B.Server-side encryption is disabled on the S3 bucket.

C.The job timeout is set too low.

D.Dynamic frame transformations may drop rows with errors.

E.The mapping of source columns to target columns may be incorrect.

AnswersA, D, E

Bookmarks control reprocessing.

Why this answer

Options A, B, and E are correct. Job bookmarks can skip data, incorrect mapping can lose columns, and dynamic frame transformations can drop rows if errors occur. Option C is wrong because disabling encryption does not cause data loss.

Option D is wrong because increasing timeout does not cause data loss.

Full explanation →

750

Multi-Selecteasy

A data engineer is setting up a data pipeline using Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using an AWS Lambda function before delivery. Which THREE steps are required to configure this?

Select 3 answers

A.Create a Lambda@Edge function in the same Region.

B.Create an AWS Lambda function that transforms the data.

C.Attach an IAM role to the Firehose delivery stream that grants permission to invoke the Lambda function.

D.Configure an S3 event notification to trigger the Lambda function when new data arrives.

E.Configure the Kinesis Data Firehose delivery stream to use the Lambda function as a data transformation source.

AnswersB, C, E

A Lambda function is needed to perform the transformation logic.

Why this answer

Options B, C, and D are correct. The Lambda function must be created (B), Firehose must be configured to invoke the Lambda function (C), and the IAM role must allow Firehose to invoke Lambda (D). Option A is wrong because Lambda@Edge is for CloudFront.

Option E is wrong because S3 events are not used for Firehose transformation.

Full explanation →

Page 10 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →