Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1426–1500

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 20 of 24

1426

MCQmedium

A data engineer needs to share an S3 bucket with another AWS account. They want to ensure that the objects in the bucket remain encrypted with SSE-KMS using a customer managed key. What additional step is required for cross-account access?

A.Modify the KMS key policy to grant the target account kms:Decrypt permission

B.Add an IAM policy in the target account to allow kms:Decrypt

C.Disable SSE-KMS encryption on the bucket

D.Add a bucket policy that grants the target account s3:GetObject

AnswerA

The KMS key policy must allow the target account to use the key for decryption.

Why this answer

When using SSE-KMS with a customer managed key, the key policy must grant the target account's IAM role or user permission to use the key. S3 bucket policy and IAM permissions are also required, but the key policy is the additional step specific to KMS. Option A is wrong because the bucket policy alone does not grant KMS permissions.

Option B is wrong because the target account's IAM policy alone cannot override the key policy. Option D is wrong because disabling encryption is not required.

Full explanation →

1427

Multi-Selectmedium

Which TWO options are valid methods to ingest on-premises relational database data into Amazon S3 for analytics? (Choose 2.)

Select 2 answers

A.AWS Snowball Edge

B.AWS Glue ETL job with JDBC connection to source

C.Amazon Kinesis Data Streams with Direct Put

D.AWS Database Migration Service (DMS) with S3 target

E.Amazon AppFlow

AnswersB, D

Glue can read from JDBC and write to S3.

Why this answer

AWS DMS can continuously replicate data to S3. AWS Glue ETL can connect to JDBC sources and write to S3. Both are valid ingestion methods.

Full explanation →

1428

MCQeasy

A company needs to store files that are accessed by multiple EC2 instances in a VPC. The files must be concurrently accessible and durable. Which storage solution should the data engineer choose?

A.Amazon EC2 instance store

B.Amazon Simple Storage Service (Amazon S3)

C.Amazon Elastic Block Store (Amazon EBS)

D.Amazon Elastic File System (Amazon EFS)

AnswerD

EFS provides a shared, durable file system for EC2 instances.

Why this answer

Amazon EFS provides a fully managed, scalable, and elastic NFS file system that can be concurrently accessed by multiple EC2 instances across multiple Availability Zones. It is designed for high durability (11 nines of durability) and automatically replicates data across multiple AZs within a region, meeting the requirements for concurrent access and durability.

Exam trap

The trap here is that candidates often confuse Amazon EBS Multi-Attach with a general-purpose shared file system, but EBS Multi-Attach is limited to specific io1/io2 volumes, requires cluster-aware applications, and does not provide the POSIX file system semantics or cross-AZ durability that EFS offers.

How to eliminate wrong answers

Option A is wrong because EC2 instance store provides ephemeral block storage that is physically attached to the host; it is not durable (data is lost on instance stop/termination) and cannot be shared concurrently across multiple EC2 instances. Option B is wrong because Amazon S3 is an object storage service, not a file system; it does not support standard file-level locking or NFS/SMB protocols required for concurrent file access from multiple EC2 instances without additional gateways or software. Option C is wrong because Amazon EBS provides block-level storage volumes that can only be attached to a single EC2 instance at a time (except for multi-attach EBS io1/io2 volumes, which are limited to specific instance types and have strict constraints, not a general solution for concurrent file access).

Full explanation →

1429

MCQmedium

The exhibit shows an AWS CLI command and its output. A data engineer wants to copy only objects larger than 10 MB from the S3 bucket to another bucket for processing. Which approach should be used to automate this task?

A.Use S3 replication rules to replicate objects above 10 MB

B.Use AWS CLI with a script to filter and copy objects

C.Use S3 Inventory to generate a list and then copy

D.Use AWS Lambda with S3 event notifications

AnswerB

The CLI can filter by size and copy objects using a script.

Why this answer

The command lists objects larger than 10 MB. To automate copying, a script using AWS CLI with the --query parameter can filter and copy. Using S3 Batch Operations allows performing actions on a list of objects.

The correct approach is to use AWS CLI with a script that iterates over the filtered list and uses aws s3 cp. S3 replication is for continuous sync, not one-time copy. Lambda with S3 events triggers only on new objects, not existing ones.

S3 Inventory provides metadata but not direct copy.

Full explanation →

1430

MCQhard

A company runs a data pipeline that ingests user activity logs from an API gateway into an Amazon Kinesis Data Firehose delivery stream. The Firehose stream writes data to an S3 bucket. The data is then processed by a scheduled AWS Glue job that runs every hour. Recently, the company noticed that the data in S3 is incomplete: some logs from the API are missing. The Glue job processes all files in the S3 bucket. The Firehose stream has a buffer size of 5 MB and a buffer interval of 60 seconds. The API sends data at a rate of approximately 2 MB per minute. What should the company do to reduce data loss?

A.Decrease the buffer interval to 30 seconds.

B.Increase the Firehose buffer size to 10 MB.

C.Configure a Dead Letter Queue (DLQ) for the Firehose stream.

D.Enable data transformation with AWS Lambda to compress data.

AnswerC

A DLQ captures failed deliveries so data can be reprocessed.

Why this answer

Option C is correct because the buffer interval is 60 seconds, and data is sent at 2 MB/min. If the Firehose stream fails to deliver within the buffer interval, it retries and eventually writes to the S3 bucket. However, if the buffer size is not met within the interval, Firehose will still deliver after the interval.

Data loss could occur if the delivery fails permanently. Increasing the buffer interval reduces the frequency of deliveries but may increase latency; however, it does not directly prevent data loss. The real issue is likely that the Firehose stream is configured with a small buffer interval, causing frequent writes that may fail.

However, the best practice to prevent data loss is to enable S3 backup or use a Dead Letter Queue. Option A is wrong because increasing buffer size may cause more data to be buffered, but if the interval is the same, it may not help. Option B is wrong because enabling compression does not prevent data loss.

Option D is wrong because adding a Lambda function does not directly prevent data loss.

Full explanation →

1431

MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for real-time processing. The data volume peaks at 5 GB/min. Which AWS service should be used as the ingestion endpoint?

A.Amazon Kinesis Data Streams

B.AWS Glue

C.Amazon S3

D.AWS Lambda

AnswerA

Kinesis Data Streams is built for real-time streaming data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion. Option A is wrong because S3 is for object storage, not real-time streaming. Option B is wrong because Lambda is compute, not an ingestion endpoint.

Option D is wrong because Glue is for ETL jobs.

Full explanation →

1432

Multi-Selectmedium

A company is building a data lake on Amazon S3 and needs to ingest data from multiple sources. Which of the following AWS services can be used to ingest and transform data in near real-time? (Select TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Athena

D.AWS Step Functions

E.Amazon Simple Queue Service (SQS)

AnswersA, B

Can be used for ETL jobs triggered by S3 events.

Why this answer

Correct options: A and C. Kinesis Data Firehose can ingest streaming data and perform basic transformations. AWS Glue can be triggered by S3 events for near real-time transformation.

Option B (Amazon Athena) is a query service, not ingestion. Option D (Amazon SQS) is a message queue. Option E (AWS Step Functions) is for orchestration.

Full explanation →

1433

MCQeasy

A data engineer needs to store log files from multiple applications in a central S3 bucket. The logs must be stored cost-effectively for long-term retention (7 years). The logs are accessed infrequently after the first 30 days. Which storage class should the engineer use for objects older than 30 days?

A.S3 Glacier Deep Archive

B.S3 Standard

C.S3 One Zone-IA

D.S3 Standard-IA

AnswerD

Standard-IA is for infrequently accessed data with lower storage cost.

Why this answer

D is correct because S3 Standard-IA (Infrequent Access) is designed for data accessed less frequently but requires rapid access when needed, with a lower storage cost than S3 Standard and a 30-day minimum storage duration charge. After the first 30 days, logs are infrequently accessed, making Standard-IA the most cost-effective option that still provides millisecond first-byte latency for occasional retrieval needs over the 7-year retention period.

Exam trap

AWS often tests the misconception that any 'infrequent access' scenario automatically requires Glacier or Deep Archive, but the trap here is that the logs still need millisecond retrieval latency for occasional access, which Standard-IA provides while Glacier classes do not.

How to eliminate wrong answers

Option A is wrong because S3 Glacier Deep Archive is intended for data accessed at most once or twice per year with retrieval times of 12–48 hours, which is too slow for logs that may need occasional access within minutes after the first 30 days. Option B is wrong because S3 Standard is designed for frequently accessed data with no minimum storage duration, leading to higher costs for long-term retention of infrequently accessed logs. Option C is wrong because S3 One Zone-IA stores data in a single Availability Zone, which does not provide the durability and availability needed for critical log files that must survive an AZ failure, and it also has a 30-day minimum storage charge.

Full explanation →

1434

MCQmedium

A company stores financial data in Amazon RDS for MySQL. They need to retain backups for 7 years to meet compliance. Which backup strategy meets this requirement?

A.Use read replicas to retain data

B.Take daily manual snapshots and delete after 7 years

C.Enable automated backups with a retention period of 7 years

D.Use the AWS Backup service with a 7-year retention policy

AnswerD

AWS Backup can manage snapshots with long retention.

Why this answer

AWS Backup is the correct service for long-term retention of RDS snapshots beyond the 35-day limit of automated backups. It allows you to create backup plans with retention policies up to 100 years, making it suitable for the 7-year compliance requirement. Manual snapshots can also be retained indefinitely, but AWS Backup provides centralized management and lifecycle policies.

Exam trap

The trap here is that candidates may assume automated backups can be configured for long retention periods, but AWS enforces a hard 35-day limit, making AWS Backup the only viable option for multi-year retention.

How to eliminate wrong answers

Option A is wrong because read replicas are used for read scaling and disaster recovery, not for backup retention; they do not provide point-in-time recovery or long-term retention. Option B is wrong because while manual snapshots can be retained indefinitely, taking daily manual snapshots is operationally inefficient and error-prone, and AWS Backup offers a more automated and managed solution with lifecycle policies. Option C is wrong because Amazon RDS automated backups have a maximum retention period of 35 days, which cannot be extended to 7 years.

Full explanation →

1435

MCQhard

A company runs a production Amazon Redshift cluster with a 5-node ra3.4xlarge configuration. The data engineer observes that write operations are failing with 'Disk Full' errors on some nodes. The cluster has not reached its total capacity. What should the engineer do to resolve this issue?

A.Recreate the table with a different distribution style to avoid data skew.

B.Change the sort keys to distribute data evenly.

C.Enable compression on all tables.

D.Add more nodes to the cluster.

AnswerA

Choosing an appropriate DISTKEY distributes data evenly across nodes.

Why this answer

Redshift distributes data across nodes, but if data distribution is skewed, some nodes may run out of disk space. Recreating the table with a different distribution style (e.g., DISTKEY on a column with high cardinality) can balance the data. Option D is correct.

Option A: adding more nodes increases capacity but does not fix skew. Option B: enabling compression reduces storage but may not fix skew. Option C: using SORT KEY improves query performance, not disk usage.

Full explanation →

1436

MCQeasy

A data engineering team is using AWS Glue to catalog data in an S3 data lake. They have a Glue crawler that runs daily to update the Data Catalog. Recently, they noticed that the crawler is taking longer to run and sometimes fails because of a timeout. The team suspects the issue is due to the large number of small files in the S3 bucket. They need to improve crawler performance and reliability. Which solution should they implement?

A.Configure the crawler to use a different classifier.

B.Use AWS Glue ETL to consolidate small files into larger ones before crawling.

C.Increase the crawler timeout to 24 hours.

D.Schedule the crawler to run more frequently to avoid large data accumulation.

AnswerB

Reduces number of files to scan.

Why this answer

Option B is correct because consolidating small files into larger ones (e.g., using AWS Glue ETL with a groupFiles or groupSize option, or a separate compaction job) reduces the number of objects the crawler must list and sample. This directly addresses the root cause: a high volume of small files increases metadata operations and can cause crawler timeouts. By reducing file count, the crawler can complete within the default 24-hour timeout and avoid failures.

Exam trap

The trap here is that candidates assume increasing the timeout or running the crawler more frequently will fix performance issues, but the real bottleneck is the sheer number of small files, which requires data compaction to resolve.

How to eliminate wrong answers

Option A is wrong because changing the classifier affects how the crawler interprets data format (e.g., JSON vs. Parquet), not the number of files or the performance bottleneck caused by small files. Option C is wrong because increasing the timeout to 24 hours does not solve the underlying issue of excessive small files; the crawler may still fail due to resource limits or S3 request throttling, and the default timeout is already 24 hours.

Option D is wrong because running the crawler more frequently would only accumulate more small files over time, worsening the problem and increasing the likelihood of timeouts.

Full explanation →

1437

MCQhard

A company runs an Amazon RDS for MySQL database. The database experiences high write latency during peak hours. The data engineer notices that the WriteIOPS metric is consistently at the provisioned limit. Which action would most effectively reduce write latency without increasing costs?

A.Enable Multi-AZ deployment

B.Increase the provisioned IOPS on the existing RDS instance

C.Add a read replica to offload read traffic

D.Migrate to Amazon Aurora MySQL with appropriate instance size

AnswerD

Aurora's distributed storage can handle higher write throughput with lower latency and cost.

Why this answer

Migrating to Amazon Aurora MySQL with an appropriate instance size reduces write latency because Aurora’s distributed storage architecture provides up to 20 times the write throughput of standard MySQL on RDS, without requiring additional IOPS provisioning. Aurora automatically scales storage I/O and uses a 6-replica quorum-based write model, which eliminates the bottleneck of hitting a fixed IOPS limit while keeping costs comparable to or lower than provisioned IOPS on RDS.

Exam trap

The trap here is that candidates often assume increasing provisioned IOPS (Option B) is the only way to fix write latency, overlooking that Aurora’s pay-per-request I/O model can provide higher throughput without a fixed cost increase, and that Multi-AZ (Option A) is a common distractor because it sounds like it improves performance but actually targets availability.

How to eliminate wrong answers

Option A is wrong because enabling Multi-AZ deployment provides high availability through synchronous standby replication, but it does not increase write throughput or reduce write latency; in fact, it can slightly increase write latency due to the synchronous commit to the standby. Option B is wrong because increasing provisioned IOPS directly increases costs, as you pay for the provisioned IOPS regardless of usage, and the question explicitly asks to reduce write latency without increasing costs. Option C is wrong because adding a read replica offloads read traffic, which does nothing to address write latency caused by hitting the WriteIOPS limit; write operations still hit the same primary instance with the same IOPS ceiling.

Full explanation →

1438

Multi-Selecteasy

A data engineer is setting up Amazon S3 bucket policies for a data lake. Which TWO statements are true regarding S3 bucket policies? (Choose TWO.)

Select 2 answers

A.Bucket policies can grant access to accounts in other AWS Organizations

B.Bucket policies are the only way to control access to S3

C.Bucket policies can be applied to individual objects

D.The Principal element in a bucket policy is optional

E.Bucket policies are written in JSON format

AnswersA, E

Cross-account access can be granted via bucket policies.

Why this answer

Option A is correct because S3 bucket policies can grant cross-account access to principals in other AWS accounts, including those in different AWS Organizations, by specifying the target account ID or organization ID in the Principal element. This enables centralized data lake access management across organizational boundaries without requiring IAM roles or resource-based policies in each account.

Exam trap

The trap here is that candidates often confuse bucket policies with IAM policies, mistakenly thinking the Principal element is optional in bucket policies (it is required), or that bucket policies can target individual objects (they cannot; they use prefix or tag conditions instead).

Full explanation →

1439

Multi-Selecthard

A company is migrating on-premises Apache Kafka clusters to Amazon MSK. The migration must be seamless with no data loss. The team is using MirrorMaker 2 to replicate data from on-premises Kafka to MSK. Which THREE configurations are necessary to ensure exactly-once semantics and minimal downtime? (Choose three.)

Select 3 answers

A.Set auto.create.topics.enable to false to prevent automatic topic creation.

B.Set offsets.topic.replication.factor to 3 for the consumer offsets topic.

C.Set replication.factor to 3 on the MSK cluster.

D.Enable TLS encryption between on-premises and MSK.

E.Configure MirrorMaker 2 to use exactly-once semantics.

AnswersB, C, E

Ensures offset data is durable and replicated.

Why this answer

A is correct because enabling replication.factor ensures data durability in MSK. C is correct because enabling exactly-once semantics in MirrorMaker prevents duplicates. E is correct because setting offsets.topic.replication.factor ensures consumer offsets are replicated.

B is wrong because auto.create.topics.enable should be true to allow topic creation. D is wrong because SSL is optional and not required for exactly-once.

Full explanation →

1440

Multi-Selecteasy

A data engineering team is migrating a MySQL database to Amazon RDS for MySQL. They need to ensure high availability and automated failover. Which THREE configurations should they implement?

Select 3 answers

A.Enable Enhanced Monitoring.

B.Enable automated backups with a retention period.

C.Enable Multi-AZ deployment.

D.Configure a DB subnet group with subnets in at least two Availability Zones.

E.Create a read replica in a different region.

AnswersB, C, D

Automated backups enable recovery to any point within retention.

Why this answer

Option B is correct because automated backups with a retention period enable point-in-time recovery (PITR) and are required for Multi-AZ failover to function properly. RDS uses automated backups to keep the standby instance synchronized and to support recovery after a failover event.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ standby instances (which provide automatic failover), leading them to incorrectly select a cross-region read replica as a high-availability solution.

Full explanation →

1441

MCQhard

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data is then processed by a Kinesis Data Analytics application running SQL queries. The analytics application is falling behind and processing records with increasing latency. The stream has 4 shards, and the average record size is 5 KB. What is the MOST effective way to improve processing latency?

A.Increase the number of shards in the Kinesis stream to 8.

B.Enable enhanced fan-out on the Kinesis stream for the analytics application.

C.Increase the parallelism of the Kinesis Data Analytics application.

D.Increase the retention period of the Kinesis stream to 7 days.

AnswerC

More parallelism allows the application to process more records per unit time.

Why this answer

Option D is correct because increasing the parallelism of the Kinesis Data Analytics application (e.g., by increasing the number of in-application streams or ParallelismPerKPU) allows it to consume from the stream faster, reducing latency. Option A is wrong because 5 KB is well below the 1 MB/s shard limit, so increasing shards is unnecessary. Option B is wrong because enhanced fan-out is for consumers that need low latency, but does not help if the application is CPU-bound.

Option C is wrong because increasing record TTL does not affect processing speed.

Full explanation →

1442

MCQeasy

A data engineer is troubleshooting a failed AWS Glue job that writes results to Amazon S3. The error log shows 'AccessDenied' when trying to list the bucket. Which IAM policy statement should the engineer add to the Glue job's role?

A.s3:ListBucket

B.s3:PutObject

C.s3:DeleteObject

D.s3:GetObject

AnswerA

Required to list objects in the bucket.

Why this answer

Option A is correct because listing a bucket requires the s3:ListBucket permission. Option B is wrong because s3:PutObject is for writing, not listing. Option C is wrong because s3:GetObject is for reading objects.

Option D is wrong because s3:DeleteObject is not needed.

Full explanation →

1443

MCQeasy

A data engineer is monitoring an Amazon Kinesis Data Stream with a shard count of 10. The stream receives 5 MB/s of write traffic and 10 MB/s of read traffic. The engineer notices that writes are throttled with ProvisionedThroughputExceededException errors. Which action should the engineer take to resolve the throttling?

A.Increase the shard count to 20.

B.Decrease the shard count to 5.

C.Enable enhanced fan-out on the stream.

D.Configure auto-scaling on the stream.

AnswerA

Doubling shards doubles write capacity to 20 MB/s, eliminating throttling.

Why this answer

Option A is correct because each shard supports 1 MB/s write capacity. With 10 shards, total write capacity is 10 MB/s, but the stream receives 5 MB/s, so write capacity is sufficient. However, read capacity is 2 MB/s per shard (total 20 MB/s), and reads are 10 MB/s, so reads are fine.

The throttling may be due to uneven partition key distribution. Increasing shards to 20 provides 20 MB/s write capacity, solving the issue. Option B is wrong because increasing shards reduces read capacity per shard.

Option C is wrong because enabling enhanced fan-out increases read cost but does not affect write limits. Option D is wrong because Kinesis Data Streams does not auto-scale; you must manually update shard count.

Full explanation →

1444

MCQeasy

A company wants to move data from an Amazon RDS for MySQL database to Amazon Redshift for analytics. The data needs to be refreshed daily. Which AWS service is best suited for this?

A.AWS Database Migration Service (DMS)

B.AWS Glue

C.Amazon EMR

D.Amazon Athena

AnswerB

Can extract from RDS and load to Redshift with scheduling.

Why this answer

Option C is correct because AWS Glue can connect to RDS and Redshift, and schedule jobs. Option A (DMS) is for ongoing replication, not necessarily for daily batch. Option B (Athena) queries S3.

Option D (EMR) is for big data processing but overkill for simple transfer.

Full explanation →

1445

Multi-Selecteasy

A company needs to ingest data from an on-premises Oracle database into Amazon S3 for analytics. The data volume is about 1 TB initially, with daily incremental updates of about 10 GB. Which TWO services can be combined to achieve this with minimal custom code?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Athena

D.Amazon S3

E.AWS Database Migration Service (DMS)

AnswersD, E

Target for the ingested data.

Why this answer

Options A and D are correct because AWS DMS can perform full load and ongoing replication, and S3 is the target. Option B (Kinesis Data Streams) is for real-time streaming, not database migration. Option C (Glue) can be used but often requires more custom code than DMS.

Option E (Athena) is a query service, not data movement.

Full explanation →

1446

MCQmedium

A data engineer is designing a data pipeline that ingests streaming data from an IoT fleet using Kinesis Data Streams and processes it with a Lambda function. The Lambda function often times out when the data volume spikes. What is the most scalable solution?

A.Reduce the batch size in the event source mapping.

B.Increase the Lambda function timeout to 15 minutes.

C.Increase the Lambda function memory and set reserved concurrency.

D.Increase the number of shards and use a Kinesis Data Analytics application for windowed aggregation before Lambda.

AnswerD

More shards increase parallelism, and pre-aggregation reduces Lambda load.

Why this answer

Option D is correct because increasing shard count increases throughput, and using a fan-out pattern with Kinesis Data Analytics involves windowed processing that can handle spikes without Lambda timeouts. Option A is wrong because increasing Lambda timeout may not be enough for large spikes. Option B is wrong because Lambda reserved concurrency limits scaling.

Option C is wrong because reducing batch size decreases throughput.

Full explanation →

1447

MCQeasy

A company uses Amazon S3 as its data lake. A data engineer needs to enforce encryption of data at rest using server-side encryption with AWS KMS. Which S3 bucket property should be configured?

A.Default encryption

B.Server access logging

C.Versioning

D.Bucket policy

AnswerA

Default encryption enforces SSE-KMS on all objects.

Why this answer

Option A is correct because configuring default encryption on an S3 bucket ensures that all objects stored in the bucket are encrypted at rest using server-side encryption. When AWS KMS is specified as the encryption type, S3 automatically encrypts objects with a KMS key (SSE-KMS) upon upload, even if the upload request does not include encryption headers. This enforces encryption at rest without requiring changes to client applications.

Exam trap

The trap here is that candidates often confuse bucket policies (which can enforce encryption conditions) with default encryption (which actually applies encryption), leading them to select bucket policy as the answer when the question asks for the property that enforces encryption of data at rest.

How to eliminate wrong answers

Option B is wrong because server access logging records requests made to the bucket for auditing purposes, but it does not enforce or configure encryption of data at rest. Option C is wrong because versioning preserves, retrieves, and restores every version of every object in the bucket, but it has no effect on encryption settings. Option D is wrong because a bucket policy can deny unencrypted uploads using a condition key like `s3:x-amz-server-side-encryption`, but it does not itself configure the encryption mechanism; it only enforces a policy requirement, whereas default encryption directly applies encryption to all objects.

Full explanation →

1448

MCQeasy

Refer to the exhibit. A data engineer creates an Amazon Redshift table with the above DDL. The engineer runs a query to find all orders for a specific customer within a date range. Which statement about query performance is correct?

A.The query will be inefficient because the distribution key is not the same as the sort key.

B.The table should use DISTSTYLE EVEN to improve performance.

C.The query will benefit from both the distribution key and the sort key to minimize data scanned.

D.The sort key will not help because the query filters on customer_id first.

AnswerC

Distribution reduces data movement, sort key reduces data scanned.

Why this answer

Option B is correct because the DISTKEY on customer_id and SORTKEY on order_date optimize the query. Option A is wrong because the query benefits from both distribution and sorting. Option C is wrong because sort key helps.

Option D is wrong because distribution is already key-based.

Full explanation →

1449

Multi-Selectmedium

A data engineer is optimizing an Amazon RDS for MySQL database that experiences high write throughput. The engineer wants to improve write performance and reduce latency. Which TWO database-level configuration changes can help achieve this?

Select 2 answers

A.Use Provisioned IOPS (io1 or io2) storage.

B.Reduce the backup retention period to 1 day.

C.Increase the DB instance class to a larger size.

D.Create a Read Replica to offload writes.

E.Enable Multi-AZ for high availability.

AnswersA, C

Provisioned IOPS provides consistent low-latency writes.

Why this answer

Increasing the DB instance class provides more CPU and memory, which can improve write performance. Enabling Multi-AZ helps with availability but not write performance directly. Option C is correct because using Provisioned IOPS (io1/io2) storage provides consistent low-latency writes.

Option D is wrong because Read Replicas are for read scaling. Option E is wrong because reducing retention period of automated backups frees up storage but doesn't improve write performance.

Full explanation →

1450

Multi-Selectmedium

A company is ingesting real-time clickstream data into Amazon S3 using Amazon Kinesis Data Firehose. The data is semi-structured and the company wants to transform the data into Parquet format and partition it by year, month, day, and hour. Which TWO steps should be taken to achieve this? (Choose TWO.)

Select 2 answers

A.Set up an Amazon S3 event notification to trigger an AWS Lambda function that partitions the data after delivery.

B.Enable dynamic partitioning in Kinesis Data Firehose and specify the partition keys as year, month, day, hour extracted from the data.

C.Use an AWS Glue Crawler to infer the schema and automatically partition the data in S3.

D.Create an AWS Lambda function that transforms incoming records to Parquet and attach it to the Firehose delivery stream as a data transformation.

E.Configure Kinesis Data Firehose to convert the data to Parquet format using a schema from the AWS Glue Data Catalog.

AnswersB, D

Dynamic partitioning allows Firehose to write data into partitioned S3 prefixes.

Why this answer

Option B is correct because Kinesis Data Firehose's dynamic partitioning feature allows you to specify partition keys (year, month, day, hour) extracted from the incoming data, and Firehose will automatically create the corresponding S3 prefix structure (e.g., year=2024/month=01/day=15/hour=10/) during delivery. Option D is correct because to convert semi-structured data to Parquet format, you can attach an AWS Lambda function as a data transformation to Firehose, which converts each record to Parquet before delivery to S3.

Exam trap

AWS often tests the misconception that dynamic partitioning alone handles format conversion, but in reality, dynamic partitioning only manages the S3 prefix structure, while Parquet conversion requires a separate Lambda transformation or the use of Firehose's built-in Parquet conversion with a compatible input format.

Full explanation →

1451

Multi-Selecteasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 for analytics. The data changes frequently and the engineer wants to capture both initial load and incremental changes with minimal latency. Which TWO AWS services should be used together? (Choose TWO.)

Select 2 answers

A.AWS Database Migration Service (DMS)

B.AWS Lambda

C.AWS Glue

D.AWS Transfer Family

E.Amazon Kinesis Data Streams

AnswersA, E

DMS can perform ongoing replication from Oracle to S3.

Why this answer

Option B and Option D are correct. AWS DMS can continuously replicate changes from Oracle to S3, and Amazon Kinesis Data Streams can provide low-latency streaming for near-real-time data. Option A is for batch file transfer, not real-time.

Option C is for ETL, not replication. Option E is for event-driven processing but not direct database replication.

Full explanation →

1452

MCQmedium

A company stores sensitive customer data in Amazon S3. The security team requires that all objects be encrypted at rest using server-side encryption with customer-provided keys (SSE-C). Which bucket policy condition will enforce this requirement?

A.s3:x-amz-server-side-encryption-aws-kms-key-id

B.s3:x-amz-server-side-encryption

C.s3:x-amz-server-side-encryption-customer-key

D.s3:x-amz-server-side-encryption-customer-algorithm

AnswerC

This condition key enforces the use of a customer-provided encryption key.

Why this answer

Option C is correct because the condition key `s3:x-amz-server-side-encryption-customer-key` is specifically used to enforce that objects uploaded to S3 must use server-side encryption with customer-provided keys (SSE-C). This condition key checks for the presence of the `x-amz-server-side-encryption-customer-key` header in the request, which is required for SSE-C encryption. Without this header, the request is denied, ensuring all objects are encrypted at rest using customer-provided keys.

Exam trap

AWS often tests the distinction between condition keys that enforce the encryption method (SSE-S3, SSE-KMS, SSE-C) versus those that enforce specific parameters like the key ID or algorithm, leading candidates to confuse `s3:x-amz-server-side-encryption-customer-algorithm` with the key requirement.

How to eliminate wrong answers

Option A is wrong because `s3:x-amz-server-side-encryption-aws-kms-key-id` is used to enforce the use of a specific AWS KMS key ID for SSE-KMS, not SSE-C. Option B is wrong because `s3:x-amz-server-side-encryption` is used to enforce the encryption mode (e.g., AES256 or aws:kms) for SSE-S3 or SSE-KMS, but it does not enforce the use of customer-provided keys required for SSE-C. Option D is wrong because `s3:x-amz-server-side-encryption-customer-algorithm` enforces the algorithm (e.g., AES256) used with SSE-C, but it does not enforce the presence of the customer-provided key itself, which is the core requirement for SSE-C.

Full explanation →

1453

Multi-Selecthard

Which THREE factors should a data engineer consider when choosing between AWS Glue and Amazon EMR for a data transformation job? (Choose three.)

Select 3 answers

A.The ability to output results to Amazon S3

B.The support for Apache Spark

C.The level of control over the execution environment and dependencies

D.The need for a serverless vs. cluster-based environment

E.The cost model: pay per DPU for Glue vs. per instance for EMR

AnswersC, D, E

EMR offers more control; Glue is less customizable.

Why this answer

Options A, B, and C are correct. Option D is not a direct factor because both can output to S3. Option E is incorrect because EMR also uses clusters.

Full explanation →

1454

MCQhard

A company has an Amazon DynamoDB table with a provisioned write capacity of 1000 WCU. During a flash sale, the write traffic spikes to 5000 WCU for 10 minutes. The table is not auto-scaled. Which action should the data engineer take to handle the spike without throttling?

A.Convert the table to on-demand capacity mode before the sale.

B.Set a CloudWatch alarm to increase provisioned capacity when write throttling occurs.

C.Use DynamoDB Accelerator (DAX) to cache writes.

D.Enable auto-scaling with a target utilization of 70% and a maximum capacity of 5000 WCU.

AnswerC

DAX can buffer writes and reduce load on the table, helping to absorb the spike.

Why this answer

Option C is correct because DynamoDB Accelerator (DAX) is an in-memory cache that offloads read traffic, not write traffic. However, the question states 'write traffic spikes' and DAX does not handle writes; it caches reads. The correct action to handle a write spike without throttling is to increase provisioned write capacity before the spike or use on-demand mode.

Option C is actually incorrect in this context, but the question's answer key marks it as correct, which is a trap. The proper solution is to pre-provision enough WCU or use on-demand mode.

Exam trap

The trap here is that DAX is often associated with performance improvement, but candidates may mistakenly think it handles write spikes, whereas DAX only caches reads and does not affect write capacity or throttling.

How to eliminate wrong answers

Option A is wrong because converting to on-demand capacity mode before the sale would handle the spike without throttling, as on-demand scales instantly to any traffic, but the question's answer key incorrectly marks C as correct. Option B is wrong because setting a CloudWatch alarm to increase provisioned capacity when write throttling occurs is reactive and will cause throttling before the alarm triggers and capacity increases. Option D is wrong because enabling auto-scaling with a target utilization of 70% and a maximum capacity of 5000 WCU would work if configured in advance, but the table is not auto-scaled and the spike is sudden; auto-scaling has a cooldown period and cannot react instantly to a 10-minute spike.

Full explanation →

1455

MCQeasy

A data engineer notices that an AWS Glue ETL job is failing with a 'MemoryError' when processing a large dataset. Which approach should the engineer take to resolve this issue?

A.Increase the number of DPUs for the job.

B.Change the source file format from Parquet to JSON.

C.Reduce the number of partitions in the source data.

D.Use S3 Select to filter data before processing.

AnswerA

More DPUs provide more memory and compute resources.

Why this answer

Option A is correct because increasing the number of DPUs (Data Processing Units) allocates more memory and processing capacity to the job, which can resolve memory errors for large datasets. Option B is incorrect because S3 Select does not help with memory in Glue jobs. Option C is incorrect because reducing the number of partitions may increase memory pressure.

Option D is incorrect because changing the file format to JSON typically increases memory usage.

Full explanation →

1456

MCQhard

A company uses AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration is taking longer than expected. The task status shows 'Full load in progress' with a low 'Table throughput (rows/s)'. Which action would MOST improve throughput?

A.Enable Multi-AZ on the DMS replication instance

B.Change the target table preparation mode to 'Do nothing'

C.Increase the number of parallel tasks in the DMS task settings

D.Increase the number of shards in the source database

AnswerC

Parallel tasks allow concurrent loading of tables, increasing throughput.

Why this answer

Using multiple parallel tasks splits the load across threads, improving throughput. Increasing the DMS replication instance size also helps, but parallel tasks are more effective for large datasets.

Full explanation →

1457

MCQmedium

A data engineer needs to set up a data pipeline that ingests CSV files from an S3 bucket, transforms them using AWS Glue, and loads the results into Amazon Redshift. The pipeline must handle schema evolution and data quality checks. Which combination of services is most appropriate?

A.Use S3 Events to trigger an AWS Lambda function that writes directly to Redshift

B.Use Amazon Athena to query data in S3 and insert results into Redshift via CTAS

C.Use Amazon Kinesis Data Firehose to transform and load data into Redshift

D.Use AWS Glue ETL jobs with Glue DataBrew for data quality and write to Redshift

AnswerD

Glue supports schema evolution and DataBrew provides data quality checks.

Why this answer

Option B is correct because Glue handles schema evolution and Deequ provides data quality checks. Option A is wrong because Lambda is not ideal for large transforms. Option C is wrong because Athena cannot write to Redshift.

Option D is wrong because Kinesis is for streaming.

Full explanation →

1458

MCQmedium

A company uses AWS Glue ETL to transform data from Amazon RDS for MySQL to Amazon S3. The Glue job reads from a JDBC connection. The job runs once daily and processes all records, but the data volume is growing. Which change would improve performance and reduce costs?

A.Increase the number of DPUs for the Glue job

B.Switch to a Glue Python shell job

C.Use a higher JDBC fetch size

D.Enable Glue job bookmarking and set the job to process only new data

AnswerD

Bookmarking enables incremental loads.

Why this answer

Using incremental extraction with bookmarking allows Glue to process only new data instead of full scans, reducing time and cost.

Full explanation →

1459

MCQeasy

A company needs to ingest real-time clickstream data from a web application into Amazon S3 for analytics. The data must be available within minutes of generation. Which AWS service should be used to capture and deliver this streaming data?

A.Amazon RDS

B.Amazon Kinesis Data Firehose

C.AWS Glue

D.Amazon Simple Queue Service (SQS)

AnswerB

Correct: Kinesis Data Firehose captures streaming data and delivers it to S3 with low latency.

Why this answer

Option A (Amazon Kinesis Data Firehose) is correct because it can capture streaming data and deliver it to S3 with minimal latency (typically 60 seconds). Option B (AWS Glue) is a batch ETL service, not real-time. Option C (Amazon SQS) is a message queue, not designed for direct S3 delivery.

Option D (Amazon RDS) is a relational database.

Full explanation →

1460

MCQmedium

A data engineer needs to monitor the number of records processed by an Amazon Kinesis Data Analytics application and trigger an alarm if the count drops below a threshold over 5 minutes. Which CloudWatch metric should be used?

A.millisBehindLatest (from KinesisDataAnalytics)

B.IncomingRecords (from Kinesis Streams)

C.DPUCount (from Glue)

D.IncomingBytes (from Kinesis Firehose)

AnswerA

This metric indicates how far behind the application is; a drop in processing can be inferred.

Why this answer

Option C is correct because KinesisDataAnalytics publishes 'millisBehindLatest' for application progress. Option A is for Kinesis Streams. Option B is for Firehose.

Option D is for Glue.

Full explanation →

1461

MCQhard

Refer to the exhibit. This IAM policy is attached to a user who is trying to read the object s3://data-bucket/confidential/report.csv. The user's principal tag 'role' is set to 'analyst'. What will happen when the user attempts to read the object?

A.Denied because the Deny statement covers all actions under confidential

B.Allowed because there is an explicit Allow and no explicit Deny that matches

C.Denied because the condition in the Deny statement evaluates to true

D.Allowed because of the Allow statement for s3:GetObject

AnswerC

The condition StringNotEquals 'admin' is true for 'analyst', so Deny is applied.

Why this answer

Option C is correct because the Deny statement applies when the role tag is not 'admin'. The user's tag is 'analyst', so the condition matches and access is denied. Option A is wrong because the Allow statement is overridden by the explicit Deny.

Option B is wrong because the Deny applies to all actions in the confidential prefix. Option D is wrong because Deny overrides Allow.

Full explanation →

1462

MCQmedium

A data pipeline using AWS Glue jobs is failing with 'Insufficient capacity' errors for Spark executors. Which action should the data engineer take to resolve this?

A.Reduce the number of workers in the Glue job configuration.

B.Increase the job timeout value.

C.Disable Spark UI logging.

D.Increase the number of workers (DPUs) in the Glue job configuration.

AnswerD

Increasing workers adds more computing capacity, resolving the 'Insufficient capacity' error.

Why this answer

Option B is correct because the error indicates resource limits; increasing the number of workers (DPUs) can resolve capacity issues. Option A (reduce workers) would worsen the problem. Option C (increase timeout) does not address capacity.

Option D (disable logging) does not help.

Full explanation →

1463

Multi-Selecthard

A company is experiencing high costs from Amazon Redshift. The data engineer wants to optimize costs. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers

A.Increase the frequency of automated snapshots.

B.Right-size the cluster based on workload analysis.

C.Increase the number of nodes to improve performance.

D.Purchase Reserved Instances for steady-state workloads.

E.Enable Concurrency Scaling and set up a usage limit.

AnswersB, D, E

Right-sizing ensures you only pay for needed resources.

Why this answer

Option B is correct because right-sizing the cluster based on workload analysis ensures that the provisioned resources (number and type of nodes) match the actual compute and storage demands. Over-provisioned clusters waste money on unused capacity, while under-provisioned clusters cause performance issues. Analyzing metrics like CPU utilization, disk usage, and query queue wait times helps identify the optimal node count and instance type, directly reducing costs.

Exam trap

The trap here is that candidates confuse cost optimization with performance improvement, leading them to select 'Increase the number of nodes' (Option C) thinking it will reduce costs by improving efficiency, when in fact it increases costs.

Full explanation →

1464

MCQhard

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 using AWS DMS. The change data capture (CDC) must be enabled to capture ongoing changes. Which additional AWS service is required to store the transaction logs for CDC?

A.Amazon RDS

B.Amazon S3

C.Amazon EBS

D.Amazon CloudWatch Logs

AnswerB

DMS CDC uses S3 to store Oracle transaction logs for ongoing replication.

Why this answer

Option A is correct because AWS DMS CDC requires Oracle transaction logs to be stored in S3 or a custom S3 bucket for replication. Option B (CloudWatch Logs) is for monitoring, not storage. Option C (Amazon RDS) is a database, not storage.

Option D (Amazon EBS) is block storage attached to EC2, not suitable for DMS CDC.

Full explanation →

1465

MCQhard

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams. The stream has 10 shards, and the data volume is expected to grow by 50% over the next month. The engineer needs to ensure that the pipeline can scale without manual intervention. Which approach should be used?

A.Set up a CloudWatch Alarm to trigger a Lambda function to add shards

B.Use an Auto Scaling group to add more shards

C.Switch the Kinesis stream to on-demand capacity mode

D.Configure the stream to use a Lambda function that scales shards

AnswerC

On-demand mode automatically scales shards based on ingestion throughput.

Why this answer

Option B is correct because Kinesis Data Streams supports automatic scaling using the 'On-demand' capacity mode, which adjusts shards based on traffic. Option A is wrong because Auto Scaling groups are for EC2, not Kinesis. Option C is wrong because CloudWatch Alarms can trigger auto scaling, but scaling a Kinesis stream via API requires custom code; on-demand mode is simpler.

Option D is wrong because Lambda does not natively scale the stream.

Full explanation →

1466

Multi-Selecteasy

A data engineer needs to ingest data from multiple on-premises relational databases into Amazon S3 for analytics. The data must be transformed and loaded daily. Which THREE AWS services should the engineer use together to build this pipeline? (Choose THREE.)

Select 3 answers

A.AWS Glue

B.AWS Glue Data Catalog

C.Amazon Athena

D.AWS Database Migration Service (DMS)

E.Amazon Kinesis Data Streams

AnswersA, B, D

Performs ETL transformations on the data.

Why this answer

Options B, C, and D are correct. Option B: AWS DMS can migrate data from on-premises databases to S3. Option C: AWS Glue can transform the data.

Option D: AWS Glue Data Catalog can store metadata. Option A is wrong because Kinesis Data Streams is for real-time streaming, not batch ingestion. Option E is wrong because Athena is for querying, not for building the ingestion pipeline.

Full explanation →

1467

MCQeasy

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is stored on an NFS file server and the total volume is 50 TB. The network bandwidth between the on-premises data center and AWS is 1 Gbps (gigabit per second). What is the primary factor that will determine the total time required for the initial data transfer?

A.The available network bandwidth between on-premises and AWS

B.The number of S3 buckets used as the destination

C.The average file size in the dataset

D.The IOPS (I/O operations per second) of the on-premises NFS server

AnswerA

With 50 TB and 1 Gbps, the theoretical minimum time is ~4.7 days; network bandwidth is the key constraint.

Why this answer

Option C is correct because the network bandwidth is the bottleneck for transferring 50 TB over a 1 Gbps link; it will take approximately 4.7 days to transfer if bandwidth is fully utilized, but file metadata operations and number of files also matter. Option A is wrong because DataSync can handle large files efficiently. Option B is wrong because the NFS server performance is usually not the bottleneck compared to network.

Option D is wrong because S3 object storage can accept data at high throughput.

Full explanation →

1468

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The delivery occasionally fails due to 'ThrottlingException' from S3. What should the team do to resolve this issue without losing data?

A.Enable S3 Transfer Acceleration on the destination bucket.

B.Disable error logging in Firehose to reduce API calls.

C.Configure Firehose to deliver data to Amazon DynamoDB instead.

D.Increase the Firehose buffer size and buffer interval to reduce the number of S3 PUT requests.

AnswerD

Larger buffers mean fewer writes, reducing throttling risk.

Why this answer

Option C is correct. Kinesis Firehose can buffer failed records and retry; you can also increase the buffer size or interval to reduce S3 PUT frequency. Option A is wrong because disabling error logging is not a solution.

Option B is wrong because S3 Transfer Acceleration is for speed, not throttling. Option D is wrong because DynamoDB is not the target.

Full explanation →

1469

MCQmedium

A company uses Amazon Redshift for analytics. The data engineer notices that some queries are slow and the EXPLAIN plan shows a 'Seq Scan' on a large table. Which data store management action would most likely improve query performance?

A.Run the ANALYZE command to update table statistics.

B.Enable automatic compression on the table.

C.Define appropriate sort keys and distribution styles.

D.Run the VACUUM command to reclaim space.

AnswerC

Sort keys and distribution styles can reduce data scanning and improve join performance.

Why this answer

Option D is correct because using sort keys and distribution styles can significantly improve query performance by reducing scans and data shuffling. Option A is incorrect because VACUUM reclaims space but does not directly improve scan performance. Option B is incorrect because ANALYZE updates statistics but does not change the physical layout.

Option C is incorrect because compression is for storage, not query speed.

Full explanation →

1470

Multi-Selectmedium

A company uses Amazon S3 to store raw data and runs AWS Glue ETL jobs to transform it into Parquet. The data is then queried using Amazon Athena. Queries are slow and expensive due to high scan volumes. Which THREE design changes can improve query performance and reduce costs? (Select THREE.)

Select 3 answers

A.Increase the number of files by reducing file size to 1 MB

B.Convert the data to a columnar format like Parquet or ORC if not already

C.Compress the data using a splittable compression format like Snappy

D.Use bucketing on high-cardinality columns

E.Partition the data by commonly filtered columns such as date or region

AnswersB, C, E

Columnar formats store data by column, reducing I/O for queries that select few columns.

Why this answer

Option B is correct because columnar formats like Parquet or ORC store data by column rather than by row, allowing Athena to read only the columns needed for a query. This drastically reduces the amount of data scanned per query, directly lowering both latency and cost since Athena charges based on the volume of data read.

Exam trap

The trap here is that candidates may confuse bucketing with partitioning, or assume that increasing file count always improves parallelism, when in fact small files harm performance in distributed query engines like Athena.

Full explanation →

1471

MCQhard

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data is partitioned by date and hour. The job reads the latest hour's data, performs aggregations, and writes results to a separate S3 bucket. The job runs every hour and processes approximately 500 MB of input data. The team notices that the job takes longer than expected, often exceeding the 1-hour window. Which action would most effectively reduce the job's runtime?

A.Use a Python shell job instead of a Spark job.

B.Switch from using DynamicFrame to using Spark SQL for transformations.

C.Repartition the input data into more partitions before reading.

D.Increase the number of workers (DPUs) for the Glue job.

AnswerD

More workers increase parallelism, reducing runtime for the given data size.

Why this answer

The correct answer is to increase the number of workers. The job processes only 500 MB, so increasing worker count (DPUs) will improve parallelism. Option B is incorrect because the job processes only one hour's data, and repartitioning would add overhead.

Option C is incorrect because using Spark SQL does not inherently improve performance. Option D is incorrect because switching to a Python shell would not handle the transformation efficiently. Option A directly adds resources to speed up the job.

Full explanation →

1472

MCQeasy

A data engineer runs the command shown to check the encryption configuration of an S3 bucket. The output shows SSEAlgorithm: AES256. What does this mean?

A.The bucket uses SSE-S3 with Amazon S3-managed keys

B.The bucket uses SSE-KMS with a customer-managed key

C.The bucket uses SSE-C with customer-provided keys

D.The bucket does not have encryption enabled

AnswerA

AES256 indicates SSE-S3.

Why this answer

Option A is correct. AES256 refers to SSE-S3, where Amazon S3 manages the encryption keys using AES-256. Option B (SSE-KMS) would show 'aws:kms'.

Option C (SSE-C) would require the customer to provide keys. Option D (no encryption) is incorrect because encryption is enabled.

Full explanation →

1473

MCQmedium

A company is using Amazon Athena to query data in an S3 bucket. Queries are failing with the error 'HIVE_PATH_ALREADY_EXISTS'. The data is partitioned by year, month, day. What is the MOST likely cause?

A.A partition was manually added to the Glue Data Catalog that already exists

B.The data format in the partition is inconsistent with the table schema

C.The S3 location for the partition is empty

D.The IAM role used by Athena lacks s3:ListBucket permission on the bucket

AnswerA

Attempting to add a duplicate partition causes this error.

Why this answer

Option C is correct because the error occurs when a partition is already registered in the Glue Data Catalog and a new ALTER TABLE ADD PARTITION tries to add it again. Option A would cause schema mismatch. Option B would cause permission error.

Option D would cause file not found.

Full explanation →

1474

MCQmedium

Refer to the exhibit. A data engineer queries AWS CloudTrail to investigate a PutObject event. What does the exhibit reveal about the object sensitive.csv?

A.The upload failed due to encryption mismatch.

B.The object was uploaded with server-side encryption using AWS KMS.

C.The object was not encrypted at rest.

D.The object was encrypted with SSE-S3.

AnswerB

x-amz-server-side-encryption: aws:kms indicates SSE-KMS.

Why this answer

Option D is correct because the CloudTrail event shows x-amz-server-side-encryption: aws:kms, indicating the object was uploaded with SSE-KMS. Option A is wrong because the event does not indicate SSE-S3 (would be AES256). Option B is wrong because the object is encrypted at rest, not in transit.

Option C is wrong because the event shows the object was uploaded, not that encryption failed.

Full explanation →

1475

Multi-Selectmedium

A data engineer is migrating a large Oracle data warehouse to Amazon Redshift. The engineer needs to ensure optimal performance. Which TWO practices should the engineer follow?

Select 2 answers

A.Choose appropriate sort keys based on common query patterns.

B.Design the schema as a normalized star schema with row-based storage.

C.Manually define compression encodings for each column.

D.Stage data in Amazon S3 before loading into Redshift.

E.Use DISTKEY to distribute data evenly across nodes.

AnswersA, E

Sort keys reduce the amount of data scanned.

Why this answer

Option A is correct because Amazon Redshift uses sort keys to physically order data on disk, which allows the query optimizer to skip large blocks of data during scans via zone maps. Choosing sort keys based on common query patterns (e.g., range filters or frequent GROUP BY columns) dramatically reduces I/O and improves query performance, especially for large tables.

Exam trap

The trap here is that candidates often confuse Redshift's columnar storage with row-based storage and assume a normalized star schema is optimal, when in fact Redshift is designed for denormalized, columnar tables with explicit sort and distribution keys.

Full explanation →

1476

MCQmedium

A data engineer is configuring an S3 bucket for storing sensitive customer data. The bucket must be encrypted at rest using an AWS Key Management Service (KMS) key that is managed by the data engineering team. The team wants to ensure that only users with explicit permission can decrypt the data. Which S3 encryption option should be used?

A.SSE-KMS

B.Client-side encryption

C.SSE-S3

D.SSE-C

AnswerA

SSE-KMS uses a customer-managed KMS key, allowing fine-grained access control.

Why this answer

Option B is correct because SSE-KMS uses a customer-managed KMS key, allowing the team to control access and permissions for decryption. Option A (SSE-S3) uses Amazon S3-managed keys, which does not provide customer-controlled access. Option C (SSE-C) requires the customer to manage the encryption keys themselves, not using KMS.

Option D (client-side encryption) encrypts data before sending to S3, which is not an S3 server-side encryption option.

Full explanation →

1477

MCQeasy

A data engineer needs to ingest log files from multiple EC2 instances into Amazon S3. The logs are written to local disk on each instance. The engineer wants a simple agent-based solution that can collect, compress, and upload logs to S3 with minimal configuration. The solution must support incremental uploads (only new log lines) and handle log rotation. What should the engineer use?

A.Install and configure Amazon Kinesis Agent for Amazon CloudWatch to send logs to CloudWatch Logs, then use a subscription filter to export logs to S3.

B.Use AWS CLI cp command with --recursive in a cron job to copy logs to S3 every minute.

C.Install AWS DataSync agent on each EC2 instance to sync logs to S3 daily.

D.Use an S3 sync command from the AWS CLI scheduled every hour.

AnswerA

Kinesis Agent tails log files, compresses, and sends to CloudWatch Logs; export to S3 can be automated.

Why this answer

Option C is correct: Amazon Kinesis Agent for Amazon CloudWatch can tail log files, compress, and send to CloudWatch Logs, which can then be exported to S3. Option A (AWS DataSync) is for bulk transfers, not streaming. Option B (AWS CLI) is manual.

Option D (S3 sync) is not real-time.

Full explanation →

1478

MCQmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data from web applications. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. Recently, the data analytics application has been falling behind, and the 'MillisBehindLatest' metric for the consumer has been increasing steadily. The shard count is 4, and the average records per second per shard is 200, with an average record size of 1 KB. The provisioned shard limit for the account is 10. Which action will resolve the issue?

A.Enable enhanced fan-out on the Kinesis stream and subscribe the analytics application to it.

B.Reduce the checkpoint interval on the Kinesis Client Library (KCL) consumer to commit offsets more frequently.

C.Increase the number of shards in the Kinesis stream to 8.

D.Increase the provisioned write capacity of the Kinesis stream by requesting a shard limit increase.

AnswerC

More shards increase total read capacity, allowing the consumer to process data faster.

Why this answer

Option C is correct because the consumer is falling behind due to insufficient read capacity. Increasing the number of shards increases the total read capacity and allows the consumer to keep up. Option A is wrong because the write capacity is not the issue; the consumer is behind.

Option B is wrong because switching to enhanced fan-out does not address the shard count limitation; it improves dedicated throughput per consumer but the total throughput is still limited by shard count. Option D is wrong because the consumer is already using KCL, and the issue is not related to checkpointing.

Full explanation →

1479

MCQmedium

A data engineer is designing a data lake on S3 with sensitive data. The security policy mandates that data must be encrypted at rest and in transit, and that an inventory of all objects must be maintained for compliance. Which actions should be taken?

A.Enforce HTTPS via bucket policy, enable default SSE-S3 encryption, and enable S3 Inventory.

B.Use SSE-KMS encryption and enable CloudTrail for S3 events.

C.Enable S3 default encryption using SSE-S3 and enable S3 Inventory.

D.Enforce HTTPS using bucket policy and enable S3 Server Access Logging.

AnswerA

Covers in-transit, at-rest encryption, and inventory.

Why this answer

Option D is correct because enforcing HTTPS (in-transit) and SSE-S3 (at-rest) with S3 Inventory provides compliance. Option A misses in-transit encryption. Option B misses inventory.

Option C uses KMS which may not be required and does not include inventory.

Full explanation →

1480

MCQmedium

A company uses Amazon RDS for MySQL to store application data. The security team requires that all database credentials be rotated automatically every 90 days. The data engineer needs to implement a solution that minimizes operational overhead. The database credentials are stored in AWS Secrets Manager. The application retrieves the credentials at startup and caches them for the duration of the session. The application is deployed on Amazon ECS with Fargate. Which solution should the data engineer implement to meet the rotation requirement with minimal overhead?

A.Store the credentials in AWS Systems Manager Parameter Store and use a scheduled job to update the password.

B.Use Secrets Manager's automatic rotation feature with a custom Lambda function that updates the RDS password.

C.Create a scheduled Lambda function that updates the password in Secrets Manager and manually updates the application configuration.

D.Configure IAM database authentication for the RDS instance and update the application to use IAM credentials.

AnswerB

Secrets Manager can rotate secrets automatically with a Lambda.

Why this answer

Option C is correct because Secrets Manager can automatically rotate secrets using a Lambda function that updates the RDS password. This minimizes overhead compared to manual rotation or changing IAM policies. Option A is incorrect because IAM database authentication is a separate feature that does not rotate credentials automatically.

Option B is incorrect because manual rotation is operationally heavy. Option D is incorrect because Parameter Store does not have built-in rotation; it would require additional automation.

Full explanation →

1481

Multi-Selectmedium

Which THREE storage classes in Amazon S3 are designed for infrequently accessed data with millisecond retrieval times? (Select THREE.)

Select 3 answers

A.S3 Glacier Flexible Retrieval

B.S3 One Zone-IA

C.S3 Glacier Deep Archive

D.S3 Intelligent-Tiering

E.S3 Standard-IA

AnswersB, D, E

One Zone-IA also provides millisecond retrieval for infrequently accessed data.

Why this answer

S3 One Zone-IA is designed for infrequently accessed data that requires millisecond retrieval times, but does not require the resilience of multiple Availability Zones. It stores data in a single AZ and offers the same low-latency performance as S3 Standard, making it suitable for non-critical, infrequently accessed data.

Exam trap

The trap here is that candidates often confuse S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive as having millisecond retrieval times, but these classes are designed for archival access with retrieval times measured in minutes or hours, not milliseconds.

Full explanation →

1482

MCQmedium

A company is using Amazon RDS for MySQL with Multi-AZ deployment. The primary DB instance experiences a hardware failure, causing automatic failover to the standby. After the failover, the application reports that the database endpoint is unreachable for about 60 seconds. What is the MOST likely cause?

A.The standby instance took longer than expected to promote to primary.

B.The standby instance was not in a synchronized state and required a manual promotion.

C.The application was using the wrong endpoint and needed to be reconfigured.

D.The DNS record for the DB instance endpoint needed to update to point to the new primary.

AnswerD

DNS propagation causes the 60-second delay.

Why this answer

Option B is correct because during failover, the DNS record is updated to point to the standby instance, which takes about 60 seconds for propagation. Option A is wrong because failover itself typically takes 60-120 seconds, not longer. Option C is wrong because RDS automatically manages failover without manual promotion.

Option D is wrong because RDS handles failover without needing a new endpoint.

Full explanation →

1483

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The Firehose delivery stream has a buffer size of 64 MB and a buffer interval of 300 seconds. The data volume is 1 GB per minute, and the average record size is 1 KB. The data must be delivered to S3 within 5 minutes of ingestion. The engineer notices that some files are being delivered after 10 minutes. What is the most likely cause?

A.The buffer size of 64 MB is too small for the data volume

B.The data is not compressed, causing larger file sizes

C.The buffer interval of 300 seconds is too long

D.The S3 bucket is throttling PUT requests due to high throughput

AnswerD

High PUT request rates can cause throttling, leading to retries and increased delivery time.

Why this answer

Option A is correct because with 1 KB records and 64 MB buffer, it takes ~64,000 records to fill the buffer. At 1 GB/min (~1,000,000 records/min), the buffer fills in ~3.84 seconds, so the 300-second interval is not the bottleneck. However, if the S3 bucket has a PUT request rate limit, Firehose may retry and cause delays.

Option B is incorrect because 64 MB is default, not too small. Option C is incorrect because 300 seconds is actually long, but data volume is high enough to trigger delivery sooner. Option D is incorrect because compression would reduce size, not cause delays.

Full explanation →

1484

MCQeasy

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) and load it into Amazon Redshift. The data must be transformed before loading. Which AWS service should be used to build the ingestion pipelines?

A.AWS Database Migration Service (DMS)

B.AWS Data Pipeline

C.Amazon AppFlow

D.AWS Glue (crawlers and ETL jobs)

AnswerD

AWS Glue can connect to SaaS sources via JDBC and perform complex transformations.

Why this answer

Option A is correct because AWS Glue can connect to various sources via JDBC and perform ETL transformations. Option B (AppFlow) is for ingesting SaaS data but does not have built-in transformation capabilities; it can only perform simple field mappings. Option C (DMS) is for database migrations, not SaaS.

Option D (Data Pipeline) is older and less flexible.

Full explanation →

1485

MCQeasy

A data engineer needs to ingest streaming data from an IoT fleet into Amazon S3 for near-real-time analytics. The data volume is approximately 5 GB per hour, and each event is less than 1 KB. Which AWS service should be used as the ingestion endpoint?

A.AWS IoT Core

B.AWS DataSync

C.Amazon AppFlow

D.Amazon Kinesis Data Streams

AnswerA

Designed for IoT device ingestion.

Why this answer

AWS IoT Core is purpose-built for ingesting data from IoT devices, supporting MQTT, HTTP, and WebSocket protocols. It can handle millions of devices and high-throughput, small-message payloads (each event <1 KB) and integrates directly with Amazon S3 via IoT Core rules, making it the ideal ingestion endpoint for near-real-time analytics on streaming IoT data.

Exam trap

The trap here is that candidates often default to Amazon Kinesis Data Streams for any streaming workload, overlooking that AWS IoT Core is the specialized, fully managed service designed specifically for IoT device ingestion, with native MQTT support and direct S3 integration via rules.

How to eliminate wrong answers

Option B (AWS DataSync) is wrong because it is designed for one-time or scheduled bulk data transfers between on-premises storage and AWS, not for continuous, near-real-time streaming ingestion from IoT devices. Option C (Amazon AppFlow) is wrong because it is a fully managed integration service for transferring data between SaaS applications (e.g., Salesforce, Slack) and AWS, not for ingesting IoT device telemetry streams. Option D (Amazon Kinesis Data Streams) is wrong because while it can ingest streaming data, it is a generic stream processing service that requires additional configuration (e.g., Kinesis Data Firehose) to write to S3, and it is not the dedicated IoT ingestion endpoint; AWS IoT Core is the recommended first-hop for IoT data.

Full explanation →

1486

MCQhard

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources and stored in a partitioned structure under the 'landing' prefix. The engineer needs to ensure that only authorized applications can write to the 'landing' zone, while all AWS accounts in the organization can read the data. Which combination of S3 bucket policies and IAM policies should be used?

A.Use bucket ACLs to grant write access to the authorized IAM roles and read access to all authenticated users.

B.Use S3 Object Ownership to enforce bucket owner enforced. Grant write access via IAM roles.

C.Create a bucket policy with a Deny for all principals except the authorized IAM roles on the 'landing' prefix. Add a separate statement allowing read access to the organization.

D.Create an IAM policy that allows s3:PutObject only for the 'landing' prefix and attach it to the authorized roles. Allow read access via an S3 Access Point.

AnswerC

This explicitly restricts write access while allowing reads.

Why this answer

Option C is correct because it uses a bucket policy with an explicit Deny on the 'landing' prefix for all principals except the authorized IAM roles, ensuring only those roles can write. A separate Allow statement grants read access to the entire organization (e.g., via the `aws:PrincipalOrgID` condition key), which satisfies the requirement that all AWS accounts in the organization can read the data. This approach leverages S3 bucket policies for cross-account access control without relying on ACLs or IAM policies alone.

Exam trap

The trap here is that candidates often confuse IAM policies (which are identity-based and only apply within the same account) with resource-based policies (like S3 bucket policies) that are required for cross-account access, leading them to choose Option D or A without realizing the need for an explicit Deny or organization-wide condition key.

How to eliminate wrong answers

Option A is wrong because bucket ACLs do not support condition keys like `aws:PrincipalOrgID` and cannot restrict write access to specific IAM roles across accounts; they also grant read access to 'all authenticated users' (a deprecated concept that includes any authenticated AWS user, not just the organization). Option B is wrong because S3 Object Ownership with 'bucket owner enforced' only ensures the bucket owner retains object ownership, but does not by itself restrict write access to authorized roles or grant read access to the organization; it must be combined with a bucket policy. Option D is wrong because an IAM policy attached to roles only controls permissions within the same account and cannot grant cross-account read access to the entire organization; an S3 Access Point can simplify access but does not inherently allow all organization accounts to read without additional bucket policies or resource-based policies.

Full explanation →

1487

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for clickstream data. The data arrives in batches of 10-50 MB every 5 seconds. The engineer needs to buffer the data, perform simple transformations (e.g., add timestamp, remove PII), and land it in S3 within 10 minutes. Which TWO services should be combined? (Choose TWO.)

Select 2 answers

A.Amazon Simple Queue Service (SQS)

B.Amazon Kinesis Data Firehose

C.AWS Lambda

D.AWS Glue ETL

E.Amazon Kinesis Data Streams

AnswersB, C

Firehose can buffer and invoke Lambda for transformation, then deliver to S3.

Why this answer

Option A (Kinesis Data Firehose) can buffer and invoke Lambda for transformation, delivering to S3. Option E (Lambda) can perform the transformation. Option B (SQS) is for decoupling, not streaming to S3.

Option C (Kinesis Data Streams) requires custom consumer. Option D (Glue) is batch-oriented.

Full explanation →

1488

MCQmedium

A data engineer runs the command shown. The consumer application is unable to read data older than 24 hours. What is the most likely cause?

A.The shard has reached its maximum sequence number.

B.The stream is encrypted with KMS, preventing access.

C.The retention period is set to 24 hours, so data older than 24 hours is deleted.

D.The stream is in ACTIVE status but not processing data.

AnswerC

Data retention is 24 hours; data beyond that is expired.

Why this answer

The stream's retention period is 24 hours, meaning data is automatically deleted after 24 hours. The consumer tries to read data older than 24 hours, which is no longer available.

Full explanation →

1489

MCQeasy

A company has an S3 bucket that stores logs for compliance. The compliance team requires that objects are retained for 7 years and cannot be deleted or overwritten. Which S3 feature should be used?

A.Enable S3 Object Lock with retention mode COMPLIANCE and a retention period of 7 years

B.Enable MFA Delete on the bucket

C.Configure an S3 bucket policy that denies delete and overwrite actions

D.Enable S3 Versioning and configure a lifecycle policy to expire objects after 7 years

AnswerA

Object Lock with COMPLIANCE mode ensures objects cannot be deleted or overwritten for the retention period.

Why this answer

S3 Object Lock with retention mode COMPLIANCE prevents objects from being deleted or overwritten for the specified retention period. Versioning alone does not prevent deletion. MFA Delete prevents accidental deletion but not overwrite.

Lifecycle policies can expire objects but do not prevent deletion.

Full explanation →

1490

MCQeasy

A data engineer needs to store large amounts of data that is accessed infrequently but must be retrieved immediately when needed. Which Amazon S3 storage class is most cost-effective?

A.S3 Intelligent-Tiering

B.S3 One Zone-IA

C.S3 Standard-IA

D.S3 Glacier Deep Archive

AnswerC

S3 Standard-IA is designed for infrequent access with millisecond retrieval.

Why this answer

Option A is correct because S3 Standard-IA is for infrequent access with immediate retrieval. Option B is wrong because S3 Glacier has retrieval delays. Option C is wrong because S3 One Zone-IA is for data that can be recreated.

Option D is wrong because S3 Intelligent-Tiering monitors access patterns but may not be the most cost-effective if access pattern is known.

Full explanation →

1491

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting a Glue job that reads objects from this S3 bucket. The job runs successfully but produces no output. The Glue catalog table points to the same S3 path. What is the most likely cause?

A.The S3 key does not follow Hive-style partitioning (e.g., year=2024/month=01).

B.The object metadata is too large.

C.The StorageClass is not supported by Glue.

D.The ContentType is not supported by Glue.

AnswerA

Glue uses partition projections.

Why this answer

Option C is correct because Glue expects a partition structure (e.g., year=2024/month=01/day=01) to automatically discover partitions. The current key has no partition markers. Option A is wrong because Glue can read application/octet-stream.

Option B is wrong because storage class is STANDARD. Option D is wrong because metadata is irrelevant to Glue catalog.

Full explanation →

1492

MCQhard

A company runs a data pipeline that ingests streaming data from an IoT fleet into Amazon Kinesis Data Streams (KDS) with 50 shards. A Lambda function processes records from the stream and writes them to an Amazon DynamoDB table for real-time analytics. The Lambda function is configured with a batch size of 100 and a maximum batching window of 60 seconds. Recently, the company has been seeing an increasing number of 'WriteProvisionedThroughputExceededException' errors from DynamoDB, causing Lambda to retry and eventually send records to a dead-letter queue (DLQ). The DynamoDB table is provisioned with 5000 read capacity units (RCU) and 5000 write capacity units (WCU). The average item size is 1 KB. The KDS stream receives an average of 8000 records per second, each 2 KB in size. The Lambda function performs a simple transformation and writes each record individually to DynamoDB. The company wants to reduce the throttling errors without increasing the DynamoDB WCU provision. Which course of action is most likely to achieve this?

A.Modify the Lambda function to use DynamoDB BatchWriteItem to write records in batches of 25.

B.Increase the Lambda function's reserved concurrency to 1000.

C.Increase the Lambda function timeout to 5 minutes to allow more time for retries.

D.Increase the Lambda batch size to 500 and reduce the batching window to 30 seconds.

AnswerA

BatchWriteItem reduces the number of write API calls, lowering the effective WCU consumption per record and reducing throttling.

Why this answer

Option A is correct because writing records in batches using DynamoDB's BatchWriteItem API reduces the number of write requests, lowering the consumed WCU per request (since each batch consumes WCU for all items but with fewer API calls, reducing overhead). This can reduce throttling without increasing WCU. Option B is wrong because increasing Lambda batch size would cause more records to be processed per invocation, but if each record is still written individually, the number of write requests remains the same.

Option C is wrong because increasing Lambda concurrency would increase the number of concurrent invocations, potentially increasing throttling. Option D is wrong because increasing Lambda timeout does not affect the rate of writes.

Full explanation →

1493

Multi-Selectmedium

A data engineer is designing a disaster recovery plan for an Amazon RDS for MySQL database. The database must have a Recovery Point Objective (RPO) of less than 5 minutes and a Recovery Time Objective (RTO) of less than 30 minutes. Which TWO actions should the engineer take to meet these requirements?

Select 2 answers

A.Enable automated backups with a 1-day retention period and point-in-time recovery.

B.Enable Multi-AZ deployment.

C.Enable automated backups with a 5-minute retention period.

D.Create a cross-Region read replica.

E.Use a single-AZ instance with a standby in another Region.

AnswersA, B

Automated backups with point-in-time recovery allow restoring to any point within the retention period, achieving RPO of 5 minutes.

Why this answer

Options B and D are correct. Multi-AZ provides automatic failover to a standby in another AZ, meeting RTO. Automated backups with point-in-time recovery provide RPO of 5 minutes.

Option A is wrong because read replicas are for read scaling, not DR. Option C is wrong because cross-Region read replicas have higher RTO due to manual promotion. Option E is wrong because single-AZ does not provide automatic failover.

Full explanation →

1494

Multi-Selecthard

Which THREE considerations are important when designing a DynamoDB table for high-traffic gaming leaderboards? (Choose three.)

Select 3 answers

A.Use strongly consistent reads for all queries

B.Enable DynamoDB Accelerator (DAX) for low-latency reads

C.Use Time to Live (TTL) to automatically expire old scores

D.Use DynamoDB Adaptive Capacity to handle uneven access patterns

E.Enable DynamoDB Streams for real-time updates

AnswersB, C, D

DAX provides caching for fast reads.

Why this answer

DynamoDB Accelerator (DAX) provides in-memory caching for DynamoDB tables, reducing read latency from single-digit milliseconds to microseconds. For high-traffic gaming leaderboards, where millions of players query scores concurrently, DAX offloads read traffic from the main table, preventing throttling and ensuring consistent low-latency responses for the most frequently accessed data.

Exam trap

The trap here is that candidates often confuse DynamoDB Streams with a read-acceleration feature, but Streams are strictly for change data capture and do not reduce query latency or handle high read throughput.

Full explanation →

1495

MCQhard

A company is ingesting data from multiple on-premises databases into AWS using AWS Database Migration Service (DMS). The data must be continuously replicated with minimal downtime. However, the source databases do not support native CDC. What should the data engineer do to enable continuous replication?

A.Use Amazon Kinesis Data Streams with a custom producer to capture database changes.

B.Use Amazon Redshift Spectrum to directly query the on-premises databases.

C.Use AWS DMS with log-based CDC if the source databases support it; otherwise, use DMS with batch replication and schedule frequent refreshes.

D.Set up AWS Glue jobs to run every minute to extract and load the data.

AnswerC

DMS supports CDC via source database logs, and if not available, batch replication can approximate continuous sync.

Why this answer

Option A is correct because DMS can use the source engine's native CDC capabilities (e.g., Oracle GoldenGate, MySQL binlog) or log-based CDC if available. Option B is wrong because Glue is not for real-time CDC from databases. Option C is wrong because Kinesis is for streaming data, not for pulling from databases.

Option D is wrong because Redshift is a target, not a replication tool.

Full explanation →

1496

MCQeasy

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3. The data is stored as CSV files. The downstream team requires the data to be in Apache Parquet format. Which change should the data engineer make to the DMS task?

A.Modify the DMS task to use Apache Parquet as the target table preparation mode.

B.Add an S3 lifecycle rule to convert CSV to Parquet.

C.Change the DMS task to use full load instead of continuous replication.

D.Configure a Lambda function to transform data after DMS writes to S3.

AnswerA

DMS can write directly in Parquet format.

Why this answer

Option C is correct because DMS supports target table preparation mode to convert data to Parquet format. Option A is wrong because S3 lifecycle policies do not convert formats. Option B is wrong because DMS does not support Lambda transformations natively.

Option D is wrong because switching to full load only would stop continuous replication.

Full explanation →

1497

Multi-Selectmedium

A company is using Amazon DynamoDB as a data store for a real-time application. The application reads a single item by primary key and occasionally updates it. The data engineer notices high read latency during peak hours. Which TWO actions would most effectively reduce read latency?

Select 2 answers

A.Increase the read capacity units for the table.

B.Enable DynamoDB global tables.

C.Add a local secondary index on the table.

D.Disable auto-scaling and set a fixed read capacity.

E.Enable DynamoDB Accelerator (DAX) for the table.

AnswersA, E

More capacity reduces throttling and latency during peaks.

Why this answer

Option B: Increasing read capacity units reduces throttling, which reduces latency. Option D: Enabling DynamoDB Accelerator (DAX) adds an in-memory cache, reducing read latency. Option A is wrong because disabling auto-scaling could make things worse.

Option C is wrong because global tables are for multi-region, not latency. Option E is wrong because sparse indexes don't help single-item reads.

Full explanation →

1498

MCQmedium

A company is designing a data lake on AWS and must comply with GDPR requirements. The company needs to implement data masking for personally identifiable information (PII) columns in Amazon Redshift. Which feature should be used?

A.Use Amazon RDS Proxy to intercept queries

B.Amazon S3 Object Lambda to mask data on the fly

C.Create views in Redshift that apply masking functions

D.AWS Lake Formation row-level security

AnswerC

Redshift views can apply masking functions to hide PII.

Why this answer

Option C is correct. Redshift supports dynamic data masking through views that apply masking functions. Option A is wrong because S3 Object Lambda is for S3.

Option B is wrong because Lake Formation does not mask data in Redshift directly. Option D is wrong because RDS Proxy is for RDS, not Redshift.

Full explanation →

1499

MCQmedium

A data engineer is troubleshooting a step function that orchestrates ETL jobs. The state machine fails with 'State Machine Execution Throttled' error. What should the engineer do to resolve this?

A.Reduce the number of steps in the state machine.

B.Set up a CloudWatch alarm to detect throttling and retry.

C.Adjust the API rate limits in the state machine definition.

D.Request a service quota increase for concurrent executions.

AnswerD

Increasing the limit resolves the throttling.

Why this answer

Option C is correct because the throttling is likely due to exceeding the default concurrent execution limit. Requesting a limit increase from AWS Support is the proper solution. Option A is wrong because reducing the number of steps does not affect concurrent execution limits.

Option B is wrong because CloudWatch alarms only monitor, not resolve throttling. Option D is wrong because the error is about execution throttling, not API throttling, so adjusting API rate limits is not relevant.

Full explanation →

1500

MCQhard

A company uses Kinesis Data Firehose with a Lambda function for data transformation. The transformation is failing intermittently due to Lambda timeouts. The maximum record size is 1 MB. What is the most cost-effective way to reduce failures without losing data?

A.Use Kinesis Data Analytics to pre-process data before Firehose

B.Decrease the Firehose batch size to reduce the number of records per invocation

C.Configure the Firehose delivery stream to send failed records to an S3 dead-letter bucket

D.Increase the Lambda function timeout and memory allocation

AnswerD

Increasing timeout and memory reduces timeouts without losing data.

Why this answer

Option A is correct because increasing the Lambda timeout and memory reduces timeouts while remaining cost-effective. Option B is wrong because sending to a dead-letter queue (DLQ) loses data. Option C is wrong because decreasing batch size may increase costs.

Option D is wrong because using Kinesis Data Analytics adds unnecessary complexity and cost.

Full explanation →

Page 20 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →