Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 226–300

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 4 of 24

226

MCQhard

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams with a shard count of 10. The incoming data rate is 1 MB/second. The consuming application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely performance bottleneck?

A.The Lambda function invoked by the stream has a cold start issue

B.The data stream has insufficient write capacity

C.The single KCL worker cannot process all shards in parallel

D.The shard count is too low to handle the data rate

AnswerC

KCL workers should be scaled to match shard count for parallel processing.

Why this answer

Option A is correct because a single KCL worker processes all shards sequentially, limiting throughput. Option B is wrong because the shard count is adequate for 1 MB/s (each shard can ingest 1 MB/s). Option C is wrong because provisioned throughput is not relevant.

Option D is wrong because Lambda concurrency would apply if using Lambda, not KCL.

Full explanation →

227

MCQeasy

A company wants to ingest real-time clickstream data from a website into Amazon S3 with a maximum latency of 60 seconds. The data volume peaks at 500 MB/s. Which service should they use to buffer and deliver the data to S3?

A.Amazon Kinesis Data Firehose

B.Amazon Simple Queue Service (SQS)

C.Amazon Kinesis Data Streams

D.AWS Lambda

AnswerA

Firehose is designed for streaming ingestion into S3 with configurable buffering.

Why this answer

Option A is correct because Kinesis Data Firehose can buffer incoming data for up to 60 seconds and deliver to S3. Option B is wrong because Kinesis Data Streams stores data for up to 365 days but requires a consumer to write to S3. Option C is wrong because SQS is a message queue, not designed for streaming ingestion.

Option D is wrong because Lambda is not a buffer service.

Full explanation →

228

MCQmedium

A company is using Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is failing with 'S3 bucket access denied' errors. The bucket policy allows the Firehose service principal. What could be the issue?

A.The S3 bucket is in a different VPC

B.The S3 bucket uses SSE-KMS and Firehose does not have KMS permissions

C.The S3 bucket name contains invalid characters

D.The IAM role assigned to Firehose lacks s3:PutObject permission

AnswerD

The role must have S3 write permissions.

Why this answer

Option C is correct because Firehose uses a trust policy with a service principal but the delivery role must have S3 permissions. Option A is wrong because SSE-KMS requires KMS permissions, not S3 access. Option B is wrong because bucket name is part of the configuration.

Option D is wrong because VPC endpoints affect connectivity, not access denied after connection.

Full explanation →

229

MCQeasy

Refer to the exhibit. A data engineer runs this CloudWatch Logs Insights query on a log group but gets no results. What is the most likely reason?

A.The query syntax is invalid

B.The query limit of 20 is too low

C.The time range is not specified and defaults to the last 15 minutes

D.There are no log events containing the string 'ERROR' in the log group

AnswerD

The filter matches only lines with ERROR.

Why this answer

Option A is correct because the query filters for 'ERROR' and if logs don't contain that string, no results are returned. Option B is wrong because the query has a limit of 20. Option C is wrong because the query is syntactically correct.

Option D is wrong because the query does not use a time filter, so it scans all available logs, but if no error logs exist, results are empty.

Full explanation →

230

MCQmedium

A company is using Amazon RDS for PostgreSQL and wants to minimize downtime during a major version upgrade. Which approach best meets this requirement?

A.Perform an in-place upgrade using the AWS Management Console.

B.Modify the DB instance class to a larger size to handle the upgrade.

C.Create a read replica of the current instance, upgrade the replica, and then promote it to primary.

D.Use pg_dump and pg_restore to migrate data to a new upgraded instance.

AnswerC

This minimizes downtime by switching over after upgrade.

Why this answer

Option C is correct because creating a read replica, upgrading it to the new major version, and then promoting it to primary minimizes downtime by allowing the replica to be upgraded while the original instance remains operational. The promotion process is fast, typically taking only a few seconds to redirect traffic, and avoids the longer downtime associated with in-place upgrades or full data migrations.

Exam trap

The trap here is that candidates often assume an in-place upgrade (Option A) is the simplest and fastest method, but they overlook the fact that major version upgrades in RDS PostgreSQL require a longer downtime window due to the need for a database restart and potential compatibility checks, making the read replica promotion strategy the superior choice for minimizing downtime.

How to eliminate wrong answers

Option A is wrong because an in-place major version upgrade for RDS PostgreSQL requires a database restart and can take significant time (often 10-30 minutes or more) depending on instance size and data volume, leading to unacceptable downtime. Option B is wrong because modifying the DB instance class to a larger size does not perform a version upgrade; it only changes compute and memory resources, leaving the PostgreSQL version unchanged. Option D is wrong because using pg_dump and pg_restore involves exporting the entire database to a file and then importing it into a new instance, which can take hours for large datasets and requires the source database to be unavailable or read-only during the process, resulting in extended downtime.

Full explanation →

231

Multi-Selecthard

A company is migrating an on-premises Apache Hadoop cluster to Amazon EMR. The data is stored in HDFS and must be moved to Amazon S3. Which THREE considerations are important when designing the migration? (Choose THREE.)

Select 3 answers

A.S3 supports POSIX file system semantics

B.HDFS can be directly mounted as an S3 bucket

C.EMR can read data directly from S3 using EMRFS

D.S3 provides eventual consistency for overwrite PUTS and DELETES

E.Using S3 as the data store allows independent scaling of compute and storage

AnswersC, D, E

EMRFS allows EMR to access S3 as a filesystem.

Why this answer

Option C is correct because Amazon EMR uses the EMR File System (EMRFS) to directly read and write data stored in Amazon S3, treating S3 as a scalable, durable data lake without needing to first copy data into HDFS. This allows EMR clusters to process data directly from S3, enabling decoupled compute and storage.

Exam trap

The trap here is that candidates often assume S3 supports POSIX semantics or that HDFS can be directly mounted as an S3 bucket, confusing the object storage model with a traditional filesystem, leading them to select options A or B.

Full explanation →

232

Multi-Selecteasy

Which TWO AWS services can be used to protect sensitive data stored in Amazon S3 by preventing accidental public access? (Choose two.)

Select 2 answers

A.AWS WAF

B.Amazon GuardDuty

C.AWS Trusted Advisor

D.Amazon S3 Block Public Access

E.AWS Key Management Service (KMS)

AnswersC, D

Provides S3 bucket permissions check and recommendations.

Why this answer

Amazon S3 Block Public Access (option C) is a service that blocks public access at the bucket or account level. AWS Trusted Advisor (option D) has a check for S3 bucket permissions that can identify publicly accessible buckets. Option A is for network security, not S3.

Option B is for malware protection. Option E is for encryption at rest, not public access prevention.

Full explanation →

233

Multi-Selecthard

A data engineer is designing an Amazon Redshift data warehouse for a high-traffic analytics workload. The engineer needs to ensure fast query performance and minimize data movement. Which THREE design decisions should be made? (Choose THREE.)

Select 3 answers

A.Choose DISTSTYLE KEY for tables that are frequently joined.

B.Use the default distribution style for all tables.

C.Use DISTSTYLE ALL for all large fact tables.

D.Apply appropriate compression encodings to columns.

E.Define SORT KEYs on columns used in WHERE clauses.

AnswersA, D, E

KEY distribution collocates rows based on join keys, reducing data movement.

Why this answer

Options B, C, and D are correct. Distribution style KEY on join keys collocates data, sort keys on WHERE columns improve scan efficiency, and compression reduces I/O. Option A is wrong because ALL distribution stores copy on every node, increasing storage and load time.

Option E is wrong because default distribution is AUTO, which may not be optimal.

Full explanation →

234

Multi-Selecteasy

A data engineer is setting up an Amazon RDS for MySQL database. The database must be highly available and automatically failover in case of an AZ outage. Which TWO configurations should the engineer enable? (Choose TWO.)

Select 2 answers

A.Multi-AZ deployment

B.A DB subnet group with subnets in at least two Availability Zones

C.Enhanced Monitoring

D.Automated backups with a retention period of 30 days

E.Read replicas in a different Region

AnswersA, B

Multi-AZ creates a standby instance in a different Availability Zone for automatic failover.

Why this answer

Multi-AZ deployment (Option A) automatically provisions and maintains a synchronous standby replica in a different Availability Zone (AZ). In the event of an AZ outage, Amazon RDS automatically fails over to the standby, ensuring high availability with minimal downtime. This is the core mechanism for automatic failover in RDS for MySQL.

Exam trap

The trap here is that candidates often confuse read replicas or automated backups with high availability failover, but only Multi-AZ deployment provides automatic, synchronous failover within the same region.

Full explanation →

235

MCQmedium

A data engineer is monitoring an AWS Glue ETL job that processes data from an S3 bucket and writes to a Redshift table. The job completes successfully but takes longer than expected. The engineer notices that the job uses 10 DPUs and the data size is 500 GB. The job runs in standard mode. Which change would MOST reduce job duration?

A.Increase the number of DPUs to 20.

B.Use a smaller worker type like G.1X.

C.Change the output format from Parquet to CSV.

D.Reduce the number of partitions in the data.

AnswerA

More DPUs provide more parallelism, reducing job execution time.

Why this answer

Option D is correct because increasing DPUs allows more parallelism if the job is distributed. Option A is wrong because reducing the number of partitions may increase data skew. Option B is wrong because using a smaller instance type reduces resources.

Option C is wrong because converting to CSV increases file size and processing time.

Full explanation →

236

MCQhard

A company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is in JSON format and each record is approximately 2 KB. The delivery stream is configured to buffer incoming records for 60 seconds or 5 MB, whichever comes first. The company notices that the data in S3 is delayed by up to 5 minutes during peak hours. Which action would most effectively reduce the delivery latency?

A.Increase the buffer size to 10 MB to allow more records per delivery.

B.Decrease the buffer interval to 15 seconds.

C.Enable compression (GZIP) on the delivery stream.

D.Enable data transformation with AWS Lambda to convert JSON to Parquet.

AnswerB

Shorter buffer interval triggers more frequent deliveries, reducing latency.

Why this answer

The observed delay of up to 5 minutes during peak hours indicates that the buffer size threshold (5 MB) is rarely reached because each record is only ~2 KB, so the delivery stream relies on the buffer interval (60 seconds) to trigger delivery. By decreasing the buffer interval to 15 seconds, Kinesis Data Firehose will push data to S3 more frequently, directly reducing the maximum latency from 60 seconds to 15 seconds per batch, which eliminates the compounding delays caused by queuing during high-throughput periods.

Exam trap

The trap here is that candidates assume increasing buffer size or enabling compression will speed up delivery, but they fail to recognize that with small records, the buffer interval is the bottleneck, and only reducing that interval directly lowers latency.

How to eliminate wrong answers

Option A is wrong because increasing the buffer size to 10 MB would actually increase the time needed to fill the buffer, worsening the latency issue during peak hours when records are small and the buffer interval is the primary trigger. Option C is wrong because enabling GZIP compression reduces storage size and cost but does not affect the delivery frequency or buffer flush timing, so it has no impact on latency. Option D is wrong because converting JSON to Parquet via Lambda adds processing overhead and introduces additional latency from the transformation invocation, which would increase rather than reduce delivery delay.

Full explanation →

237

MCQmedium

A company uses Amazon Redshift for analytics. The data engineer notices that queries are slow due to many small inserts. Which technique would improve write performance?

A.Use the COPY command to load data from Amazon S3.

B.Define DISTKEY and SORTKEY on the table.

C.Increase the number of nodes in the cluster.

D.Configure workload management (WLM) queues.

AnswerA

Bulk loading is more efficient than small inserts.

Why this answer

The COPY command is the recommended way to load data into Amazon Redshift because it performs bulk inserts in parallel across all nodes, leveraging the cluster's distributed architecture. Small individual INSERT statements cause high overhead due to transaction logging and commit processing, leading to slow write performance. By loading data from Amazon S3 using COPY, you bypass these per-row overheads and achieve optimal throughput.

Exam trap

The trap here is that candidates often confuse performance tuning for reads (DISTKEY/SORTKEY) or general scaling (adding nodes) with the specific write performance bottleneck caused by many small inserts, overlooking the COPY command as the primary solution for bulk data loading.

How to eliminate wrong answers

Option B is wrong because defining DISTKEY and SORTKEY improves query read performance by optimizing data distribution and sort order, but does not directly address the write performance issue caused by many small inserts. Option C is wrong because increasing the number of nodes adds compute and storage capacity, but does not solve the fundamental problem of per-insert overhead; small inserts will still be slow on a larger cluster. Option D is wrong because configuring workload management (WLM) queues manages concurrency and prioritizes queries, but does not reduce the overhead of individual small INSERT statements.

Full explanation →

238

MCQhard

A data engineering team uses AWS Glue ETL jobs to process data from an S3 data lake and load it into an Amazon Redshift cluster. The security policy mandates that all data in transit between AWS Glue and Redshift must be encrypted using TLS. The team uses a JDBC connection. Currently, the connection is failing with an SSL-related error. Which configuration change should the team make to ensure encrypted connectivity?

A.Modify the Redshift security group to allow inbound traffic on port 5439 from the Glue subnet.

B.Update the JDBC connection string to include ssl=true and sslmode=require.

C.Enable server-side encryption on the S3 bucket using AWS KMS.

D.Set the Redshift cluster parameter group to require_ssl=ON.

AnswerB

Ensures the JDBC driver uses SSL encryption.

Why this answer

To enforce SSL for JDBC connections to Redshift, you must add the ssl=true parameter in the connection URL. Option D is correct. Option A is wrong because the security group controls network access, not encryption.

Option B is wrong because enabling encryption on the S3 bucket does not affect the Glue-to-Redshift connection. Option C is wrong because Redshift uses SSL by default, but the JDBC driver requires explicit ssl=true in the URL to enable it.

Full explanation →

239

MCQmedium

A data engineering team is responsible for ingesting streaming data from a fleet of IoT devices into Amazon S3 using Kinesis Data Firehose. The data volume spikes unpredictably, and the team has configured Kinesis Data Firehose with a buffer size of 5 MB and buffer interval of 60 seconds. During spikes, the team notices that the delivery to S3 is delayed, and some records are lost due to exceeding the service limits. The team needs to ensure no data loss and reduce delivery latency. What should the team do?

A.Implement an AWS Lambda function to pre-process the data and send it to Firehose in a throttled manner.

B.Increase the buffer size to 10 MB and buffer interval to 120 seconds to allow more data accumulation before delivery.

C.Use Amazon Kinesis Data Streams as the data source for Firehose to decouple ingestion and delivery.

D.Enable S3 Transfer Acceleration on the destination bucket.

AnswerB

Larger buffers reduce the frequency of delivery calls and help handle spikes.

Why this answer

Option A is correct because increasing the buffer size and interval reduces the number of PutRecord.Batch calls and allows Firehose to handle larger spikes without throttling. Option B is wrong because using Kinesis Data Streams as a source adds a buffer but does not directly fix Firehose delivery issues. Option C is wrong because Lambda cannot directly increase Firehose throughput.

Option D is wrong because S3 Transfer Acceleration improves upload speed from clients, not Firehose delivery.

Full explanation →

240

MCQhard

A company uses AWS Glue to process data from multiple S3 buckets. The Glue job runs daily and reads data from a bucket that contains millions of small files (each < 1 MB). The job has been running for hours and is often close to the 8-hour timeout limit. Which optimization would MOST reduce the job's runtime?

A.Pre-process the data to consolidate small files into larger files before the Glue job.

B.Convert the source data from CSV to Parquet format.

C.Increase the number of DPUs allocated to the Glue job.

D.Use a larger Spark shuffle partition size.

AnswerA

Fewer, larger files reduce the overhead of opening and reading files.

Why this answer

Small files cause overhead in reading and processing. Grouping them into larger files (e.g., by using S3 batch operations or a compaction step) reduces the number of files and improves performance. Option A is wrong because increasing DPUs may help but not as much as file consolidation.

Option B is wrong because Spark is already used by Glue. Option D is wrong because Parquet helps but if files are small, the benefit is limited.

Full explanation →

241

Multi-Selecteasy

A company is migrating a MySQL database to Amazon RDS for MySQL. The database is 2 TB in size and the company can only afford minimal downtime. The migration must be secure and use AWS DMS. Which TWO configuration steps are required? (Choose TWO.)

Select 2 answers

A.Create a source endpoint pointing to the on-premises MySQL database.

B.Create a target endpoint pointing to the Amazon RDS instance.

C.Configure SSL/TLS encryption for the DMS endpoints.

D.Set up VPC peering between the on-premises network and the Amazon VPC.

E.Create a DMS replication instance.

AnswersA, B

A source endpoint is needed for DMS to connect to the source.

Why this answer

AWS DMS requires a source endpoint (source database connection) and a target endpoint (RDS instance). Option C is the source endpoint, option D is the target endpoint. The other options are not required for DMS: SSL/TLS is optional but recommended, replication instance is created by DMS, and VPC peering is not typically needed if using the same VPC.

Full explanation →

242

MCQhard

A company runs an Amazon EMR cluster that processes sensitive data stored in Amazon S3. The security team requires that all data in transit between the EMR cluster and S3 be encrypted. Which configuration ensures this requirement is met?

A.Enable in-transit encryption within the EMR cluster using EMRFS.

B.Enable server-side encryption with S3 managed keys (SSE-S3) on the S3 bucket.

C.Configure the S3 endpoint to use TLS and ensure the EMR cluster uses HTTPS for S3 access.

D.Use an S3 access point with a bucket policy that denies HTTP requests.

AnswerC

TLS encrypts data in transit between EMR and S3.

Why this answer

Option A is correct because enabling S3 encryption in transit via TLS for the S3 endpoint enforces HTTPS for all S3 requests. Option B is wrong because SSE-S3 encrypts data at rest, not in transit. Option C is wrong because EMR encryption at rest does not cover transit.

Option D is wrong because S3 access points do not enforce encryption in transit by default.

Full explanation →

243

MCQhard

A data engineer uses AWS Glue to catalog data from an S3 bucket. The data is partitioned by year, month, day. After adding new partitions, the Glue Crawler does not detect them. What is the MOST likely reason?

A.The crawler is configured to only add new partitions to existing tables, but the table schema has changed.

B.The crawler runs only once and does not schedule subsequent runs.

C.The IAM role lacks permission to write to the Glue Data Catalog.

D.The partition depth exceeds the crawler's default limit.

AnswerA

If the schema changed, the crawler may skip partitions; or if the crawler is set to not update new partitions, it won't add them.

Why this answer

Option A is correct because the Glue Crawler, by default, is configured to add new partitions only if the table schema remains unchanged. When new partitions are added to an S3 bucket, if the underlying data schema has changed (e.g., new columns, different data types), the crawler will not add those partitions to the existing table. This is a common safeguard to prevent schema drift from corrupting the cataloged table structure.

Exam trap

The trap here is that candidates often assume partition detection failures are due to permissions or depth limits, but AWS Glue's default schema-change protection is the subtle and less obvious cause.

How to eliminate wrong answers

Option B is wrong because even if the crawler runs only once, it should still detect new partitions during that single run; the issue is about detection failure, not scheduling frequency. Option C is wrong because if the IAM role lacked permission to write to the Glue Data Catalog, the crawler would fail entirely or produce an error, not silently skip new partitions. Option D is wrong because the default partition depth limit in AWS Glue Crawlers is 10 levels, and the given path (year/month/day) is only 3 levels deep, well within the limit.

Full explanation →

244

Multi-Selectmedium

A company is designing a data lake on Amazon S3. The data engineering team needs to implement a lifecycle policy to manage costs. Which TWO actions should be taken to reduce storage costs?

Select 2 answers

A.Transition objects to S3 Glacier Deep Archive after 90 days.

B.Transition objects to S3 One Zone-IA after 30 days.

C.Enable S3 Intelligent-Tiering.

D.Transition objects to S3 Standard-IA after 30 days.

E.Delete incomplete multipart uploads after 7 days.

AnswersA, E

Deep Archive is lowest cost for rarely accessed data.

Why this answer

Option A is correct because transitioning objects to S3 Glacier Deep Archive after 90 days significantly reduces storage costs for data that is rarely accessed and can tolerate a retrieval time of 12 hours. This lifecycle policy is a standard cost-optimization strategy for data lakes where historical or cold data does not require immediate access.

Exam trap

The trap here is that candidates often choose S3 Intelligent-Tiering or S3 Standard-IA as cost-saving measures without considering that the question specifically asks for lifecycle policy actions to reduce costs, and that Glacier Deep Archive and deleting incomplete multipart uploads are the most direct and effective actions for a data lake scenario.

Full explanation →

245

MCQeasy

A company uses Amazon DynamoDB as its primary data store for a web application. The application experiences high latency during peak hours. The data engineer notices that the table has a large number of items with the same partition key. Which DynamoDB feature should the engineer use to improve performance?

A.Redesign the partition key to use a composite key that includes a timestamp or random suffix.

B.Enable DynamoDB Accelerator (DAX) to cache read requests.

C.Create a global table to replicate data across multiple Regions.

D.Enable auto scaling on the table to increase write capacity.

AnswerA

A well-designed partition key prevents hot spots by distributing writes evenly.

Why this answer

The high latency is caused by a hot partition, where many items share the same partition key, overwhelming a single DynamoDB partition. Redesigning the partition key to include a timestamp or random suffix distributes the workload evenly across partitions, improving throughput and reducing latency. This directly addresses the root cause of the performance issue.

Exam trap

The trap here is that candidates often confuse caching solutions (DAX) or scaling mechanisms (auto scaling) with the need to fix the data model itself, which is the only way to resolve a hot partition caused by a skewed partition key.

How to eliminate wrong answers

Option B is wrong because DynamoDB Accelerator (DAX) caches read requests, which can reduce read latency but does not solve the underlying hot partition issue caused by skewed write or read traffic on a single partition key. Option C is wrong because creating a global table replicates data across multiple Regions for disaster recovery or low-latency global access, but it does not distribute load within a single table's partitions. Option D is wrong because enabling auto scaling increases the table's provisioned capacity, but if the workload is concentrated on one partition, the partition's throughput limit (3000 RCU or 1000 WCU) will still be exceeded, causing throttling and high latency.

Full explanation →

246

MCQmedium

A company uses AWS Glue DataBrew to clean and prepare data for machine learning. The source data is in an S3 bucket with server-side encryption using AWS KMS (SSE-KMS). The DataBrew project is set up with an IAM role that has permissions to read from the S3 bucket and use the KMS key. When the DataBrew job runs, it fails with an error indicating that it cannot access the data. The IAM role has the following policy: { 'Version': '2012-10-17', 'Statement': [ { 'Effect': 'Allow', 'Action': ['s3:GetObject', 's3:ListBucket'], 'Resource': ['arn:aws:s3:::my-bucket', 'arn:aws:s3:::my-bucket/*'] }, { 'Effect': 'Allow', 'Action': 'kms:Decrypt', 'Resource': 'arn:aws:kms:us-east-1:123456789012:key/my-key' } ] }. What is the most likely cause of the failure?

A.The IAM role is missing s3:PutObject permission on the DataBrew output bucket.

B.The IAM role is missing s3:ListBucket permission on the source bucket.

C.The S3 bucket is in a different region than the DataBrew project.

D.The IAM role is missing kms:GenerateDataKey permission for the KMS key.

AnswerA

DataBrew writes to its own S3 bucket for job outputs; missing write permission causes failure.

Why this answer

Option C is correct because DataBrew uses a separate S3 bucket for storing intermediate outputs and recipe results. The IAM role needs s3:PutObject permission on that bucket. The error typically manifests as access denied when DataBrew tries to write.

Option A is wrong because the role has kms:Decrypt permission. Option B is wrong because DataBrew does not require VPC endpoints by default. Option D is wrong because the role includes s3:ListBucket.

Full explanation →

247

MCQmedium

A company uses Kinesis Data Streams to ingest real-time clickstream data. The data is processed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

A.Increase the Lambda function's memory allocation.

B.Increase the number of shards in the Kinesis stream.

C.Enable enhanced fan-out for the stream.

D.Reduce the batch size in the Lambda event source mapping.

AnswerB

More shards increase throughput capacity.

Why this answer

The 'ProvisionedThroughputExceededException' error indicates that the Lambda function is reading data from the Kinesis stream faster than the stream's shard-level throughput limits allow. Each shard in a Kinesis Data Stream supports up to 2 MB/s read throughput or 5 read transactions per second. Increasing the number of shards distributes the read load across more shards, raising the total available read throughput and resolving the throttling.

Exam trap

The trap here is that candidates confuse ProvisionedThroughputExceededException (a read-side throttling error) with write-side throttling, leading them to choose options like reducing batch size or increasing memory, which do not address the shard-level read throughput limit.

How to eliminate wrong answers

Option A is wrong because increasing Lambda memory allocation improves compute performance (CPU, network) but does not affect the read throughput limits of the Kinesis stream shards, which is the root cause of the throttling. Option C is wrong because enhanced fan-out provides dedicated 2 MB/s read throughput per consumer per shard, which reduces contention between consumers but does not increase the total read capacity of the stream; the error is from exceeding shard-level limits, not from consumer contention. Option D is wrong because reducing the batch size decreases the number of records per invocation, but the Lambda function still reads from the same shard at the same rate; the throttling is caused by exceeding the shard's read throughput, not by batch size.

Full explanation →

248

Multi-Selecthard

A data engineer is designing a data pipeline that ingests data from multiple sources into Amazon S3, then processes it with AWS Glue and loads it into Amazon Redshift. Which THREE practices should be implemented to ensure data quality?

Select 3 answers

A.Implement data validation checks at the ingestion stage

B.Use AWS Glue DataBrew for data profiling and schema enforcement

C.Compress data files to reduce storage costs

D.Use manual sampling to check data quality periodically

E.Set up Amazon CloudWatch alarms for pipeline failures and data anomalies

AnswersA, B, E

Early validation catches errors before processing.

Why this answer

Option A is correct because data validation at ingestion catches issues early. Option C is correct because schema enforcement prevents data type mismatches. Option E is correct because monitoring with CloudWatch allows proactive detection of failures.

Option B is wrong because manual sampling is not scalable. Option D is wrong because compressing data does not ensure quality.

Full explanation →

249

MCQeasy

A company needs to store JSON documents that are accessed by a key-value pattern. The data is 500 GB and requires single-digit millisecond latency. Which AWS database is most suitable?

A.Amazon Redshift

B.Amazon DynamoDB

C.Amazon Neptune

D.Amazon RDS for MySQL

AnswerB

DynamoDB is a NoSQL key-value and document database with low latency.

Why this answer

Option C is correct because DynamoDB supports document store and key-value access with low latency. Option A is wrong because RDS is relational. Option B is wrong because Neptune is graph.

Option D is wrong because Redshift is analytical.

Full explanation →

250

MCQmedium

A company needs to monitor and record all changes to IAM policies in their AWS account. Which AWS service should be used?

A.Amazon CloudWatch Logs

B.Amazon GuardDuty

C.AWS CloudTrail

D.AWS IAM Access Analyzer

AnswerC

CloudTrail records all API calls, including IAM policy changes.

Why this answer

AWS CloudTrail records API calls, including IAM policy changes. AWS Config records resource configuration changes and can evaluate compliance rules. Option A is wrong because IAM Access Analyzer is for analyzing resource policies for public access, not recording changes.

Option C is wrong because GuardDuty is for threat detection. Option D is wrong because CloudWatch Logs does not directly record IAM changes.

Full explanation →

251

Multi-Selectmedium

A company wants to use AWS CloudTrail to monitor data events for an S3 bucket. Which TWO configurations are required to capture object-level API operations?

Select 2 answers

A.Configure the CloudTrail trail to log data events for the S3 bucket.

B.Enable management events in the CloudTrail trail.

C.Enable S3 server access logs on the bucket.

D.Create a CloudTrail trail in the same AWS Region as the S3 bucket.

E.Set up an Amazon CloudWatch Events rule to capture S3 events.

AnswersA, D

Data events capture object-level operations like GetObject, PutObject.

Why this answer

Option B and Option D are correct. CloudTrail must be enabled in the S3 bucket's Region, and data events for S3 must be specified to capture object-level operations. Option A is incorrect because S3 server access logs are separate.

Option C is incorrect because management events capture bucket-level operations, not object-level. Option E is incorrect because CloudWatch Events is not required for CloudTrail to capture events.

Full explanation →

252

Multi-Selectmedium

A company uses Amazon Kinesis Data Firehose to ingest data into an S3 bucket. The data is in JSON format and the team wants to convert it to Parquet before storage. Which TWO configurations are required?

Select 2 answers

A.Use Kinesis Data Analytics to transform data to Parquet.

B.Create a Glue table with the schema of the data.

C.Configure a Lambda function to convert data on the fly.

D.Set up an Athena table to read the data.

E.Enable data format conversion in Firehose and set Output format to Parquet.

AnswersB, E

Firehose needs a schema for Parquet conversion.

Why this answer

Options A and B are correct: Firehose can convert to Parquet if you specify a data format conversion and a Glue table (schema). Option C (Kinesis Data Analytics) is not needed. Option D (Athena) is for querying.

Option E (Lambda) can be used but is not required.

Full explanation →

253

MCQmedium

A data engineering team uses Amazon EMR to run Spark jobs on a transient cluster. The jobs read data from S3 and write results back to S3. The team notices that jobs are taking longer than expected. Which configuration change is most likely to improve performance?

A.Use a larger instance type for the master node.

B.Enable HDFS as the intermediate storage and copy data from S3 to HDFS before processing.

C.Increase the number of core nodes to improve parallelism.

D.Enable EMRFS consistent view and use S3 as the direct input source.

AnswerD

EMRFS consistent view reduces retries due to S3 eventual consistency, improving performance.

Why this answer

Option D is correct because EMRFS consistent view is important for eventually consistent S3; lack of consistency can cause retries. Option A is wrong because reducing node count reduces parallelism. Option B is wrong because fewer instances reduce compute capacity.

Option C is wrong because HDFS is slower for S3 access.

Full explanation →

254

MCQmedium

A company runs a production Amazon Redshift cluster. The data engineering team notices that queries are running slowly during peak hours. The cluster's CPU utilization is consistently above 80%. Which action should the engineer take to improve query performance?

A.Move some tables to Amazon Redshift Spectrum.

B.Re-distribute the tables using a different distribution key.

C.Enable concurrency scaling.

D.Perform an elastic resize to add more nodes.

AnswerD

Elastic resize adds nodes and CPU capacity quickly.

Why this answer

Option C is correct because elastic resize allows adding nodes without downtime, addressing CPU bottleneck. Option A is incorrect because concurrency scaling only helps with many concurrent queries, not CPU. Option B is incorrect because distributing data on a different key may not reduce CPU.

Option D is incorrect because spectrum offloads to S3 but does not reduce cluster CPU usage.

Full explanation →

255

MCQhard

A company uses AWS Glue DataBrew for data preparation. The data source is an S3 bucket with millions of small CSV files (each < 1 MB). The DataBrew project takes a long time to load the sample data. What is the most likely cause and solution?

A.Use Amazon Athena to query the data instead of DataBrew

B.The DataBrew job is under-provisioned; increase the number of DPUs

C.The large number of small files causes S3 LIST overhead; concatenate files into larger files

D.Use AWS Glue ETL instead of DataBrew for this volume

AnswerC

S3 performance degrades with many small files; combining them reduces API calls.

Why this answer

Option B is correct because DataBrew samples data by reading files, and many small files cause high overhead due to S3 LIST and GET requests. Concatenating files into fewer larger files reduces this overhead. Option A (increase DPUs) does not help with the LIST overhead.

Option C (use Glue ETL) is a different service. Option D (use Athena) is for querying, not data preparation.

Full explanation →

256

Multi-Selecthard

A company uses Amazon Redshift for its data warehouse and needs to enforce column-level security on sensitive columns. Which TWO approaches can achieve this?

Select 2 answers

A.Apply an S3 bucket policy to the underlying data files.

B.Create views that expose only non-sensitive columns and grant access to the views.

C.Use Redshift Spectrum to query external tables and restrict columns via the external schema.

D.Use Redshift column-level security to grant or revoke permissions on specific columns.

E.Use Redshift row-level security policies to restrict column access.

AnswersB, D

Views can limit column visibility.

Why this answer

Options A and D are correct. Redshift column-level security allows defining access controls at the column level. Views can also restrict column access.

Option B is wrong because row-level security does not restrict columns. Option C is wrong because S3 bucket policies are unrelated. Option E is wrong because Redshift Spectrum queries external data, not column-level control.

Full explanation →

257

Multi-Selecteasy

Which TWO methods can be used to enforce least-privilege access to an Amazon S3 bucket? (Choose two.)

Select 2 answers

A.Use IAM policies to grant specific permissions to users and roles.

B.Set bucket ACLs to allow full control to the bucket owner only.

C.Use an S3 bucket policy that explicitly denies actions not required.

D.Configure a VPC endpoint to restrict access to the bucket.

E.Generate pre-signed URLs for all access.

AnswersA, C

IAM policies allow granular permissions.

Why this answer

Options A and C are correct. IAM policies define user permissions, and bucket policies control bucket-level access. Option B is for network security.

Option D (pre-signed URLs) grants temporary access but not least-privilege enforcement. Option E (ACLs) are legacy and not recommended.

Full explanation →

258

Multi-Selecteasy

A data analytics company uses Amazon Athena to query data stored in an S3 bucket. The data contains personally identifiable information (PII). The security team wants to ensure that only authorized users can access the data through Athena, and that the data is encrypted at rest in S3. Which combination of actions should the company take? (Choose two.)

Select 2 answers

A.Attach an IAM policy to users that grants Athena access and S3 read access to the bucket.

B.Use AWS Lake Formation to define data lake permissions.

C.Use AWS Kinesis to stream data to Athena.

D.Create an S3 Access Point with a restricted policy.

E.Enable server-side encryption (SSE-S3) on the S3 bucket.

AnswersA, E

Controls access to Athena and underlying data.

Why this answer

Option A and D are correct. Option A enables encryption at rest. Option D restricts Athena access via IAM policies and bucket policies.

Option B is wrong because Lake Formation is not required for basic access control; IAM policies suffice. Option C is wrong because S3 Access Points can provide granular access but are not necessary. Option E is wrong because Athena does not support Kinesis.

Full explanation →

259

MCQmedium

A data engineering team needs to load data from an on-premises Oracle database to Amazon S3 daily. The data volume is about 50 GB per day, and the network bandwidth is 100 Mbps. The team wants to minimize operational overhead and use AWS managed services. Which solution should they choose?

A.Use AWS Database Migration Service (DMS) to migrate the data to S3.

B.Use AWS DataSync to copy the database files directly to S3.

C.Use AWS Glue with a JDBC connection and schedule a crawler to load data into S3.

D.Use Amazon Kinesis Data Firehose to stream data from Oracle to S3.

AnswerA

DMS supports ongoing replication and scheduled migrations from Oracle to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can continuously replicate or schedule data migration from on-premises databases to S3 with minimal setup. Option A is wrong because AWS DataSync is for file transfers, not database tables. Option B is wrong because AWS Glue can connect to JDBC sources but requires more configuration for scheduled loads.

Option D is wrong because Amazon Kinesis Firehose is for streaming data, not for batch database loads.

Full explanation →

260

MCQmedium

A company wants to centrally manage encryption keys for multiple AWS services and automatically rotate them every year. Which AWS service should be used?

A.AWS CloudHSM

B.AWS Certificate Manager (ACM)

C.AWS Secrets Manager

D.AWS Key Management Service (KMS)

AnswerD

KMS can automatically rotate customer managed keys yearly.

Why this answer

Option A is correct. AWS KMS supports automatic key rotation annually. Option B is wrong because CloudHSM is for hardware-based keys but does not provide automatic rotation.

Option C is wrong because Secrets Manager is for secrets, not encryption keys. Option D is wrong because ACM is for certificates.

Full explanation →

261

MCQmedium

A team uses Amazon Redshift for analytics. They notice that some queries are slow and the system shows high disk usage. The team wants to improve query performance without adding more nodes. Which action should they take first?

A.Run the VACUUM and ANALYZE commands on the tables.

B.Enable compression on all columns.

C.Redistribute the tables by changing the distribution key to a column with high cardinality.

D.Modify the workload management (WLM) queue to increase concurrency.

AnswerA

VACUUM reclaims space, ANALYZE updates statistics.

Why this answer

Option B is correct because VACUUM and ANALYZE reclaim space and update statistics, which can significantly improve query performance. Option A is wrong because distribution key changes require table recreation. Option C is wrong because WLM queues affect concurrency, not disk usage.

Option D is wrong because compression encoding is set at table creation.

Full explanation →

262

Multi-Selectmedium

Which TWO actions can help improve query performance in Amazon Redshift? (Choose two.)

Select 2 answers

A.Use appropriate sort keys for tables.

B.Disable SSL encryption for connections.

C.Use VARCHAR instead of CHAR for fixed-length strings.

D.Apply compression encodings to columns.

E.Increase the number of nodes in the cluster.

AnswersA, D

Sort keys help the query optimizer scan less data.

Why this answer

Option A is correct because defining appropriate sort keys in Amazon Redshift enables the query optimizer to use zone maps to skip irrelevant data blocks during table scans, significantly reducing the amount of data read from disk. Sort keys also improve the effectiveness of merge joins and the performance of range-restricted queries by physically co-locating rows with similar sort key values on disk.

Exam trap

The trap here is that candidates often assume scaling out (adding nodes) always speeds up individual queries, but in Redshift, query performance is more dependent on data layout (sort keys, distribution, compression) than on cluster size, and adding nodes primarily benefits concurrent workloads rather than single-query latency.

Full explanation →

263

MCQhard

A data engineering team is designing a data lake on Amazon S3. They need to store raw data in its original format and transformed data in Parquet. The data is accessed by multiple analytics services, including Amazon Athena and Amazon Redshift Spectrum. Compliance requirements mandate that all data be encrypted at rest with AWS KMS and that the encryption keys be rotated every 90 days. Which S3 bucket configuration meets these requirements?

A.Use SSE-KMS with a customer-managed KMS key that has automatic key rotation enabled.

B.Use SSE-C with client-managed keys and rotate them manually.

C.Use a bucket policy to enforce encryption and rely on default S3 encryption.

D.Use SSE-S3 with default encryption enabled.

AnswerA

SSE-KMS with automatic rotation meets compliance requirements.

Why this answer

Option A is correct because SSE-KMS with a customer-managed KMS key allows you to enable automatic key rotation, which meets the 90-day rotation requirement. AWS KMS automatically rotates customer-managed keys annually, but you can also configure a custom rotation period (e.g., 90 days) using a Lambda function or AWS Config rule. This setup ensures raw and Parquet data are encrypted at rest, and the key rotation satisfies compliance mandates.

Exam trap

The trap here is that candidates assume SSE-S3 or default encryption meets the rotation requirement because AWS rotates keys automatically, but they overlook the need for a configurable 90-day rotation period, which only SSE-KMS with a customer-managed key can support.

How to eliminate wrong answers

Option B is wrong because SSE-C requires you to manage and rotate encryption keys client-side, which adds operational overhead and does not integrate with AWS KMS for automated rotation; manual rotation every 90 days is possible but not automated, and it violates the requirement to use AWS KMS. Option C is wrong because relying on default S3 encryption (SSE-S3) uses S3-managed keys that cannot be rotated on a 90-day schedule; AWS rotates SSE-S3 keys annually, but you have no control over the rotation frequency. Option D is wrong because SSE-S3 does not support customer-controlled key rotation; it uses S3-managed keys with automatic rotation by AWS, but the rotation period is not configurable and does not meet the 90-day requirement.

Full explanation →

264

MCQeasy

Refer to the exhibit. A data engineer creates an external table in AWS Glue Data Catalog pointing to an S3 bucket that contains encrypted objects (SSE-S3). The CREATE TABLE statement fails with an error. What change should be made to fix the error?

A.Change the SERDE to 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'.

B.Add 'aws_iam_role' as a table property.

C.Include the KMS key ARN in the LOCATION.

D.Set 'has_encrypted_data' to 'true'.

AnswerD

The property tells the catalog that data is encrypted.

Why this answer

If the data is encrypted with SSE-S3, the table property 'has_encrypted_data' should be set to 'true'. Option C is correct. Option A is unnecessary.

Option B is not the cause. Option D is not required for SSE-S3.

Full explanation →

265

Multi-Selecthard

A company is using AWS KMS to encrypt data in Amazon S3. The security team wants to ensure that only specific IAM roles can decrypt the data. Which THREE steps should the data engineer take? (Choose three.)

Select 2 answers

A.Use the default AWS managed KMS key for S3 (aws/s3)

B.Use SSE-S3 encryption instead of KMS

C.Create a customer-managed KMS key with a key policy that grants kms:Decrypt only to the allowed IAM roles

D.Add an IAM policy to the role that requires MFA for kms:Decrypt

E.Configure the S3 bucket to use SSE-KMS with the customer-managed key

AnswersC, E

This restricts decryption to the specified roles.

Why this answer

To restrict decryption to specific roles, you need to create a customer-managed key, configure the key policy to allow only those roles to use kms:Decrypt, and use SSE-KMS when writing objects. Using the default KMS key does not allow custom policies. SSE-S3 uses different keys.

Requiring MFA for API calls does not restrict decryption to specific roles.

Full explanation →

266

Multi-Selecthard

A company ingests IoT sensor data into Amazon Kinesis Data Streams. The data must be enriched with device metadata from Amazon DynamoDB and then stored in Amazon S3 in Apache Parquet format. The solution must minimize latency and cost. Which THREE steps should a data engineer implement? (Choose three.)

Select 3 answers

A.Deliver the enriched data to Amazon Kinesis Data Firehose and enable Parquet conversion.

B.Configure an AWS Lambda function to read from the stream, enrich, and write to S3.

C.Use AWS Glue streaming ETL to enrich and convert data to Parquet.

D.Use Amazon EMR with Spark Streaming to process and store the data.

E.Perform a DynamoDB lookup in the Flink application for each record.

.Use Amazon Kinesis Data Analytics for Apache Flink to enrich the stream with data from DynamoDB.

AnswersA, E

Kinesis Data Firehose can convert incoming data to Parquet and write to S3.

Why this answer

Option A: Using Kinesis Data Analytics for Flink allows stream enrichment with low latency. Option D: Kinesis Data Firehose can buffer data and convert it to Parquet before delivering to S3. Option E: Kinesis Data Analytics for Flink can look up DynamoDB data for enrichment.

Option B (Lambda) has invocation limits and higher cost per record. Option C (Glue) adds latency and cost for streaming. Option F (EMR) is managed and not as seamless.

Full explanation →

267

MCQmedium

A company wants to monitor and alert on unauthorized API calls in their AWS account. Which AWS service should be used to detect and notify on such events?

A.Amazon GuardDuty and AWS Security Hub

B.Amazon VPC Flow Logs and Amazon CloudWatch Logs

C.AWS Config and AWS Systems Manager

D.AWS CloudTrail and Amazon CloudWatch Events

AnswerD

CloudTrail logs API calls, and CloudWatch Events can trigger alerts on specific events.

Why this answer

Option A is correct because AWS CloudTrail logs API calls, and Amazon CloudWatch Events (or EventBridge) can be used to create rules that trigger notifications on specific API calls. Option B is wrong because AWS Config monitors resource configuration, not API calls. Option C is wrong because Amazon GuardDuty is a threat detection service that can detect API call anomalies but is not primarily designed for monitoring all unauthorized API calls.

Option D is wrong because VPC Flow Logs monitor network traffic.

Full explanation →

268

MCQhard

A data engineer runs the describe-stream command and sees the output above. The stream has a retention period of 24 hours. The engineer needs to ensure that consumers can replay data for up to 7 days. Which action is required?

A.Increase the number of shards to allow more data storage.

B.Delete the stream and recreate it with a longer retention period.

C.Use the IncreaseStreamRetentionPeriod API to set retention to 168 hours.

D.Create new consumer applications that read from the stream.

AnswerC

The API can increase retention up to 365 days.

Why this answer

Option C is correct because to increase retention, you call IncreaseStreamRetentionPeriod. Option A is wrong because changing shard count does not affect retention. Option B is wrong because starting new consumers does not change retention.

Option D is wrong because you can only increase retention up to 365 days.

Full explanation →

269

MCQhard

A company uses Amazon Kinesis Data Streams with a shard count of 10 to ingest clickstream data. The data is consumed by a Lambda function that transforms the records and writes to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. The average record size is 5 KB, and the incoming data rate is 15 MB/s. What is the most likely cause and solution?

A.Increase the number of shards in the Kinesis data stream to 15.

B.Decrease the batch size of the Lambda event source mapping.

C.Increase the Lambda function's memory allocation to 3008 MB.

D.Increase the Lambda function's reserved concurrency.

AnswerA

Each shard provides 1 MB/s write capacity; 15 shards would support 15 MB/s.

Why this answer

Option B is correct because with 10 shards, the total write capacity is 10 MB/s (1 MB/s per shard). The incoming rate is 15 MB/s, exceeding capacity. Increasing shards to 15 would provide 15 MB/s write capacity.

Option A is wrong because Lambda concurrency is not the issue; the error is about throughput exceeded. Option C is wrong because increasing Lambda memory does not affect Kinesis write limits. Option D is wrong because the error is on the producer side, not the consumer.

Full explanation →

270

MCQeasy

A data engineer needs to ingest streaming data from a social media API into Amazon S3 for batch analytics. The data arrives at a rate of 500 records per second. Which service should be used to capture the stream?

A.Amazon Simple Notification Service (SNS)

B.Amazon Simple Queue Service (SQS)

C.Amazon Kinesis Data Streams

D.Amazon MQ

AnswerC

Kinesis Data Streams is designed for real-time streaming data ingestion.

Why this answer

Option B is correct because Kinesis Data Streams can capture high-throughput streaming data. Option A is wrong because SQS is for message queues, not for streaming analytics. Option C is wrong because SNS is for pub/sub notifications.

Option D is wrong because MQ is for message brokers.

Full explanation →

271

Multi-Selecthard

A data engineer needs to set up a data ingestion pipeline that reads from Amazon MSK (Managed Streaming for Kafka) and writes to Amazon S3 with transformations. The data is in Avro format and must be converted to Parquet. Which THREE components should be used together? (Choose THREE.)

Select 3 answers

A.AWS Lambda function to convert Avro to Parquet as a Firehose transformation

B.Amazon Athena to convert the data format

C.Amazon Kinesis Data Firehose delivery stream with MSK as source

D.Amazon MSK cluster as the data source

E.AWS Glue ETL job to read from MSK

AnswersA, C, D

Lambda can be used in Firehose to perform data transformation.

Why this answer

Options A, B, and D are correct. A: MSK is the source. B: Kinesis Data Firehose can consume from MSK via a custom endpoint and deliver to S3, with Lambda for transformation.

D: Lambda can be used as a transformation function within Firehose to convert Avro to Parquet. C: Glue is not directly integrated with MSK as a Firehose source. E: Athena is for querying, not part of the ingestion pipeline.

Full explanation →

272

MCQmedium

A company is streaming IoT data from thousands of devices into Amazon Kinesis Data Streams. The data must be transformed in real time before being stored in Amazon S3. Which service should be used to perform the transformation as the data streams through Kinesis?

A.AWS Glue

B.Amazon Kinesis Data Analytics for Apache Flink

C.Amazon EMR

D.AWS Lambda

AnswerB

Correctly processes streaming data in real time with Flink.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics for Apache Flink can process streaming data in real time using Flink, which is ideal for transformations. Option B (Lambda) can process records but is better for lightweight transformations and may not handle complex stateful operations. Option C (Glue) is batch-oriented, not real-time.

Option D (EMR) is for big data processing but adds latency.

Full explanation →

273

Multi-Selecteasy

Which TWO AWS services can be used as sources for AWS Glue ETL jobs? (Choose two.)

Select 2 answers

A.Amazon Route 53

B.Amazon CloudFront

C.Amazon API Gateway

D.Amazon S3

E.Amazon RDS

AnswersD, E

S3 is a common source for Glue jobs.

Why this answer

Glue can read from S3 and JDBC sources like RDS and Redshift. Option C (API Gateway) is not a data store; Option D (CloudFront) is a CDN; Option E (Route 53) is DNS.

Full explanation →

274

Multi-Selectmedium

A company is using Amazon Redshift for data warehousing. They need to ensure that data is encrypted at rest and in transit. Which TWO configurations are required to meet these requirements?

Select 2 answers

A.Enable encryption on the Redshift cluster using AWS KMS.

B.Configure the Redshift cluster to require SSL connections.

C.Use AWS CloudHSM to manage encryption keys for Redshift.

D.Enable VPC Flow Logs on the Redshift subnet.

E.Enable EBS encryption on the Redshift cluster nodes.

AnswersA, B

KMS encrypts data at rest.

Why this answer

Options A and D are correct. To encrypt data at rest, enable encryption on the Redshift cluster using KMS (A). To encrypt data in transit, configure the cluster to use SSL (D).

Option B is incorrect because VPC Flow Logs do not encrypt. Option C is incorrect because EBS encryption is not directly applicable to Redshift cluster storage. Option E is incorrect because Redshift does not support CloudHSM natively for encryption; it uses KMS for at-rest encryption.

Full explanation →

275

Multi-Selecthard

A company runs an Amazon EMR cluster processing data from S3. The data engineer notices that the cluster's task nodes are underutilized while core nodes are fully utilized. Which TWO steps should the engineer take to improve resource utilization?

Select 2 answers

A.Consolidate multiple small tasks into larger tasks.

B.Increase the number of core nodes.

C.Add more task nodes using Spot Instances.

D.Reduce the number of core nodes and increase the number of task nodes.

E.Move HDFS data from EBS to instance store volumes.

AnswersB, C

More core nodes distribute processing load.

Why this answer

Option A is correct because increasing the number of core nodes adds more capacity for processing. Option D is correct because enabling task nodes with spot instances can offload work from core nodes. Option B is incorrect because instance store is temporary and not suitable for HDFS.

Option C is incorrect because consolidating tasks may not help. Option E is incorrect because reducing core nodes would worsen utilization.

Full explanation →

276

MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that fails with an access denied error when writing to an S3 bucket. The Glue job uses an IAM role that has an S3 bucket policy attached. The bucket policy denies access to any principal that does not use server-side encryption. What is the most likely cause of the failure?

A.The VPC endpoint policy for S3 is too restrictive.

B.The IAM role does not have s3:PutObject permission.

C.The Glue job is not using server-side encryption when writing to S3.

D.The S3 bucket uses S3 Block Public Access which denies all writes.

AnswerC

The bucket policy denies requests without encryption, causing access denied even if the role has PutObject permission.

Why this answer

Option C is correct because if the Glue job does not set the encryption header (or the role does not have the kms:GenerateDataKey permission for SSE-KMS), the bucket policy will deny the request. Option A is wrong because Glue requires permissions on the S3 bucket and KMS key. Option B is wrong because VPC endpoints do not cause access denied errors for encryption.

Option D is wrong because S3 Block Public Access does not deny write access to authorized roles.

Full explanation →

277

MCQeasy

A company is using AWS Glue to catalog data stored in Amazon S3. The data is partitioned by year, month, and day. A data analyst reports that new partitions are not automatically discovered by the Glue crawler. The crawler runs on a schedule every hour. What is the MOST likely reason for the missing partitions?

A.The IAM role used by the crawler does not have permission to list the S3 bucket.

B.The Glue Data Catalog is not configured to use a Hive metastore.

C.The number of partitions exceeds the Glue catalog limit of 100,000.

D.The crawler schedule is set to run too frequently.

AnswerA

Without s3:ListBucket permission, the crawler cannot see new partitions.

Why this answer

Option C is correct because if the S3 bucket policy denies the crawler's IAM role, the crawler cannot list objects and discover partitions. Option A is incorrect because the crawler can handle many partitions. Option B is incorrect because the schedule is hourly, which should be sufficient.

Option D is incorrect because the crawler can discover partitions in S3 without a Hive metastore connection.

Full explanation →

278

MCQmedium

Refer to the exhibit. A data engineer is running an AWS Glue job that reads data from an S3 source. The job fails with the error shown. What is the MOST likely cause?

A.The IAM role does not have s3:GetObject permission.

B.One of the source files is empty or corrupted.

C.The file is in JSON format but the schema expects Parquet.

D.The Glue job has insufficient memory allocated.

AnswerB

Empty file can return None when read, causing 'NoneType' has no attribute 'read'.

Why this answer

Option C is correct because a corrupted file can cause a read error. Option A is wrong because missing permissions would cause an access denied error, not a read attribute error. Option B is wrong because a wrong file format would cause a parsing error, not a NoneType error.

Option D is wrong because insufficient memory would cause OOM error.

Full explanation →

279

Matchingmedium

Match each AWS service to its primary purpose in data engineering.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Serverless ETL and data catalog

Data warehousing and SQL analytics

Big data processing using Hadoop/Spark

Building and managing data lakes

Real-time streaming data ingestion

Why these pairings

These are core AWS services for data engineering workloads.

Full explanation →

280

MCQhard

A company runs a real-time analytics platform on Amazon ECS that ingests streaming data from Amazon Kinesis Data Streams, processes it, and stores results in Amazon DynamoDB. The data volume spikes unpredictably, causing DynamoDB to throttle write requests. The application uses on-demand capacity mode. The data engineer notices that the throttling occurs on a specific partition due to a hot key. The hot key is a customer ID that receives a disproportionate number of writes. The application cannot change the partition key design immediately. The engineer needs to reduce throttling while maintaining low latency. Which solution is most effective?

A.Switch to provisioned capacity with auto scaling and increase the write capacity units.

B.Implement a write buffer using Amazon SQS, and have consumers write to DynamoDB at a controlled rate.

C.Enable DynamoDB Accelerator (DAX) to cache the hot key writes.

D.Use DynamoDB Streams to trigger a Lambda function that retries throttled writes.

AnswerB

SQS decouples the producers from the writes, allowing batch processing and reducing throttling.

Why this answer

Option B is correct because buffering writes through Amazon SQS decouples the ingestion rate from DynamoDB's capacity, allowing consumers to write at a controlled pace. This directly mitigates throttling on the hot key without requiring a partition key redesign, and SQS provides low-latency, durable buffering suitable for real-time analytics.

Exam trap

The trap here is that candidates often assume on-demand capacity eliminates all throttling, but it does not protect against hot key skew; they may also confuse DAX's read caching with write buffering, or think retrying throttled writes is a viable solution rather than a reactive fix that increases latency.

How to eliminate wrong answers

Option A is wrong because switching to provisioned capacity with auto scaling does not solve the hot key issue; throttling occurs on a specific partition regardless of total capacity, and increasing write capacity units would not prevent a single partition from exceeding its 1,000 WCU limit. Option C is wrong because DAX is a caching layer for reads, not writes; it cannot buffer or absorb write throttling on a hot key. Option D is wrong because using DynamoDB Streams to retry throttled writes introduces latency and does not prevent throttling; it only retries failed writes, which can lead to backlog and increased latency, not a controlled rate.

Full explanation →

281

MCQhard

A company is using an Amazon RDS for PostgreSQL database to store personally identifiable information (PII). The security team wants to ensure that database administrators cannot view the plaintext PII data. Which solution should a data engineer implement?

A.Use IAM policies to restrict DBA access to the RDS instance

B.Enable Dynamic Data Masking in RDS to obfuscate PII for all users

C.Enable encryption at rest for the RDS instance using AWS KMS

D.Use client-side encryption with AWS KMS to encrypt PII before inserting into the database

AnswerD

Client-side encryption ensures data is encrypted before reaching the database, so DBAs cannot see the plaintext.

Why this answer

Using AWS KMS with client-side encryption ensures that data is encrypted before being sent to RDS, so database administrators cannot read the plaintext. Dynamic data masking in RDS is not natively supported; application-level masking would be needed. RDS encryption at rest protects data on disk but DBAs with access can still query plaintext.

Using IAM policies to restrict access does not prevent DBAs with database credentials from viewing data.

Full explanation →

282

Multi-Selectmedium

Which TWO services can be used to ingest streaming data into Amazon S3? (Choose two.)

Select 2 answers

A.Amazon Athena

B.AWS Glue

C.Amazon Kinesis Data Streams

D.AWS Database Migration Service (DMS)

E.Amazon Kinesis Data Firehose

AnswersC, E

Data Streams can be consumed and written to S3 via a consumer application.

Why this answer

Amazon Kinesis Data Streams is a real-time streaming service that can ingest and store streaming data, which can then be consumed and written to Amazon S3 using a Kinesis Data Analytics or a custom consumer application. Amazon Kinesis Data Firehose is a fully managed service that can directly load streaming data into Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service, with optional data transformation and compression.

Exam trap

The trap here is that candidates often confuse AWS Glue's ETL capabilities with real-time streaming ingestion, or mistakenly think Amazon Athena can ingest data because it queries S3, but neither service is designed for streaming data ingestion.

Full explanation →

283

MCQhard

A data engineer is designing a data lake on Amazon S3. The data is partitioned by year, month, day, and hour. The engineer needs to ensure that queries using Amazon Athena are cost-effective and performant. The data is written in Parquet format, and the total volume is 50 TB. Which approach minimizes query costs?

A.Use AWS Glue Data Catalog to catalog the data

B.Convert data to CSV format

C.Partition the data by year, month, day, and hour

D.Use S3 Intelligent-Tiering storage class

AnswerC

Partitioning allows Athena to scan only relevant partitions, reducing cost.

Why this answer

Option C is correct because partitioning by year, month, day, and hour allows Athena to use partition pruning, reading only the relevant S3 prefixes instead of scanning the entire 50 TB dataset. This drastically reduces the amount of data scanned per query, which directly lowers query costs (Athena charges per TB scanned). The existing Parquet format further optimizes performance through columnar storage and compression.

Exam trap

AWS often tests the misconception that simply cataloging data (Option A) or using a storage tier (Option D) directly improves query performance, when in fact only partitioning and efficient file formats reduce the data scanned by Athena.

How to eliminate wrong answers

Option A is wrong because using AWS Glue Data Catalog to catalog the data is a prerequisite for Athena to query the data, but it does not by itself reduce query costs or improve performance; it only provides schema and partition metadata. Option B is wrong because converting data to CSV format would increase the amount of data scanned (CSV is not columnar and lacks compression compared to Parquet), leading to higher query costs and slower performance. Option D is wrong because S3 Intelligent-Tiering is a storage class that optimizes storage costs based on access patterns, but it has no impact on Athena query costs or performance, which depend on data format and partitioning, not storage tier.

Full explanation →

284

MCQhard

A company runs a nightly Amazon EMR job that processes data from S3 and writes results back to S3. The job fails with 'OutOfMemoryError' in the reduce phase. The cluster currently uses 5 m5.xlarge instances. Which cost-effective change should the data engineer make?

A.Increase the number of core nodes to 10.

B.Increase the number of reducers (mapreduce.reduce.tasks) and keep the same instance type.

C.Reduce the input data size by filtering early in the job.

D.Switch to r5.xlarge instances for more memory per instance.

AnswerB

More reducers reduce memory per reducer, preventing OOM.

Why this answer

Option A is correct because increasing the number of reducers distributes the memory load, and m5.xlarge instances are cost-effective. Option B is wrong because r5 instances are memory-optimized but more expensive. Option C is wrong because increasing instance count may not help if reducer memory is the issue.

Option D is wrong because reducing input size may affect completeness.

Full explanation →

285

MCQmedium

A financial services company uses Amazon Redshift for its data warehouse. The cluster has two nodes and is used for complex analytical queries. The company recently migrated from a single-node cluster to a two-node cluster to improve performance. After the migration, the data engineer notices that query performance has not improved as expected. Some queries are even slower than before. The engineer checks the workload management (WLM) queue configuration and sees that there is only one queue with a concurrency level of 5. The queries are mostly large scans and aggregations. The cluster's CPU utilization is low, but disk I/O is high. What should the data engineer do to improve query performance?

A.Apply compression to the tables to reduce the amount of data scanned.

B.Increase the concurrency level in the WLM queue to allow more queries to run simultaneously.

C.Add more nodes or upgrade to a larger node type to increase memory and reduce disk spills.

D.Change the distribution style of large tables to DISTSTYLE ALL to avoid data redistribution.

AnswerC

More memory reduces disk I/O by allowing intermediate results to stay in memory.

Why this answer

Option C is correct because high disk I/O and low CPU utilization indicate that the cluster is spilling to disk due to insufficient memory. Increasing the number of nodes or upgrading to a larger node type (e.g., dc2.large to dc2.8xlarge) increases memory. Option A is wrong because increasing concurrency would increase contention.

Option B is wrong because distribution style is unlikely the main issue. Option D is wrong because manual compression is not needed; Redshift automatically applies compression.

Full explanation →

286

MCQeasy

A data engineer is troubleshooting a failed Amazon Kinesis Data Firehose delivery stream. The stream is configured to deliver data to an Amazon S3 bucket. The error log shows: 'The destination S3 bucket's bucket policy does not allow the firehose to put objects.' What is the MOST likely issue?

A.The S3 bucket's ACL is configured to deny write access to the firehose.

B.The IAM role used by Firehose does not have the necessary permissions.

C.The S3 bucket policy does not include an Allow statement for the firehose to put objects.

D.The IAM role's trust policy does not allow Firehose to assume the role.

AnswerC

The bucket policy must explicitly grant s3:PutObject to the firehose's IAM role.

Why this answer

Option C is correct because the error states the bucket policy does not allow the firehose to put objects. The solution is to add an Allow statement in the bucket policy granting the firehose's IAM role permission to execute s3:PutObject. Option A is incorrect because the error is about bucket policy, not ACLs.

Option B is incorrect because the error is already about permissions. Option D is incorrect because the issue is at the S3 bucket policy level, not IAM role trust policy.

Full explanation →

287

MCQhard

An IAM policy is attached to a role assumed by authenticated users via Amazon Cognito. What does this policy allow?

A.Users can read and write items in the Orders table where the partition key matches their Cognito identity ID.

B.Users can read any item in the Orders table using GetItem and Query.

C.Users can scan the entire Orders table but only if they use a filter expression.

D.Users can read items in the Orders table only if the partition key matches their Cognito identity ID.

AnswerD

The LeadingKeys condition restricts based on the partition key equal to the Cognito sub.

Why this answer

Option A is correct because the policy uses 'dynamodb:LeadingKeys' condition to restrict access to items where the partition key equals the Cognito identity ID (sub). This provides fine-grained access control. Option B is wrong because it allows only GetItem and Query, not Scan.

Option C is wrong because it restricts to a specific table. Option D is wrong because it does not allow writes.

Full explanation →

288

MCQeasy

A company uses Amazon DynamoDB as the primary data store for a web application. The application experiences occasional throttling on write requests. The data engineer needs to implement a solution that handles throttling gracefully without losing data. Which approach should the engineer use?

A.Increase the provisioned write capacity to a higher value

B.Use an Amazon SQS queue to buffer write requests before sending to DynamoDB

C.Implement exponential backoff in the application's write retry logic

D.Enable DynamoDB Accelerator (DAX) to cache writes

AnswerC

Exponential backoff is a best practice to handle throttling effectively.

Why this answer

Option C is correct because implementing exponential backoff in the application's write retry logic is the standard AWS-recommended approach for handling DynamoDB throttling (ProvisionedThroughputExceededException). Exponential backoff gradually increases the wait time between retries, reducing the retry rate and allowing the throttling condition to subside, while ensuring no write data is lost as long as the retries eventually succeed. This approach is lightweight, requires no additional AWS services, and aligns with best practices for building resilient applications against DynamoDB throttling.

Exam trap

The trap here is that candidates often confuse DAX as a write cache or assume SQS is the only way to buffer writes, but the question specifically asks for handling throttling gracefully without losing data, and exponential backoff is the direct, built-in mechanism for retrying throttled requests in DynamoDB.

How to eliminate wrong answers

Option A is wrong because simply increasing provisioned write capacity may reduce throttling but does not handle throttling gracefully when it occurs; it also incurs higher costs and does not address the root cause of occasional spikes. Option B is wrong because using an SQS queue to buffer write requests introduces eventual consistency and potential data loss if the queue messages expire or are not processed before the DynamoDB write; it also adds complexity and latency, and is not the standard pattern for handling DynamoDB throttling directly. Option D is wrong because DynamoDB Accelerator (DAX) is an in-memory cache for reads only, not writes; it cannot cache write requests or mitigate write throttling.

Full explanation →

289

MCQhard

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon Redshift. The migration is successful, but after a few days, data in Redshift becomes inconsistent with the source due to ongoing changes. The company needs to keep Redshift synchronized with minimal latency. Which approach should the data engineer use?

A.Configure DMS with ongoing replication using change data capture (CDC).

B.Use Amazon Redshift COPY with S3 staging and AWS Lambda triggers.

C.Schedule a full DMS load every night.

D.Set up Amazon Redshift Spectrum to query the Oracle database directly.

AnswerA

CDC captures changes continuously and applies them to Redshift.

Why this answer

AWS DMS supports ongoing replication using change data capture (CDC), which captures incremental changes from the Oracle source (via Oracle LogMiner or binary logs) and applies them to Amazon Redshift in near real-time. This approach ensures that Redshift remains synchronized with the source database with minimal latency, meeting the requirement for ongoing consistency after the initial full load.

Exam trap

The trap here is that candidates may confuse Amazon Redshift Spectrum's federated querying capability with actual data replication, or assume that nightly batch loads (Option C) are sufficient for 'minimal latency' requirements, when DMS CDC is the only option that provides continuous, low-latency synchronization.

How to eliminate wrong answers

Option B is wrong because Amazon Redshift COPY with S3 staging and AWS Lambda triggers requires manual or event-driven extraction of data from Oracle, which introduces latency and complexity, and does not provide native CDC-based continuous replication. Option C is wrong because scheduling a full DMS load every night would result in significant data loss between loads (up to 24 hours of inconsistency) and does not achieve minimal latency. Option D is wrong because Amazon Redshift Spectrum queries external data directly from Oracle via federated querying, but it does not replicate or synchronize data into Redshift; it only provides a query-time view, which incurs high latency and does not maintain a consistent local copy.

Full explanation →

290

MCQmedium

A company uses AWS Lake Formation to manage data lake permissions. The data lake contains sensitive customer data in the 'customer' database. The security team wants to ensure that only users with a specific tag 'access_level=analyst' can query the 'customer' table. Which combination of steps should the data engineer take to enforce this?

A.In Lake Formation, create an LF-tag 'access_level' with values 'analyst' and 'admin'. Grant 'SELECT' permission on the 'customer' table to the tag value 'analyst'. Associate the LF-tag with the 'customer' table.

B.Create an IAM policy that conditionally allows 'glue:GetTable' based on the tag 'access_level=analyst'.

C.Apply a bucket policy on the S3 location of the 'customer' table that allows access only if the request carries the tag 'access_level=analyst'.

D.Use Lake Formation column-level filters to restrict access to columns based on the tag 'access_level=analyst'.

AnswerA

This uses Lake Formation TBAC to restrict access based on the user's tag.

Why this answer

Option A is correct because Lake Formation LF-tags allow you to define metadata tags (key-value pairs) and grant permissions to those tags. By creating an LF-tag 'access_level' with values 'analyst' and 'admin', granting SELECT on the 'customer' table to the tag value 'analyst', and associating that LF-tag with the table, only principals who have the tag 'access_level=analyst' (or are granted via the tag) can query the table. This enforces tag-based access control at the Lake Formation permission layer, which is the intended mechanism for fine-grained, attribute-based access control in Lake Formation.

Exam trap

The trap here is that candidates often confuse IAM tag-based policies (Option B) or S3 bucket policies (Option C) with Lake Formation's native LF-tag mechanism, not realizing that LF-tags are a Lake Formation-specific construct that must be managed within Lake Formation itself, not at the IAM or S3 level.

How to eliminate wrong answers

Option B is wrong because an IAM policy conditionally allowing 'glue:GetTable' based on a tag controls access to the Glue Data Catalog API, but it does not enforce Lake Formation permissions on the underlying data; Lake Formation permissions override IAM policies for registered locations, and this approach would not prevent a user with the tag from querying the table if Lake Formation grants are not also configured. Option C is wrong because S3 bucket policies operate at the object storage layer and cannot evaluate Lake Formation LF-tags; they can use IAM tags via the 'aws:RequestTag' condition key, but this would require the request to carry the tag, which is not how Lake Formation principals are identified, and it would bypass Lake Formation's centralized permission model. Option D is wrong because column-level filters in Lake Formation restrict access to specific columns based on a filter expression, not based on LF-tags; LF-tags are used for row-level or table-level permission grants, not for column-level filtering.

Full explanation →

291

MCQmedium

A company is using Amazon DynamoDB for a high-traffic web application. They notice increased read latency during peak hours. Which design change would best reduce read latency without increasing cost?

A.Increase read capacity units

B.Use DynamoDB global tables

C.Switch to strongly consistent reads

D.Enable DynamoDB Accelerator (DAX)

AnswerD

DAX is a caching layer that reduces read latency.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency from single-digit milliseconds to microseconds for eventually consistent reads, without requiring any changes to provisioned capacity. Since the question specifies reducing latency without increasing cost, DAX is ideal because it offloads read traffic from the underlying table, allowing you to potentially lower read capacity units (RCUs) while maintaining performance.

Exam trap

The trap here is that candidates often confuse increasing provisioned capacity (Option A) with reducing latency, but DynamoDB's internal latency is dominated by storage I/O and network round trips, not capacity units—DAX addresses the actual bottleneck by caching hot data in memory.

How to eliminate wrong answers

Option A is wrong because increasing read capacity units (RCUs) would directly increase cost, and while it can reduce throttling, it does not inherently reduce per-request latency caused by internal DynamoDB overhead or hot partitions. Option B is wrong because global tables are designed for multi-region replication and disaster recovery, not for reducing read latency within a single region; they would increase cost due to replication writes and cross-region traffic. Option C is wrong because switching to strongly consistent reads actually increases latency (as they require a quorum read from multiple storage nodes) and consumes twice the RCUs, thus increasing cost without improving performance.

Full explanation →

292

MCQmedium

A data engineer is designing a data store for real-time analytics on high-velocity clickstream data. The data must be stored in a schema-on-read format and support SQL queries with sub-second latency. Which service should be used?

A.Amazon Redshift

B.Amazon Kinesis Data Firehose to S3 with Athena

C.Amazon Kinesis Data Analytics

D.Amazon DynamoDB

AnswerB

Firehose streams data to S3, Athena queries with schema-on-read and partitioning for low latency.

Why this answer

Amazon Kinesis Data Firehose can ingest high-velocity clickstream data and deliver it to Amazon S3, where it is stored in a schema-on-read format (e.g., Parquet or ORC). Amazon Athena then allows SQL queries directly on the data in S3 with sub-second latency when using partitions, columnar formats, and optimizations like AWS Glue Catalog. This combination meets the requirements for real-time analytics without predefining a schema.

Exam trap

The trap here is that candidates confuse Amazon Kinesis Data Analytics (which processes streams but does not store data) with a storage solution, or they assume Amazon Redshift is suitable for real-time streaming without recognizing its schema-on-write requirement and higher latency for ad-hoc queries.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift requires a predefined schema (schema-on-write) and is optimized for batch analytics, not sub-second latency on high-velocity streaming data without significant preprocessing. Option C is wrong because Amazon Kinesis Data Analytics processes streaming data in real time using SQL but does not store the data persistently in a schema-on-read format; it is for transient analytics, not a data store. Option D is wrong because Amazon DynamoDB is a NoSQL key-value and document database that does not support SQL queries natively (it uses PartiQL with limitations) and is schema-on-write, not schema-on-read, making it unsuitable for ad-hoc SQL analytics on clickstream data.

Full explanation →

293

Multi-Selectmedium

A data engineer needs to audit all access to an S3 bucket containing sensitive data. The engineer must capture who accessed the bucket, from which IP address, and what actions were performed. Which AWS services should be used together to meet this requirement? (Choose THREE.)

Select 3 answers

A.Amazon CloudWatch Logs

B.AWS Config

C.Amazon S3 server access logs

D.AWS CloudTrail

E.VPC Flow Logs

AnswersA, C, D

Can store and analyze logs from S3 and CloudTrail.

Why this answer

Options A, B, and D are correct. S3 server access logs record detailed information about requests. CloudTrail records API calls with identity and source IP.

CloudWatch Logs can ingest and analyze logs. Option C is wrong because VPC Flow Logs capture network traffic but not S3 API details. Option E is wrong because Config records resource configuration changes, not access.

Full explanation →

294

MCQhard

A company runs a time-series forecasting model that writes results to an S3 bucket every 5 minutes. A downstream ETL job reads this data, but sometimes fails because it encounters incomplete files (zero bytes). What is the MOST reliable way to ensure the ETL job only processes complete files?

A.Set an S3 Lifecycle policy to delete files smaller than 1 MB.

B.Use S3 Copy to move files to a 'processed' folder after the ETL job reads them.

C.Configure S3 Select to query the files and only return rows if the file is complete.

D.Use S3 Event Notifications to trigger a Lambda function that checks file size and then moves the file to a 'ready' prefix.

AnswerD

Lambda can verify completeness before moving, ensuring only complete files are processed.

Why this answer

Option B is correct because S3 Event Notifications with a Lambda function can process only after a final PUT, and writing to a temporary prefix then renaming ensures atomicity. Option A is wrong because S3 Copy can't detect completeness. Option C is wrong because S3 Select still reads incomplete files.

Option D is wrong because S3 Lifecycle policies manage lifecycle, not completeness.

Full explanation →

295

MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices and store it in Amazon S3 for batch processing. The data arrives at a rate of 10 MB/s, with occasional spikes up to 50 MB/s. The data must be processed in near real-time with minimal latency. Which AWS service should be used for ingestion?

A.Amazon DynamoDB Streams

B.Amazon Kinesis Data Streams

C.Amazon SQS

D.Amazon S3

AnswerB

Designed for real-time data streaming with high throughput and S3 integration via Kinesis Firehose.

Why this answer

Option C is correct because Kinesis Data Streams can handle high throughput and is designed for real-time ingestion. Option A is wrong because SQS is for decoupled messaging, not high-throughput streaming; it also lacks built-in S3 integration. Option B is wrong because S3 is a storage service, not an ingestion endpoint.

Option D is wrong because DynamoDB Streams captures changes to DynamoDB tables, not external IoT data.

Full explanation →

296

MCQeasy

Refer to the exhibit. A data engineer creates an IAM policy for a service role used by AWS Glue. What does the condition in the policy enforce?

A.The role can use the KMS key from any AWS service

B.The role can only use the KMS key when the request comes from Glue

C.The role can only use the KMS key for decrypting data

D.The role can only use the KMS key when the request comes from S3

AnswerD

kms:ViaService limits to S3 endpoints.

Why this answer

The condition restricts KMS actions to requests that come from S3 via the kms:ViaService condition key. Option B is wrong because it says Glue, but the condition is for S3. Option C is wrong because it's not for any service.

Option D is wrong because it restricts to S3, not any KMS key. Option A is correct.

Full explanation →

297

MCQeasy

A data engineer needs to store semi-structured JSON logs from multiple microservices in a cost-effective manner for later analysis using Amazon Athena. The logs are generated continuously, and the total volume is about 1 TB per day. The data must be queryable within minutes of arrival. Which storage solution is most appropriate?

A.Amazon DynamoDB table with JSON attribute

B.Amazon RDS for PostgreSQL table with JSON column

C.Amazon S3 bucket with partitioned folders

D.Amazon Redshift cluster with JSON ingestion

AnswerC

S3 is cost-effective, and Athena can query the data directly.

Why this answer

Amazon S3 with partitioned folders is the most appropriate solution because it provides a cost-effective, scalable storage layer for semi-structured JSON logs, and integrates natively with Amazon Athena for serverless querying. By partitioning the data by time (e.g., year/month/day/hour), Athena can use partition pruning to minimize scanned data, enabling queries within minutes of arrival. S3's low cost per GB and lifecycle policies further optimize storage for the 1 TB/day volume.

Exam trap

AWS often tests the misconception that a data warehouse (Redshift) or a NoSQL database (DynamoDB) is required for analytical queries on semi-structured data, when in fact S3 with Athena is the most cost-effective and scalable solution for serverless ad-hoc analysis on raw logs.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is optimized for key-value and document access patterns with low-latency reads/writes, not for ad-hoc analytical queries on large volumes of JSON logs; scanning 1 TB/day would be prohibitively expensive and slow, and it lacks native integration with Athena. Option B is wrong because Amazon RDS for PostgreSQL is a relational database designed for transactional workloads, not for storing and analyzing 1 TB/day of semi-structured logs; it would require manual partitioning, incur high storage costs, and cannot scale to petabyte-scale analytics efficiently. Option D is wrong because Amazon Redshift is a petabyte-scale data warehouse optimized for complex analytical queries, but it is overkill and more expensive than S3 for raw log storage; ingesting 1 TB/day of JSON logs into Redshift requires an ETL pipeline (e.g., COPY from S3) and incurs compute costs even when idle, whereas S3 with Athena is serverless and pay-per-query.

Full explanation →

298

MCQhard

A financial services company is building a real-time fraud detection system. Transaction data is ingested via Amazon Kinesis Data Streams and processed by an Amazon Kinesis Data Analytics for Apache Flink application that runs sliding window aggregations. The output is written to an Amazon S3 bucket for downstream analysis. The Flink application is configured with parallelism of 4 and checkpointing every minute. The company has noticed that the application is experiencing high latency and the checkpointing is frequently failing. The CloudWatch metrics show that the Flink application's CPU utilization is near 100% and the checkpoint duration is spiking to over 5 minutes. The data engineer needs to improve performance. Which action should the data engineer take?

A.Increase the number of shards in the source Kinesis stream to improve throughput.

B.Increase the parallelism of the Flink application to distribute the workload across more resources.

C.Increase the heap memory of the Flink application to handle larger state.

D.Decrease the checkpoint interval to 30 seconds to reduce the amount of state being checkpointed.

AnswerB

More parallelism can reduce CPU utilization and checkpoint time.

Why this answer

Option C is correct because increasing parallelism allows the workload to be distributed across more resources, reducing CPU pressure and checkpoint duration. Option A is wrong because reducing checkpoint interval would increase frequency and likely worsen failures. Option B is wrong because increasing shards without increasing parallelism may not help if CPU is bottleneck.

Option D is wrong because memory increase is not the primary issue.

Full explanation →

299

Multi-Selectmedium

A company needs to protect sensitive data stored in Amazon S3 from unauthorized access. Which TWO actions should the data engineer take? (Choose two.)

Select 2 answers

A.Configure S3 bucket policies to require MFA for delete operations

B.Enable cross-region replication for all buckets

C.Set up an S3 Lifecycle policy to transition objects to Glacier

D.Enable S3 Block Public Access at the account level

E.Enable S3 Versioning on all buckets

AnswersA, D

MFA adds an extra layer of security for sensitive operations.

Why this answer

Enabling S3 Block Public Access at the account level prevents any public access. Using bucket policies with conditions to require MFA adds an extra layer of security. Versioning does not prevent unauthorized access.

Lifecycle policies manage storage, not security. Cross-region replication is for disaster recovery, not security.

Full explanation →

300

MCQmedium

A company uses Amazon DynamoDB to store session data for a web application. The application experiences throttling errors during peak traffic. The data engineer observes that the table's read capacity is consistently at 100% and the write capacity is at 20%. The engineer needs to resolve the throttling with minimal cost. Which solution should the engineer implement?

A.Increase the provisioned read capacity units for the table.

B.Enable DynamoDB auto scaling for read capacity.

C.Implement DynamoDB Accelerator (DAX) to cache read-heavy workloads.

D.Decrease the provisioned write capacity units to free up budget for reads.

AnswerC

DAX reduces read load on the table by caching, lowering required read capacity.

Why this answer

Option A is correct. Using DynamoDB Accelerator (DAX) caches frequent reads, reducing read capacity consumption. Option B is wrong because increasing read capacity units is more expensive than DAX and does not address cost.

Option C is wrong because auto scaling may not respond fast enough for sudden spikes. Option D is wrong because decreasing write capacity does not help read throttling.

Full explanation →

Page 4 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →