AWS Certified Data Engineer Associate DEA-C01 DEA-C01 Questions 301–375 | Page 5/24

301

MCQhard

A company uses Amazon Redshift for data warehousing. The security team requires that all data be encrypted at rest using a key managed by the company. Which Redshift encryption option should be used?

A.Enable encryption using AWS managed key (default)

B.Use SSL/TLS encryption

C.Use hardware security module (HSM)

D.Specify a customer managed KMS key when enabling encryption

AnswerD

Redshift allows you to specify a customer managed KMS key for encryption.

Why this answer

Redshift supports encryption at rest using KMS keys. To use a customer-managed key, you need to specify a KMS key ID. Option C is correct.

HSM is not directly supported by Redshift.

Full explanation →

302

MCQhard

A company uses Amazon DynamoDB as the primary data store for a high-traffic application. Recently, read latency has increased significantly. The DynamoDB table has on-demand capacity mode. Which action is MOST effective to reduce read latency?

A.Add a DynamoDB Accelerator (DAX) cluster in front of the table

B.Switch the table to provisioned capacity mode with higher read capacity

C.Increase the read capacity units in the table's auto scaling settings

D.Enable DynamoDB Global Tables to distribute reads across regions

AnswerA

DAX caches reads, reducing latency.

Why this answer

Option B is correct because adding a DynamoDB Accelerator (DAX) cluster provides an in-memory cache, reducing read latency significantly. Option A is wrong because on-demand capacity already scales automatically. Option C is wrong because increasing read capacity units is not applicable in on-demand mode.

Option D is wrong because Global Tables do not reduce read latency for local reads.

Full explanation →

303

MCQmedium

A company wants to grant cross-account access to an S3 bucket without using IAM roles. The data engineer needs to write a bucket policy that allows another AWS account to list objects. Which Principal should be specified in the bucket policy?

A.The AWS account ID that owns the bucket

B.The AWS account ID of the other account

C.The IAM user ARN in the other account

D.The root user of the other account

AnswerB

The Principal should be the other account's ID.

Why this answer

Option D is correct because the Principal should be the AWS account ID of the other account. Option A is wrong because a specific IAM user ARN would grant access only to that user. Option B is wrong because the root user is not best practice.

Option C is wrong because the bucket policy is on the resource side.

Full explanation →

304

MCQeasy

A data engineer is reviewing the lifecycle configuration of an S3 bucket. The bucket stores log files. The engineer wants to ensure that objects are deleted after 365 days. What is the current behavior?

A.Objects are deleted after 365 days.

B.Objects are transitioned to S3 Standard-IA immediately.

C.Objects are transitioned to S3 Glacier after 365 days.

D.Noncurrent versions of objects are deleted after 365 days.

AnswerA

The expiration rule sets deletion after 365 days.

Why this answer

Option A is correct because the expiration rule deletes objects after 365 days. Option B is wrong because there is no transition to Glacier. Option C is wrong because objects transition to STANDARD_IA after 30 days, not immediately.

Option D is wrong because noncurrent versions are not included.

Full explanation →

305

MCQhard

A data engineer is designing a data lake on Amazon S3 that must comply with a regulatory requirement to prevent any data from being overwritten or deleted for 7 years after creation. Which S3 feature should be used?

A.S3 bucket policy that denies s3:DeleteObject

B.S3 bucket versioning with MFA Delete

C.S3 Object Lock with retention mode set to COMPLIANCE

D.S3 bucket versioning only

AnswerC

COMPLIANCE retention prevents any deletion or overwrite during the retention period.

Why this answer

Option A is correct because S3 Object Lock with retention mode COMPLIANCE prevents any deletion or overwrite for the specified period. Option B is wrong because MFA Delete requires a token but can still be disabled. Option C is wrong because bucket policies can be changed.

Option D is wrong because versioning alone does not prevent deletion; objects can still be deleted with delete markers.

Full explanation →

306

Multi-Selecthard

A data engineer is troubleshooting a Kinesis Data Streams consumer application that is falling behind. The stream has 10 shards and is receiving 5 MB/s of data. The consumer uses the Kinesis Client Library (KCL) with a single worker. The worker is processing all 10 shards but is experiencing high latency and checkpointing delays. Which THREE actions should the engineer take to improve consumer performance? (Select THREE.)

Select 3 answers

A.Increase the number of KCL workers to match the number of shards.

B.Enable enhanced fan-out for the consumer.

C.Decrease the checkpoint interval to reduce checkpointing overhead.

D.Increase the KCL maxRecords parameter to process more records per call.

E.Increase the number of shards in the stream.

AnswersA, B, D

Multiple workers can process shards in parallel, reducing per-worker load.

Why this answer

Option A (increase number of workers) allows shard distribution across multiple workers, improving parallelism. Option D (increase KCL maxRecords per call) reduces the number of API calls, improving throughput. Option E (enable enhanced fan-out) dedicates throughput to the consumer, reducing read throttling.

Option B is wrong because increasing shards would increase the load on the consumer. Option C is wrong because reducing the checkpoint interval would cause more frequent checkpointing, potentially increasing delays.

Full explanation →

307

Multi-Selecthard

A company is migrating an on-premises Apache Hadoop cluster to Amazon EMR. The cluster uses HDFS for storage. Which THREE features of Amazon EMR help reduce storage costs compared to on-premises HDFS? (Choose THREE)

Select 3 answers

A.Leverage Amazon S3 storage classes like S3 Standard-IA for older data.

B.Use instance store volumes for intermediate data.

C.Enable automatic data compression in EMRFS.

D.Use EMR File System (EMRFS) to store data in Amazon S3.

E.Attach Amazon EBS volumes to cluster nodes for persistent storage.

AnswersA, C, D

S3 storage classes allow cost optimization for infrequently accessed data.

Why this answer

Option A is correct because Amazon S3 Standard-IA (Infrequent Access) offers lower storage costs than S3 Standard for data that is accessed less frequently, making it ideal for older or archival data in a Hadoop migration. By using S3 as the primary storage layer via EMRFS, you decouple compute from storage and avoid the replication overhead of HDFS (which typically uses 3x replication), significantly reducing storage costs.

Exam trap

The trap here is that candidates often confuse instance store volumes or EBS volumes as cost-saving alternatives, but the exam tests the understanding that S3-based storage with EMRFS is the primary mechanism for reducing storage costs in EMR by eliminating HDFS replication and enabling lifecycle management.

Full explanation →

308

Multi-Selectmedium

A data engineer is using Amazon DynamoDB to store session data for a web application. The engineer wants to ensure that all data is encrypted at rest using an AWS managed key. Which THREE steps should the engineer take to achieve this? (Choose THREE.)

Select 2 answers

A.Enable server-side encryption with S3-managed keys (SSE-S3) on the DynamoDB table.

B.Disable encryption at rest to improve performance.

C.Specify an AWS KMS customer managed key for encryption if required.

D.Use client-side encryption before writing data to DynamoDB.

E.Create the DynamoDB table with encryption at rest enabled using an AWS managed key.

AnswersC, E

You can choose a customer managed key for encryption.

Why this answer

Option C is correct because when using DynamoDB encryption at rest, you can choose between an AWS managed key (aws/dynamodb) or a customer managed key (CMK) in AWS KMS. Specifying a customer managed key is a valid step if the requirement is to use an AWS managed key, but the question asks for steps to achieve encryption with an AWS managed key, so selecting a customer managed key would be incorrect unless the engineer intends to use it; however, the option states 'if required', which aligns with the flexibility to choose. Option E is correct because creating the DynamoDB table with encryption at rest enabled using an AWS managed key directly fulfills the requirement—DynamoDB defaults to the AWS managed key when encryption is enabled without specifying a CMK.

Exam trap

The trap here is that candidates may confuse the encryption options across AWS services (e.g., applying S3-specific SSE-S3 to DynamoDB) or think client-side encryption is a valid substitute for server-side encryption at rest, when DynamoDB's native encryption at rest is the correct mechanism.

Full explanation →

309

MCQeasy

A data engineer needs to ingest real-time clickstream data from a website into Amazon S3 for analytics. The data arrives as JSON records, each under 1 KB. The engineer wants to use a serverless solution with automatic scaling and minimal operational overhead. Which AWS service should be used as the ingestion endpoint?

A.Amazon S3 with presigned URLs

B.Amazon Kinesis Data Analytics

C.Amazon Kinesis Data Firehose

D.AWS Lambda function behind an API Gateway

AnswerC

Serverless, automatically scales, delivers to S3 with optional transformation.

Why this answer

Option C is correct because Kinesis Data Firehose is a serverless service that automatically scales and delivers streaming data to S3. Option A is wrong because Lambda can process but not serve as a persistent ingestion endpoint; it would require custom scaling. Option B is wrong because S3 is not a real-time ingestion endpoint.

Option D is wrong because Kinesis Data Analytics is for real-time analysis, not ingestion.

Full explanation →

310

MCQeasy

Refer to the exhibit. A data engineer checks the versioning status of an S3 bucket and sees the above output. The bucket contains critical logs that must not be permanently deleted. What should the engineer do to enhance protection against accidental or malicious deletion?

A.Enable MFA Delete on the bucket

B.Enable versioning on the bucket

C.Enable cross-region replication

D.Configure a lifecycle policy to expire noncurrent versions

AnswerA

MFA Delete requires additional authentication to permanently delete versions, protecting against accidental or malicious deletion.

Why this answer

Enabling MFA Delete on the bucket requires multi-factor authentication to delete object versions, which adds protection. Versioning is already enabled, so that is not needed. Enabling Object Lock with retention mode is another option, but the question asks for enhancement using the current setup; MFA Delete is a direct enhancement.

A lifecycle policy does not prevent deletion. Cross-region replication is for disaster recovery, not deletion protection.

Full explanation →

311

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and each record is about 2 KB. The delivery stream is configured to buffer data for 60 seconds or 5 MB, whichever comes first. The team notices that the S3 objects are very small (around 1 MB) and numerous, causing high costs due to S3 PUT requests. Which configuration change should the team make to reduce the number of S3 objects?

A.Enable compression (GZIP) on the delivery stream.

B.Increase the buffer size to 50 MB and the buffer interval to 300 seconds.

C.Reduce the buffer interval to 30 seconds and keep buffer size at 5 MB.

D.Switch from Kinesis Data Firehose to Amazon Kinesis Data Streams and use a Lambda function to write to S3.

AnswerB

Larger buffer accumulates more data before writing, resulting in fewer, larger objects.

Why this answer

The correct answer is to increase the buffer size to a larger value (e.g., 50 MB) and increase the buffer interval to 300 seconds. Larger buffers produce fewer, larger S3 objects. Option A (reducing buffer interval) would create even more objects.

Option B (compression) reduces object size but not count. Option D (switching to Kinesis Streams) changes the architecture. Option C directly addresses the issue.

Full explanation →

312

MCQhard

A data engineer is troubleshooting a Kinesis Data Streams application that is experiencing high latency. The stream has 2 shards. The application is using a single Kinesis Client Library (KCL) worker to process all shards. Which change will MOST likely reduce latency?

A.Increase the number of shards to 4.

B.Deploy multiple KCL workers to process shards in parallel.

C.Use a larger instance type for the Kinesis stream.

D.Decrease the number of shards to 1.

AnswerB

Multiple workers can process shards concurrently, reducing latency.

Why this answer

Option D is correct because using multiple KCL workers, one per shard, allows parallel processing of each shard, reducing latency. Option A is incorrect because increasing shard count would increase capacity but not necessarily reduce latency if the processing is bottlenecked by a single worker. Option B is incorrect because decreasing shard count would reduce parallelism.

Option C is incorrect because the KCL worker runs in the application, not in Kinesis.

Full explanation →

313

Multi-Selectmedium

A company is using AWS Glue to run ETL jobs that read from Amazon S3 and write to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. Which TWO actions should the data engineer take to resolve this issue? (Choose TWO.)

Select 2 answers

A.Switch the output to Amazon S3 instead of Redshift

B.Increase the number of DPUs allocated to the Glue job

C.Reduce the number of partitions in the input data

D.Increase the spark.sql.shuffle.partitions parameter

E.Enable job metrics in CloudWatch to monitor memory usage

AnswersB, E

More DPUs provide more memory.

Why this answer

Option A is correct because increasing the DPU count provides more memory and processing power. Option B is correct because enabling job metrics helps identify memory bottlenecks. Option C (increasing shuffle threshold) may help but is not a primary solution.

Option D (reducing parallelism) could reduce memory but may slow down the job. Option E (changing Redshift) is unrelated to memory in Glue.

Full explanation →

314

MCQhard

A financial services company stores sensitive transaction data in an Amazon S3 bucket. Compliance requires that all objects be encrypted using SSE-KMS and that the bucket be protected from accidental deletion. Which combination of actions meets these requirements? (Select TWO.)

A.Enable MFA Delete on the bucket

B.Enable S3 Block Public Access

C.Add a bucket policy that denies PutObject if the object is not encrypted with SSE-KMS

D.Enable S3 Versioning on the bucket

E.Set default encryption to SSE-S3

AnswerC, D

This ensures all uploads use SSE-KMS.

Why this answer

Options A and C are correct because enabling versioning protects against accidental deletion and a bucket policy denying unencrypted uploads enforces SSE-KMS. Option B (default encryption) does not enforce encryption on existing objects. Option D (MFA delete) is an additional protection but not required by the scenario.

Option E (block public access) addresses public access, not deletion.

Full explanation →

315

MCQeasy

A data engineer needs to store semi-structured JSON files that are accessed infrequently but must be retrievable within minutes. The data is immutable and must be stored cost-effectively. Which AWS service should the engineer use?

A.Amazon DynamoDB with on-demand capacity

B.Amazon EBS with gp3 volume

C.Amazon S3 with S3 Standard-IA storage class

D.Amazon RDS for PostgreSQL with JSONB data type

AnswerC

S3 is designed for object storage, supports JSON, and Standard-IA is cost-effective for infrequent access with millisecond retrieval.

Why this answer

Amazon S3 Standard-IA (Infrequent Access) is designed for data that is accessed less frequently but requires rapid retrieval when needed, with retrieval times in milliseconds. It offers lower storage costs than S3 Standard while maintaining high durability and availability, making it ideal for storing immutable semi-structured JSON files that must be retrievable within minutes. The service is cost-effective for infrequently accessed data because it charges a retrieval fee per GB, but the storage price is significantly lower than standard tiers.

Exam trap

The trap here is that candidates often confuse 'infrequently accessed' with 'archival' and choose Glacier or Deep Archive, but the requirement for retrieval within minutes eliminates those options, while DynamoDB or RDS seem plausible for JSON but are not cost-effective for immutable, infrequently accessed data.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB with on-demand capacity is a NoSQL database optimized for high-frequency, low-latency queries and is not cost-effective for infrequently accessed, immutable JSON files; it charges per read/write request unit and storage, which would be wasteful for archival-like data. Option B is wrong because Amazon EBS with gp3 volume is a block storage service designed for EC2 instances and requires an attached compute instance to access data, adding unnecessary cost and complexity; it is not a standalone object storage solution for infrequently accessed files. Option D is wrong because Amazon RDS for PostgreSQL with JSONB data type is a relational database service that incurs ongoing compute and storage costs, even when idle, and is overkill for storing immutable JSON files that are only occasionally retrieved; it is designed for transactional workloads and complex queries, not cost-effective archival storage.

Full explanation →

316

MCQmedium

A company uses Amazon EMR to run Spark jobs on a cluster of 20 nodes. The cluster stores intermediate data on Amazon S3 using EMRFS. The company's data engineering team notices that the Spark jobs are running slower than expected. Upon investigating, they find that the cluster is experiencing high network I/O and that the S3 storage costs have increased significantly. The team suspects that the Spark jobs are writing too much intermediate data to S3. The jobs are performing many shuffle operations. The team wants to optimize the job performance and reduce costs without modifying the Spark application code. What should the data engineer do?

A.Enable S3 server-side encryption on the S3 bucket to reduce storage costs.

B.Increase the size of the EBS root volumes on the cluster nodes to store more intermediate data locally.

C.Configure the EMR cluster to use instance store volumes for intermediate data instead of EMRFS.

D.Add more nodes to the cluster to distribute the shuffle load.

AnswerC

Instance store provides local ephemeral storage, reducing S3 dependency and network I/O.

Why this answer

Option C is correct because enabling automatic encryption on the bucket does not affect performance; the issue is about shuffle data. Option B is wrong because increasing EBS volumes for shuffle storage on instance nodes is not standard; EMR uses instance store or EMRFS. Option A is correct because using instance store for shuffle data reduces S3 I/O and cost.

Option D is wrong because increasing node count may increase cost and network I/O.

Full explanation →

317

Multi-Selecteasy

A data engineer needs to transfer 50 TB of data from an on-premises data center to Amazon S3 over a 1 Gbps network. The transfer must be completed within one week. Which TWO AWS services can be used for this task? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.AWS DataSync

C.AWS Snowball

D.Amazon S3 Transfer Acceleration

E.AWS Direct Connect

AnswersB, C

Designed for network-based bulk data transfer.

Why this answer

Option A (AWS DataSync) is correct because it can transfer large volumes over the network efficiently. Option D (AWS Snowball) is correct because it can be used for offline transfer if network is insufficient. Option B is wrong because S3 Transfer Acceleration only speeds up internet transfers but not necessarily achieve the required throughput.

Option C is wrong because Direct Connect is a network connection, not a transfer service. Option E is wrong because Glue is for ETL, not bulk transfer.

Full explanation →

318

MCQhard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a fleet of EC2 instances running a custom application that processes the records and writes to DynamoDB. The application is experiencing high latency and records are being processed slower than they are produced. The stream has 5 shards. Which action would MOST effectively improve processing speed?

A.Use the Kinesis Client Library (KCL) to automatically distribute shards among instances.

B.Increase the EC2 instance size to provide more CPU and memory.

C.Add more EC2 instances consuming from the same stream without changing shard count.

D.Increase the number of shards in the Kinesis stream.

AnswerD

More shards increase the stream's capacity and allow more parallel consumers.

Why this answer

Increasing the number of shards increases the throughput of the stream, allowing more parallel consumers. Option A is wrong because increasing instance size may not help if the bottleneck is the stream. Option C is wrong because adding more consumers without more shards doesn't help (each shard supports one consumer per application).

Option D is wrong because KCL handles shard distribution, but more shards are needed.

Full explanation →

319

Multi-Selectmedium

Which THREE of the following are valid storage classes in Amazon S3? (Choose THREE.)

Select 3 answers

A.S3 Standard

B.S3 Archive

C.S3 Intelligent-Tiering

D.S3 Cold

E.S3 One Zone-IA

AnswersA, C, E

S3 Standard is a general-purpose storage class.

Why this answer

S3 Standard is a valid storage class designed for frequently accessed data with low latency and high throughput. It offers 99.999999999% durability and 99.99% availability, making it suitable for a wide range of use cases like cloud applications, dynamic websites, and content distribution.

Exam trap

AWS often tests the distinction between valid S3 storage classes and fabricated names like 'S3 Archive' or 'S3 Cold', expecting candidates to recall the exact naming conventions (e.g., S3 Glacier, S3 Glacier Deep Archive) rather than generic terms.

Full explanation →

320

Multi-Selectmedium

A data engineer is designing a disaster recovery strategy for an Amazon RDS for PostgreSQL database. The primary database is in us-east-1. Which TWO approaches provide cross-region disaster recovery?

Select 2 answers

A.Configure cross-region automated backups to copy to us-west-2.

B.Take a manual snapshot and copy it to us-west-2 daily.

C.Use Amazon S3 cross-region replication for the database export.

D.Enable Multi-AZ in us-east-1.

E.Create a cross-region read replica in us-west-2.

AnswersA, E

Backups are automatically copied and can be restored.

Why this answer

Cross-region read replica can be promoted to a primary in another region. Cross-region automated backups can be restored to a different region. Multi-AZ is within a region.

Snapshots are manual and region-specific unless copied.

Full explanation →

321

Multi-Selectmedium

A data engineer needs to ensure that sensitive data stored in Amazon S3 is encrypted at rest. Which TWO options meet this requirement? (Choose TWO.)

Select 2 answers

A.Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS)

B.Server-Side Encryption with S3-Managed Keys (SSE-S3)

C.Using a VPC to restrict network access

D.Enabling MFA Delete on the S3 bucket

E.Client-Side Encryption with SSL/TLS

AnswersA, B

SSE-KMS encrypts objects at rest using KMS keys.

Why this answer

Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) allows you to enforce encryption at rest for S3 objects using a customer-managed or AWS-managed KMS key. This option meets the requirement because the encryption is applied server-side by S3 before the data is written to disk, and the data is decrypted automatically when accessed with appropriate permissions. SSE-KMS also provides an audit trail via AWS CloudTrail for every key usage.

Exam trap

The trap here is that candidates often confuse encryption in transit (SSL/TLS) with encryption at rest, or they mistakenly think network controls like VPCs or access controls like MFA Delete provide data encryption, when they only address different security domains.

Full explanation →

322

MCQmedium

A data engineer is migrating an on-premises Apache HBase workload to Amazon DynamoDB. The application requires strongly consistent reads and the ability to query by a composite key (partition key + sort key). Which DynamoDB table design should be used?

A.Create a table with a partition key and sort key, and use ConsistentRead parameter.

B.Use a secondary index with strongly consistent reads.

C.Use a local secondary index (LSI) with the composite key.

D.Create a global secondary index (GSI) with the composite key.

AnswerA

Table key provides composite key querying; ConsistentRead ensures strong consistency.

Why this answer

DynamoDB natively supports strongly consistent reads when you use the `ConsistentRead` parameter set to `true` on GetItem, Query, or Scan operations. By defining a table with a partition key and sort key, you can directly query by the composite key (partition key + sort key) with strong consistency, meeting both requirements without additional infrastructure.

Exam trap

The trap here is that candidates assume secondary indexes (LSI or GSI) can provide strongly consistent reads, but DynamoDB explicitly restricts strong consistency to base table operations only.

How to eliminate wrong answers

Option B is wrong because secondary indexes (both LSI and GSI) in DynamoDB only support eventually consistent reads by default; strongly consistent reads are not supported on any secondary index. Option C is wrong because a local secondary index (LSI) does not replace the base table's composite key query capability; it provides an alternative sort key but still requires the base table for strongly consistent reads, and LSI itself cannot be read with strong consistency. Option D is wrong because a global secondary index (GSI) supports only eventually consistent reads and cannot be used for strongly consistent queries, regardless of the key schema.

Full explanation →

323

MCQhard

A company is using Amazon ElastiCache for Redis to cache frequently accessed data. The cache hit ratio is low, and the engineering team suspects that the eviction policy is causing important data to be removed. Which eviction policy should be used to minimize eviction of the most frequently accessed keys?

A.allkeys-lru

B.allkeys-lfu

C.noeviction

D.volatile-lru

AnswerB

LFU evicts least frequently used keys, retaining popular ones.

Why this answer

Option D is correct because LFU (Least Frequently Used) eviction policy evicts keys that are accessed least frequently, thus preserving frequently accessed keys. Option A is wrong because allkeys-lru evicts the least recently used keys, which may remove frequently accessed keys if they are not accessed recently. Option B is wrong because volatile-lru only applies to keys with TTL.

Option C is wrong because noeviction will return errors when memory is full.

Full explanation →

324

MCQmedium

A company uses Amazon S3 to store images that are accessed by a web application. The application generates presigned URLs for users to download images. Recently, the application has been experiencing errors when generating presigned URLs for objects that were uploaded using multipart upload. The errors indicate that the presigned URL does not work. The data engineer needs to ensure that presigned URLs work for all objects, including those uploaded via multipart upload. What should the data engineer do?

A.Use a different signing algorithm when generating the presigned URL.

B.Ensure that the IAM user or role used to generate the presigned URL has s3:GetObject permission for the object.

C.Enable S3 Versioning on the bucket.

D.Re-upload the objects using single-part upload instead of multipart upload.

AnswerB

Permissions are required to generate a valid presigned URL.

Why this answer

Multipart uploads may result in objects with ETags that are not simple MD5 hashes. Presigned URLs work regardless of how the object was uploaded, as long as the bucket policy does not restrict access. The issue is likely that the IAM user or role used to generate the presigned URL does not have s3:GetObject permission for the object, or the presigned URL is generated incorrectly for multipart uploads.

The correct action is to ensure the IAM policy grants s3:GetObject access and that the presigned URL generation uses the correct method (e.g., using AWS SDK). Option C is correct. Option A: enabling versioning does not affect presigned URL generation.

Option B: using a different signing algorithm is unnecessary; SigV4 is default. Option D: multipart upload does not affect presigned URLs.

Full explanation →

325

Multi-Selecteasy

Which THREE actions can help improve read performance in Amazon DynamoDB? (Choose THREE.)

Select 3 answers

A.Use DynamoDB global tables to replicate data.

B.Use parallel scans to distribute read load across partitions.

C.Use strongly consistent reads for all queries.

D.Enable DynamoDB Accelerator (DAX) to cache reads.

E.Increase the read capacity units (RCU) for the table.

AnswersB, D, E

Parallel scans can improve scan performance.

Why this answer

Option A is correct: DAX provides in-memory caching for reads. Option C is correct: Using strongly consistent reads can improve performance for read-after-write scenarios, but eventually consistent reads are faster; however, the question asks for actions that help read performance; both consistent read types are supported; but note: strongly consistent reads may have higher latency. Actually, to improve performance, eventually consistent reads are faster.

Wait, the question says 'help improve read performance' – using eventually consistent reads is better. But option C says 'Use strongly consistent reads' which is incorrect for performance. Let me re-evaluate.

I think correct answers are A, D, E. Option B is wrong because global tables replicate writes, not improve read performance. Option C is wrong because strongly consistent reads are slower.

Option D is correct: adjusting read capacity units improves performance. Option E is correct: using parallel scans improves performance. So I will mark A, D, E as correct.

But the stem says 'Choose THREE', so I need exactly three. Let me set correct flags accordingly.

Full explanation →

326

MCQhard

A data engineer runs the above AWS CLI command to investigate who uploaded a file to an S3 bucket. The output shows the event was recorded. Which additional step is needed to confirm the identity of the user?

A.No additional step is needed; the 'Username' field already identifies the IAM user.

B.Use the 'MFA' field to check if multi-factor authentication was used.

C.View the 'accessKeyId' field in the CloudTrailEvent JSON.

D.Look up the 'sourceIPAddress' in the CloudTrailEvent.

AnswerA

The 'Username' field contains the full ARN of the IAM user who made the request.

Why this answer

Option C is correct because the 'Username' field shows the ARN of the IAM user, which is the identity that made the API call. Option A (access key ID) is not shown in the output. Option B (IP address) is not in the output.

Option D (MFA) is not recorded in this field.

Full explanation →

327

MCQeasy

A data engineer is ingesting streaming data from thousands of IoT devices into AWS. The data is JSON-formatted and must be stored in Amazon S3 for long-term analytics. Which service is most appropriate for real-time ingestion and routing to S3?

A.Amazon SQS

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams

D.AWS Glue

AnswerB

Kinesis Data Firehose can deliver streaming data directly to S3 without additional code.

Why this answer

Amazon Kinesis Data Firehose is the easiest way to load streaming data into S3. Kinesis Data Streams requires custom consumers; AWS Glue is batch ETL; Amazon SQS is not designed for direct S3 delivery.

Full explanation →

328

MCQhard

A healthcare company is ingesting patient data from a legacy system into an Amazon S3 data lake using AWS Glue. The legacy system produces CSV files with inconsistent schemas (columns may appear or disappear in different files). The data engineer needs to create a Glue ETL job that can handle schema evolution and transform the data into a standardized parquet format. The job should also be able to process new files as they arrive. Which approach should the data engineer use?

A.Use AWS Glue crawlers to create a schema in the Data Catalog and then use a standard Spark DataFrame for transformation.

B.Use AWS Glue DynamicFrames to read the CSV files and apply transformations using resolveChoice and applyMapping.

C.Use a Python shell job in Glue to manually parse each file and write to parquet.

D.Use a Glue ETL job with a static schema defined in the script and ignore files that don't match.

AnswerB

DynamicFrames support schema evolution.

Why this answer

Option C is correct because Glue DynamicFrames can handle schema evolution by allowing schema-on-read and resolving schema inconsistencies. Option A is wrong because crawlers are for cataloging, not ETL. Option B is wrong because explicit schema mapping would fail with evolving schemas.

Option D is wrong because Spark DataFrames require a predefined schema.

Full explanation →

329

Multi-Selecteasy

A data engineer is monitoring Amazon CloudWatch metrics for an Amazon Redshift cluster and notices high CPU utilization. The engineer wants to reduce CPU usage. Which TWO actions should the engineer take?

Select 2 answers

A.Enable concurrency scaling to offload read queries to additional clusters.

B.Increase the number of nodes in the cluster.

C.Optimize the table design by using sort keys and compression.

D.Run the VACUUM command on all tables.

E.Enable audit logging to monitor queries.

AnswersA, C

Offloads queries, reducing CPU on main cluster.

Why this answer

Options B and C are correct. Setting concurrency scaling offloads queries, and using sort keys reduces the amount of data scanned. Option A is wrong because vacuum does not reduce CPU significantly.

Option D is wrong because increasing node count increases CPU capacity but not reduce usage. Option E is wrong because enabling audit logging adds CPU overhead.

Full explanation →

330

Multi-Selectmedium

A data engineer needs to schedule a nightly ETL job that reads from an Amazon RDS database and writes to Amazon S3 in Parquet format. The solution must be serverless and minimize cost. Which TWO AWS services should be used? (Choose TWO.)

Select 2 answers

A.AWS Data Pipeline

B.AWS Lambda

C.Amazon Athena

D.Amazon S3

E.AWS Glue

AnswersD, E

S3 is the destination for the transformed data.

Why this answer

AWS Glue can run serverless ETL jobs. Amazon S3 is the destination. Lambda could trigger but not run the ETL itself; Data Pipeline is not serverless; Athena is query-only.

Full explanation →

331

MCQmedium

Refer to the exhibit. A data engineer sees this error in CloudWatch Logs from an AWS Glue ETL job. The job reads from an S3 location that contains both .parquet and .csv files. What is the most likely cause?

A.The S3 object was deleted during the job execution.

B.The IAM role does not have permission to read the S3 object.

C.The job is reading a CSV file that was incorrectly placed in the directory with .parquet extension.

D.The Glue job does not have enough memory to parse the Parquet file.

AnswerC

The file might have .parquet extension but be CSV, or the job is reading all files regardless of extension.

Why this answer

Option B is correct because the error indicates that the object is not a valid Parquet file. Since the job expects Parquet, it likely encountered a CSV file. Option A is wrong because the error specifically says invalid Parquet, not missing file.

Option C is wrong because insufficient memory would cause OOM errors. Option D is wrong because the error is about the object's format, not permissions.

Full explanation →

332

MCQhard

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

A.Use a job parameter to store the last processed timestamp with millisecond precision and query records greater than that value.

B.Increase the job frequency to every 30 minutes.

C.Run a full refresh of the table each time instead of incremental.

D.Modify the MySQL table to use a DATE data type instead of TIMESTAMP.

AnswerA

Custom job bookmark with higher precision.

Why this answer

Option A is correct because AWS Glue job bookmarks track timestamps with only second precision, so records with microsecond differences within the same second are missed. By using a custom job parameter to store the last processed timestamp with millisecond precision and querying records greater than that value, you bypass Glue's bookmark limitation and capture all new or modified records.

Exam trap

The trap here is that candidates assume Glue job bookmarks automatically handle all timestamp precisions, but the exam tests awareness that bookmarks default to second-level granularity and that custom logic is required for sub-second precision.

How to eliminate wrong answers

Option B is wrong because increasing job frequency does not address the precision mismatch; it only reduces the window for missed records but does not eliminate the root cause of second-level granularity. Option C is wrong because running a full refresh each time is inefficient and costly, and it does not solve the precision issue—it simply avoids incremental processing. Option D is wrong because changing the column to DATE data type would lose time-of-day information entirely, making incremental processing based on last_modified impossible.

Full explanation →

333

MCQhard

A company stores sensitive customer data in an Amazon S3 bucket with versioning enabled. A data engineer accidentally deleted the current version of an object. What is the quickest way to restore the object to its previous state without additional data transfer costs?

A.Use S3 Batch Operations to restore the object from the Recycle Bin.

B.Delete the delete marker that was created by the deletion.

C.Copy the previous version from the bucket to itself.

D.Use the S3 sync command to restore the previous version.

AnswerB

Deleting the delete marker restores the previous version as the current object without copying data.

Why this answer

With versioning enabled, deleted objects become delete markers. To restore, you delete the delete marker, which makes the previous version current again. Option A is correct.

Option B costs data transfer. Option C restores to a different bucket. Option D is not a valid operation.

Full explanation →

334

MCQhard

A data pipeline uses AWS Glue ETL jobs to process data from Amazon RDS for MySQL to Amazon S3. Recently, the jobs have been failing with the error 'Communications link failure' during the connection phase. The RDS instance is in a private subnet, and the Glue job uses a VPC endpoint for S3. What is the most likely cause?

A.The RDS database has reached the maximum number of connections.

B.The Glue job does not have IAM permissions to decrypt the RDS database using AWS KMS.

C.The JDBC driver used by Glue is incompatible with the MySQL version.

D.The Glue job does not have a network path to the RDS instance because it is not attached to the same VPC subnet.

AnswerD

Glue jobs need an ENI in the same VPC to connect to RDS.

Why this answer

Option B is correct because the Glue job needs a VPC endpoint for S3 to access S3, but for RDS, a VPC endpoint is not sufficient; the job must have network connectivity to the RDS subnet, typically via an ENI in the same VPC. Option A is wrong because KMS permissions would cause access denied, not connection failure. Option C is wrong because connection pooling is not relevant.

Option D is wrong because the error is 'Communications link failure', not authentication.

Full explanation →

335

MCQeasy

A company uses AWS Glue to process sensitive data stored in Amazon S3. The security team requires that all data in transit between AWS Glue and S3 be encrypted. Which configuration should be used to meet this requirement?

A.Use an S3 bucket policy that denies requests not using HTTPS.

B.Use an AWS KMS key to encrypt the data before uploading to S3.

C.Configure AWS Glue to use SSL by setting the 'ssl' parameter to 'true'.

D.Enable default encryption on the S3 bucket using SSE-S3.

AnswerA

This enforces encryption in transit for all requests.

Why this answer

Option A is correct because requiring HTTPS for all requests to the S3 bucket ensures that data in transit between AWS Glue and S3 is encrypted using TLS. By using an S3 bucket policy with a condition that denies requests where `aws:SecureTransport` is false, the company enforces encryption for all connections, including those from AWS Glue. This meets the security requirement without needing to modify Glue or S3 configurations beyond the bucket policy.

Exam trap

The trap here is that candidates often confuse encryption at rest (SSE-S3, SSE-KMS, client-side encryption) with encryption in transit (TLS/HTTPS), and may incorrectly assume that enabling default encryption or using KMS keys secures the data during transfer.

How to eliminate wrong answers

Option B is wrong because encrypting data with an AWS KMS key before uploading to S3 (client-side encryption) protects data at rest, not data in transit; the security team specifically requires encryption in transit. Option C is wrong because AWS Glue does not have an 'ssl' parameter; Glue uses HTTPS by default when connecting to S3, and this setting is not configurable via a simple parameter. Option D is wrong because enabling default encryption on the S3 bucket (SSE-S3) only encrypts data at rest, not data in transit between Glue and S3.

Full explanation →

336

MCQhard

A company is using Amazon DynamoDB for an e-commerce application. The application experiences sudden spikes in traffic, causing throttling errors. The data engineer needs to handle the spikes cost-effectively. Which solution should be used?

A.Implement DynamoDB Accelerator (DAX) to cache reads.

B.Switch to DynamoDB on-demand capacity mode.

C.Use DynamoDB auto scaling with a target utilization of 70%.

D.Provision high read and write capacity units to handle peak traffic.

AnswerC

Auto scaling adjusts capacity dynamically based on traffic.

Why this answer

Option C is correct because DynamoDB auto scaling with a target utilization of 70% allows the table to dynamically adjust provisioned read/write capacity based on actual traffic patterns, handling sudden spikes without manual intervention while avoiding over-provisioning. This balances performance and cost by scaling up during spikes and scaling down during low traffic, preventing throttling errors cost-effectively.

Exam trap

The trap here is that candidates often confuse caching (DAX) with scaling, or assume on-demand mode is always the best for spikes without considering cost, when the question explicitly requires a cost-effective solution for sudden but intermittent traffic.

How to eliminate wrong answers

Option A is wrong because DynamoDB Accelerator (DAX) is an in-memory cache that only improves read latency and reduces read throttling, but it does not address write throttling or handle sudden spikes in write traffic, which is the primary issue here. Option B is wrong because DynamoDB on-demand capacity mode automatically scales to handle spikes but is significantly more expensive for predictable or moderate workloads, making it less cost-effective than auto scaling for this scenario. Option D is wrong because provisioning high read and write capacity units to handle peak traffic leads to over-provisioning and wasted cost during normal or low traffic periods, as you pay for the provisioned capacity regardless of actual usage.

Full explanation →

337

MCQmedium

An e-commerce company ingests clickstream data from their website into Amazon S3. The data is in JSON format, and each file is about 10 MB. They need to transform the data into a columnar format for analytics and load it into Amazon Redshift nightly. The transformation should be cost-effective and require minimal operational overhead. Which approach meets these requirements?

A.Use AWS Glue ETL job to convert to Parquet and load into Redshift.

B.Use Amazon Redshift COPY command to load JSON directly.

C.Use Amazon EMR with Spark to transform and load data.

D.Use AWS Lambda to transform each file and write to Redshift.

AnswerA

Serverless and minimal overhead.

Why this answer

AWS Glue ETL is the correct choice because it is a serverless, managed service that can efficiently convert JSON to Parquet (a columnar format optimized for Redshift) and load the data into Redshift with minimal operational overhead. The nightly batch processing of 10 MB files is well-suited for Glue's pay-per-use pricing, making it cost-effective without requiring infrastructure management.

Exam trap

The trap here is that candidates may choose Amazon EMR or Lambda because they are familiar with Spark or serverless functions, but they overlook the operational overhead of EMR and the execution limits of Lambda for batch workloads, while Glue provides a balanced, managed solution for this specific use case.

How to eliminate wrong answers

Option B is wrong because the Redshift COPY command can load JSON directly, but it does not transform the data into a columnar format like Parquet; it loads JSON as-is, which is less efficient for analytics and may require additional schema handling. Option C is wrong because Amazon EMR with Spark introduces significant operational overhead for managing clusters, tuning, and monitoring, which is unnecessary for a simple nightly transformation of small 10 MB files. Option D is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and limited memory (up to 10 GB), making it unsuitable for batch processing multiple files or handling large datasets; it is designed for event-driven, short-lived tasks, not nightly ETL workloads.

Full explanation →

338

MCQhard

A data engineer is troubleshooting a failed AWS Glue job that reads from an Apache Hive metastore in an Amazon EMR cluster. The error message indicates 'ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException'. The Glue job uses a custom Python shell script. What is the most likely cause of this error?

A.Check the network connectivity between Glue and the EMR cluster.

B.Include the Hive JAR files in the 'Python library path' or use a Glue version with Hive support.

C.Modify the Python script to import the Hive libraries manually.

D.Update the IAM role to allow 'hive:Describe*' actions.

AnswerB

Glue needs Hive JARs in the classpath to connect to Hive metastore.

Why this answer

Option C is correct because the ClassNotFoundException for Hive classes indicates that the required Hive JARs are not available in the Glue job's classpath. The Glue job needs to include Hive jars either via a library path or by using a Glue version that supports Hive connectivity. Option A is incorrect because the Python script itself does not need to be modified to include imports; the JARs must be provided.

Option B is incorrect because network connectivity would not cause a class not found error. Option D is incorrect because the error is not about IAM permissions.

Full explanation →

339

MCQeasy

A data engineer is tasked with transforming JSON data from an S3 bucket into Parquet format for efficient querying. The transformation should run on a schedule every hour. Which AWS service is best suited for this task?

A.AWS Lambda

B.Amazon Athena

C.AWS Glue

D.Amazon EMR

AnswerC

Glue provides managed ETL jobs that can be scheduled and support Parquet conversion.

Why this answer

Option C (AWS Glue) is correct because it offers scheduled ETL jobs that can read from S3, transform data, and write back in Parquet format. Option A (Amazon Athena) is a query service, not for scheduled transformations. Option B (Amazon EMR) is more complex and typically used for large-scale processing.

Option D (AWS Lambda) has a 15-minute timeout and is not ideal for hourly batch jobs.

Full explanation →

340

MCQmedium

A data engineer notices that an AWS Glue ETL job processing data from Amazon S3 to Amazon Redshift has been failing intermittently with the error 'S3ServiceException: SlowDown'. Which action is MOST likely to resolve this issue?

A.Increase the number of partitions in the Glue job to parallelize reads.

B.Switch from a Standard to a G.2X large Glue worker type.

C.Implement exponential backoff and retry logic in the Glue job.

D.Enable S3 Transfer Acceleration on the source bucket.

AnswerC

Exponential backoff reduces request rate and handles throttling gracefully.

Why this answer

Option B is correct because the 'SlowDown' error indicates request throttling by S3; implementing exponential backoff and retries reduces request rate and prevents throttling. Option A is wrong because increasing partitions could increase the number of requests. Option C is wrong because switching to a larger instance type does not affect S3 request rate.

Option D is wrong because S3 does not have a burst concurrency limit.

Full explanation →

341

MCQmedium

An organization needs to audit all access to their S3 buckets for compliance purposes. They want to log both successful and failed API calls. Which AWS service should be used?

A.Amazon CloudWatch Logs

B.AWS Config

C.AWS CloudTrail

D.VPC Flow Logs

AnswerC

CloudTrail logs API calls for auditing.

Why this answer

Option C is correct because AWS CloudTrail logs API activity in an AWS account, including S3 operations. Option A (Amazon CloudWatch) monitors metrics, not API calls. Option B (AWS Config) tracks resource configuration changes, not API calls.

Option D (VPC Flow Logs) captures network traffic, not API calls.

Full explanation →

342

MCQhard

A company runs an Amazon Redshift cluster for analytics. During peak hours, query performance degrades significantly. The data engineer notices that disk space usage is above 80% on many nodes. Which of the following is the MOST effective long-term solution to improve query performance?

A.Increase the workload management (WLM) queue slots.

B.Resize the cluster to include additional nodes.

C.Apply compression encoding to all columns.

D.Run the VACUUM command to reclaim space.

AnswerB

Adding nodes increases both storage and compute resources, directly addressing disk usage and performance.

Why this answer

Resizing the cluster to add more nodes increases total storage and compute capacity, reducing disk pressure and improving performance. Option A is wrong because vacuuming reclaims space but does not add capacity. Option B is wrong because compression helps but may not be sufficient.

Option D is wrong because it only addresses queries waiting for resources.

Full explanation →

343

Multi-Selectmedium

A data engineer is designing a data store for a real-time analytics application that requires sub-millisecond read and write latency for time-series data. The data volume is expected to grow to hundreds of terabytes. Which TWO AWS services should the engineer consider? (Choose TWO.)

Select 2 answers

A.Amazon Redshift

B.Amazon DynamoDB with Time-to-Live (TTL)

C.Amazon ElastiCache for Redis

D.Amazon RDS for PostgreSQL

E.Amazon Timestream

AnswersB, E

DynamoDB supports low-latency reads/writes and TTL for automatic expiration of old data.

Why this answer

Options A and D are correct. Amazon Timestream is a time-series database with fast query performance; Amazon DynamoDB with TTL can handle time-series data, but for sub-millisecond latency, DynamoDB is suitable. Option B (Redshift) is for analytics, not sub-millisecond; Option C (RDS) is relational and slower for time-series; Option E (ElastiCache) is in-memory but limited by memory size for hundreds of terabytes.

Full explanation →

344

MCQmedium

A company uses AWS Glue ETL jobs to process data from an S3 data lake. The job reads data in CSV format, transforms it, and writes to Parquet. The job runs daily and takes 2 hours to complete. The data volume is increasing by 20% each month. The engineer wants to reduce the job runtime. Which action is most effective?

A.Increase the number of DPUs for the Glue job

B.Enable compression on the input CSV files

C.Switch from Python Shell to Spark ETL

D.Partition the input data in S3 by date and use partition pruning in the job

AnswerD

Partition pruning limits the data read to only relevant partitions, drastically reducing processing time.

Why this answer

Option B is correct because partitioning data by date (e.g., year/month/day) allows Glue to read only the new partitions incrementally, reducing data scanned. Option A (increasing DPUs) may help but not as much as reducing data volume. Option C (using Spark) is already used by Glue.

Option D (compression) is already Parquet, which is compressed.

Full explanation →

345

MCQmedium

A streaming application sends data to Amazon Kinesis Data Streams. The data must be enriched with reference data from an Amazon DynamoDB table in real-time. Which AWS service can be used to perform this enrichment with minimal latency?

A.Amazon Kinesis Data Analytics for Apache Flink

B.Amazon Kinesis Data Firehose with Lambda transformation

C.AWS Lambda function triggered by Kinesis Data Streams

D.AWS Glue streaming ETL

AnswerA

Flink can perform low-latency stream processing and join with DynamoDB.

Why this answer

Option B is correct because Kinesis Data Analytics for Apache Flink can process streaming data and join with DynamoDB in real-time. Option A is wrong because Lambda can enrich but may add latency due to cold starts. Option C is wrong because Kinesis Data Firehose does not support real-time enrichment from DynamoDB.

Option D is wrong because Glue is batch-oriented.

Full explanation →

346

MCQeasy

A data engineer is investigating why Amazon Athena queries on the 'my-data-lake' bucket are slow. The table is partitioned by year/month/day. The exhibit shows the objects in one partition. What is the MOST likely cause of poor query performance?

A.The files are too small, causing excessive read overhead

B.The files are not compressed

C.The partition columns are not appropriately chosen

D.The data format is CSV instead of Parquet

AnswerA

Many small files cause many S3 GET requests and slow performance.

Why this answer

Option C is correct because the exhibit shows tiny files (50 bytes), which cause high metadata overhead and slow query performance. Option A is about compression, which cannot be determined. Option B is about partitioning, which is fine.

Option D is about format, but CSV is standard.

Full explanation →

347

MCQeasy

A company uses Amazon RDS for MySQL to store application data. The database contains personally identifiable information (PII). The security team requires that all data be encrypted at rest using AWS KMS. The database is currently unencrypted. The data engineer needs to enable encryption without significant downtime. Which approach should the data engineer take?

A.Use AWS DMS to migrate data to a new encrypted RDS instance continuously.

B.Take a snapshot of the database, copy it with encryption enabled, and restore from the encrypted snapshot.

C.Create a read replica with encryption enabled and promote it to primary.

D.Modify the RDS instance and enable encryption in the configuration.

AnswerB

Standard procedure to enable encryption on existing RDS.

Why this answer

Option D is correct. To enable encryption on an existing RDS instance, you must take a snapshot, copy it with encryption, and restore from the encrypted snapshot. This process involves some downtime but is the only way.

Option A is wrong because you cannot modify the instance to enable encryption directly. Option B is wrong because you cannot enable encryption on the existing instance via modification. Option C is wrong because DMS requires source and target, and the target can be encrypted, but it's more complex.

Full explanation →

348

MCQeasy

A company runs an Amazon RDS for PostgreSQL database and wants to capture change data (inserts, updates, deletes) to stream into Amazon Kinesis Data Streams for real-time processing. Which AWS service should be used to capture the changes directly from the database?

A.Amazon RDS automated snapshots

B.AWS Glue ETL job scheduled to run every minute

C.Amazon Kinesis Agent

D.AWS Database Migration Service (DMS) with ongoing replication

AnswerD

DMS supports CDC and can stream changes to Kinesis.

Why this answer

AWS DMS with ongoing replication (change data capture) is the correct service because it can continuously capture insert, update, and delete operations from the PostgreSQL transaction logs (WAL) and stream them to a Kinesis Data Streams endpoint. This allows real-time processing without modifying the source database or requiring application-level triggers.

Exam trap

The trap here is that candidates confuse scheduled polling (Glue) or file-based agents (Kinesis Agent) with true CDC, failing to recognize that only DMS ongoing replication can stream row-level changes directly from the database transaction log in real time.

How to eliminate wrong answers

Option A is wrong because Amazon RDS automated snapshots are point-in-time backups of the entire database, not a mechanism to capture individual row-level changes in real time. Option B is wrong because an AWS Glue ETL job scheduled every minute introduces at least 60 seconds of latency and cannot capture every single change as it happens, making it unsuitable for true real-time streaming. Option C is wrong because Amazon Kinesis Agent is designed to stream log files (e.g., from EC2 instances) to Kinesis, not to connect directly to a database and read transactional changes from its WAL.

Full explanation →

349

MCQmedium

A company is running a production Amazon RDS for MySQL Multi-AZ DB instance. The database experiences a sudden spike in read requests, causing performance degradation. The company needs to improve read scalability with minimal application changes. Which solution should the data engineer recommend?

A.Implement DynamoDB Accelerator (DAX) in front of the database.

B.Enable Multi-AZ on the existing DB instance.

C.Increase the DB instance size to a larger instance class.

D.Create an Amazon RDS Read Replica and update the application to use it for read queries.

AnswerD

Read Replicas handle read traffic, offloading the primary instance and improving scalability.

Why this answer

Creating one or more Read Replicas distributes read traffic away from the primary DB instance, improving read scalability with minimal application changes (only need to update connection strings for read queries). Option A is wrong because Multi-AZ is for high availability, not read scaling. Option B is wrong because increasing instance class provides more resources but is a vertical scaling approach and may not be cost-effective.

Option D is wrong because DynamoDB Accelerator is for DynamoDB, not RDS.

Full explanation →

350

Multi-Selecteasy

A company is designing a data lake on Amazon S3. The data includes CSV files, Parquet files, and images. The data engineering team needs to catalog the metadata and enable SQL queries. Which TWO AWS services should be used together?

Select 2 answers

A.Amazon EMR

B.Amazon Redshift Spectrum

C.Amazon QuickSight

D.Amazon Athena

E.AWS Glue

AnswersD, E

Athena can directly query data in S3 using the Glue Data Catalog.

Why this answer

Amazon Athena is correct because it is a serverless interactive query service that can directly query data stored in Amazon S3 using standard SQL, without needing to load or transform data. AWS Glue is correct because it provides a fully managed data catalog (AWS Glue Data Catalog) that stores metadata about the data lake's schema, partitions, and locations, which Athena can use to discover and query the data efficiently.

Exam trap

The trap here is that candidates often confuse Amazon Redshift Spectrum (which requires a Redshift cluster) with Athena (which is serverless), or they think Amazon EMR is needed for SQL queries on S3, not realizing Athena provides a simpler, cluster-free solution.

Full explanation →

351

MCQmedium

A company uses AWS Lake Formation to manage permissions on a data lake stored in S3. A data scientist is unable to query a table in Amazon Athena, receiving an 'Access Denied' error. The data scientist has IAM permissions to call Athena and has been granted SELECT permission on the table in Lake Formation. What is the most likely cause?

A.The data scientist does not have DESCRIBE permission on the table.

B.The data is encrypted with SSE-KMS and the data scientist lacks kms:Decrypt permission.

C.The S3 bucket policy denies access to the data scientist's IAM role.

D.The S3 bucket containing the data is not registered as a Lake Formation location.

AnswerD

Prevents Lake Formation from granting S3 access.

Why this answer

Lake Formation requires explicit S3 permission via a resource link or register location. Option B is correct because if the underlying S3 location is not registered or the IAM role does not have S3 access, the query fails. Option A is wrong because the data scientist has SELECT permission.

Option C is wrong because encryption is not the issue. Option D is wrong because Lake Formation bypasses S3 bucket policies for registered locations.

Full explanation →

352

MCQhard

A company uses AWS Glue to transform data stored in Amazon S3. During a run, the job fails with a 'OutOfMemoryError' in the Spark executor. The job processes 2 TB of parquet files using 10 DPUs. The data is evenly distributed across partitions. Which action would MOST likely resolve the issue without impacting the job logic?

A.Enable S3 request rate increase to speed up data reading.

B.Increase the number of DPUs allocated to the Glue job.

C.Repartition the data to a larger number of partitions.

D.Change the input format from Parquet to Snappy-compressed CSV.

AnswerB

More DPUs increase total memory available.

Why this answer

Option A is correct because increasing DPUs provides more memory and cores. Option B is wrong because repartitioning may not reduce memory per executor if total DPUs unchanged. Option C is wrong because compressing reduces input size, but Spark uses memory for processing, not just storage.

Option D is wrong because increasing S3 request rate does not help with executor memory.

Full explanation →

353

MCQeasy

A data engineer runs a Spark job on Amazon EMR that reads data from Amazon S3 and writes results back to S3. The job fails with an 'S3AccessDenied' error. The engineer verifies that the IAM role attached to the EMR cluster has s3:GetObject and s3:PutObject permissions on the relevant buckets. What is the MOST likely cause of the error?

A.S3 Transfer Acceleration is not enabled on the bucket.

B.EMRFS consistent view is not configured.

C.The S3 bucket is in a different AWS Region than the EMR cluster.

D.The IAM role does not have s3:ListBucket permission on the bucket.

AnswerD

EMR requires ListBucket permission to access objects in the bucket.

Why this answer

The IAM role attached to the EMR cluster must have the s3:ListBucket permission on the bucket to allow the Spark job to enumerate objects when reading from S3. Without this permission, even with s3:GetObject and s3:PutObject, the job fails with an 'S3AccessDenied' error because the S3 list operation is required for directory listing and file discovery.

Exam trap

The trap here is that candidates often assume GetObject and PutObject are sufficient for S3 read/write operations, overlooking that the ListBucket permission is required for directory listing and file discovery in Spark jobs.

How to eliminate wrong answers

Option A is wrong because S3 Transfer Acceleration is a feature for faster uploads over long distances and is not required for basic read/write operations; its absence does not cause an access denied error. Option B is wrong because EMRFS consistent view is a consistency mechanism for eventually consistent S3 buckets, not a permission or access control feature; its absence would not produce an S3AccessDenied error. Option C is wrong because while cross-region access can cause latency or additional costs, it does not inherently cause an access denied error as long as the IAM role has the correct permissions and the bucket policy allows cross-region access.

Full explanation →

354

MCQmedium

A data engineer needs to ensure that an S3 bucket can only be accessed from a specific VPC. Which policy element should be used?

A.Use the condition key aws:VpcSourceIp in the bucket policy.

B.Use the condition key aws:SourceIp in the bucket policy.

C.Use the condition key aws:SourceVpce in the bucket policy.

D.Use the condition key aws:SourceVpc in the bucket policy.

AnswerD

This restricts access to requests from the specified VPC.

Why this answer

Option B is correct because the condition key aws:SourceVpc restricts requests to originate from a specific VPC. Option A limits to a VPC endpoint, not the VPC itself. Option C is for specific IPs.

Option D is for VPC endpoint IDs.

Full explanation →

355

MCQeasy

A data engineer receives an alert that a Kinesis Data Stream has a 'WriteProvisionedThroughputExceeded' error. The stream has 5 shards with 1 MB/s write capacity per shard. The producer application is sending data at 8 MB/s sustained. What should the engineer do to resolve the issue?

A.Reduce the record size to below 1 MB per record.

B.Enable enhanced fan-out on the stream.

C.Increase the number of shards from 5 to 10.

D.Use Kinesis Firehose as an intermediary to buffer data.

AnswerC

More shards increase the total write capacity, matching the 8 MB/s requirement.

Why this answer

The 'WriteProvisionedThroughputExceeded' error indicates that the total write throughput to the Kinesis Data Stream exceeds the provisioned capacity. With 5 shards, each offering 1 MB/s write capacity, the total write capacity is 5 MB/s. The producer is sending 8 MB/s, which is above this limit.

Increasing the number of shards to 10 raises the total write capacity to 10 MB/s, accommodating the sustained 8 MB/s throughput and resolving the throttling.

Exam trap

The trap here is that candidates confuse write-side throttling with read-side limitations, leading them to choose enhanced fan-out (a read-side optimization) instead of scaling shards to increase write capacity.

How to eliminate wrong answers

Option A is wrong because reducing record size below 1 MB does not address the throughput limit; the error is about aggregate write throughput exceeding shard capacity, not individual record size limits. Option B is wrong because enhanced fan-out is a feature for increasing read throughput (up to 2 MB/s per shard per consumer) and does not affect write capacity or resolve write-side throttling. Option D is wrong because Kinesis Firehose is a delivery service that reads from a Kinesis stream; it cannot buffer data before it is written to the stream, so it does not solve the write throughput exceedance at the producer side.

Full explanation →

356

MCQeasy

A data engineer needs to grant an IAM role read-only access to Amazon DynamoDB tables in a specific AWS account. Which IAM policy element should be used to restrict access to only the 'GetItem' and 'Query' actions?

A.Resource

B.Action

C.Effect

D.Condition

AnswerB

Action specifies the API actions like GetItem and Query.

Why this answer

The 'Action' element specifies the allowed API actions. 'Effect' is 'Allow' or 'Deny'. 'Resource' specifies the ARN. 'Condition' adds conditions. So Action is correct.

Full explanation →

357

MCQmedium

An e-commerce company uses Amazon DynamoDB as the primary data store for its product catalog. The table has a simple primary key (ProductID) and handles 10,000 writes per second during peak hours. Recently, the engineering team noticed increased write latency and throttled requests during peak times. The table's provisioned write capacity is set to 12,000 WCU. What is the most likely cause of the throttling?

A.The table has reached the maximum number of partitions

B.DynamoDB Accelerator (DAX) is not configured

C.Write traffic is unevenly distributed across partitions

D.A global secondary index is consuming write capacity

AnswerC

Uneven distribution can cause some partitions to throttle even if total capacity is adequate.

Why this answer

Option C is correct because DynamoDB partitions data by the primary key's hash value. If write traffic is unevenly distributed across partitions (e.g., a few ProductIDs receive most writes), those hot partitions can exceed their individual throughput limits (3,000 WCU per partition for provisioned tables), causing throttling even when the table's total provisioned WCU of 12,000 is not fully utilized.

Exam trap

The trap here is that candidates assume throttling only occurs when total provisioned capacity is exceeded, overlooking the per-partition throughput limits that cause throttling on hot partitions even when the table's overall WCU is underutilized.

How to eliminate wrong answers

Option A is wrong because DynamoDB tables do not have a maximum number of partitions; partitions are automatically added or removed based on storage and throughput needs. Option B is wrong because DAX is an in-memory cache for reads, not writes; it does not affect write capacity or throttling. Option D is wrong because while a global secondary index (GSI) does consume write capacity from the table's WCU pool, the question states the table has 12,000 WCU provisioned, and throttling occurs during peak writes of 10,000 writes per second, so the GSI would only contribute to throttling if its own provisioned WCU were insufficient, but the scenario does not indicate that.

Full explanation →

358

MCQmedium

A media company stores video metadata in Amazon RDS for PostgreSQL. The database is 500 GB and experiences high write traffic. The data engineer notices that the transaction log (WAL) is growing rapidly, causing storage issues. The company needs to retain backups for 30 days for compliance. The database is currently using automated backups with a retention period of 7 days. Which solution should the engineer implement to address the WAL growth while meeting compliance requirements?

A.Create manual snapshots daily and delete automated backups.

B.Change the instance type to a larger one with more storage.

C.Increase the backup retention period to 30 days.

D.Configure the database to stream WAL files to Amazon S3.

AnswerC

Automated backups manage WAL retention; RDS purges WAL older than retention period.

Why this answer

Option C is correct because increasing the backup retention period to 30 days directly addresses the compliance requirement while also managing WAL growth. In Amazon RDS for PostgreSQL, automated backups rely on WAL files to support point-in-time recovery (PITR). When the retention period is too short (e.g., 7 days), RDS may not purge old WAL segments aggressively enough, causing them to accumulate.

By extending the retention to 30 days, RDS properly manages WAL cleanup based on the new retention window, preventing unbounded WAL growth and meeting the 30-day compliance requirement.

Exam trap

The trap here is that candidates may think manual snapshots or streaming WAL to S3 are valid solutions, but RDS for PostgreSQL does not support direct WAL streaming to S3, and manual snapshots do not control WAL retention—only the automated backup retention period governs WAL cleanup in RDS.

How to eliminate wrong answers

Option A is wrong because creating manual snapshots daily and deleting automated backups removes the ability to perform point-in-time recovery (PITR) within the retention window, and manual snapshots do not manage WAL growth—WAL files are still retained for automated backup purposes until they are no longer needed. Option B is wrong because changing the instance type to a larger one with more storage only addresses the symptom (storage filling up) but does not stop the underlying WAL growth; it merely postpones the storage issue and increases cost without solving the root cause. Option D is wrong because streaming WAL files to Amazon S3 is not a native feature of Amazon RDS for PostgreSQL; RDS manages WAL internally and does not expose direct WAL streaming to S3—this option reflects a misunderstanding of RDS architecture.

Full explanation →

359

MCQeasy

A data engineer needs to ingest data from an external partner's FTP server to Amazon S3. The data arrives once daily as a CSV file. Which AWS service should be used for this ingestion?

A.AWS DataSync

B.Amazon Kinesis Data Firehose

C.Amazon AppFlow

D.AWS Transfer Family

AnswerD

AWS Transfer Family provides managed FTP and SFTP support for S3.

Why this answer

Option A is correct because AWS Transfer Family supports SFTP and FTP-based transfers into S3. Option B is wrong because AWS DataSync is for transfers between storage systems, but it requires agent installation. Option C is wrong because Amazon AppFlow is for SaaS applications, not FTP.

Option D is wrong because Amazon Kinesis Firehose is for streaming data.

Full explanation →

360

MCQmedium

A data engineer is troubleshooting a Kinesis Data Analytics application that processes streaming data. The application is falling behind, and the metric 'MillisBehindLatest' is consistently above 60000. The source Kinesis stream has 10 shards, and the application uses a Flink application with default parallelism. What is the MOST likely cause of the lag?

A.The sink (destination) is throttling writes.

B.The Flink application parallelism is set to 1.

C.The Kinesis stream has too few shards.

D.The retention period of the Kinesis stream is too short.

AnswerB

Default parallelism of 1 causes a single consumer to process all shards.

Why this answer

Option C is correct because default parallelism in Flink is 1, which means only one task processes data from all shards, causing a bottleneck. Option A is wrong because increasing shards would increase parallelism but the issue is with parallelism. Option B is wrong because the retention period does not affect lag.

Option D is wrong because the stream is the source, not the sink.

Full explanation →

361

MCQhard

A data engineer is designing a solution to move data from an on-premises Oracle database to Amazon S3 using AWS DMS. The engineer needs to ensure that data changes are replicated continuously with minimal latency. Which DMS configuration is most appropriate?

A.Use AWS SCT to convert the schema and then use DMS for full load

B.Use a full-load task with ongoing replication (CDC)

C.Use a full-load task that runs daily

D.Use Kinesis Data Streams to capture changes and write to S3

AnswerB

CDC captures changes continuously after initial load.

Why this answer

Option B is correct because DMS with CDC (change data capture) captures ongoing changes with low latency. Option A is wrong because full load only does initial copy. Option C is wrong because SCT is for schema conversion, not data movement.

Option D is wrong because Kinesis is for streaming, but DMS is the right service for database replication.

Full explanation →

362

MCQeasy

A company is using Amazon Kinesis Data Firehose to ingest data into Amazon S3. The data must be transformed from JSON to Parquet format before delivery. Which feature should be enabled on the Firehose delivery stream?

A.Amazon Kinesis Data Analytics

B.Amazon S3 event notifications

C.Format conversion (Parquet/ORC)

D.AWS Lambda transformation

AnswerC

Firehose natively supports converting JSON to Parquet or ORC.

Why this answer

Option D is correct because Firehose can convert the input data format to Parquet or ORC using its built-in format conversion feature. Option A is wrong because Lambda transformation is for custom code, not format conversion. Option B is wrong because S3 events are for notifications, not transformation.

Option C is wrong because Kinesis Data Analytics is for stream processing, not directly tied to Firehose.

Full explanation →

363

Multi-Selectmedium

A data engineer is configuring a data lake on Amazon S3 that contains sensitive customer information. The company requires that all access to this data be logged and monitored, and that any data shared with external partners must be anonymized before leaving the S3 bucket. Which combination of AWS services should the engineer use to meet these requirements? (Choose THREE.)

Select 3 answers

A.AWS WAF

B.AWS Lake Formation

C.AWS CloudTrail

D.AWS Direct Connect

E.Amazon Macie

AnswersB, C, E

Lake Formation provides fine-grained access control and can be used to enforce anonymization policies.

Why this answer

AWS Lake Formation (B) is correct because it provides fine-grained access control and data anonymization capabilities for data lakes on Amazon S3. It allows you to define column-level and row-level security policies, and can automatically anonymize sensitive data (e.g., via masking or tokenization) before it is shared with external partners, ensuring compliance with data governance requirements.

Exam trap

The trap here is that candidates often confuse AWS WAF (a web-layer security tool) with data-level security, or assume Direct Connect provides logging and monitoring, when in fact neither service addresses S3 data access logging or anonymization.

Full explanation →

364

MCQeasy

A company needs to store streaming data from IoT devices with a retention period of 7 days for real-time analysis. Which AWS service is most suitable?

A.Amazon DynamoDB

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams

D.Amazon S3

AnswerC

Kinesis Data Streams supports real-time data ingestion with adjustable retention.

Why this answer

Amazon Kinesis Data Streams is the most suitable service because it is designed for real-time ingestion and processing of streaming data, such as IoT device telemetry, with a default retention period of 24 hours, extendable up to 365 days. The requirement for a 7-day retention period for real-time analysis aligns perfectly with Kinesis Data Streams' ability to retain data for exactly that duration, allowing consumers to process records in near real-time using the Kinesis Client Library (KCL) or AWS Lambda.

Exam trap

The trap here is that candidates often confuse Kinesis Data Firehose with Kinesis Data Streams, assuming Firehose can retain data for a period, but Firehose is a delivery service with no retention—data is immediately delivered to a destination, whereas Data Streams provides a durable buffer with configurable retention for real-time consumption.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is a NoSQL key-value and document database optimized for low-latency reads and writes, not for streaming data ingestion or temporary buffering with a retention period; it stores data indefinitely unless TTL is configured, and lacks native streaming ingestion capabilities. Option B is wrong because Amazon Kinesis Data Firehose is designed for loading streaming data into destinations like S3, Redshift, or Elasticsearch, but it does not support custom retention periods or real-time processing by multiple consumers—data is delivered immediately and not retained for 7 days. Option D is wrong because Amazon S3 is an object storage service with eventual consistency and no built-in streaming ingestion or real-time processing; it is a destination for stored data, not a buffer for real-time analysis with a 7-day retention window.

Full explanation →

365

Multi-Selectmedium

A data engineer needs to design a data ingestion pipeline that captures streaming data from mobile app events into Amazon S3 for analytics. The pipeline must support real-time processing of events and allow for schema evolution over time. Which AWS services should the engineer use? (Choose THREE.)

Select 3 answers

A.Amazon Kinesis Data Analytics

B.Amazon Kinesis Data Firehose

C.AWS Glue ETL jobs

D.Amazon Kinesis Data Streams

E.AWS AppFlow

AnswersA, B, D

Enables real-time processing and schema evolution.

Why this answer

Options A, C, and D are correct. Kinesis Data Streams ingests streaming data. Kinesis Data Analytics can perform real-time processing.

Kinesis Data Firehose delivers data to S3. Option B (Glue) is batch-oriented and not real-time. Option E (AppFlow) is for SaaS data ingestion.

Full explanation →

366

MCQhard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an S3 bucket. The application is consistently running out of memory and failing. The operator has already increased the Parallelism and TaskManager memory. What is the next BEST step to troubleshoot?

A.Change the processing mode from exactly-once to at-least-once

B.Reduce the number of shards in the source stream

C.Enable Apache Flink metrics in Amazon CloudWatch to monitor heap and checkpoint details

D.Increase the buffer timeout for the S3 sink

AnswerC

Detailed metrics help identify root cause of OOM.

Why this answer

Option D is correct because enabling Apache Flink metrics on CloudWatch allows monitoring of heap usage, checkpoint sizes, and backpressure, which can diagnose the memory issue. Option A adds latency but doesn't diagnose memory. Option B changes processing semantics but doesn't address memory.

Option C is unrelated.

Full explanation →

367

MCQhard

A financial services company has an Amazon DynamoDB table named 'Transactions' with provisioned read capacity of 10,000 RCU and write capacity of 5,000 WCU. The table stores transaction records for the past 90 days. The application performs point reads by transaction ID (partition key) and range queries by customer ID and timestamp (GSI). Recently, the company started a new marketing campaign, causing a sudden spike in write traffic. The write capacity is now at 4,500 WCU, and the application is experiencing occasional throttling on writes. The data engineer needs to ensure that writes are not throttled during future campaigns, while keeping costs low. The table currently has auto scaling enabled with a maximum capacity of 10,000 WCU. Which solution should the engineer implement?

A.Switch the table to DynamoDB on-demand capacity mode.

B.Use DynamoDB Accelerator (DAX) to cache write requests.

C.Implement an Amazon SQS queue to buffer write requests and process them in batches.

D.Increase the maximum write capacity in the auto scaling configuration to 20,000 WCU.

AnswerA

On-demand mode scales instantly to handle any traffic spike, eliminating throttling.

Why this answer

Option C is correct. DynamoDB on-demand mode automatically scales to accommodate traffic spikes, eliminating throttling without manual intervention. Option A is wrong because increasing max auto scaling setting still has a limit and may not react quickly enough.

Option B is wrong because using SQS adds latency and complexity; not ideal for real-time. Option D is wrong because DAX is for reads, not writes.

Full explanation →

368

Multi-Selecthard

A company is using AWS Glue DataBrew to clean and transform data from an S3 bucket. The data contains personally identifiable information (PII). The company wants to mask the PII columns before making the dataset available to analysts. Which THREE actions can the engineer perform using DataBrew to mask PII? (Choose THREE.)

Select 3 answers

A.Apply a 'Column masking' transformation to replace values with 'XXX'.

B.Apply a 'Column tokenization' transformation to replace values with tokens.

C.Apply a 'Column hashing' transformation using SHA-256.

D.Apply a 'Column delete' transformation to remove the PII columns entirely.

E.Apply a 'Column encryption' transformation to encrypt the column values.

AnswersA, C, D

Column masking is a built-in transformation that hides data.

Why this answer

Options A, C, and D are correct. DataBrew provides built-in transformations for masking: 'Column masking' replaces values with a fixed pattern, 'Column hashing' replaces with a hash, and 'Column delete' removes the column. Option B is wrong because 'Column encryption' is not a DataBrew transformation; encryption is typically done at rest or with KMS.

Option E is wrong because there is no 'Column tokenization' transformation in DataBrew; tokenization would require a custom recipe step.

Full explanation →

369

MCQmedium

A data engineer is responsible for a data warehouse on Amazon Redshift that stores 5 TB of data. The engineer needs to load 50 GB of new data daily from Amazon S3 into Redshift. The current load process uses the COPY command and takes 2 hours, which is within the maintenance window. However, the engineer wants to optimize the load time and reduce the impact on concurrent queries. The engineer notices that the tables are not distributed evenly across the slices. The cluster has 4 nodes of dc2.large. Which approach will best improve load performance?

A.Increase the cluster size to 8 nodes.

B.Change the distribution style of the tables to EVEN.

C.Use GZIP compression on the S3 files.

D.Add sort keys to the tables based on the load timestamp.

AnswerB

EVEN distribution ensures each slice gets an equal amount of data, improving parallelism.

Why this answer

Option B is correct because the COPY command distributes data across slices based on the table's distribution style. With dc2.large nodes, each node has 2 slices, so a 4-node cluster has 8 slices. If tables are not distributed evenly, some slices handle more data, causing bottlenecks.

Changing the distribution style to EVEN forces rows to be spread uniformly across all slices, maximizing parallelism during the COPY load and reducing load time.

Exam trap

The trap here is that candidates often assume adding more nodes (scaling out) always improves load performance, but the real bottleneck is slice-level data skew, which EVEN distribution directly fixes without additional cost.

How to eliminate wrong answers

Option A is wrong because increasing the cluster size to 8 nodes adds cost and complexity without addressing the root cause of uneven data distribution; the load time improvement would be marginal if slices are still unbalanced. Option C is wrong because using GZIP compression on S3 files reduces storage and transfer time, but the COPY command already decompresses data automatically; the bottleneck here is slice imbalance, not I/O or network bandwidth. Option D is wrong because adding sort keys based on load timestamp improves query performance for range-restricted scans, but does not affect how the COPY command distributes data across slices during the load process.

Full explanation →

370

Multi-Selecthard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application reads from a Kinesis stream and writes aggregated results to an Amazon S3 bucket. The company notices that the application is falling behind and the checkpoint duration is increasing. Which THREE actions should the data engineer take to improve performance? (Choose THREE.)

Select 3 answers

A.Decrease the number of shards in the source Kinesis stream.

B.Use multiple S3 prefixes in the output path to avoid throttling.

C.Increase the heap memory of the Flink application.

D.Increase the checkpoint interval to reduce checkpoint overhead.

E.Increase the parallelism of the Flink application.

AnswersB, D, E

Multiple prefixes increase S3 write performance.

Why this answer

Options A, C, and E are correct. Increasing parallelism allows more parallel processing. Using S3 with multiple prefixes reduces S3 write throttling.

Increasing checkpoint interval reduces overhead. Option B is wrong because heap memory increase is not directly related to checkpoint duration. Option D is wrong because reducing the number of shards would decrease throughput.

Full explanation →

371

MCQhard

A data team uses AWS Glue ETL jobs to process data from an S3 bucket (s3://data-lake-raw) and write results to another S3 bucket (s3://data-lake-processed). Both buckets are encrypted with SSE-KMS using the same KMS key (alias 'data-key'). The Glue job runs in the same account. The team recently enabled S3 Server Access Logging for the raw bucket, sending logs to a separate logging account. After enabling logging, the Glue job starts failing with 'AccessDenied' when reading from the raw bucket. The Glue job's IAM role has s3:GetObject permission on the raw bucket. Which additional permission is most likely missing?

A.s3:GetBucketLocation on the raw bucket.

B.kms:Decrypt on the KMS key (alias 'data-key').

C.s3:PutObject on the processed bucket.

D.kms:GenerateDataKey on the KMS key.

AnswerB

The Glue job needs permission to decrypt the objects using the KMS key.

Why this answer

When S3 Server Access Logging is enabled for a bucket encrypted with SSE-KMS, the S3 service must write log objects to the target bucket. If the target bucket is in a different account, the S3 service needs permission to use the KMS key. However, the failure is on the Glue job reading from the raw bucket, not writing logs.

The issue could be that the raw bucket's S3 access log delivery writes to a target bucket that uses a different KMS key, but that would affect logging, not Glue reads. Re-reading: The Glue job reading the raw bucket fails after enabling logging. It's likely that the raw bucket policy was modified to allow log delivery, inadvertently restricting other access.

Actually, the most likely cause is that the S3 bucket policy now includes a condition that denies access unless a specific header is present, or the KMS key policy was changed. Given the options, the correct answer is that the KMS key policy for the data-key now denies the Glue role because the S3 service principal was added for cross-account logging. But the Glue role needs kms:Decrypt permission.

The scenario says the same key is used for both buckets. The correct answer is B: The KMS key policy does not allow the Glue role to decrypt because the S3 log delivery service is using the key and the key policy may have a condition. Actually, the most direct answer: The Glue role is missing kms:Decrypt permission on the KMS key.

But the team might have added a statement to allow S3 logging that inadvertently denies the Glue role. However, the simplest answer is that the Glue role lacks kms:Decrypt. But the question says 'Which additional permission is most likely missing?' The options are specific permissions.

I'll go with the need for kms:Decrypt on the KMS key.

Full explanation →

372

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineer notices that queries are slow on a large table that is frequently filtered on a column 'transaction_date'. Which optimization technique best improves query performance?

A.Apply compression encoding to 'transaction_date'.

B.Set the sort key to 'transaction_date'.

C.Set the distribution key to 'transaction_date'.

D.Run VACUUM on the table.

AnswerB

Sort keys enable zone maps to skip irrelevant blocks.

Why this answer

Setting the sort key to 'transaction_date' organizes the table data physically by that column, which allows Redshift to use zone maps to skip blocks that don't match query filters. This dramatically reduces the amount of data scanned for range-restricted queries on 'transaction_date', improving query performance.

Exam trap

The trap here is that candidates confuse distribution keys (which optimize joins) with sort keys (which optimize filtering and range scans), leading them to pick distribution key as the answer for a single-table filter performance issue.

How to eliminate wrong answers

Option A is wrong because compression encoding reduces storage size and I/O but does not directly optimize query filtering on a column; it can even slow down scans if the column is frequently used in predicates. Option C is wrong because setting the distribution key to 'transaction_date' distributes rows across nodes based on that column, which can help with joins but does not improve the efficiency of range-restricted scans on a single table. Option D is wrong because VACUUM reclaims space and re-sorts data but does not improve query performance unless the table is already sorted on a key; without a sort key on 'transaction_date', VACUUM has no effect on filter performance.

Full explanation →

373

MCQmedium

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms each record and writes it to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors when writing to S3. The team has already increased the Lambda function's memory and timeout. Which action should the team take to resolve the issue?

A.Use S3 Batch Operations to write data in batches.

B.Increase the number of shards in the Kinesis data stream.

C.Enable S3 Transfer Acceleration on the destination bucket.

D.Implement retries with exponential backoff in the Lambda function for S3 put operations.

AnswerD

This handles transient S3 throttling by retrying with backoff.

Why this answer

The 'ProvisionedThroughputExceededException' error indicates that the Lambda function is being throttled by S3 due to exceeding the bucket's request rate limits. Implementing retries with exponential backoff in the Lambda function for S3 put operations is the correct solution because it allows the function to gracefully handle transient throttling errors by waiting progressively longer between retries, which aligns with AWS's guidance for managing S3 request rate limits.

Exam trap

The trap here is that candidates confuse the source of the error (Kinesis vs. S3) and incorrectly assume that increasing Kinesis shards will fix the S3 throttling, or they mistake S3 Transfer Acceleration for a solution to rate limits when it only improves network latency.

How to eliminate wrong answers

Option A is wrong because S3 Batch Operations is designed for bulk processing of existing objects in S3, not for handling real-time streaming writes from Lambda, and it does not address the immediate throttling issue during individual put operations. Option B is wrong because increasing the number of shards in the Kinesis data stream would increase the parallelism of data ingestion into Lambda, but the error occurs when writing to S3, not when reading from Kinesis, so it would not resolve the S3 throttling. Option C is wrong because S3 Transfer Acceleration optimizes network transfer speed by using AWS edge locations, but it does not affect S3's internal request rate limits or throttle errors, which are based on bucket-level throughput capacity.

Full explanation →

374

MCQhard

A company is using Amazon RDS for SQL Server with Multi-AZ. The database has a 500 GB data file and 100 GB log file. The application experiences high latency during peak hours. Monitoring shows high WriteIOPS on the primary. Which change will reduce latency without losing the ability to failover?

A.Reduce the log file size by changing recovery model

B.Increase the provisioned IOPS on the RDS instance

C.Create a Read Replica in a different Availability Zone

D.Switch to Multi-AZ with two readable standbys

AnswerB

Higher IOPS reduces write latency.

Why this answer

Option A is correct because increasing IOPS reduces latency for writes. Option B is wrong because Read Replicas help reads, not writes. Option C is wrong because log file size is not the issue.

Option D is wrong because Multi-AZ does not reduce latency.

Full explanation →

375

MCQmedium

A company stores log files in Amazon S3. They want to automatically move logs older than 90 days to S3 Glacier Deep Archive to reduce costs. Which S3 feature should be used?

A.S3 Intelligent-Tiering

B.S3 Lifecycle configuration

C.S3 Replication

D.S3 Object Lock

AnswerB

Lifecycle policies can move objects to Glacier Deep Archive after 90 days.

Why this answer

S3 Lifecycle configuration allows you to define rules that automatically transition objects to colder storage classes, such as S3 Glacier Deep Archive, based on age. By setting a rule to move objects older than 90 days to S3 Glacier Deep Archive, you reduce storage costs without manual intervention. This is the correct feature for automating tier-based data lifecycle management.

Exam trap

The trap here is that candidates may confuse S3 Intelligent-Tiering with lifecycle policies, but Intelligent-Tiering does not support age-based transitions to Glacier Deep Archive and is designed for unpredictable access patterns, not fixed retention schedules.

How to eliminate wrong answers

Option A is wrong because S3 Intelligent-Tiering automatically moves data between access tiers based on changing access patterns, not on a fixed age-based schedule, and it does not support direct transition to S3 Glacier Deep Archive. Option C is wrong because S3 Replication is used to copy objects across buckets or regions for redundancy or compliance, not to transition objects to colder storage classes. Option D is wrong because S3 Object Lock is designed to prevent object deletion or overwrites for a specified retention period, not to manage storage tier transitions.

Full explanation →

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 301–375