Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1726–1786

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 24 of 24

1726

Multi-Selectmedium

A data engineer is evaluating storage options for a new application that requires low-latency access to unstructured blobs (up to 5 TB each) with high throughput. The data will be accessed frequently for the first 30 days and then rarely. Which TWO storage solutions meet these requirements? (Choose TWO)

Select 2 answers

A.Amazon S3 with lifecycle policies

B.Amazon EBS with io2 Block Express volumes

C.Amazon EFS

D.Amazon RDS for PostgreSQL

E.Amazon FSx for Lustre

AnswersA, E

S3 can handle large objects and lifecycle policies automate transitions to cost-optimized storage.

Why this answer

Amazon S3 with lifecycle policies is correct because S3 provides low-latency access to unstructured blobs (up to 5 TB each) with high throughput, and lifecycle policies can automatically transition objects to colder storage tiers (e.g., S3 Glacier Deep Archive) after 30 days, matching the access pattern of frequent then rare access.

Exam trap

The trap here is that candidates may confuse block storage (EBS) or file storage (EFS) with object storage (S3), or overlook that lifecycle policies are the key to handling the 'frequent then rare' access pattern, leading them to choose EBS or EFS for blob storage.

Full explanation →

1727

MCQeasy

A company uses Amazon S3 as a data lake. A data engineer needs to ensure that all objects uploaded to the 'incoming' prefix are automatically encrypted at rest using AWS KMS with a specific customer managed key. What is the simplest way to enforce this?

A.Enable S3 Transfer Acceleration to force encryption in transit.

B.Use a bucket policy that denies PutObject requests without the required encryption header.

C.Configure S3 Inventory to report on encryption status and alert on non-compliance.

D.Enable default encryption on the bucket with SSE-S3.

AnswerB

A bucket policy with a condition for s3:x-amz-server-side-encryption-aws-kms-key-id enforces the specific key.

Why this answer

Option B is correct because S3 bucket policies can enforce encryption using a condition key. Option A is wrong because default encryption does not enforce customer managed keys. Option C is wrong because S3 Inventory does not enforce encryption.

Option D is wrong because S3 Transfer Acceleration does not affect encryption.

Full explanation →

1728

MCQmedium

A company uses AWS Lambda to process records from an Amazon Kinesis Data Stream. Each record is about 50 KB. The Lambda function transforms the data and writes to Amazon DynamoDB. Recently, the Lambda function has been experiencing throttling and high error rates. The Kinesis stream has 10 shards. What is the most cost-effective solution to improve processing throughput?

A.Increase the number of shards in the Kinesis stream.

B.Increase the Parallelization Factor for the Lambda event source mapping.

C.Increase the memory allocated to the Lambda function.

D.Increase the Batch Window (MaximumBatchingWindowInSeconds) for the event source mapping.

AnswerD

Reduces invocation frequency.

Why this answer

Option D is correct because increasing the Batch Window (MaximumBatchingWindowInSeconds) allows the Lambda function to accumulate more records from the Kinesis stream before invoking the function, reducing the number of invocations and thus lowering the chance of throttling. This is the most cost-effective solution as it does not require additional shards, memory, or parallelization, and directly addresses the high error rates caused by excessive concurrent executions.

Exam trap

The trap here is that candidates often assume increasing shards or parallelization is the only way to improve throughput, but the question specifically asks for the most cost-effective solution, and increasing the batch window reduces invocation count without incurring additional costs.

How to eliminate wrong answers

Option A is wrong because increasing the number of shards would increase the number of concurrent Lambda invocations, potentially worsening throttling and increasing costs, not improving throughput cost-effectively. Option B is wrong because increasing the Parallelization Factor (which controls concurrent batches per shard) would also increase concurrency, leading to more throttling and higher costs, and is not a cost-effective fix. Option C is wrong because increasing memory allocated to the Lambda function does not directly address throttling or error rates caused by invocation frequency; it may improve per-invocation performance but at a higher cost without solving the root cause.

Full explanation →

1729

MCQeasy

Refer to the exhibit. A data engineer is configuring a Kinesis Data Firehose delivery stream. The stream is expected to receive bursts of 10 MB of data every 2 minutes. What is the maximum time it will take for data to be delivered to S3 during a burst?

A.300 seconds

B.60 seconds

C.1 second

D.600 seconds

AnswerA

The buffer interval is 300 seconds; even with bursts, the maximum time is the interval.

Why this answer

Option B is correct. The buffer interval is 300 seconds (5 minutes) and the buffer size is 5 MB. During a burst of 10 MB every 2 minutes, the buffer will fill to 5 MB in 1 minute (since 10 MB/2 min = 5 MB/min), so the buffer will be flushed every minute due to size reaching 5 MB.

However, the buffer interval is 300 seconds, which is the maximum time. The actual delivery time will be the minimum of the size and interval triggers. Since the size triggers first, data will be delivered within 1 minute of the burst.

But the question asks for maximum time; if the burst is continuous, the buffer size will trigger every 1 minute. However, the interval is 300 seconds, so the maximum time could be up to 5 minutes if there is no data. But during a burst, the buffer will flush sooner.

The correct answer is that the maximum time is 300 seconds (5 minutes) because that is the configured interval. But since the buffer size is small, data will be flushed sooner. However, the question asks for maximum time during a burst; if the burst is exactly 10 MB every 2 minutes, the buffer will fill to 5 MB in 1 minute, so delivery will occur within 1 minute.

But the maximum possible is the interval of 300 seconds. Considering the wording, the maximum time is 300 seconds. Option B is correct.

Full explanation →

1730

Drag & Dropmedium

Arrange the steps to implement data encryption at rest for an Amazon Redshift cluster using AWS KMS.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, create the KMS key. Then launch a new encrypted cluster, specify the key, configure, and verify encryption.

Full explanation →

1731

MCQmedium

A company uses Amazon Redshift for data warehousing. The security team requires that all data loading into Redshift be encrypted in transit. Which configuration ensures this requirement is met?

A.Use a VPC security group to restrict access

B.Configure the Redshift cluster to require SSL connections

C.Use client-side encryption before loading data

D.Enable server-side encryption on the Redshift cluster

AnswerB

SSL encrypts data in transit between clients and Redshift.

Why this answer

Encryption in transit for Redshift is achieved by using SSL connections. Client-side encryption before loading does not encrypt the transmission. Server-side encryption is for at-rest.

VPC security groups control network access, not encryption.

Full explanation →

1732

MCQeasy

A data engineer is responsible for ingesting daily CSV files from an external partner into an Amazon S3 bucket. The partner uploads files to an AWS Transfer Family (SFTP) endpoint. Once a file is uploaded, an AWS Lambda function triggers an AWS Glue ETL job to transform the data and load it into an Amazon RDS database. Recently, some files have failed to trigger the Glue job because the Lambda function timed out while waiting for the Glue job to complete. The engineer needs to ensure that all files are processed reliably without manual intervention. What should the data engineer do?

A.Modify the Lambda function to send a message to an Amazon SQS queue after uploading, and create a separate Lambda function that reads from the queue and triggers the Glue job asynchronously.

B.Increase the Lambda function timeout to 15 minutes to accommodate longer Glue jobs.

C.Increase the Lambda function's reserved concurrency to allow multiple invocations.

D.Configure S3 event notifications to trigger the Glue job directly without Lambda.

AnswerA

Decoupling prevents timeout and ensures retries.

Why this answer

Option A is correct because decoupling the Lambda function from the Glue job by using SQS allows the Lambda function to submit the job and exit, while a second Lambda function monitors completion. Option B is wrong because increasing Lambda timeout still ties it to job duration. Option C is wrong because the issue is not concurrency.

Option D is wrong because increasing S3 events does not address the timeout.

Full explanation →

1733

MCQeasy

A data engineer needs to ingest data from an Amazon S3 bucket into Amazon Redshift for analytics. The data is in CSV format and the Redshift table already exists. Which service can be used to perform this ingestion with minimal configuration?

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Redshift COPY command

D.AWS Database Migration Service (DMS)

AnswerC

The COPY command loads data from S3 into Redshift efficiently.

Why this answer

Option C is correct because Redshift COPY command can directly load data from S3. Option A is wrong because Kinesis Data Firehose is for streaming, not batch. Option B is wrong because Glue is overkill.

Option D is wrong because DMS is for database migration.

Full explanation →

1734

MCQhard

A company runs a data pipeline that uses Amazon EMR to process large datasets. The pipeline reads data from S3, processes it using Spark, and writes results back to S3. Recently, the pipeline has been failing with 'OutOfMemoryError' in the Spark executors. The EMR cluster is configured with 5 core nodes of type m5.xlarge (4 vCPU, 16 GB memory each). The Spark application uses dynamic allocation and default Spark configurations. The input data size is approximately 500 GB in Parquet format. What is the most cost-effective way to resolve the out-of-memory errors?

A.Increase the spark.executor.memory setting to 8 GB in the Spark configuration.

B.Change the core node instance type to r5.xlarge (32 GB memory) and keep 5 nodes.

C.Increase the number of core nodes to 10 to distribute the data across more executors.

D.Change the input data format from Parquet to ORC to reduce memory footprint.

AnswerB

Memory-optimized instances provide more memory per node, reducing OOM without increasing node count.

Why this answer

Option B is correct because the current cluster has limited memory per node (16 GB). By switching to memory-optimized instances like r5.xlarge (32 GB), each node has double the memory, reducing the chance of OOM. This is more cost-effective than adding more nodes because the total memory per node increases without increasing the number of instances.

Option A is wrong because increasing the number of nodes adds more memory but also more cost; it might be more expensive than using fewer, larger nodes. Option C is wrong because it's generally not recommended to increase spark.executor.memory beyond the physical memory; it could cause YARN to kill containers. Option D is wrong because Parquet is already efficient; changing to a different format may not solve memory issues.

Full explanation →

1735

MCQmedium

A company has an S3 bucket with millions of objects. The data engineer needs to identify which objects are not accessed for 90 days to move them to a lower-cost storage class. Which feature should be used?

A.S3 Storage Class Analysis

B.S3 Inventory

C.S3 Server Access Logs

D.S3 Event Notifications

AnswerA

It analyzes access patterns and provides recommendations for lifecycle transitions.

Why this answer

S3 Storage Class Analysis (SCA) is the correct feature because it monitors access patterns across objects and provides recommendations for transitioning data to lower-cost storage classes based on last-access dates. SCA can analyze objects that have not been accessed for 90 days and generate a report to inform lifecycle policy creation, directly addressing the requirement to identify objects for cost optimization.

Exam trap

The trap here is that candidates often confuse S3 Inventory (which lists objects) with S3 Storage Class Analysis (which analyzes access patterns), assuming that a list of objects is sufficient to determine access frequency, but Inventory lacks the temporal access data needed for this task.

How to eliminate wrong answers

Option B (S3 Inventory) is wrong because it provides a flat list of all objects and their metadata (e.g., size, storage class) but does not track access patterns or last-accessed timestamps, so it cannot identify objects unused for 90 days. Option C (S3 Server Access Logs) is wrong because it records detailed request-level logs (e.g., requester, operation, timestamp) but requires custom parsing and aggregation to derive last-access dates, and it does not natively provide a summary of objects not accessed for a specific period. Option D (S3 Event Notifications) is wrong because it triggers real-time events for object operations (e.g., PUT, POST, DELETE) but does not store historical access data or analyze access patterns over time, making it unsuitable for identifying long-unused objects.

Full explanation →

1736

MCQhard

A data engineer is troubleshooting an issue where an Amazon Redshift query returns an error: 'ERROR: permission denied for relation table_name'. The user has been granted SELECT on the table. What is the most likely cause?

A.The user's session has timed out.

B.The user does not have CONNECT permission on the database.

C.The table is in a different schema than expected.

D.The user does not have USAGE permission on the schema.

AnswerD

Without USAGE on the schema, the user cannot access tables even with SELECT.

Why this answer

In Redshift, users need usage permission on the schema to access tables within it. Option B is correct. Option A is unrelated.

Option C would result in a schema not found error. Option D would be a connection timeout.

Full explanation →

1737

MCQeasy

A data engineer needs to store time-series data from IoT devices. The data is write-heavy and requires low-latency queries by device ID and timestamp. The data volume is expected to grow to terabytes. Which AWS database service is most suitable?

A.Amazon RDS for MySQL

B.Amazon ElastiCache for Redis

C.Amazon DynamoDB

D.Amazon Timestream

AnswerD

Timestream is designed for time-series data.

Why this answer

Amazon Timestream is purpose-built for time-series data, offering automatic tiered storage (in-memory for recent data and magnetic for historical) to handle write-heavy IoT workloads at scale. It supports low-latency queries by device ID and timestamp via its SQL-compatible query engine, making it the most suitable choice for terabytes of time-series data.

Exam trap

The trap here is that candidates often choose DynamoDB (Option C) because of its high write throughput and low-latency queries, but they overlook the lack of native time-series optimizations, leading to complex manual partitioning and TTL management that Timestream handles automatically.

How to eliminate wrong answers

Option A is wrong because Amazon RDS for MySQL is a relational database optimized for OLTP workloads with structured queries, not for the high-volume, write-heavy, time-series pattern that requires automatic data retention policies and time-based partitioning. Option B is wrong because Amazon ElastiCache for Redis is an in-memory cache designed for sub-millisecond read/write performance on hot data, but it cannot cost-effectively store terabytes of data and lacks native time-series query optimizations like downsampling and interpolation. Option C is wrong because Amazon DynamoDB is a key-value and document database that can handle high write throughput, but it does not have built-in time-series functions (e.g., time-based aggregation, retention policies) and requires manual partitioning and TTL management to handle time-series data efficiently at terabyte scale.

Full explanation →

1738

MCQhard

A company has an Amazon RDS for MySQL database that is experiencing performance issues due to a large number of read requests. The application is read-heavy and can tolerate eventually consistent reads. Which action will reduce the load on the primary database with the least operational overhead?

A.Create a read replica in the same region

B.Use Amazon ElastiCache for caching

C.Enable Multi-AZ deployment

D.Increase the instance size of the primary DB

AnswerA

Offloads read traffic with minimal overhead.

Why this answer

Creating a read replica offloads read queries from the primary, reducing load. Multi-AZ is for high availability, not read scaling. Caching or increasing instance size may help but read replica is simplest for read-heavy workloads.

Full explanation →

1739

Multi-Selectmedium

A data engineer needs to ensure that data in an Amazon S3 bucket is not publicly accessible. Which TWO measures should the engineer implement? (Choose TWO.)

Select 2 answers

A.Attach a bucket policy that denies access to 'Principal': '*' unless specific conditions are met.

B.Create a lifecycle policy to delete objects after 30 days.

C.Enable S3 Block Public Access settings on the bucket.

D.Enable S3 Versioning on the bucket.

E.Enable default encryption on the bucket.

AnswersA, C

A bucket policy can deny all public access.

Why this answer

Blocking public access at the bucket level and using bucket policies to deny public access are effective controls. Option C is wrong because encryption does not prevent public access. Option D is wrong because versioning does not control access.

Option E is wrong because lifecycle policies do not control access.

Full explanation →

1740

MCQeasy

A data engineer is designing a data pipeline that ingests data from an on-premises database into Amazon S3 using AWS Database Migration Service (DMS). The data must be encrypted at rest in S3 using SSE-S3. The engineer also needs to track changes to the source database in real time. Which DMS configuration should the engineer use?

A.Use DMS with a snapshot of the source database.

B.Use DMS with ongoing replication (change data capture) enabled.

C.Use DMS with a full load task only.

D.Use DMS with a full load task and then stream to Amazon Kinesis.

AnswerB

CDC captures real-time changes.

Why this answer

Option A is correct because DMS with CDC captures ongoing changes. Option B is wrong because full load only captures a snapshot. Option C is wrong because DMS supports CDC without needing Kinesis.

Option D is wrong because restoring from a snapshot is not real-time.

Full explanation →

1741

MCQmedium

A company is running an Amazon EMR cluster with Spark for data processing. The data engineer wants to automatically scale the core and task nodes based on the YARN memory and CPU utilization. Which scaling metric should the engineer use for the EMR managed scaling policy?

A.YARNMemoryAvailablePercentage

B.CPUUtilization

C.DiskIOPS

D.HDFSUtilization

AnswerA

EMR managed scaling uses YARN memory metrics.

Why this answer

Option A is correct because EMR managed scaling uses YARNMemoryAvailablePercentage and YARNContainersPending as the default metrics for scaling. Option B is incorrect because CPUUtilization is not a default metric for EMR managed scaling. Option C is incorrect because HDFSUtilization is for HDFS, not YARN.

Option D is incorrect because IOPS is not a metric for EMR managed scaling.

Full explanation →

1742

MCQeasy

A data engineer needs to store semi-structured JSON data that is accessed infrequently but requires millisecond retrieval latency. The data is immutable once written. Which AWS service is most cost-effective?

A.Amazon DynamoDB with on-demand capacity

B.Amazon ElastiCache for Redis

C.Amazon RDS for PostgreSQL with JSONB

D.Amazon S3 (Standard-IA) with S3 Select

AnswerD

S3 Select can retrieve subsets of JSON data efficiently, and Standard-IA is cost-effective for infrequent access.

Why this answer

Option A is correct because S3 with S3 Select can retrieve specific JSON fields with low latency for infrequent access, and Standard-IA reduces cost. Option B (DynamoDB) is for frequent access patterns. Option C (RDS) is for relational data.

Option D (ElastiCache) is for caching, not durable storage.

Full explanation →

1743

MCQhard

A company uses Amazon DynamoDB to store session data for a web application. The application experiences occasional spikes in traffic, causing throttling on the table. The data engineer needs to implement a solution that handles traffic spikes without manual intervention and minimizes cost. What should the data engineer do?

A.Switch to provisioned capacity with a high fixed read/write capacity.

B.Implement DynamoDB Accelerator (DAX) to cache read requests.

C.Purchase DynamoDB reserved capacity.

D.Enable DynamoDB Auto Scaling for the table.

AnswerD

Auto Scaling adjusts capacity automatically to handle spikes and minimize cost.

Why this answer

DynamoDB Auto Scaling automatically adjusts read and write capacity based on traffic patterns, handling spikes without manual intervention. Option A is wrong because provisioned capacity with fixed values would either underprovision (causing throttling) or overprovision (wasting cost). Option C is wrong because DynamoDB Accelerator (DAX) is a caching layer that reduces read load but doesn't handle write spikes and adds cost.

Option D is wrong because reserved capacity offers discounts but doesn't handle spikes automatically.

Full explanation →

1744

Multi-Selecteasy

A data engineer is monitoring an Amazon Kinesis Data Stream used to ingest clickstream data. The engineer notices that the stream's 'WriteProvisionedThroughputExceeded' metric is frequently above zero. Which TWO actions could help mitigate this issue? (Choose TWO.)

Select 2 answers

A.Increase the number of shards in the stream.

B.Reduce the data retention period to free up capacity.

C.Decrease the number of shards to reduce overhead.

D.Implement a random prefix for the partition key to distribute data evenly.

E.Enable enhanced fan-out on the stream.

AnswersA, D

More shards increase total write capacity.

Why this answer

Options A and D are correct. Increasing the number of shards increases write capacity. Implementing a random prefix for partition keys distributes writes more evenly across shards.

Option B is wrong because decreasing shards reduces capacity. Option C is wrong because enabling enhanced fan-out increases read capacity, not write. Option E is wrong because reducing the retention period does not affect write throughput.

Full explanation →

1745

Multi-Selectmedium

A data engineer is troubleshooting a Glue ETL job that reads from an S3 bucket and writes to a Redshift table. The job fails with a 'MemoryError' when processing a large dataset. Which TWO actions should the engineer take to resolve this issue? (Choose TWO.)

Select 2 answers

A.Increase the number of DPUs and set 'spark.sql.shuffle.partitions' to a higher value.

B.Increase the number of DPUs and set 'coalesce(1)' in the script.

C.Decrease the number of DPUs and increase 'spark.shuffle.partitions'.

D.Set the 'RedshiftTempDir' parameter to a larger S3 bucket.

E.Set the 'groupFiles' option to 'inPartition' in the S3 source configuration.

AnswersA, E

More DPUs and shuffle partitions distribute data across more executors, reducing per-executor memory load.

Why this answer

Option A is correct because increasing the number of DPUs (Data Processing Units) provides more memory and compute resources to the Glue job, directly addressing the MemoryError. Setting 'spark.sql.shuffle.partitions' to a higher value reduces the amount of data shuffled per partition, preventing out-of-memory errors during wide transformations like joins or aggregations.

Exam trap

The trap here is that candidates confuse 'coalesce(1)' (which reduces parallelism) with a memory-saving technique, or mistakenly think decreasing DPUs or adjusting RedshiftTempDir can fix memory errors, when in fact memory errors require more resources and better partition management.

Full explanation →

1746

MCQeasy

A data engineer has an IAM policy attached to an IAM role used by an AWS Glue job. The Glue job needs to read from S3 bucket 'data-bucket' and write to the same bucket. The job fails with an access denied error when trying to write to S3. What is the issue?

A.The Glue job cannot assume the IAM role because of trust policy.

B.The policy does not include s3:ListBucket permission, which Glue may need.

C.The resource ARN for S3 is missing the bucket-level permission.

D.The actions for S3 are incorrect; s3:PutObject is not sufficient.

AnswerB

Glue may require ListBucket to navigate the bucket.

Why this answer

The policy allows s3:PutObject on the bucket, but Glue jobs also need permissions on the bucket itself (s3:ListBucket) for certain operations. Also, the resource is correct. However, the most common issue is that the IAM role does not have permission to decrypt the KMS key if the bucket uses SSE-KMS.

But the exhibit does not mention KMS. Option A is wrong because the resource is correct. Option B is wrong because the actions are correct.

Option C is correct because the policy does not include s3:ListBucket, which may be required by Glue to list objects. Option D is wrong because Glue can assume roles.

Full explanation →

1747

MCQeasy

A startup is ingesting event data from a mobile app into an Amazon Kinesis Data Streams stream with 2 shards. Each shard can ingest up to 1 MB/s or 1000 records/s. The app sends about 800 records per second with an average record size of 1.5 KB. The data engineer notices that the stream is throttling some records, resulting in data loss. The engineer needs to ensure that all records are ingested without changing the application code. What should the data engineer do?

A.Reduce the average record size to below 1 KB by compressing data on the client side.

B.Switch from Kinesis Data Streams to Kinesis Data Firehose for ingestion.

C.Increase the number of shards in the Kinesis stream to 3.

D.Enable enhanced fan-out on the stream to provide dedicated throughput to each consumer.

AnswerC

Adding shards increases total ingestion capacity.

Why this answer

Option C is correct because the current throughput is 800 records/s * 1.5 KB = 1.2 MB/s, exceeding the 1 MB/s per shard limit. Adding a shard increases capacity. Option A is wrong because decreasing record size is not possible without code change.

Option B is wrong because enhanced fan-out is for consumers, not producers. Option D is wrong because the bottleneck is ingestion, not processing.

Full explanation →

1748

Multi-Selecthard

A company is migrating a legacy data warehouse to Amazon Redshift. They need to choose a distribution style to minimize data movement during joins. Which THREE factors should they consider?

Select 3 answers

A.The size of the table (number of rows).

B.The join frequency with other tables on specific columns.

C.The number of columns in the table.

D.Whether the table is a fact or dimension table.

E.The data type of the distribution key column.

AnswersA, B, D

Large tables need careful distribution to avoid skew.

Why this answer

Option A is correct because the size of the table (number of rows) directly influences the distribution strategy. In Amazon Redshift, large tables benefit from a distribution style that evenly distributes rows across slices to avoid data skew, which can cause performance bottlenecks during joins. Choosing a distribution key that aligns with the join columns minimizes data movement, but the table size determines whether an ALL distribution (for small tables) or a KEY distribution (for large tables) is more appropriate to reduce shuffling.

Exam trap

The trap here is that candidates may overthink irrelevant table properties like column count or data types, while the core considerations for minimizing data movement are table size, join frequency, and table role (fact vs. dimension).

Full explanation →

1749

MCQhard

A company uses Amazon DynamoDB for a gaming application. The table has a partition key of 'user_id' and a sort key of 'game_timestamp'. The application frequently queries by 'user_id' and filters by 'game_timestamp' within a specific date range. The queries are slow. The table has a global secondary index (GSI) on 'game_timestamp'. What is the most likely cause of the slow queries?

A.The GSI has insufficient read capacity.

B.The GSI is used instead of the base table for queries on 'user_id'.

C.A hot partition exists due to uneven access pattern on 'user_id'.

D.The sort key is not used in the query.

AnswerC

If a few 'user_id' values are accessed frequently, they create hot partitions, slowing queries.

Why this answer

Option B is correct because querying by 'user_id' using the base table is efficient, but if the GSI is being used for queries that filter by 'game_timestamp' without the partition key, the GSI may not be designed optimally. However, the stem says queries are by 'user_id' and filter by 'game_timestamp' – that should use the base table. Option A is not likely because sort key filtering is efficient.

Option C (hot partition) is possible if 'user_id' distribution is skewed. Option D (GSI write capacity) doesn't affect reads. The most likely cause is a hot partition due to uneven 'user_id' distribution.

Full explanation →

1750

MCQmedium

A data engineer is building a data pipeline that ingests data from Amazon S3 into Amazon Redshift. The data is in CSV format and includes a timestamp column. The pipeline should load only new data incrementally. Which approach is most efficient?

A.Use the COPY command to load the entire bucket and rely on Redshift to deduplicate

B.Use the COPY command with a manifest file that lists only the new S3 objects

C.Use Amazon Redshift Spectrum to query the S3 data directly without loading

D.Use INSERT statements within a loop to load each new file

AnswerB

A manifest file allows incremental loading by specifying only new files.

Why this answer

Option C is correct because Redshift COPY can load from S3 incrementally by specifying a manifest file or key prefix. Option A is wrong because COPY from an entire bucket reloads all data. Option B is wrong because INSERT is inefficient for large volumes.

Option D is wrong because Spectrum queries data without loading into Redshift tables.

Full explanation →

1751

MCQeasy

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a custom consumer application that writes to Amazon S3 every 5 minutes. The consumer is falling behind and processing lag is increasing. Which action is MOST effective to reduce the lag?

A.Switch to Amazon Kinesis Data Firehose to deliver data directly to S3

B.Increase the batch size of records written to S3

C.Increase the number of shards in the Kinesis stream

D.Reduce the retention period of the stream

AnswerC

More shards increase parallelism and throughput, allowing the consumer to keep up.

Why this answer

The consumer is falling behind because the stream's throughput capacity is insufficient for the incoming data volume. Increasing the number of shards in the Kinesis stream directly increases the total read capacity (each shard provides 2 MB/s read throughput and 5 transactions/second), allowing the consumer to process more data in parallel and reduce lag.

Exam trap

The trap here is that candidates often confuse throughput scaling with batch size or delivery destination changes, but the only way to increase read throughput from a Kinesis stream is to increase the number of shards or use enhanced fan-out.

How to eliminate wrong answers

Option A is wrong because switching to Kinesis Data Firehose does not change the underlying stream's throughput; Firehose is a delivery service that still reads from the same shards, so it would not resolve the consumer's processing lag. Option B is wrong because increasing the batch size written to S3 only affects the write operation to S3, not the consumer's ability to read from the stream faster; the bottleneck is the consumer's read throughput, not the S3 write batch size. Option D is wrong because reducing the retention period (default 24 hours to 1 hour) does not increase read throughput; it only causes data to expire sooner, which could lead to data loss but does not help the consumer catch up.

Full explanation →

1752

MCQmedium

A data engineer is ingesting data from an Amazon RDS for PostgreSQL database into Amazon S3 using AWS Glue. The Glue job reads the entire table each time it runs, which takes several hours. The team wants to reduce the job duration by reading only new or updated records. Which approach should the engineer adopt?

A.Enable job bookmarks in AWS Glue and use a column with timestamps as the bookmark key to read only incremental data.

B.Partition the table in the source database by date and read only the latest partition.

C.Increase the number of Glue workers to improve parallel reads.

D.Use Amazon Kinesis Data Streams to capture changes from PostgreSQL.

AnswerA

Glue bookmarks track processed records; using a timestamp column allows incremental reads.

Why this answer

The correct answer is to enable Change Data Capture (CDC) using AWS DMS or enable PostgreSQL logical replication and use Glue with bookmark keys. Glue job bookmarks can track processed data if a column (like last_updated) is used. Option B (increasing workers) does not solve reading all data.

Option C (partitioning) reduces data per run but still reads entire table. Option D (using Kinesis) is real-time but overkill for periodic ingestion.

Full explanation →

1753

MCQeasy

A company uses Amazon Redshift for its data warehouse. The data engineering team loads data daily from Amazon S3 using COPY commands. Recently, the load performance has degraded because the S3 bucket contains many small files. The team needs to optimize the COPY operation to improve performance. Which approach should they take?

A.Use Redshift Spectrum to query the data directly from S3 without loading.

B.Increase the number of nodes in the Redshift cluster.

C.Use a manifest file that lists only the necessary files, and consolidate small files into larger ones before loading.

D.Enable automatic compression on the Redshift table.

AnswerC

Fewer large files improve COPY performance.

Why this answer

Option C is correct because the performance degradation is caused by the overhead of processing many small files during the COPY command. Consolidating small files into larger ones (e.g., 100 MB–1 GB each) reduces the number of S3 GET requests and the metadata overhead on Redshift, directly improving load throughput. Using a manifest file further optimizes by explicitly listing only the required files, avoiding unnecessary S3 list operations.

Exam trap

The trap here is that candidates often confuse scaling the cluster (Option B) with optimizing data ingestion, failing to recognize that the bottleneck is the number of S3 objects, not the cluster's compute capacity.

How to eliminate wrong answers

Option A is wrong because Redshift Spectrum queries data in place from S3 without loading it into Redshift tables, which does not optimize the COPY operation for loading data into the warehouse. Option B is wrong because increasing the number of nodes adds compute and storage capacity but does not address the root cause of many small files; the COPY command still suffers from the same per-file overhead regardless of cluster size. Option D is wrong because automatic compression (via the COPY command with the COMPUPDATE option) optimizes column encoding for storage efficiency, not the file-level I/O performance during the load process.

Full explanation →

1754

MCQeasy

A company uses Amazon S3 to store raw data and AWS Lambda to process files as they arrive. The Lambda function sometimes times out when processing large files. The team wants to improve reliability and scalability. Which approach should the team take?

A.Replace Lambda with AWS Batch and use S3 event notifications to trigger the batch job.

B.Use Amazon S3 event notifications to send events to an Amazon SNS topic, which triggers Lambda.

C.Increase the Lambda function timeout to 15 minutes and memory to 3 GB.

D.Use Amazon S3 event notifications to send events to an Amazon SQS queue, and then have Lambda poll the queue in batches.

AnswerD

Decoupling with SQS allows Lambda to process at its own pace.

Why this answer

Option A is correct because S3 event notifications to SQS decouple the producer and consumer, allowing Lambda to poll at its own pace and process in batches. Option B is wrong because SNS is push-based and can still cause timeouts. Option C is wrong because increasing Lambda timeout is a temporary fix.

Option D is wrong because AWS Batch is for long-running batch jobs, not event-driven processing.

Full explanation →

1755

MCQeasy

A data engineer is troubleshooting a failed AWS Glue Crawler. The crawler logs show 'Insufficient permissions to access S3 bucket'. What should the engineer do to resolve this?

A.Grant the crawler's IAM user access to the bucket

B.Attach a VPC endpoint to the S3 bucket

C.Enable S3 default encryption on the bucket

D.Update the IAM role used by the crawler to include S3 read permissions

AnswerD

The role must have s3:GetObject and s3:ListBucket.

Why this answer

The AWS Glue Crawler uses an IAM role to access data sources. The error 'Insufficient permissions to access S3 bucket' indicates that the IAM role attached to the crawler lacks the necessary S3 read permissions (e.g., s3:GetObject, s3:ListBucket). Updating the IAM role's policy to include these permissions resolves the issue, as the crawler operates under that role, not under a specific IAM user.

Exam trap

The trap here is that candidates may confuse the crawler's execution context with an IAM user, leading them to choose Option A, but AWS Glue Crawlers always run under an IAM role, not a user.

How to eliminate wrong answers

Option A is wrong because AWS Glue Crawlers do not use an IAM user for execution; they use an IAM role. Granting access to an IAM user would not affect the crawler's permissions. Option B is wrong because a VPC endpoint enables private connectivity between a VPC and S3 but does not grant or modify IAM permissions; the error is about authorization, not network connectivity.

Option C is wrong because enabling S3 default encryption controls server-side encryption settings and does not affect IAM permission policies; the crawler still needs explicit read access regardless of encryption.

Full explanation →

1756

MCQmedium

A company runs a data pipeline using AWS Lambda to process records from an Amazon Kinesis Data Stream. Recently, the Lambda function has been experiencing high invocation errors and the stream is throttling. The function performs simple transformations and writes to Amazon S3. What is the most effective way to reduce throttling and errors?

A.Increase the Lambda function timeout.

B.Enable provisioned concurrency on the Lambda function.

C.Increase the number of shards in the Kinesis stream.

D.Increase the batch size in the Lambda event source mapping.

AnswerD

Larger batch sizes mean fewer invocations, reducing throttling and errors.

Why this answer

Increasing the batch size in the Lambda event source mapping allows each invocation to process more records from the Kinesis stream, reducing the number of total invocations. This lowers the rate at which Lambda polls the stream, which decreases the likelihood of hitting the Kinesis read throughput limits (5 transactions per second per shard) and reduces throttling errors. The simple transformations and S3 writes are likely I/O-bound, so larger batches improve throughput without increasing invocation concurrency.

Exam trap

The trap here is that candidates mistakenly believe throttling is caused by Lambda concurrency limits or cold starts, when in fact the root cause is the Kinesis stream's read throughput limit per shard, which is reduced by increasing the batch size in the event source mapping.

How to eliminate wrong answers

Option A is wrong because increasing the Lambda function timeout does not reduce the invocation rate or the number of concurrent executions; it only allows a single invocation to run longer, which does not address throttling caused by excessive polling or read throughput limits. Option B is wrong because provisioned concurrency pre-warms execution environments to reduce cold starts, but it does not reduce the number of invocations or the rate at which Lambda polls the Kinesis stream; it may even increase concurrency and exacerbate throttling. Option C is wrong because increasing the number of shards would increase the total read throughput capacity of the stream, but it does not reduce the per-shard invocation rate or the number of Lambda invocations; it could actually increase the total number of concurrent invocations, potentially worsening throttling if the batch size remains small.

Full explanation →

1757

MCQeasy

A company has a nightly batch job that processes 100 GB of data from an Amazon S3 bucket and loads it into an Amazon Redshift table. The job currently runs on an Amazon EMR cluster. Which service would reduce operational overhead while providing similar functionality?

A.AWS Database Migration Service

B.AWS Glue

C.Amazon Redshift Spectrum

D.Amazon Athena

AnswerB

Glue can run serverless ETL jobs on a schedule, reducing overhead.

Why this answer

AWS Glue is a serverless ETL service that can run scheduled jobs and reduce overhead compared to managing EMR clusters. Redshift Spectrum is for querying S3, not for ETL. Athena is for ad-hoc queries.

DMS is for database migration.

Full explanation →

1758

MCQmedium

A data engineer is running an Amazon Athena query that scans a large amount of data in Amazon S3, resulting in high costs. The data is stored in Parquet format in a partitioned table. Which strategy would be MOST effective in reducing the amount of data scanned?

A.Ensure the query includes a WHERE clause that filters on partition columns.

B.Convert the Parquet files to CSV format and apply GZIP compression.

C.Use S3 Intelligent-Tiering storage class to reduce storage costs.

D.Increase the number of partitions by adding more partition columns.

AnswerA

Partition pruning reduces the amount of data scanned.

Why this answer

Option D is correct because using a WHERE clause on partition columns allows Athena to use partition pruning, scanning only the relevant partitions. Option A is incorrect because converting from Parquet to CSV would increase data scanned. Option B is incorrect because increasing the number of partitions without querying on them does not reduce scan.

Option C is incorrect because compressing with GZIP reduces storage size but Athena still decompresses and scans the full data if no partition pruning is used.

Full explanation →

1759

MCQeasy

A data engineer is designing a data lake on Amazon S3 for storing raw sensor data. The data is append-only and accessed infrequently after 30 days. Compliance requires that data be retained for 7 years. Which S3 storage class is the MOST cost-effective for data older than 30 days?

A.S3 Standard-IA

B.S3 Glacier Deep Archive

C.S3 One Zone-IA

D.S3 Intelligent-Tiering

AnswerB

This is the lowest-cost storage class for long-term archival data with infrequent access.

Why this answer

B is correct because Amazon S3 Glacier Deep Archive is the most cost-effective storage class for data that is accessed infrequently and must be retained for long periods (7 years). For data older than 30 days, the retrieval time of 12 hours is acceptable given the append-only, infrequent access pattern, and the storage cost is significantly lower than other classes.

Exam trap

The trap here is that candidates often choose S3 Glacier Flexible Retrieval (not listed) or S3 Standard-IA, mistakenly thinking that faster retrieval is necessary for compliance data, when in fact the 12-hour retrieval time of Glacier Deep Archive is sufficient for infrequent access patterns and offers the lowest cost.

How to eliminate wrong answers

Option A is wrong because S3 Standard-IA is designed for infrequently accessed data but has higher storage costs than Glacier Deep Archive, making it less cost-effective for 7-year retention. Option C is wrong because S3 One Zone-IA does not provide the durability of 99.999999999% across multiple Availability Zones, which is critical for compliance-retained data, and its storage cost is higher than Glacier Deep Archive. Option D is wrong because S3 Intelligent-Tiering automatically moves data between tiers but incurs a monthly monitoring and automation fee per object, and it does not include a Deep Archive tier by default, so it would not achieve the lowest cost for data older than 30 days without manual configuration.

Full explanation →

1760

MCQeasy

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data engineer wants to minimize operational overhead and avoid managing any servers. Which AWS service should the data engineer use?

A.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

B.AWS Glue

C.Amazon Kinesis Data Analytics

D.Amazon Kinesis Data Streams

AnswerA

MSK is a fully managed Kafka service that can ingest data from on-premises Kafka.

Why this answer

Option C is correct. Amazon MSK is a fully managed Kafka service that can be used to ingest data from on-premises Kafka via mirroring or replication. Option A is wrong because Kinesis Data Analytics is for analyzing streaming data, not for ingestion.

Option B is wrong because Kinesis Data Streams is a different streaming service, not compatible with Kafka without custom connectors. Option D is wrong because Glue is for ETL, not for streaming ingestion from Kafka.

Full explanation →

1761

MCQhard

A company is running a production Amazon Aurora PostgreSQL database. The database experiences high write latency during peak hours. The data engineer suspects that the issue is due to a large number of small transactions. Which action would MOST effectively reduce write latency?

A.Enable parallel query for the database

B.Increase the instance size and use Provisioned IOPS storage

C.Enable Aurora Auto Scaling for read replicas

D.Enable Performance Insights to identify the bottleneck

AnswerB

Larger instances provide more CPU and memory, and Provisioned IOPS can reduce I/O latency, helping with write performance under high transaction loads.

Why this answer

Increasing the instance size and using Provisioned IOPS storage directly addresses high write latency by providing more CPU and memory resources to handle transaction processing, while Provisioned IOPS ensures consistent, low-latency I/O for write operations. This is the most effective action because small transactions create high I/O demand, and scaling up the instance with dedicated IOPS reduces contention and write queue depth.

Exam trap

The trap here is that candidates often confuse scaling read replicas (which only help read scaling) with solving write latency, or they mistake monitoring tools (like Performance Insights) for performance fixes, when the real solution is to provision more write capacity through larger instances and dedicated IOPS.

How to eliminate wrong answers

Option A is wrong because parallel query is designed for read-heavy analytical queries, not for reducing write latency from small transactions; it does not improve write throughput or I/O performance. Option C is wrong because Aurora Auto Scaling for read replicas only scales read capacity, not write capacity; write latency is a primary node issue and read replicas do not offload write operations. Option D is wrong because Performance Insights is a monitoring and diagnostic tool that helps identify bottlenecks but does not directly reduce write latency; it provides visibility but no performance improvement.

Full explanation →

1762

Multi-Selecthard

A company must encrypt all data at rest in their Amazon RDS for MySQL instance. Which THREE steps are required to achieve this? (Select THREE.)

Select 3 answers

A.Enable SSL/TLS for database connections

B.Use an AWS KMS key to encrypt the instance

C.Enable encryption at rest when creating the DB instance

D.Modify the DB parameter group to require encryption

E.Ensure that automated backups and snapshots are encrypted

AnswersB, C, E

KMS key is used for encryption at rest.

Why this answer

Options A, C, and D are correct. Option A (enable encryption at rest) is required. Option C (use KMS key) is needed to manage encryption keys.

Option D (ensure backups are encrypted) is necessary because encrypted instances require encrypted backups. Option B (enable SSL) is for encryption in transit, not at rest. Option E (use parameter groups) does not enable encryption.

Full explanation →

1763

Multi-Selecthard

A company is running a Redshift cluster and wants to improve query performance for a frequently used dashboard. Which THREE approaches are recommended?

Select 3 answers

A.Enable concurrency scaling

B.Apply column compression encoding

C.Define sort keys on columns used in WHERE clauses

D.Add more nodes to the cluster

E.Choose an appropriate distribution key for large tables

AnswersB, C, E

Reduces I/O and storage.

Why this answer

Option A is correct because distribution keys reduce data movement. Option C is correct because sort keys enable range-restricted scans. Option E is correct because compression reduces I/O.

Option B is wrong because more nodes adds cost and may not be needed. Option D is wrong because concurrency scaling addresses concurrent queries, not single query speed.

Full explanation →

1764

Multi-Selecteasy

Which TWO AWS services can be used to ingest streaming data into Amazon S3? (Choose two.)

Select 2 answers

A.Amazon S3 Transfer Acceleration

B.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

C.Amazon Kinesis Data Firehose

D.Amazon Elastic Block Store (Amazon EBS)

E.AWS Snowball

AnswersB, C

MSK can stream data to S3 via Kafka Connect S3 sink.

Why this answer

Option A and Option C are correct. Kinesis Data Firehose can directly deliver streaming data to S3. Amazon MSK can be used with Kafka Connect to sink data into S3.

Option B is wrong because S3 Transfer Acceleration accelerates uploads but is not a streaming ingestion service. Option D is wrong because Snowball is for offline data transfer. Option E is wrong because EBS volumes are block storage attached to EC2.

Full explanation →

1765

Multi-Selecteasy

A data engineer is designing a real-time streaming pipeline to ingest clickstream data from a website into Amazon S3. The data must be transformed before storage. Which TWO AWS services can be used together to build this pipeline? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon S3 Transfer Acceleration

E.AWS Database Migration Service (DMS)

AnswersA, C

Delivers streaming data to S3 with transformation capabilities.

Why this answer

Option A and Option E are correct. Amazon Kinesis Data Streams (A) can ingest streaming data, and Amazon Kinesis Data Firehose (E) can deliver the data to S3 with optional transformation via Lambda. Option B (S3 Transfer Acceleration) is for accelerating uploads to S3, not for streaming.

Option C (AWS DMS) is for database migration. Option D (AWS Glue) is for batch ETL, not real-time.

Full explanation →

1766

MCQeasy

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration completes successfully, but the target database has inconsistent data. What should the team do to ensure data consistency?

A.Use 'Limited LOB mode' and set the maximum LOB size to a higher value.

B.Enable 'Full LOB mode' in the DMS task settings.

C.Restart the DMS task after truncating the target tables.

D.Configure the DMS task to use 'Full LOB mode' with parallel threads and enable 'BatchApply'.

AnswerD

This ensures all LOBs are migrated and applied efficiently.

Why this answer

Option C is correct because using LOB mode and parallel threads improves consistency and performance. Option A is wrong because full LOB mode can be slow but not cause inconsistency. Option B is wrong because limited LOB mode truncates data.

Option D is wrong because task restart is not a solution for inconsistency.

Full explanation →

1767

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon S3 to Amazon Redshift. The job reads JSON files, applies schema mapping, and writes to a Redshift table. Recently, the job started failing with memory errors. The data volume has increased tenfold. Which approach should a data engineer take to resolve this issue with minimal code changes?

A.Switch from Spark to Python Shell job type.

B.Implement batch processing with smaller file sizes.

C.Increase the number of DPUs allocated to the Glue job.

D.Use Redshift Spectrum to query data directly from S3.

AnswerC

Provides more resources for processing.

Why this answer

Option C is correct because increasing the number of DPUs (Data Processing Units) allocated to the AWS Glue job directly addresses the memory constraint caused by a tenfold increase in data volume. Glue ETL jobs run on Apache Spark, which distributes data processing across executors; more DPUs provide more memory and compute capacity, allowing the job to handle larger datasets without code changes.

Exam trap

The trap here is that candidates may assume memory errors always require code optimization (e.g., batching or partitioning), but the question explicitly asks for minimal code changes, making resource scaling the correct answer.

How to eliminate wrong answers

Option A is wrong because switching from Spark to Python Shell job type would reduce parallelism and memory capacity, as Python Shell runs on a single node with limited resources, making it unsuitable for large-scale data transformations. Option B is wrong because implementing batch processing with smaller file sizes would require significant code changes to split and manage files, contradicting the 'minimal code changes' requirement, and does not address the root cause of insufficient memory allocation. Option D is wrong because using Redshift Spectrum to query data directly from S3 bypasses the Glue ETL job entirely, which is a different architectural approach that does not resolve the memory error in the existing Glue job and may introduce new costs and complexity.

Full explanation →

1768

MCQeasy

A data engineer is tasked with setting up a data pipeline that moves data from an on-premises Oracle database to Amazon S3 every hour. The network bandwidth is limited, and the engineer needs to ensure data consistency. Which AWS service should the engineer use?

A.AWS DataSync.

B.Amazon Kinesis Data Firehose.

C.S3 Transfer Acceleration.

D.AWS Database Migration Service (DMS) with change data capture (CDC).

AnswerD

DMS supports continuous replication and ensures data consistency via CDC.

Why this answer

AWS DMS with CDC is the correct choice because it can continuously replicate ongoing changes from an on-premises Oracle database to Amazon S3 while ensuring data consistency. CDC captures only the incremental changes (inserts, updates, deletes) after an initial full load, minimizing the data transferred over limited bandwidth and maintaining transactional integrity.

Exam trap

The trap here is that candidates often confuse AWS DataSync (a file-transfer service) with database replication, or assume S3 Transfer Acceleration can solve bandwidth issues without addressing the need for change data capture and consistency from a live database.

How to eliminate wrong answers

Option A is wrong because AWS DataSync is designed for large-scale file and object transfers between on-premises storage and AWS, not for streaming database changes from a relational database like Oracle. Option B is wrong because Amazon Kinesis Data Firehose is a streaming ingestion service for real-time data into S3, but it cannot directly connect to an on-premises Oracle database or perform change data capture. Option C is wrong because S3 Transfer Acceleration only speeds up uploads to S3 over the public internet by using AWS edge locations; it does not handle database replication, CDC, or data consistency from an on-premises source.

Full explanation →

1769

Multi-Selectmedium

A company needs to enforce encryption at rest for all data stored in Amazon S3. Which of the following are valid methods to achieve this? (Choose TWO.)

Select 2 answers

A.Use Amazon S3 Transfer Acceleration.

B.Enable default bucket encryption using SSE-S3.

C.Enable S3 Versioning.

D.Use client-side encryption before uploading objects.

E.Use SSL/TLS for all S3 API calls.

AnswersB, D

Default bucket encryption ensures all objects are encrypted at rest with SSE-S3.

Why this answer

Server-side encryption with S3 managed keys (SSE-S3) and client-side encryption are both methods for encryption at rest. Option A is for in transit. Option D is for data in motion.

Option E is not encryption at rest. Correct: B and C.

Full explanation →

1770

MCQhard

A data engineer is troubleshooting an AWS Glue crawler that is not correctly inferring the schema of CSV files stored in Amazon S3. The files have headers, but the crawler is treating the header row as data. The crawler is configured with a custom classifier that has a CSV classifier with 'Column header' set to 'Use first row as header'. What is the most likely reason the crawler is not recognizing the header?

A.The CSV classifier's 'Quote symbol' setting does not match the files.

B.The CSV files have a varying number of columns across rows.

C.The CSV files have a different delimiter than the default comma.

D.The header row contains uppercase letters.

AnswerA

If the classifier expects a quote symbol but the files have none, the classifier may not apply, causing the crawler to treat header as data.

Why this answer

The CSV classifier may not be applied if the file does not match the classifier's 'Quote symbol' or 'Allow single column' settings. Option D is correct because if the classifier expects a quote symbol but the file has none, it may not match. Option A is wrong because the classifier is set to use first row as header.

Option B is wrong because the crawler does not require the header to be uppercase. Option C is wrong because the number of columns is not a header recognition issue.

Full explanation →

1771

MCQmedium

Refer to the exhibit. A data engineer applies the following S3 bucket policy to an S3 bucket. What does this policy enforce?

A.Denies all uploads unless SSE-S3 is used

B.Allows only SSE-S3 encrypted uploads

C.Allows any type of server-side encryption

D.Requires that all objects uploaded to the bucket be encrypted with SSE-KMS

AnswerD

Denies PutObject if encryption header is not KMS.

Why this answer

The policy denies s3:PutObject if the object is not encrypted with SSE-KMS. Option A is wrong because it denies if not SSE-KMS, not allows only SSE-S3. Option B is wrong because it doesn't require SSE-S3.

Option D is wrong because it doesn't allow any encryption. Option C is correct.

Full explanation →

1772

MCQeasy

A data engineer needs to store encryption keys used for protecting data in Amazon S3 and automatically rotate them every year. Which service should be used?

A.AWS KMS

B.AWS CloudHSM

C.AWS Certificate Manager

D.AWS Secrets Manager

AnswerA

KMS provides automatic key rotation.

Why this answer

Option A is correct. AWS KMS supports automatic key rotation for customer managed keys. Option B is wrong because CloudHSM does not provide automatic rotation.

Option C is wrong because Secrets Manager is for secrets. Option D is wrong because ACM is for certificates.

Full explanation →

1773

Multi-Selectmedium

A data engineer is designing a data lake on Amazon S3 that will store sensitive financial data. The engineer needs to implement encryption at rest and ensure that only authorized users can access the data. Which TWO actions should the engineer take to meet these requirements? (Choose TWO.)

Select 2 answers

A.Configure a bucket policy that denies writes if the object is not encrypted.

B.Use server-side encryption with customer-provided keys (SSE-C).

C.Enable S3 Transfer Acceleration for the bucket.

D.Enable object-level access control lists (ACLs).

E.Create IAM policies that grant least privilege access to users.

AnswersA, E

Bucket policies can enforce encryption and control access.

Why this answer

Option A is correct because S3 bucket policies can be used to enforce encryption and control access. Option C is correct because IAM policies define user permissions. Option B is incorrect because SSE-C is not recommended as it requires managing your own keys.

Option D is incorrect because S3 Transfer Acceleration is for speed, not security. Option E is incorrect because ACLs are legacy and not recommended for access control.

Full explanation →

1774

Multi-Selecthard

A data engineer is building a data ingestion pipeline using AWS Glue. The source is an Amazon DynamoDB table, and the target is an Amazon S3 data lake in Parquet format. The pipeline must handle large volumes and ensure exactly-once processing. Which THREE features should the engineer use together to achieve this? (Choose THREE.)

Select 3 answers

A.Use Amazon Kinesis Data Streams to capture DynamoDB Streams changes.

B.Configure the Glue job to convert data to Parquet format.

C.Use Amazon S3 Object Lambda to transform data on the fly.

D.Enable job bookmarks in the Glue job to track processed items.

E.Use DynamoDB's export to S3 feature to get a full snapshot.

AnswersB, D, E

Parquet is columnar and efficient for analytics.

Why this answer

Options A, C, and E are correct. Option A: Glue job bookmarks track processed data to avoid duplicates. Option C: Using the DynamoDB export feature to S3 is efficient for large volumes and provides a consistent snapshot.

Option E: Converting to Parquet in the Glue job is a common pattern. Option B is wrong because Kinesis Data Streams is for streaming, not for DynamoDB bulk export. Option D is wrong because S3 Object Lambda is for modifying data on read, not for ingestion.

Full explanation →

1775

MCQeasy

A company wants to enforce that all data in Amazon S3 is encrypted at rest. They want to automatically reject any PUT request that does not include encryption headers. What S3 feature should they use?

A.Bucket policy with a condition for encryption headers

B.Default encryption

C.MFA Delete

D.S3 Block Public Access

AnswerA

A bucket policy can deny requests that lack the required encryption header, enforcing encryption.

Why this answer

S3 bucket policies can be used to deny requests that do not include the x-amz-server-side-encryption header. This enforces encryption. Option A is wrong because default encryption only encrypts objects that are uploaded without encryption headers, but it does not reject unencrypted requests.

Option B is wrong because S3 Block Public Access is about public access, not encryption. Option D is wrong because MFA Delete is about deletion protection.

Full explanation →

1776

MCQmedium

A data engineer uses AWS Glue to process data from S3. The Glue job frequently fails with 'Out of Memory' errors. The job reads several large compressed files. What is the MOST effective way to resolve this issue without changing the code?

A.Increase the number of G.1X workers or use G.2X workers

B.Convert the compressed files to uncompressed format before processing

C.Repartition the data to fewer partitions

D.Increase the job timeout setting

AnswerA

More workers or higher memory workers provide more heap space for processing.

Why this answer

Option C is correct because increasing the number of G.1X workers (DPU) provides more memory per worker. Option A (changing file type) may help but is not directly about memory. Option B (repartition) requires code change.

Option D (increasing timeout) does not fix memory.

Full explanation →

1777

MCQeasy

A data engineer is setting up a data ingestion pipeline using Amazon Kinesis Data Firehose to deliver web server logs to Amazon S3. The logs are in JSON format and the engineer wants to convert them to Parquet format. The engineer has configured a Glue table for the schema. However, when testing, the Firehose delivery stream fails with 'Error converting to Parquet'. The engineer checks the Glue table schema and notices that it includes a column 'timestamp' of type 'string' in the format 'yyyy-MM-dd HH:mm:ss'. The logs have a 'timestamp' field in the same format. What is the MOST likely cause of the failure?

A.Firehose does not support Parquet conversion.

B.The Glue table schema does not match the data schema exactly.

C.The S3 bucket lacks write permissions for Firehose.

D.The 'timestamp' column must be of type 'timestamp' instead of 'string'.

AnswerB

Mismatch between schema and data causes conversion failure.

Why this answer

Option A is correct because Firehose requires the schema to match the data exactly. If the Glue table schema is not consistent with the data, conversion fails. Option B is wrong because Firehose supports Parquet conversion with Glue schema.

Option C is wrong because S3 bucket permissions would cause a different error. Option D is wrong because string type is acceptable for conversion.

Full explanation →

1778

MCQmedium

A company is ingesting streaming data from thousands of IoT devices into Amazon Kinesis Data Streams. The data is processed by a Kinesis Data Analytics application. Recently, the application started reporting high iterator age (millisBehindLatest). Which action would BEST reduce the iterator age?

A.Decrease the data retention period of the Kinesis stream.

B.Increase the data retention period of the Kinesis stream.

C.Increase the record size limit in the Kinesis stream.

D.Increase the number of shards in the Kinesis stream.

AnswerD

More shards allow higher throughput and reduce the backlog, decreasing iterator age.

Why this answer

Option C is correct because increasing the number of shards increases throughput and reduces iterator age. Option A is incorrect because increasing retention does not affect processing speed. Option B is incorrect because decreasing retention may cause data loss.

Option D is incorrect as a larger record size could increase processing time.

Full explanation →

1779

MCQeasy

A company stores sensitive customer data in an S3 bucket. The data engineer needs to ensure that all data is encrypted at rest. Which S3 feature should be enabled?

A.S3 Versioning

B.S3 Block Public Access

C.Bucket policy requiring aws:SecureTransport

D.Default encryption

AnswerD

Default encryption automatically encrypts new objects.

Why this answer

Option C is correct because default encryption ensures all new objects are encrypted with SSE-S3, SSE-KMS, or SSE-C. Option A is wrong because S3 Block Public Access is a security feature for access control, not encryption. Option B is wrong because bucket policies can enforce encryption, but default encryption is simpler.

Option D is wrong because versioning does not encrypt data.

Full explanation →

1780

MCQeasy

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3. Which AWS service should be used to perform the transformation?

A.AWS Glue

B.Amazon EMR

C.Amazon Kinesis Data Analytics

D.Amazon Athena

AnswerC

Kinesis Data Analytics provides real-time stream processing capabilities.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink. Option B is wrong because AWS Glue is a batch ETL service, not real-time. Option C is wrong because Amazon EMR is for big data processing, not real-time streaming transformation.

Option D is wrong because Amazon Athena is an interactive query service, not for real-time transformation.

Full explanation →

1781

MCQmedium

A company uses AWS DMS to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. After initial load, ongoing replication is set up. The replication task shows 'Task status: failed with error: The specified LSN is not available in the source database logs.' What is the most likely cause?

A.DMS does not support PostgreSQL as a source for ongoing replication.

B.The source database's network security group blocks outbound traffic to DMS.

C.The full load was incomplete, preventing CDC from starting.

D.The source database's WAL retention period is too short, and required logs have been purged.

AnswerD

DMS uses WAL logs for CDC; if logs are purged, replication fails.

Why this answer

The error indicates that the required WAL logs have been removed or are not accessible. Option A is correct because DMS needs WAL logs for CDC. Option B is wrong because network connectivity would cause a different error.

Option C is wrong because DMS supports PostgreSQL. Option D is wrong because ongoing replication does not need a full load.

Full explanation →

1782

MCQhard

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The delivery stream is configured with a buffer size of 5 MB and a buffer interval of 60 seconds. The team notices that the S3 objects are much smaller than 5 MB. What is the most likely explanation?

A.The incoming data volume is low, so the 60-second buffer interval triggers delivery before the 5 MB buffer is filled.

B.The S3 bucket has event notifications that split the objects.

C.The S3 bucket has a lifecycle policy that transitions objects to Glacier.

D.The delivery stream is using GZIP compression, which reduces the object size.

AnswerA

Low volume causes frequent small deliveries.

Why this answer

Option A is correct because if the incoming data rate is low, the buffer interval (60 seconds) expires before the buffer size (5 MB) is reached, causing small objects. Option B is wrong because compression would reduce size but not cause small objects. Option C is wrong because S3 events are not related.

Option D is wrong because S3 lifecycle would delete objects, not create small ones.

Full explanation →

1783

MCQeasy

Refer to the exhibit. A data engineer runs the above CLI command to find files smaller than 1000 bytes in a bucket. The command returns an empty array, but the engineer knows there are small files. What is the issue?

A.The prefix is incorrect; it should be 'logs/2023/01/01/'.

B.The bucket policy does not allow listing objects.

C.The query syntax is invalid; use a filter instead.

D.The Size is compared as a string, not an integer; remove quotes around '1000'.

AnswerD

JMESPath comparison requires numeric types.

Why this answer

Option C is correct because the Size attribute is a number, but the command compares it to a string '1000', which causes type mismatch. Option A is wrong because the prefix is valid. Option B is wrong because the query syntax is correct.

Option D is wrong because bucket policy does not affect CLI query results if access is allowed.

Full explanation →

1784

MCQmedium

A data engineer is troubleshooting a slow Amazon Redshift query that joins several large tables. The query plan shows a large number of broadcasts. Which design change would most likely reduce the broadcast operations?

A.Change the SORT KEY on all tables to match the join column.

B.Change the DISTSTYLE to EVEN on all tables.

C.Change the DISTKEY on all tables to match the join column.

D.Change the DISTSTYLE to ALL on all large tables.

AnswerC

Matching DISTKEY on join columns ensures data is co-located, avoiding broadcasts.

Why this answer

Option C is correct because setting the DISTKEY on all tables to the join column ensures that rows with the same join key value are co-located on the same compute node. This allows Redshift to perform a collocated join, eliminating the need to broadcast entire tables across the network, which is the primary cause of the slow query.

Exam trap

The trap here is that candidates confuse SORT KEY (which optimizes data skipping and range scans) with DISTKEY (which controls data distribution for joins), leading them to pick Option A, even though broadcast reduction is purely a distribution concern.

How to eliminate wrong answers

Option A is wrong because changing the SORT KEY affects the order of data on disk and can improve range-restricted scans, but it does not influence data distribution across nodes; broadcast operations are caused by distribution mismatches, not sort order. Option B is wrong because changing DISTSTYLE to EVEN distributes rows randomly across nodes, which maximizes the chance that join keys are scattered, forcing Redshift to broadcast rows to satisfy the join. Option D is wrong because changing DISTSTYLE to ALL on large tables copies the entire table to every node, which reduces broadcasts but at the cost of massive storage and maintenance overhead, making it impractical for large tables and often degrading overall performance.

Full explanation →

1785

MCQeasy

A data engineer needs to share a dataset from an S3 bucket in Account A with users in Account B. The dataset must remain encrypted at rest with an S3-managed key. What is the MOST secure way to grant cross-account access?

A.Make the bucket public and use bucket policies to allow only Account B users.

B.Create a bucket policy that grants cross-account access to an IAM role in Account B.

C.Use S3 object ACLs to grant access to Account B's root user.

D.Use an S3 VPC endpoint to allow Account B users through private IPs.

AnswerB

Bucket policy with cross-account IAM role is secure and follows best practices.

Why this answer

Option B is correct because a bucket policy granting access to the IAM role in Account B is the recommended secure method. Option A is insecure because it grants public access. Option C is incorrect because ACLs are legacy and less secure.

Option D is not a valid AWS feature.

Full explanation →

1786

MCQhard

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on clickstream data. The application uses a sliding window of 1 minute. The data engineer notices that the application is producing incorrect results because late-arriving records are not being handled properly. What should the data engineer do to ensure late records are included in the window calculations?

A.Use a Kinesis Data Firehose to buffer the data and then send to Kinesis Data Analytics.

B.Increase the watermark delay in the Kinesis Data Analytics application to allow more time for late records.

C.Increase the window size from 1 minute to 2 minutes.

D.Increase the retention period of the Kinesis stream to 7 days.

AnswerB

Watermark delay controls how long the application waits for late data.

Why this answer

Option B is correct because Kinesis Data Analytics allows configuring a higher watermark delay, which tells the application to wait longer for late-arriving records. Option A is wrong because increasing the window size changes the semantics. Option C is wrong because modifying the stream retention does not affect the application's window.

Option D is wrong because ordering is not the issue; it's about waiting for late data.

Full explanation →

Page 24 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →