Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1351–1425

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 19 of 24

1351

MCQhard

A company ingests JSON data from an S3 bucket into a Glue ETL job. The data contains nested structures and arrays. The team wants to flatten the data into a tabular format for analysis in Athena. Which Glue transformation is appropriate?

A.Map

B.Relationalize

C.Filter

D.DropNullFields

AnswerB

Relationalize transforms nested JSON into relational tables suitable for querying.

Why this answer

The Relationalize transformation is specifically designed to flatten nested JSON into relational tables. Option A (DropNullFields) removes nulls; Option C (Map) applies a function; Option D (Filter) selects rows.

Full explanation →

1352

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries against a large fact table are slow. The table is distributed using DISTSTYLE EVEN and has multiple sort keys. After analyzing the query plans, they find that most queries filter on a specific column, 'customer_id'. Which change would most likely improve query performance for these filter operations?

A.Add a secondary sort key on 'customer_id'.

B.Change to DISTSTYLE KEY on the 'customer_id' column.

C.Change to DISTSTYLE EVEN with a different sort key.

D.Change to DISTSTYLE ALL for the fact table.

AnswerB

KEY distribution on the filtered column reduces data movement during queries.

Why this answer

Option C is correct because using DISTSTYLE KEY on 'customer_id' collocates rows with the same customer_id on the same slice, reducing data shuffling for queries filtering on that column. Option A (EVEN distribution) does not help with filtering. Option B (ALL distribution) is for small dimension tables.

Option D (compound sort key with other columns) may not help if the filter is only on 'customer_id'.

Full explanation →

1353

MCQmedium

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is complex and involves multiple steps. The data engineer wants to implement a workflow that handles dependencies and retries on failure. Which AWS service should be used to orchestrate the Glue jobs?

A.AWS Step Functions

B.AWS Lambda

C.Amazon Managed Workflows for Apache Airflow (MWAA)

D.Amazon CloudWatch Events

AnswerA

Step Functions can orchestrate Glue jobs with retries and error handling.

Why this answer

Option A is correct because AWS Step Functions can orchestrate multiple Glue jobs with error handling and retries. Option B is wrong because Amazon MWAA (Airflow) can also orchestrate but is more complex. Option C is wrong because Amazon CloudWatch Events schedules jobs but does not handle dependencies.

Option D is wrong because AWS Lambda is not designed for orchestration of long-running jobs.

Full explanation →

1354

MCQhard

A data engineer notices that an Amazon Redshift cluster’s storage usage is increasing rapidly due to many UPDATE and DELETE operations. The engineer needs to reclaim storage space and improve query performance. Which action should be taken?

A.Run VACUUM command

B.UNLOAD the table to S3 and reload

C.Increase cluster node count

D.Run ANALYZE command

AnswerA

VACUUM reclaims disk space and re-sorts rows.

Why this answer

The VACUUM command in Amazon Redshift reclaims disk space occupied by deleted or updated rows and re-sorts the data according to the table's sort keys. This directly addresses the storage increase from UPDATE/DELETE operations and improves query performance by restoring the physical order of rows, which reduces the number of blocks scanned.

Exam trap

The trap here is that candidates confuse ANALYZE with VACUUM, thinking updating statistics will also reclaim storage, when in fact ANALYZE only refreshes metadata for the query optimizer and has no effect on physical storage.

How to eliminate wrong answers

Option B is wrong because unloading the table to S3 and reloading is a heavy, manual process that does not reclaim space in place and can be avoided with a simple VACUUM; it also incurs additional S3 costs and time. Option C is wrong because increasing the cluster node count adds more storage and compute capacity but does not reclaim the existing wasted space from deleted rows, and it may not improve performance if the underlying data is fragmented. Option D is wrong because the ANALYZE command only updates table statistics for the query planner, it does not reclaim storage space or physically reorganize data affected by UPDATE/DELETE operations.

Full explanation →

1355

MCQeasy

A company uses Amazon CloudWatch Logs to collect application logs from EC2 instances. The logs are exported to Amazon S3 for long-term storage. Recently, the export task failed with the error 'Access Denied'. What is the most likely cause of this failure?

A.The S3 bucket policy denies access from the CloudWatch Logs service.

B.The IAM role does not have s3:PutObject permission on the destination bucket.

C.The IAM role does not have s3:ListBucket permission.

D.The EC2 instances are in a VPC without a VPC endpoint for CloudWatch Logs.

AnswerB

Without PutObject, the export task cannot write logs to S3.

Why this answer

The export task from CloudWatch Logs to S3 uses an IAM role to write data to the destination bucket. If the role lacks the s3:PutObject permission, the S3 service will reject the request with an 'Access Denied' error. This is the most common cause because the export operation requires write access to the bucket.

Exam trap

The trap here is that candidates often confuse the permissions needed for exporting logs to S3 (which requires s3:PutObject on the IAM role) with the permissions needed for sending logs from EC2 to CloudWatch Logs (which requires CloudWatch Logs agent permissions and possibly a VPC endpoint).

How to eliminate wrong answers

Option A is wrong because the S3 bucket policy can deny access, but the question states the export task failed with 'Access Denied' from CloudWatch Logs, which typically indicates a missing permission in the IAM role rather than a bucket policy denial; a bucket policy denial would also produce an 'Access Denied' error but is less likely as the default configuration allows CloudWatch Logs to write if the role has permissions. Option C is wrong because s3:ListBucket permission is required for listing objects, not for writing new objects; the export task only needs to upload logs, so s3:PutObject is sufficient. Option D is wrong because a VPC endpoint for CloudWatch Logs is used for sending logs from EC2 to CloudWatch Logs, not for exporting logs from CloudWatch Logs to S3; the export task runs within the AWS CloudWatch Logs service, not from the EC2 instances.

Full explanation →

1356

MCQeasy

A data engineer needs to store semi-structured data (JSON logs) from thousands of IoT devices. The data must be schema-less, highly scalable, and support low-latency queries by device ID and timestamp. Which AWS service should the engineer use?

A.Amazon RDS for PostgreSQL

B.Amazon Redshift

C.Amazon DynamoDB

D.Amazon S3

AnswerC

DynamoDB supports flexible schema, high throughput, and low-latency queries on partition key and sort key.

Why this answer

Option C is correct because Amazon DynamoDB is a NoSQL key-value and document database that supports schema-less design, high scalability, and low-latency queries. Option A is wrong because RDS is relational and schema-on-write. Option B is wrong because Redshift is a data warehouse for analytics, not low-latency point queries.

Option D is wrong because S3 is object storage, not a database, and query performance would be higher latency.

Full explanation →

1357

MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but writes duplicate rows into Redshift. The source data is static and does not contain duplicates. Which configuration change is most likely to resolve this issue?

A.Enable the 'upsert' feature in the Redshift connection by setting 'update' to true.

B.Modify the job to use the 'postactions' option with a SQL statement that deletes duplicates before final insert.

C.Use partition pruning on the S3 source to reduce the number of files read.

D.Increase the number of DPUs (Data Processing Units) allocated to the job.

AnswerB

Using postactions to perform a MERGE or delete duplicates after staging can ensure idempotent writes.

Why this answer

The job runs successfully but writes duplicate rows because AWS Glue's Spark-based ETL jobs can retry tasks on failure, and when writing to Redshift using the JDBC connector, the default behavior is to append data without deduplication. Using the 'postactions' option with a SQL DELETE statement that removes duplicates before the final INSERT ensures that only unique rows remain, resolving the duplication without altering the source data.

Exam trap

The trap here is that candidates often assume duplicate rows come from the source data or a misconfiguration in the write mode, but the real cause is the default append behavior combined with Spark task retries, and the solution is to use post-write deduplication rather than changing the write mode or source processing.

How to eliminate wrong answers

Option A is wrong because enabling 'upsert' with 'update' to true is used for merging data based on a key, but it does not prevent duplicate rows from being inserted; it only updates existing rows if a key matches, and the source data has no duplicates, so this would not fix the issue. Option C is wrong because partition pruning on the S3 source reduces the number of files read but does not address the duplication caused by job retries or write behavior; it optimizes performance, not data integrity. Option D is wrong because increasing the number of DPUs allocates more compute resources to the job, which can improve performance but does not prevent duplicate writes; duplication is a logic or configuration issue, not a resource constraint.

Full explanation →

1358

MCQeasy

A company uses AWS Glue to process data. The security team requires that all data in transit between AWS Glue and Amazon S3 be encrypted using TLS. Which configuration should be used?

A.Enable S3 server-side encryption and use HTTPS endpoints

B.Configure a bucket policy to require aws:SecureTransport

C.Enable default encryption on the S3 bucket using SSE-KMS

D.Use an S3 VPC endpoint

AnswerA

Glue uses HTTPS, which includes TLS.

Why this answer

Option A is correct because AWS Glue uses TLS for data in transit by default. Option B is wrong because S3 default encryption is for at-rest. Option C is wrong because VPC endpoints use AWS PrivateLink but don't enforce encryption.

Option D is wrong because it's not required for TLS.

Full explanation →

1359

MCQmedium

A company ingests streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time using custom Python code before being stored in Amazon S3. Which AWS service should be used to perform this transformation?

A.Amazon EMR

B.Amazon Kinesis Data Firehose

C.AWS Glue

D.Amazon Kinesis Data Analytics for Apache Flink

AnswerD

Kinesis Data Analytics for Apache Flink allows running Flink applications that can process streaming data with custom Python code.

Why this answer

Option C is correct because Amazon Kinesis Data Analytics for Apache Flink allows running Apache Flink applications, which can include custom Python transformations via Flink's Python API. Option A is wrong because AWS Glue is a batch ETL service, not optimized for real-time stream processing. Option B is wrong because Amazon Kinesis Data Firehose can perform simple transformations via Lambda, but the question specifies custom Python code; Firehose's transformation capability is limited to Lambda functions, which is possible but not the best practice for complex streaming transformations; however, the better answer is Kinesis Data Analytics for Apache Flink as it is purpose-built for stream processing with Flink.

Option D is wrong because Amazon EMR is a big data platform that can process streams but requires more setup and is not the simplest managed service for this use case.

Full explanation →

1360

MCQmedium

A company runs a Redshift cluster and notices that query performance has degraded over time. The data engineer suspects that table statistics are stale. What should the engineer do to improve query performance?

A.Rebuild the tables by using CREATE TABLE AS

B.Increase the number of slices in the cluster

C.Run the ANALYZE command on the tables

D.Run the VACUUM command on the tables

AnswerC

ANALYZE updates table statistics for the optimizer.

Why this answer

Stale table statistics cause the Redshift query optimizer to generate suboptimal execution plans, leading to degraded query performance. Running the ANALYZE command updates these statistics, allowing the optimizer to make better decisions about join order, distribution, and data scan strategies. This directly addresses the root cause of performance degradation over time.

Exam trap

The trap here is confusing the VACUUM command (which reorganizes physical storage) with the ANALYZE command (which updates query optimizer metadata), leading candidates to choose VACUUM when stale statistics are the actual culprit.

How to eliminate wrong answers

Option A is wrong because rebuilding tables with CREATE TABLE AS (CTAS) does not update statistics; it creates a new table that still requires an explicit ANALYZE to populate its statistics, and it is an unnecessarily heavy operation for fixing stale stats. Option B is wrong because increasing the number of slices in the cluster requires resizing the cluster (e.g., adding nodes or changing node types), which is a disruptive, costly operation that does not address stale statistics; query performance degradation from stale stats is not resolved by adding more slices. Option D is wrong because the VACUUM command reclaims disk space and sorts rows to maintain physical data organization, but it does not update table statistics; stale statistics persist after VACUUM, so the optimizer remains uninformed.

Full explanation →

1361

MCQhard

A company runs a daily batch ETL job using AWS Glue. The job processes 500 GB of data from Amazon RDS to Amazon S3. The job currently uses a single DPU and takes 6 hours to complete. The team wants to reduce runtime to under 1 hour without increasing costs significantly. Which approach should they use?

A.Change the job type from Python to Spark.

B.Use multiple Glue jobs triggered sequentially.

C.Increase the RDS instance size to improve read throughput.

D.Use AWS Glue Spark job with 100 workers.

AnswerD

More workers enable parallelism, reducing runtime.

Why this answer

Option D is correct because increasing the number of workers (DPUs) allows parallel processing, reducing runtime. Option A is wrong because the job type (Spark vs Python) affects resource usage but increasing workers is more direct. Option B is wrong because using Spark in Glue (which is default) already offers parallelism.

Option C is wrong because using a larger instance type for RDS may improve read throughput but is not a Glue optimization and could increase database cost.

Full explanation →

1362

Multi-Selecthard

A company uses AWS DMS to replicate data from an Amazon RDS for MySQL database to Amazon S3. Which TWO configurations are required to enable continuous change data capture (CDC) from MySQL?

Select 2 answers

A.Ensure the S3 bucket is in the same AWS Region as the source database

B.Grant REPLICATION CLIENT and REPLICATION SLAVE privileges to the DMS user

C.Enable binary logging (binlog) on the MySQL source database

D.Enable versioning on the target S3 bucket

E.Configure the MySQL source to be Multi-AZ

AnswersB, C

Required for DMS to read binary logs.

Why this answer

Correct options: B and D. DMS CDC requires binary logs (binlog) to be enabled on the MySQL source and the S3 target endpoint must be configured with `DataFormat` set to `parquet` or `csv` (but not required for CDC). However, to capture changes, binlog retention must be sufficient.

Option B is correct: binlog must be enabled. Option D is correct: DMS needs the `REPLICATION CLIENT` and `REPLICATION SLAVE` privileges. Option A is incorrect; the S3 bucket versioning is not required.

Option C is incorrect; the source database does not need to be in Multi-AZ for CDC. Option E is incorrect; the target S3 bucket does not need to be in the same region (though recommended).

Full explanation →

1363

Multi-Selecthard

Which THREE factors should be considered when choosing a partition key for an Amazon DynamoDB table?

Select 3 answers

A.The partition key should be chosen to maximize the size of items in each partition.

B.If the table has a write-heavy workload, the partition key should distribute writes evenly.

C.The partition key should align with the most common query access pattern.

D.The partition key should be chosen to minimize read capacity unit consumption.

E.The partition key should have high cardinality to distribute data evenly.

AnswersB, C, E

Even write distribution prevents throttling.

Why this answer

Option B is correct because DynamoDB distributes data and request traffic across partitions based on the partition key. For write-heavy workloads, a partition key that evenly distributes writes prevents hot partitions, which can throttle requests and degrade performance. This ensures that no single partition exceeds its write capacity limit.

Exam trap

The trap here is that candidates may think maximizing item size (Option A) or minimizing RCU consumption (Option D) are primary factors, when in fact even distribution and access pattern alignment are the critical design principles for DynamoDB partition keys.

Full explanation →

1364

MCQhard

A data engineer is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is delivered in 5-minute intervals. However, the engineer notices that the data in S3 is often delayed by up to 30 minutes. Which configuration change would most likely reduce the delay?

A.Decrease the 'Buffer interval' from 300 seconds to 60 seconds.

B.Enable compression (GZIP) on the Firehose delivery stream.

C.Increase the 'Buffer size' from 5 MB to 50 MB.

D.Enable 'Dynamic partitioning' on the Firehose stream.

AnswerA

Shorter buffer interval triggers more frequent deliveries.

Why this answer

Option B is correct because Firehose buffers data by time or size; decreasing buffer interval forces more frequent deliveries. Option A is wrong because compression increases size, making buffer fill faster but may not reduce delay. Option C is wrong because it increases buffer size, possibly increasing delay.

Option D is wrong because it aggregates data into fewer files.

Full explanation →

1365

MCQeasy

A company needs to transform JSON data from an S3 bucket into a structured format for Amazon Redshift. The transformation should be done serverlessly. Which service should be used?

A.AWS Glue

B.Amazon EMR

C.Amazon Athena

D.AWS Lambda

AnswerA

Glue provides serverless ETL capabilities.

Why this answer

Option B is correct because AWS Glue is a serverless ETL service that can transform data for Redshift. Option A is wrong because Lambda can process events but is not ideal for large-scale ETL. Option C is wrong because Athena is a query service, not for transformation.

Option D is wrong because EMR is not serverless.

Full explanation →

1366

MCQmedium

A financial services company uses AWS Glue ETL jobs to process sensitive customer data stored in Amazon S3. The data is encrypted at rest with SSE-KMS using a customer-managed key. Recently, the security team discovered that the Glue job's IAM role has an overly permissive policy that allows the 'kms:Decrypt' action for all KMS keys in the account. The company wants to follow the principle of least privilege. The Glue job runs on a schedule and reads from a specific S3 bucket. The security team needs to update the IAM policy to restrict KMS decryption to only the specific key used for that bucket. What should they do?

A.Update the policy to allow 'kms:Decrypt' with a resource of 'arn:aws:kms:us-east-1:123456789012:key/*' to cover all keys in the account.

B.Update the policy to allow 'kms:Decrypt' with a resource of '*' to ensure the job can always decrypt data.

C.Update the policy to allow 'kms:Decrypt' only for the specific KMS key ARN used by the S3 bucket containing the customer data.

D.Remove the 'kms:Decrypt' action from the policy and rely on S3 bucket policies to grant decryption permissions.

AnswerC

This restricts decryption to only the required key.

Why this answer

Option B is correct because the IAM policy should grant 'kms:Decrypt' only for the specific KMS key ARN used by the target S3 bucket, following least privilege. Option A is wrong because using '*' for the key is overly permissive. Option C is wrong because using a wildcard in the key ARN (key/*) is not a valid ARN pattern; KMS key ARNs are specific.

Option D is wrong because 'kms:Decrypt' is required for the Glue job to access encrypted data; removing it would break the job.

Full explanation →

1367

MCQmedium

A data engineer needs to allow an IAM user to rotate the secret in AWS Secrets Manager for an RDS database. Which IAM action should be included in the policy?

A.secretsmanager:RotateSecret

B.secretsmanager:PutSecretValue

C.secretsmanager:UpdateSecret

D.secretsmanager:GetSecretValue

AnswerA

This action allows rotating the secret.

Why this answer

The secretmanager:RotateSecret action allows the user to initiate rotation of a secret. Option B is correct. secretmanager:GetSecretValue only retrieves the secret value, not rotate it.

Full explanation →

1368

MCQeasy

A company is using an RDS for PostgreSQL instance and wants to minimize downtime during a major version upgrade. Which approach should be taken?

A.Create a read replica of the DB instance, upgrade the replica, and then promote it to the primary instance.

B.Use AWS Database Migration Service (DMS) to migrate data to a new upgraded instance.

C.Modify the DB instance and apply the upgrade immediately.

D.Take a snapshot of the DB instance and restore it as a new instance with the upgraded version.

AnswerA

Minimizes downtime by failing over to the upgraded replica.

Why this answer

Option A is correct because creating a read replica of the RDS for PostgreSQL instance, upgrading the replica to the new major version, and then promoting it to become the primary instance minimizes downtime by allowing the replica to be upgraded while the original primary remains fully operational. The promotion process is fast (typically seconds), and the only downtime is the brief cutover period when applications switch to the promoted replica. This approach leverages RDS's managed replication and avoids the longer downtime associated with direct in-place upgrades.

Exam trap

The trap here is that candidates often assume a snapshot-and-restore (Option D) is the fastest method because it seems like a simple copy, but they overlook the fact that the snapshot itself requires the instance to be operational and the restore creates a new instance that is not automatically kept in sync, leading to longer overall downtime compared to the replica promotion method.

How to eliminate wrong answers

Option B is wrong because AWS Database Migration Service (DMS) is designed for heterogeneous or homogeneous migrations with ongoing replication, but it introduces significant complexity and potential downtime during the full-load and change-data-capture phases; it is not the optimal approach for a simple major version upgrade of an existing RDS instance. Option C is wrong because modifying the DB instance and applying the upgrade immediately causes an in-place upgrade that typically results in several minutes of downtime (often 10–30 minutes or more) while the instance is stopped, upgraded, and restarted, which violates the goal of minimizing downtime. Option D is wrong because taking a snapshot and restoring it as a new instance with the upgraded version requires the source instance to be available during the snapshot (which can take time) and then the restore process creates a new instance that is not automatically synchronized with the original; this approach involves significant downtime for the snapshot creation and restore, and does not provide a seamless cutover.

Full explanation →

1369

MCQeasy

A data engineer needs to store semi-structured JSON data from IoT devices. The data is written frequently and read occasionally. Which AWS service is MOST cost-effective for this use case?

A.Amazon ElastiCache for Redis

B.Amazon DynamoDB

C.Amazon RDS for MySQL

D.Amazon Redshift

AnswerB

DynamoDB handles high write volumes efficiently.

Why this answer

Option A is correct because DynamoDB is designed for high write throughput and low-latency reads. Option B (RDS) is relational and more expensive. Option C (ElastiCache) is in-memory and costly for large data.

Option D (Redshift) is for analytics.

Full explanation →

1370

MCQeasy

A data engineer needs to ensure that an Amazon S3 bucket containing sensitive customer data is encrypted at rest. Which AWS service can be used to manage the encryption keys?

A.AWS Certificate Manager

B.AWS Secrets Manager

C.AWS CloudHSM

D.AWS Key Management Service (KMS)

AnswerD

KMS is the managed service for creating and controlling encryption keys used by S3 SSE-KMS.

Why this answer

AWS KMS is the service for managing encryption keys. S3 SSE-S3 uses S3-managed keys, while SSE-C uses customer-provided keys. CloudHSM is a hardware security module but not directly used for S3 encryption key management.

Full explanation →

1371

MCQhard

A company uses Amazon DynamoDB with encryption at rest using an AWS managed KMS key. The security team requires that the encryption key be rotated every 90 days. What should the data engineer do to meet this requirement?

A.Switch to AWS CloudHSM to manage the encryption key

B.Create a scheduled AWS Lambda function to rotate the AWS managed key

C.Enable a custom encryption context in DynamoDB to trigger rotation

D.Use a customer managed KMS key and configure automatic rotation

AnswerD

Customer managed keys support automatic yearly rotation.

Why this answer

Option D is correct. AWS managed keys are rotated automatically every 3 years, but not configurable. To meet 90-day rotation, the company must use a customer managed key and configure automatic yearly rotation (or manual).

Option A is wrong because DynamoDB does not support custom encryption contexts. Option B is wrong because managed keys cannot be rotated on demand. Option C is wrong because CloudHSM is not required.

Full explanation →

1372

MCQmedium

A data engineer notices that an AWS Glue ETL job is failing with an OutOfMemory error when processing a large dataset. The job uses a Standard worker type. Which action is MOST effective to resolve this issue without changing the job script?

A.Increase the number of workers

B.Switch to G.1X worker type

C.Change to G.2X worker type

D.Increase the number of DPUs per worker

AnswerD

Increasing DPUs per worker increases memory per worker, directly addressing OutOfMemory errors.

Why this answer

Option B is correct because increasing the number of DPUs per worker provides more memory per task, addressing OutOfMemory errors directly. Option A is wrong because G.1X worker type has less memory than Standard. Option C is wrong because increasing the number of workers does not increase memory per worker.

Option D is wrong because changing to G.2X may help but is not as direct as increasing DPUs per worker, and it may also increase cost unnecessarily.

Full explanation →

1373

MCQmedium

Refer to the exhibit. The exhibit shows output from AWS CLI commands. Which key can be used to enable automatic annual rotation?

A.The second key (5678efgh-...)

B.Both keys

C.Neither key

D.The first key (1234abcd-...)

AnswerD

Customer managed keys can have automatic rotation enabled.

Why this answer

Option A is correct because only customer managed keys support automatic rotation. Option B is wrong because AWS managed keys are rotated automatically but cannot be configured by the user. Option C is wrong because both keys can be rotated, but only customer managed keys have user-controlled rotation.

Option D is wrong because the second key is AWS managed.

Full explanation →

1374

MCQhard

A data engineer is designing a multi-region disaster recovery solution for Amazon RDS for PostgreSQL. The primary region must have a standby in a different Availability Zone, and the secondary region must have a readable replica that can be promoted in case of failure. Which configuration meets these requirements?

A.Use a single-AZ primary and enable automatic backups

B.Enable Multi-AZ in the primary region and create a cross-region read replica

C.Use a single-AZ primary and create a cross-region read replica

D.Enable Multi-AZ in both primary and secondary regions

AnswerB

Multi-AZ provides standby; cross-region replica provides DR.

Why this answer

Option B is correct because it meets both requirements: Multi-AZ in the primary region provides a synchronous standby in a different Availability Zone for high availability, and a cross-region read replica in the secondary region provides an asynchronous, readable copy that can be promoted to a standalone primary during a regional failure. This combination ensures both intra-region fault tolerance and inter-region disaster recovery.

Exam trap

The trap here is that candidates often confuse Multi-AZ (synchronous, for high availability within a region) with cross-region read replicas (asynchronous, for disaster recovery), and may incorrectly assume that Multi-AZ alone provides cross-region failover or that a single-AZ primary with a read replica satisfies the intra-region standby requirement.

How to eliminate wrong answers

Option A is wrong because a single-AZ primary with automatic backups does not provide a standby in a different Availability Zone, nor does it create a readable replica in a secondary region; backups are for point-in-time recovery, not for immediate failover or read scaling. Option C is wrong because a single-AZ primary lacks the required standby in a different Availability Zone within the primary region; the cross-region read replica only addresses the secondary region requirement. Option D is wrong because enabling Multi-AZ in both regions does not create a cross-region read replica; Multi-AZ in the secondary region provides a standby within that region but does not establish a readable replica that can be promoted from the primary region.

Full explanation →

1375

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data includes personally identifiable information (PII) that must be encrypted at rest. Which encryption option provides the most control over encryption keys?

A.Client-side encryption using Amazon S3 Encryption Client.

B.Server-side encryption with S3 managed keys (SSE-S3).

C.Server-side encryption with AWS KMS managed keys (SSE-KMS).

D.Server-side encryption with customer-provided keys (SSE-C).

AnswerC

Allows use of customer-managed KMS keys, giving more control.

Why this answer

Option C is correct because SSE-KMS allows the customer to manage and control KMS keys. Option A is incorrect because SSE-S3 uses Amazon-managed keys. Option B is incorrect because SSE-C uses customer-provided keys but requires managing keys on client side.

Option D is incorrect because client-side encryption is not server-side.

Full explanation →

1376

MCQmedium

Refer to the exhibit. A data engineer has attached this IAM policy to a user. The user reports being unable to upload files to my-bucket from an on-premises network with a public IP of 203.0.113.5. What is the issue?

A.The resource ARN does not include the bucket itself

B.The user's IP address is not within the allowed IP range

C.The user does not have s3:PutObject permission

D.The bucket requires server-side encryption

AnswerB

The condition only allows 10.0.0.0/16.

Why this answer

Option A is correct because the policy restricts access to the IP range 10.0.0.0/16, which is a private range, not the user's public IP. Option B is wrong because the actions are allowed. Option C is wrong because if encryption were required, there would be a condition.

Option D is wrong because the resource includes the bucket.

Full explanation →

1377

MCQeasy

A company uses Amazon S3 to store raw data and AWS Glue to run ETL jobs. The data is partitioned by date in the format 'year=YYYY/month=MM/day=DD'. A new data source started sending data with a different date format 'YYYY-MM-DD'. The Glue crawler is configured to create a single table for the entire bucket. The crawler runs daily, but it is not detecting the new partitions from the new data source. The existing partitions are in the format 'year=2024/month=05/day=10', while the new data is stored as '2024-05-10/' without the key-value structure. How should the engineer modify the data pipeline to include the new data?

A.Run the crawler with the 'Create partition indexes' option enabled.

B.Configure the crawler to add a custom classifier for date formats.

C.Modify the new data source to store data in the same Hive-style partition format as the existing data.

D.Convert the new data to Parquet format.

AnswerC

Consistent partition structure enables the crawler to detect partitions.

Why this answer

Option A is correct because the new data uses a different partition structure. The crawler expects Hive-style partitions (key=value). To include the new data, the engineer should either change the folder structure to match the existing one or configure a separate crawler for the new format.

Option B is wrong because the crawler will not automatically infer non-Hive partitions. Option C is wrong because the data format (CSV, JSON) is not the issue. Option D is wrong because Glue does not have a 'partition discovery' setting that overrides the structure.

Full explanation →

1378

MCQhard

A company is migrating an on-premises Hadoop cluster to AWS. The data is stored in HDFS and needs to be accessible by both Amazon EMR and Amazon Redshift Spectrum. Which storage solution is most cost-effective and scalable?

A.Amazon FSx for HDFS

B.Amazon Simple Storage Service (S3)

C.Amazon Elastic Block Store (EBS)

D.Amazon Elastic File System (EFS)

AnswerB

S3 is highly scalable, durable, and can be queried by Redshift Spectrum and processed by EMR.

Why this answer

Option B is correct because Amazon S3 is a cost-effective, scalable object store that can be accessed by both EMR and Redshift Spectrum. Option A is wrong because EBS is limited to a single EC2 instance. Option C is wrong because EFS is a file system, not as cost-effective for large-scale data.

Option D is wrong because Amazon FSx for HDFS is designed for HDFS compatibility but is more expensive than S3.

Full explanation →

1379

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data includes personally identifiable information (PII) that must be encrypted at rest. Which combination of actions meets the encryption requirement with the least operational overhead?

A.Apply a bucket policy that denies access to unencrypted requests

B.Enable default encryption on the S3 bucket using SSE-S3

C.Use client-side encryption with AWS KMS

D.Use server-side encryption with AWS KMS (SSE-KMS)

AnswerB

SSE-S3 is simple and automatically encrypts objects.

Why this answer

Option B is correct because S3 managed keys (SSE-S3) provide encryption at rest with minimal management. Option A is wrong because client-side encryption requires application changes. Option C is wrong because KMS adds key management overhead.

Option D is wrong because bucket policies do not encrypt data.

Full explanation →

1380

Multi-Selectmedium

A company is using AWS Lake Formation to manage permissions on a data lake. Which of the following are valid ways to grant access to a user or role? (Choose THREE.)

Select 3 answers

A.Grant permissions to a SAML or SCIM group

B.Grant permissions using tag-based access control (LF-Tags)

C.Grant permissions to an IAM user or role

D.Grant permissions to an AWS Organizations unit

E.Grant permissions via an S3 bucket policy

AnswersA, B, C

Lake Formation can integrate with SAML/SCIM for group-based access.

Why this answer

Options A, B, and E are correct. Lake Formation can grant directly to IAM users/roles, to SAML/SCIM groups, and through tag-based access control. Option C is wrong because AWS Organizations is for account management, not individual permissions.

Option D is wrong because S3 bucket policies are separate from Lake Formation.

Full explanation →

1381

MCQeasy

A company uses Amazon S3 Event Notifications to trigger a Lambda function that processes incoming files. Recently, the Lambda function has been timing out for large files (>100 MB). The data engineer wants to improve the pipeline to handle large files reliably. Which solution is the MOST scalable and cost-effective?

A.Use S3 Event Notification to send to an SQS queue, then have Lambda poll the queue

B.Use Amazon SNS to fan out the event to multiple Lambda functions

C.Use AWS Step Functions to orchestrate multiple Lambda functions for parallel processing

D.Increase the Lambda timeout to 15 minutes

AnswerA

SQS decouples and buffers events, allowing Lambda to process at a manageable rate.

Why this answer

Option B is correct because S3 Event Notifications can send to SQS, which decouples the producer and consumer, allowing Lambda to process at its own pace. Option A (increasing timeout) may not be enough. Option C (Step Functions) adds complexity.

Option D (SNS) does not buffer.

Full explanation →

1382

MCQmedium

A data engineer is troubleshooting an Amazon Redshift cluster that is not responding to queries. The engineer suspects that the cluster may have been accidentally deleted. Which AWS service should be used to investigate the deletion?

A.AWS Config

B.AWS CloudTrail

C.Amazon CloudWatch Logs

D.AWS Trusted Advisor

AnswerB

CloudTrail logs API calls like DeleteCluster.

Why this answer

Option A is correct because AWS CloudTrail records DeleteCluster API calls. Option B is wrong because CloudWatch Logs stores logs but not API calls. Option C is wrong because AWS Config records resource changes, not deletions? Actually Config would show a deletion, but CloudTrail is more direct for API calls.

The best answer is CloudTrail. Option D is wrong because AWS Trusted Advisor is for best practices.

Full explanation →

1383

Multi-Selectmedium

A company uses Amazon DynamoDB for a gaming application. The application experiences throttling during peak hours. The table's read and write capacity is provisioned. Which TWO actions can reduce throttling?

Select 2 answers

A.Enable TTL (time to live) on the table to automatically delete old items

B.Enable DynamoDB auto scaling for the table

C.Increase the provisioned read capacity units (RCUs)

D.Implement DynamoDB Accelerator (DAX) to cache read requests

E.Add a DynamoDB Global Table for the table

AnswersB, D

Auto scaling adjusts provisioned capacity based on traffic.

Why this answer

DynamoDB auto scaling (Option B) automatically adjusts the provisioned read and write capacity based on actual traffic patterns, preventing throttling during peak hours without manual intervention. This is the correct action because it dynamically increases capacity when demand spikes and reduces it during low traffic, directly addressing the throttling issue.

Exam trap

The trap here is that candidates often confuse increasing provisioned capacity (Option C) as the only solution, but the exam tests whether you understand that auto scaling (Option B) is the correct managed approach, and that DAX (Option D) can reduce read throttling by caching, making both B and D valid together.

Full explanation →

1384

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by an Amazon Kinesis Data Analytics for Apache Flink application that performs real-time analytics. The Flink application writes its results to an Amazon S3 bucket. The company has noticed that the Flink application is experiencing high checkpoint failure rates, causing delays. The CloudWatch metrics show that the checkpoint size is large and increasing. The data engineer needs to reduce the checkpoint size. Which action should the data engineer take?

A.Decrease the checkpoint interval to reduce the amount of state accumulated.

B.Reduce the parallelism of the Flink application.

C.Increase the state time-to-live (TTL) configuration to retain state longer.

D.Enable incremental checkpointing in the Flink application to only write changes since the last checkpoint.

AnswerD

Incremental checkpoints reduce size and improve performance.

Why this answer

Option D is correct because enabling incremental checkpointing in Flink reduces the amount of data written per checkpoint by only writing changes since the last checkpoint. Option A is wrong because reducing parallelism may increase load per operator. Option B is wrong because decreasing checkpoint interval increases frequency, not size.

Option C is wrong because state TTL does not directly reduce checkpoint size.

Full explanation →

1385

MCQmedium

A data engineer is monitoring an Amazon Kinesis Data Stream and notices that the 'WriteProvisionedThroughputExceeded' metric is frequently elevated. The stream has 5 shards and is used by multiple producers. What is the BEST action to resolve this issue?

A.Increase the consumer's processing speed to reduce lag.

B.Increase the number of shards in the Kinesis data stream.

C.Reduce the data retention period of the stream.

D.Implement exponential backoff and retries in the producer applications.

AnswerB

More shards provide higher write throughput.

Why this answer

Option B is correct because WriteProvisionedThroughputExceeded indicates that the write rate exceeds the shards' capacity. Increasing the number of shards increases the total write capacity. Option A is incorrect because retries do not resolve the root cause of insufficient capacity.

Option C is incorrect because reducing the retention period does not affect write throughput. Option D is incorrect because enhancing the consumer's processing speed does not affect write throttling.

Full explanation →

1386

MCQhard

A data engineer is troubleshooting a slow-running Amazon Redshift query. The query involves a large fact table with a distribution style of EVEN and a sort key on date. The table has 10 slices. The engineer notices that the query is performing a broadcast join with a small dimension table. Which change would most improve performance?

A.Remove the sort key and use a compound sort key on the join column

B.Change the dimension table to DISTSTYLE ALL

C.Increase the number of slices by resizing the cluster

D.Change the fact table to DISTSTYLE KEY on the join column

AnswerD

KEY distribution colocates matching rows, reducing the need for broadcast.

Why this answer

Option C is correct because changing the fact table to KEY distribution on the join column reduces data movement. Option A is wrong because ALL distribution on the dimension table is better but not listed. Option B is wrong because it does not reduce the broadcast.

Option D is wrong because removing sort key would hurt performance for range queries.

Full explanation →

1387

MCQmedium

A data engineering team is troubleshooting a failing AWS Glue ETL job that processes data from an S3 bucket. The job writes output to another S3 bucket. The job fails with an AccessDenied error when writing to the output bucket. The IAM role used by the job has the following policy attached: {"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:GetObject","s3:ListBucket"],"Resource":["arn:aws:s3:::input-bucket/*","arn:aws:s3:::input-bucket"]}]}. What is the most likely cause of the failure?

A.The ETL job is processing more than 10 TB of data.

B.The output bucket has a bucket policy that denies access to the IAM role.

C.The IAM role does not have s3:PutObject permission on the output bucket.

D.The IAM role used by the job does not exist.

AnswerC

The policy lacks s3:PutObject for the output bucket, causing the AccessDenied error.

Why this answer

Option B is correct because the IAM policy only grants GetObject and ListBucket permissions on the input bucket, but no permissions on the output bucket. Option A is wrong because there is no restriction on data size. Option C is wrong because there is no bucket policy mentioned.

Option D is wrong because the role exists.

Full explanation →

1388

MCQeasy

A data engineer has this IAM policy attached to their user. They are trying to create an Amazon EMR cluster with a custom service role 'EMR_CustomRole'. What will happen?

A.The cluster creation will fail because elasticmapreduce:* is too broad.

B.The cluster creation will succeed because elasticmapreduce:* is allowed.

C.The cluster creation will fail with an 'Access Denied' error for iam:PassRole.

D.The cluster creation will succeed because PassRole is not required for EMR.

AnswerC

The policy restricts PassRole to only the default role, so passing a custom role is denied.

Why this answer

Option B is correct because the policy only allows iam:PassRole for the specific role 'EMR_DefaultRole', not for 'EMR_CustomRole'. The elasticmapreduce:* action allows creating clusters, but the PassRole will fail. Option A (full success) is incorrect.

Option C (EMR not allowed) is false. Option D (PassRole not needed) is false.

Full explanation →

1389

MCQmedium

A company is using Amazon Kinesis Data Firehose to ingest log data from web servers into an Amazon S3 bucket. The data is then queried by Amazon Athena. The company has noticed that the Athena queries are slow and expensive. The data engineer wants to optimize the storage format to improve query performance and reduce costs. Which configuration change should the data engineer make to the Firehose delivery stream?

A.Increase the buffer interval to 600 seconds and buffer size to 128 MB to create larger files.

B.Change the output format to ORC and enable GZIP compression.

C.Enable S3 server access logs to track query patterns.

D.Enable data transformation in Firehose to convert JSON to Parquet format with Snappy compression.

AnswerD

Parquet is columnar and efficient for Athena.

Why this answer

Option B is correct because converting data to Parquet format and compressing it reduces storage space and improves query performance in Athena. Option A is wrong because storing as ORC is also good but Parquet is more common with Athena. Option C is wrong because increasing buffer size delays delivery.

Option D is wrong because enabling S3 server access logs adds cost and does not help query performance.

Full explanation →

1390

MCQeasy

A company wants to ingest data from thousands of IoT devices into AWS for real-time analytics. The data is in JSON format and each device sends about 1 KB every second. Which service should be used as the primary ingestion point?

A.AWS IoT Core

B.Amazon Kinesis Data Firehose

C.Amazon SQS

D.Amazon Kinesis Data Streams

AnswerD

Handles high-volume streaming data.

Why this answer

Option B is correct because Kinesis Data Streams can handle high throughput real-time data from many sources. Option A (SQS) is for message queuing, not streaming analytics. Option C (Firehose) could be used but lacks the ability to have multiple consumers for real-time processing.

Option D (IoT Core) is specifically for IoT but not as general-purpose.

Full explanation →

1391

Multi-Selectmedium

A financial services company is designing a data store for transaction records that must be immutable and auditable. The data must be stored for 7 years. Which AWS services can be combined to meet these requirements? (Choose TWO.)

Select 2 answers

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 with Object Lock enabled

C.Amazon EBS volume with snapshots

D.Amazon RDS with automated backups

E.Amazon DynamoDB with point-in-time recovery

AnswersA, B

Glacier Deep Archive is cost-effective for long-term archival.

Why this answer

Amazon S3 Glacier Deep Archive is correct because it provides the lowest-cost storage for long-term retention of immutable data, with a 7-year lifecycle meeting compliance requirements. Amazon S3 with Object Lock enabled is correct because it enforces a write-once-read-many (WORM) model, preventing records from being deleted or overwritten for a specified retention period, ensuring immutability and auditability.

Exam trap

The trap here is that candidates often confuse backup solutions (like RDS automated backups or DynamoDB PITR) with immutable storage, but backups are deletable and do not enforce WORM, whereas S3 Object Lock provides true immutability required for audit compliance.

Full explanation →

1392

Multi-Selecteasy

A data engineer is migrating an on-premises Microsoft SQL Server database to Amazon RDS for SQL Server. The database is 2 TB in size and has a 4-hour maintenance window. The company needs to minimize downtime and ensure data consistency. Which TWO methods should the engineer use? (Choose TWO.)

Select 2 answers

A.Use AWS Database Migration Service (AWS DMS) with ongoing replication to minimize downtime.

B.Use SQL Server Management Studio (SSMS) export wizard to transfer data.

C.Take a native backup of the on-premises database and restore it to RDS.

D.Use AWS Schema Conversion Tool (AWS SCT) to convert the schema and migrate data.

E.Export the database to CSV files and use BULK INSERT to load into RDS.

AnswersA, C

DMS can perform a full load and then replicate changes, reducing downtime.

Why this answer

AWS DMS with ongoing replication (change data capture) is correct because it allows continuous synchronization from the on-premises SQL Server to Amazon RDS for SQL Server, minimizing downtime by keeping the target database up-to-date until the final cutover. This approach ensures data consistency by capturing and applying ongoing changes without requiring a long outage window.

Exam trap

The trap here is that candidates often assume native backup/restore alone is sufficient for minimal downtime, forgetting that it only handles the initial data load and does not capture changes made during the backup window without additional replication.

Full explanation →

1393

MCQhard

A company uses Amazon DynamoDB with on-demand capacity for a gaming leaderboard. The table has 100 GB of data and receives 10,000 write requests per second with spikes to 50,000. The application experiences throttling during spikes. Which action should be taken to reduce throttling without changing the application?

A.Write data to Amazon S3 and use S3 Select

B.Increase the provisioned read capacity units

C.Switch to provisioned capacity with Auto Scaling

D.Enable DynamoDB Accelerator (DAX)

AnswerD

DAX can offload read traffic and reduce write throttling.

Why this answer

Option D is correct because DynamoDB Accelerator (DAX) provides in-memory caching to absorb read spikes and reduce write throttling by offloading reads. Option A is wrong because increasing read capacity does not help write throttling. Option B is wrong because Auto Scaling is not available with on-demand mode.

Option C is wrong because S3 is not suitable for real-time writes.

Full explanation →

1394

MCQeasy

A company uses AWS Glue ETL jobs to transform data and load it into Amazon Redshift. The jobs are failing with 'Out of Memory' errors. What is the most cost-effective way to resolve this issue without changing the transformation logic?

A.Increase the number of G.1X workers in the Glue job configuration.

B.Use Amazon Redshift Spectrum to query data directly from S3 without transformation.

C.Change the worker type to G.2X and keep the same number of workers.

D.Switch the job from Python to Scala.

AnswerA

More workers increase parallelism and total memory.

Why this answer

Option A is correct. Increasing the number of workers (DPUs) adds parallelism without changing logic. Option B is wrong because increasing worker type is more expensive than adding workers.

Option C is wrong because a different engine may not be compatible. Option D is wrong because Redshift Spectrum is for querying S3, not for ETL memory.

Full explanation →

1395

MCQmedium

A data engineer is ingesting streaming data from an IoT fleet into Amazon Kinesis Data Streams. The data must be transformed in real-time and loaded into an Amazon Redshift cluster. Which solution minimizes operational overhead?

A.Use Kinesis Data Firehose with a Lambda transformation function

B.Use AWS Glue ETL jobs running continuously

C.Use Kinesis Client Library (KCL) to consume and transform data, then write to Redshift using COPY

D.Use AWS Direct Connect to stream data directly into Redshift

AnswerA

Firehose handles buffering, transformation via Lambda, and direct delivery to Redshift.

Why this answer

Option A is correct because Kinesis Data Firehose can buffer and batch incoming data, invoke a Lambda function for transformation, and load directly into Redshift. Option B is wrong because KCL requires custom application management. Option C is wrong because Glue is batch-oriented.

Option D is wrong because Direct Connect is for dedicated network connections.

Full explanation →

1396

MCQhard

A financial services company is ingesting trade data from multiple exchanges via Amazon Kinesis Data Streams. Each shard receives data from multiple exchanges, and a consumer application (using KCL) processes the data. The company needs to ensure that trades from the same exchange are processed in order. However, the current implementation distributes records to shards using a random partition key, causing trades from the same exchange to be spread across shards and processed out of order. The team must enforce ordering per exchange without significantly reducing throughput. What should the team do?

A.Implement a custom sequence number in the application to reorder after processing.

B.Use a single shard for all data to guarantee order.

C.Use the exchange ID as the partition key when putting records into the stream.

D.Increase the number of shards to 10 per exchange.

AnswerC

Ensures same exchange goes to same shard, preserving order.

Why this answer

Option C is correct because using the exchange ID as the partition key ensures all trades from the same exchange go to the same shard, preserving order. Option A is wrong because increasing shard count would further spread data and break ordering. Option B is wrong because using a single shard would preserve order but reduce throughput due to shard limits.

Option D is wrong because implementing a custom sequencer is complex and unnecessary.

Full explanation →

1397

MCQhard

A company uses Amazon Kinesis Data Analytics (now Managed Service for Apache Flink) to run a Flink application on streaming data. The application fails with 'OutOfMemoryError: Java heap space'. The data volume is 10 MB/s. What is the most likely cause and solution?

A.The data contains records larger than 1 MB; split records into smaller chunks.

B.Checkpointing is enabled too frequently; reduce checkpoint interval.

C.The Flink application is not suitable for 10 MB/s throughput; use Kinesis Data Firehose instead.

D.The application's Parallelism is too low; increase the number of Parallelism and KPUs.

AnswerD

Low parallelism causes data to accumulate in operator buffers, leading to OOM.

Why this answer

Insufficient Parallelism or KPU allocation leads to OOM. Option A is correct. Option B is wrong because checkpointing actually helps.

Option C is wrong because Flink can handle 10 MB/s with proper resources. Option D is wrong because data format does not cause OOM.

Full explanation →

1398

Multi-Selectmedium

A company is using AWS Glue to run ETL jobs that transform data from S3 to Redshift. The jobs are failing intermittently with out-of-memory errors. Which THREE actions can help resolve this issue? (Choose THREE.)

Select 3 answers

A.Increase the number of DPUs allocated to the Glue job

B.Use S3 Select to filter data before reading into the Glue job

C.Use Spark's 'coalesce' function to reduce the number of partitions

D.Optimize the transformation logic to use less memory, for example by filtering early

E.Use a larger worker type, such as G.2X

AnswersA, D, E

More DPUs provide more memory and compute resources.

Why this answer

Options A, B, and D are correct. A: Increasing the number of DPUs provides more memory. B: Using a larger worker type (e.g., G.1X or G.2X) increases memory per worker.

D: Optimizing the transformation logic to reduce memory usage helps. C: Using Spark's 'coalesce' reduces partitions but may not solve memory issues. E: Using S3 Select pushes down filtering but does not address memory.

Full explanation →

1399

MCQhard

A company runs a transactional database on Amazon RDS for PostgreSQL with Multi-AZ deployment. The database size is 2 TB and experiences moderate write load. The company recently enabled RDS Performance Insights and noticed a high number of 'TupleLock' wait events during peak hours. The development team reports that a batch update job runs every hour, updating millions of rows in a large table. The job takes longer than expected. The DBA suspects that excessive row-level locking is causing contention. The team wants to minimize lock contention without changing the application code. Which solution should be implemented?

A.Tune the autovacuum settings (e.g., autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold) to run more frequently and aggressively.

B.Increase the RDS instance size to a larger instance class with more vCPUs and memory.

C.Enable RDS Proxy to manage database connections and reduce connection overhead.

D.Implement table partitioning using the pg_partman extension to split the large table into smaller partitions.

AnswerA

This reduces dead tuple accumulation, which lowers lock contention and improves the batch job performance.

Why this answer

Option D (autovacuum tuning) is correct because frequent updates generate dead tuples, leading to increased lock contention. Tuning autovacuum ensures timely cleanup, reducing lock escalation and wait events. Option A (increasing RDS instance size) may alleviate CPU/memory pressure but does not address lock contention.

Option B (enabling RDS Proxy) helps with connection pooling, not lock contention. Option C (using pg_partman for partitioning) reduces row-level contention by splitting the table but requires code changes to queries, which is prohibited by the stem.

Full explanation →

1400

MCQhard

A company has a 100 TB dataset stored on-premises in a Hadoop cluster. They want to ingest this data into Amazon S3 for processing with AWS Glue. The company has a limited time window and a slow internet connection. Which strategy is MOST appropriate?

A.Use AWS Snowball Edge to physically ship the data to AWS.

B.Use AWS DataSync over the existing internet connection.

C.Use Amazon S3 Transfer Acceleration to speed up the upload.

D.Use AWS Direct Connect to establish a high-bandwidth connection.

AnswerA

Snowball Edge can handle 100 TB offline, bypassing network limitations.

Why this answer

Option A is correct because AWS Snowball Edge is designed for large data transfers with limited network. Option B is wrong because Direct Connect may not be feasible if the connection is slow. Option C is wrong because DataSync is for smaller volumes and needs network.

Option D is wrong because S3 Transfer Acceleration also requires a network connection.

Full explanation →

1401

MCQmedium

A company runs an Amazon EMR cluster with Spark jobs that process data from Amazon S3. The data engineer receives an alert that one of the Spark jobs failed with an OutOfMemoryError. The job processes large files and uses the default Spark configurations. Which configuration change is MOST likely to resolve the issue?

A.Increase the spark.executor.memory configuration.

B.Increase the number of executors.

C.Disable dynamic resource allocation.

D.Decrease the number of cores per executor.

AnswerA

Increasing executor memory directly addresses the OutOfMemoryError.

Why this answer

Option D is correct because increasing spark.executor.memory gives each executor more memory to handle large data processing. Option A is wrong because reducing the number of cores would reduce parallelism, potentially worsening the problem. Option B is wrong because dynamic allocation is enabled by default and helps, but the issue is executor memory.

Option C is wrong because increasing the number of executors without increasing memory per executor does not address the OOM per executor.

Full explanation →

1402

MCQmedium

A company uses AWS Glue to process data from multiple sources. The data is stored in an Amazon S3 data lake. The company needs to transform the data using a custom Python library that is not available in the default Glue environment. What is the MOST efficient way to make this library available to the Glue jobs?

A.Manually install the library on each node in the Glue cluster by editing the bootstrap script.

B.Upload the library as a .whl file to Amazon S3 and reference it in the Glue job's --additional-python-modules parameter.

C.Create a custom Docker image with the library and use it in AWS Glue for Ray.

D.Use a shell command in the Glue job script to run 'pip install <library>' before the job runs.

AnswerB

This is the recommended way to add custom libraries to Glue jobs.

Why this answer

Option D is correct because AWS Glue supports Python shell jobs and allows installing additional Python modules via --additional-python-modules or by providing a .whl file in S3 and referencing it. Option A is wrong because Glue does not support installing packages via pip at runtime. Option B is wrong because creating a custom Docker image is for Glue for Ray or for ETL jobs with custom environment, but it's more complex than needed.

Option C is wrong because installing the library on every node is not supported; Glue manages the environment.

Full explanation →

1403

MCQeasy

A data engineering team needs to transform CSV files stored in Amazon S3 into Parquet format using AWS Glue. The files are partitioned by date and are updated hourly. Which AWS Glue feature should be used to automatically detect the schema and partition structure?

A.AWS Glue Crawler

B.AWS Glue DataBrew

C.AWS Lake Formation

D.Amazon Athena

AnswerA

Discovers schema and partitions automatically.

Why this answer

AWS Glue Crawler is the correct choice because it automatically scans data in S3, infers the schema (including data types), and detects the partition structure (e.g., date-based partitions like year/month/day) by examining the folder hierarchy. It then populates the AWS Glue Data Catalog with metadata, enabling ETL jobs to read the data without manual schema definition.

Exam trap

AWS often tests the distinction between tools that discover metadata (Crawler) versus tools that consume or transform data (Athena, DataBrew), leading candidates to pick Athena because it can query partitioned data, but it cannot automatically detect the partition structure without a pre-existing catalog.

How to eliminate wrong answers

Option B (AWS Glue DataBrew) is wrong because it is a visual data preparation tool for cleaning and normalizing data, not for automatic schema or partition detection. Option C (AWS Lake Formation) is wrong because it provides centralized security and governance for data lakes, but it does not perform schema discovery or partition detection itself. Option D (Amazon Athena) is wrong because it is a query engine that can read data from the Glue Data Catalog, but it does not automatically detect schemas or partitions; it relies on existing catalog metadata.

Full explanation →

1404

Multi-Selecthard

A data engineer is setting up an Amazon Redshift cluster for a data warehouse. The cluster will store historical sales data and support complex analytical queries. To optimize query performance and manage storage, the engineer needs to choose appropriate distribution styles and sort keys for a large fact table 'sales' and several dimension tables. Which TWO of the following design decisions are BEST practices?

Select 2 answers

A.Use interleaved sort keys on columns that are frequently used in filter predicates (e.g., date, region, product).

B.Use EVEN distribution for the fact table 'sales' to ensure an even data distribution across all nodes.

C.Use ALL distribution for the 'sales' fact table to replicate data to every node and avoid data movement.

D.Use a compound sort key with the most frequently filtered column first.

E.Choose AUTO distribution style for all tables and let Amazon Redshift automatically assign distribution.

AnswersA, B

Interleaved sort keys improve performance for queries filtering on multiple columns.

Why this answer

Option A is correct because interleaved sort keys in Amazon Redshift give equal weight to each column in the sort key, making them ideal for queries with filter predicates on multiple columns (e.g., date, region, product). This design optimizes zone maps and minimizes the amount of data scanned, significantly improving query performance for complex analytical workloads on large fact tables.

Exam trap

The trap here is that candidates often confuse EVEN distribution as a universal best practice for all fact tables, overlooking that KEY distribution on the join column is superior for star schema joins, and they may also incorrectly assume ALL distribution is suitable for large fact tables due to its join performance benefits, ignoring the prohibitive storage and write costs.

Full explanation →

1405

MCQmedium

A company runs an Amazon RDS for PostgreSQL database for its e-commerce platform. The application team reports that write-intensive workloads are causing high latency and the database is experiencing storage bottlenecks. The database currently uses General Purpose SSD (gp2) storage. Which action would be MOST effective in improving write performance without changing the database instance class?

A.Create a read replica and offload writes to it.

B.Switch the storage type to Provisioned IOPS SSD (io1).

C.Enable Multi-AZ deployment for high availability.

D.Change the storage type to General Purpose SSD (gp3).

AnswerD

gp3 offers higher baseline IOPS and throughput than gp2, improving write performance.

Why this answer

D is correct because gp3 storage provides a baseline performance that is higher than gp2 for the same storage size, and it allows you to independently provision IOPS and throughput without needing to increase storage. This directly addresses the write-intensive workload's high latency and storage bottleneck by offering up to 4,000 IOPS at no additional cost (compared to gp2's 3,000 IOPS baseline for larger volumes), and you can scale IOPS up to 16,000 without changing the instance class.

Exam trap

The trap here is that candidates often assume Provisioned IOPS (io1) is always the best choice for write performance, but the question specifically tests knowledge of gp3's superior baseline performance and cost efficiency for write-intensive workloads without requiring an instance class change.

How to eliminate wrong answers

Option A is wrong because a read replica cannot offload writes; it only handles read traffic, and writes must still go to the primary database, so it does not reduce write latency or storage bottlenecks. Option B is wrong because while io1 provides consistent IOPS, it is significantly more expensive than gp3 and does not offer the same baseline performance improvements for write-heavy workloads without also increasing storage; additionally, the question asks for the most effective action without changing the instance class, and gp3 is a more cost-effective and modern choice. Option C is wrong because Multi-AZ deployment provides high availability and automatic failover, but it does not improve write performance; in fact, synchronous replication to the standby can add slight latency to writes.

Full explanation →

1406

Multi-Selecthard

A company is ingesting IoT sensor data into Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data must be transformed and aggregated in real-time before being stored in Amazon DynamoDB. Which THREE services should be used together in the pipeline? (Choose THREE.)

Select 3 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon S3

D.Amazon Kinesis Data Streams

E.Amazon Kinesis Data Firehose

AnswersA, B, D

Writes results to DynamoDB.

Why this answer

Options A, C, and D are correct. Kinesis Data Streams ingests the data. Kinesis Data Analytics performs real-time transformations and aggregations.

Lambda can be used to write the aggregated results to DynamoDB. Option B is wrong because Kinesis Data Firehose is for delivery to S3 or Redshift, not DynamoDB. Option E is wrong because S3 is not needed for this pipeline.

Full explanation →

1407

MCQhard

A data engineer is designing a data lake on S3 that must be encrypted at rest using customer-managed keys in AWS KMS. The security team requires that the key be used only for S3 operations and that the key be rotated every 180 days. Which solution meets these requirements?

A.Create a customer managed key with a key policy that grants usage only to S3, and enable automatic rotation with a 180-day period.

B.Use an AWS managed key (aws/s3) and enable automatic rotation.

C.Use an S3 bucket policy to enforce SSE-KMS with a CloudHSM key.

D.Use a custom key store backed by CloudHSM and rotate the key manually.

AnswerA

Customer managed keys allow custom rotation periods and key policies to restrict usage.

Why this answer

Option B is correct because customer managed keys can have a custom key rotation period (180 days) and a key policy can restrict usage to S3. Option A is wrong because AWS managed keys have a fixed annual rotation. Option C is wrong because custom key stores do not support automatic rotation.

Option D is wrong because S3 does not use CloudHSM directly for S3 encryption.

Full explanation →

1408

Multi-Selectmedium

A company is using AWS Glue ETL to process data from Amazon RDS for MySQL to Amazon S3. The job runs daily and takes 2 hours to complete. The engineer wants to improve performance without increasing cost significantly. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Switch to a smaller worker type (e.g., G.1X instead of G.2X).

B.Use Spark DataFrames instead of DynamicFrames.

C.Enable 'Auto Scaling' in the Glue job configuration.

D.Add a partition column to the source table based on a date column.

E.Increase the number of Glue DPUs.

AnswersD, E

Partitioning allows Glue to read data in parallel.

Why this answer

Increasing the number of DPUs improves performance but increases cost; however, it's a common approach. Partitioning the source table helps parallel reads. Option A is correct.

Option B is correct. Option C is wrong because reducing worker type would slow performance. Option D is wrong because DynamicFrame is recommended for Glue.

Option E is wrong because AWS Glue does not support auto-scaling by default without additional configuration.

Full explanation →

1409

Multi-Selecthard

A company is using AWS KMS with customer-managed keys to encrypt data in Amazon RDS. The security team wants to ensure that the key can be rotated automatically every year. Which THREE steps are required to achieve automatic key rotation?

Select 3 answers

A.Migrate the key to AWS CloudHSM.

B.Enable automatic key rotation in the KMS key configuration.

C.Use a symmetric KMS key.

D.Configure the RDS instance to use the KMS key for encryption.

E.Create a new KMS key and configure RDS to use it.

AnswersB, C, D

This is required to rotate the key automatically.

Why this answer

Options B, C, and D are correct. To enable automatic rotation for a customer-managed KMS key, you must enable rotation via the KMS console or API (B), ensure the key is a symmetric key (C) as asymmetric keys do not support automatic rotation, and configure the RDS instance to use the key (D) for encryption. Option A is incorrect because rotation can be enabled for existing keys.

Option E is incorrect because CloudHSM is not involved.

Full explanation →

1410

MCQhard

A data pipeline ingests JSON data from an S3 bucket using AWS Glue. The JSON files contain nested structures, and the team wants to flatten them for analysis in Amazon Athena. Which Glue transformation is most appropriate?

A.Filter

B.Join

C.Map

D.Relationalize

AnswerD

Flattens nested JSON into separate tables.

Why this answer

Option D is correct because Relationalize is specifically designed to flatten nested JSON into relational tables. Option A (Map) applies a function to each record. Option B (Filter) removes records.

Option C (Join) combines datasets.

Full explanation →

1411

Multi-Selecthard

Which TWO are valid approaches to troubleshoot a slow Amazon Redshift query? (Choose two.)

Select 2 answers

A.Check for table locks using STV_LOCKS.

B.Enable encryption on the cluster.

C.Use the EXPLAIN command to review the query execution plan.

D.Run VACUUM on the table.

E.Alter the table to change DISTSTYLE to KEY.

AnswersA, C

Locks can cause waits.

Why this answer

Options B and D are correct. Using EXPLAIN to review the query plan and checking for table locks are troubleshooting steps. Option A is wrong because VACUUM reclaims space, not directly troubleshoots slow queries.

Option C is wrong because DISTSTYLE is a design choice, not a troubleshooting step. Option E is wrong because enabling encryption does not affect query performance.

Full explanation →

1412

MCQmedium

A data engineer uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration task is set to full load + CDC. After the full load completes, the CDC phase starts but shows a high latency of 5 minutes. The source database has a low write load. What should the engineer do to reduce the CDC latency?

A.Decrease the batch size for the CDC task.

B.Disable the validation feature on the DMS task.

C.Increase the size of the DMS replication instance.

D.Enable logging for the DMS task to capture additional details.

AnswerC

More resources reduce latency.

Why this answer

Option B is correct because increasing the DMS replication instance size provides more memory and CPU for processing changes. Option A is wrong because disabling validation reduces overhead but not latency due to resource constraints. Option C is wrong because the source has low write load, so batch size is not the bottleneck.

Option D is wrong because the source is PostgreSQL, not Oracle.

Full explanation →

1413

MCQmedium

A company uses Kinesis Data Streams to ingest IoT data. The data volume varies, and occasionally the shard write throughput is exceeded, causing ProvisionedThroughputExceeded exceptions. The data engineer needs to handle these spikes without losing data. Which approach is most cost-effective and requires minimal code changes?

A.Implement custom retry logic using the Kinesis Client Library with exponential backoff

B.Increase the number of shards to handle peak throughput

C.Use Kinesis Data Firehose as a consumer with retries and buffer settings

D.Send data to an SQS queue first, then have a Lambda function write to Kinesis

AnswerC

Firehose can buffer data and retry, handling spikes with minimal code changes.

Why this answer

Option B (Use Kinesis Data Firehose as a consumer with retries and buffer settings) is correct because Firehose can buffer data and retry on failures, handling spikes. Option A (Increase shard count) is costly and requires manual scaling. Option C (Use Kinesis Client Library with exponential backoff) is a good practice but still requires custom code; Firehose is more managed.

Option D (Send data to SQS and then to Kinesis) adds complexity and cost.

Full explanation →

1414

MCQeasy

A company is streaming data from an application to Amazon Kinesis Data Streams. The data must be transformed in real time and then stored in Amazon S3 in Parquet format. Which AWS service should be used for the transformation step?

A.Amazon Kinesis Data Firehose with a Lambda transformation.

B.Amazon EMR running Apache Spark Streaming.

C.Amazon Kinesis Data Analytics for Apache Flink.

D.AWS Lambda with a Kinesis trigger.

AnswerC

Kinesis Data Analytics for Flink is designed for real-time stream processing and can perform complex transformations.

Why this answer

Option D is correct because Amazon Kinesis Data Analytics for Apache Flink is a serverless service that can run Apache Flink applications to perform real-time transformations on streaming data. Option A (AWS Lambda) can be used but is limited by execution time and is not optimized for complex streaming transformations. Option B (Amazon Kinesis Data Firehose) can transform data using Lambda functions but is more suited for loading data.

Option C (Amazon EMR) is for batch processing, not real-time streaming.

Full explanation →

1415

MCQhard

A data engineer is troubleshooting an AWS Lake Formation permissions issue. A user is able to query an Amazon Athena table but cannot see the underlying S3 data in the AWS Glue Data Catalog. The user has been granted SELECT permission on the table in Lake Formation. What is the most likely cause?

A.The user does not have DESCRIBE permission on the table in Lake Formation.

B.The data location is not registered with Lake Formation.

C.The S3 bucket policy does not grant the user access.

D.The user does not have the aws:SourceArn condition in the IAM policy.

AnswerA

SELECT permission allows querying but not viewing the table metadata; DESCRIBE is needed to see the table in the catalog.

Why this answer

In AWS Lake Formation, the ability to query a table via Athena (which requires SELECT permission) is separate from the ability to view the table's metadata in the Glue Data Catalog. To see the underlying S3 data location or table properties in the catalog, a user needs DESCRIBE permission on the table. Without DESCRIBE, the table appears invisible in the Glue console or API, even though SELECT queries succeed.

Exam trap

The trap here is that candidates assume SELECT permission is sufficient for all table interactions, overlooking that Lake Formation separates metadata visibility (DESCRIBE) from data access (SELECT).

How to eliminate wrong answers

Option B is wrong because registering the data location with Lake Formation is a prerequisite for granting permissions, but the user can already query the table, so the location must be registered. Option C is wrong because if the S3 bucket policy were blocking access, the Athena query would fail, not just the catalog visibility. Option D is wrong because the aws:SourceArn condition is a security best practice for cross-account access, but its absence does not cause the described symptom of a missing table in the catalog.

Full explanation →

1416

MCQeasy

A data engineer needs to store JSON documents that are frequently read and written by a web application. The data has a flexible schema and requires low-latency queries on primary key lookups. Which AWS service is MOST suitable?

A.Amazon Redshift

B.Amazon S3

C.Amazon DynamoDB

D.Amazon RDS for MySQL

AnswerC

DynamoDB provides single-digit millisecond performance for key-value lookups and supports flexible schemas.

Why this answer

Option B is correct because DynamoDB is a NoSQL database designed for low-latency key-value lookups with flexible schema. Option A is wrong because RDS is relational and requires fixed schema. Option C is wrong because S3 is object storage, not designed for low-latency primary key lookups.

Option D is wrong because Redshift is a data warehouse for analytics, not transactional workloads.

Full explanation →

1417

MCQmedium

An AWS Glue job that processes streaming data from Amazon Kinesis Data Streams is failing intermittently with 'Failed to checkpoint' errors. The job uses checkpointing to an Amazon S3 bucket every 60 seconds. Which action should the engineer take to resolve the issue?

A.Increase the checkpoint interval to 120 seconds.

B.Move the checkpoint location to an Amazon DynamoDB table.

C.Decrease the Kinesis shard count to reduce throughput.

D.Disable checkpointing and rely on Kinesis iterator age.

AnswerA

Reduces the frequency of checkpoint writes, mitigating contention.

Why this answer

The 'Failed to checkpoint' error in AWS Glue streaming jobs typically occurs when the checkpoint operation exceeds the 60-second interval due to high throughput or large state size. Increasing the checkpoint interval to 120 seconds provides more time for the checkpoint to complete, reducing the likelihood of timeouts and allowing the job to stabilize without losing progress.

Exam trap

The trap here is that candidates may assume DynamoDB is always faster for checkpoints (Option B), but AWS Glue streaming jobs natively support only S3 for checkpointing, and DynamoDB is not a valid checkpoint location—this distracts from the simple fix of adjusting the interval.

How to eliminate wrong answers

Option B is wrong because moving the checkpoint location to DynamoDB does not address the root cause of checkpoint timeouts; DynamoDB has its own throughput limits and latency, which could introduce similar or worse failures. Option C is wrong because decreasing the Kinesis shard count reduces throughput capacity, which may cause data loss or increased iterator age, but does not fix the checkpoint timeout issue—it could even worsen it by increasing processing pressure on fewer shards. Option D is wrong because disabling checkpointing removes fault tolerance entirely; relying solely on Kinesis iterator age does not provide recovery from failures and can lead to data reprocessing or loss, violating the job's reliability requirements.

Full explanation →

1418

MCQmedium

A data engineer runs an AWS Glue ETL job that transforms data in Amazon S3. The job fails with the error shown in the exhibit. Which action will MOST likely fix the issue?

A.Decrease the number of workers from 2 to 1.

B.Add an IAM policy that grants the Glue job permission to write to S3.

C.Increase the number of workers from 2 to 4.

D.Change the worker type from G.1X to G.2X.

AnswerD

G.2X provides more memory per worker, addressing the OOM error.

Why this answer

Option B is correct because the error indicates an out-of-memory error in the Spark executor. Increasing the worker type (e.g., from G.1X to G.2X) provides more memory per worker. Option A is wrong because increasing the number of workers does not increase memory per worker if the worker type is unchanged.

Option C is wrong because decreasing the number of workers reduces parallelism and does not help memory. Option D is wrong because the error is not related to IAM permissions.

Full explanation →

1419

MCQeasy

A data engineer needs to transfer 50 TB of data from an on-premises HDFS cluster to Amazon S3. The on-premises network has a 1 Gbps link to AWS. Which AWS service should be used to perform the transfer efficiently?

A.AWS DataSync

B.Amazon S3 Transfer Acceleration

C.AWS Snowball Edge

D.AWS Direct Connect

AnswerA

DataSync can transfer large datasets over the network efficiently.

Why this answer

Option B is correct because AWS DataSync is designed for large-scale data transfers over the network. Option A is wrong because S3 Transfer Acceleration speeds up uploads but is not a transfer service. Option C is wrong because AWS Snowball is for offline data transfer, but the network link is sufficient.

Option D is wrong because AWS Direct Connect is a network connection, not a data transfer service.

Full explanation →

1420

MCQmedium

A data engineer needs to ingest streaming data from thousands of devices sending JSON messages via HTTP POST. The data should be stored in Amazon S3 with minimal latency and also be available for real-time analytics. Which combination of services is MOST appropriate?

A.Amazon DynamoDB with DynamoDB Streams and Lambda.

B.Amazon SQS and AWS Lambda to write to S3.

C.AWS Lambda directly writing to S3 via API Gateway.

D.Amazon API Gateway, Amazon Kinesis Data Streams, and Kinesis Data Firehose.

AnswerD

API Gateway receives POST, sends to Kinesis for real-time analytics, and Firehose batches to S3.

Why this answer

Option C is correct because API Gateway ingests HTTP POST, sends to Kinesis Data Streams for real-time consumption, and Firehose delivers to S3. Option A (SQS) is not ideal for real-time streaming. Option B (DynamoDB Streams) is for database changes.

Option D (Lambda + S3) lacks real-time analytics capability.

Full explanation →

1421

MCQmedium

A data engineer needs to store and analyze time-series data from IoT devices. The data volume is 10 GB per day, and the queries are mostly on the most recent 7 days of data. The engineer wants to minimize storage costs while retaining historical data for 1 year. Which combination of AWS services is most cost-effective?

A.Amazon Timestream

B.Amazon DynamoDB with TTL and S3 for archival

C.Amazon Redshift

D.Amazon RDS with MySQL

AnswerA

Timestream is cost-effective for time-series data with automatic storage tiering.

Why this answer

Amazon Timestream is purpose-built for time-series data, offering automatic tiering between in-memory (for recent 7 days) and magnetic stores (for historical data up to 1 year). This matches the query pattern (mostly recent 7 days) and retention requirement (1 year) while minimizing storage costs through its serverless, pay-per-query model. Timestream also supports time-series-specific functions like interpolation and smoothing, making it more efficient than general-purpose databases for this workload.

Exam trap

The trap here is that candidates often choose DynamoDB with TTL and S3 for archival (Option B) because it seems cost-effective, but they overlook the operational complexity and query latency of accessing historical data in S3, which violates the 'minimize storage costs while retaining historical data for 1 year' requirement without considering query patterns.

How to eliminate wrong answers

Option B (DynamoDB with TTL and S3 for archival) is wrong because DynamoDB is optimized for key-value and document workloads, not time-series analytics; TTL only deletes old data, but querying historical data from S3 requires additional services like Athena or Glue, increasing complexity and latency. Option C (Amazon Redshift) is wrong because Redshift is a columnar data warehouse designed for large-scale analytical queries on structured data, but it is over-provisioned and costly for 10 GB/day of time-series data, and its storage and compute are not optimized for time-series-specific operations like downsampling or retention policies. Option D (Amazon RDS with MySQL) is wrong because RDS is a relational database with fixed storage and compute, leading to higher costs for storing 3.65 TB of historical data (10 GB/day × 365 days) and poor query performance on time-series data without built-in time-series features like automatic retention or partitioning.

Full explanation →

1422

MCQeasy

A company is using Amazon S3 for data lake storage. They need to query the data directly using SQL without loading it into a database. Which AWS service should be used?

A.Amazon Redshift Spectrum

B.Amazon Athena

C.Amazon EMR

D.AWS Glue

AnswerB

Athena is a serverless query service for S3 data using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL, without needing to load or transform the data into a database. Athena uses Presto under the hood and supports querying structured, semi-structured, and unstructured data formats (e.g., CSV, JSON, Parquet, ORC) stored in S3, making it ideal for ad-hoc SQL queries on a data lake.

Exam trap

The trap here is that candidates often confuse AWS Glue's data cataloging and ETL capabilities with direct SQL querying, or they assume Redshift Spectrum is a standalone service rather than a feature requiring an existing Redshift cluster, leading them to pick a wrong answer that requires additional infrastructure or is not a query engine.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift Spectrum is a feature of Amazon Redshift that allows querying data in S3 from within a Redshift data warehouse, but it requires an existing Redshift cluster and is not a standalone service for directly querying S3 data without a database. Option C is wrong because Amazon EMR is a big data platform that uses frameworks like Apache Spark, Hive, or Presto for querying S3 data, but it requires provisioning and managing clusters, which adds complexity and is not a serverless SQL-only solution. Option D is wrong because AWS Glue is a serverless data integration service primarily used for ETL (extract, transform, load) jobs and data cataloging, not for directly querying S3 data with SQL; while it can prepare data for Athena, it is not a query engine itself.

Full explanation →

1423

Multi-Selectmedium

A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?

Select 2 answers

A.Increase the number of shards in the Kinesis stream.

B.Increase the batch size in the Lambda event source mapping.

C.Decrease the batch window in the Lambda event source mapping.

D.Configure DynamoDB to use on-demand capacity mode.

E.Increase the Lambda reserved concurrency to 1000.

AnswersB, D

Larger batches mean fewer invocations, reducing throttling.

Why this answer

Option B is correct because increasing the batch size in the Lambda event source mapping allows each invocation to process more records from the Kinesis stream, reducing the number of concurrent Lambda invocations and thus lowering the risk of throttling. Option D is correct because switching DynamoDB to on-demand capacity mode eliminates write capacity limits, preventing throttling on the DynamoDB side that can cause Lambda retries and backpressure.

Exam trap

The trap here is that candidates often assume increasing shards (Option A) always improves throughput, but in a Lambda-integrated Kinesis stream, more shards mean more concurrent invocations, which can actually increase throttling risk.

Full explanation →

1424

Multi-Selecthard

A company uses AWS Glue to transform data stored in S3. The Glue job runs daily and processes data in the range of hundreds of GB. The data engineer wants to optimize the job for cost and performance. Which THREE actions should be taken? (Choose THREE.)

Select 3 answers

A.Store intermediate data in HDFS on Amazon EMR

B.Increase the number of DPUs for the job

C.Reduce the number of DPUs to save cost

D.Use columnar data formats such as Parquet

E.Partition the data by date or other high-cardinality columns

AnswersB, D, E

More DPUs can reduce runtime, improving cost if job runs shorter.

Why this answer

Option A is correct because using columnar formats like Parquet improves performance and reduces data scanned. Option B is correct because partitioning reduces data processed. Option D is correct because increasing the number of DPUs can speed up the job, but must be balanced with cost.

Option C is wrong because S3 is already the source; moving to HDFS is not relevant. Option E is wrong because reducing DPUs would increase runtime.

Full explanation →

1425

MCQeasy

A company wants to use Amazon Redshift Spectrum to query data in Amazon S3. The data is in Parquet format and partitioned by date. Which step is required to enable Redshift Spectrum?

A.Load the data into Redshift tables using the COPY command.

B.Create an external schema and external table in the AWS Glue Data Catalog.

C.Create a separate Redshift Spectrum cluster.

D.Copy the data from S3 to Redshift-managed storage.

AnswerB

Redshift Spectrum uses the Glue Data Catalog to query data in S3.

Why this answer

Option A is correct because Redshift Spectrum requires an external schema and table defined in the AWS Glue Data Catalog or an external Hive metastore. Option B is wrong because the data is already in S3. Option C is wrong because loading data into Redshift is not required for Spectrum.

Option D is wrong because Spectrum does not require a separate cluster.

Full explanation →

Page 19 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →