AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 175

1786 questions total · 24pages · All types, answers revealed

Page 1 of 24

Page 2
1
MCQmedium

A company uses AWS DMS to migrate an on-premises Oracle database to Amazon Aurora PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that the target Aurora database has a higher lag than expected. Which action would most likely reduce the lag?

A.Increase the size of the S3 bucket used for staging
B.Increase the number of parallel tasks in the DMS task settings
C.Enable Batch Optimized Apply on the DMS task
D.Disable validation of data on the target
AnswerB

More parallel tasks improve apply throughput.

Why this answer

Option D is correct because increasing the number of parallel tasks improves throughput. Option A is wrong because turning off validation reduces reliability but may help slightly. Option B is wrong because it does not affect replication lag.

Option C is wrong because batch-optimized apply is not for Aurora.

2
Multi-Selectmedium

A data engineer needs to securely store database credentials used by an AWS Glue ETL job. Which THREE steps should the engineer take?

Select 3 answers
A.Hardcode the credentials in the Glue job script.
B.Store the credentials in AWS Secrets Manager.
C.Grant the Glue job's IAM role permission to read the secret.
D.Configure the Glue job to use the Secrets Manager connector to retrieve credentials.
E.Use AWS Systems Manager Parameter Store with a SecureString parameter.
AnswersB, C, D

Secrets Manager provides secure storage and rotation.

Why this answer

Options A, B, and D are correct. AWS Secrets Manager securely stores credentials, and the Glue job can retrieve them via the AWS Secrets Manager connector or directly from the service. Option C is wrong because hardcoding credentials is insecure.

Option E is wrong because Parameter Store does not support automatic rotation.

3
MCQhard

Refer to the exhibit. A data engineer is troubleshooting a permissions issue. The IAM role 'DataEngineerRole' is used by an AWS Glue job that needs to read data from an S3 bucket encrypted with a customer managed KMS key. The above key policy is attached to the KMS key. The Glue job fails with an AccessDenied error when trying to read the data. What is the MOST likely cause?

A.The key policy requires requests to originate from a VPC endpoint, but the Glue job is not using one.
B.The key policy denies requests that are not using HTTPS, but the Glue job is using HTTPS.
C.The key policy condition 'kms:ViaService' restricts KMS actions to only when they are made through S3, but AWS Glue calls KMS directly, not via S3.
D.The Glue job is running in a different AWS region than the S3 bucket.
AnswerC

Glue does not use S3 to make KMS calls; it calls KMS directly, so the condition fails.

Why this answer

The key policy includes a condition `kms:ViaService` that restricts KMS actions to only when they are made through the S3 service. However, AWS Glue does not call KMS via S3; it calls KMS directly to decrypt the S3 object's data key. Because the Glue job's KMS request does not originate from the S3 service, the condition fails, resulting in an AccessDenied error.

Exam trap

AWS often tests the nuance that `kms:ViaService` only applies when the KMS API call is made through the specified service's endpoint, not when a service like Glue calls KMS directly to decrypt an S3 object's key.

How to eliminate wrong answers

Option A is wrong because the key policy does not contain any VPC endpoint condition; the condition shown is `kms:ViaService`, not `aws:SourceVpce`. Option B is wrong because the key policy does not include an HTTPS condition, and even if it did, AWS Glue uses HTTPS by default, so this would not cause an AccessDenied error. Option D is wrong because cross-region access to an S3 bucket is allowed as long as the KMS key is in the same region as the bucket; the error is not region-related.

4
MCQeasy

A data engineer needs to capture change data capture (CDC) events from an Amazon RDS for PostgreSQL database and stream them to Amazon S3 in near real-time. Which AWS service should be used?

A.Amazon S3 Transfer Acceleration
B.Amazon Athena
C.AWS Database Migration Service (AWS DMS)
D.Amazon Kinesis Data Streams
AnswerC

DMS supports ongoing replication (CDC) from databases to S3.

Why this answer

Option C is correct because AWS DMS can continuously replicate changes from a source database to S3. Option A is wrong because Kinesis Data Streams is for custom streaming applications. Option B is wrong because S3 Transfer Acceleration speeds up uploads but does not capture CDC.

Option D is wrong because Athena is a query service.

5
MCQhard

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3 in Parquet format. The replication is used for near-real-time analytics. Recently, the DMS task started failing with an error indicating insufficient memory. The source database is large (2 TB). What should a data engineer do to resolve this issue while minimizing changes to the existing architecture?

A.Change the target format to JSON to reduce memory usage.
B.Split the DMS task into multiple smaller tasks.
C.Use Change Data Capture (CDC) only, without full load.
D.Increase the DMS replication instance size.
AnswerD

Provides more memory.

Why this answer

The error indicates the DMS replication instance is running out of memory during continuous replication of a 2 TB Oracle database to S3 in Parquet format. Increasing the replication instance size (Option D) directly addresses the memory constraint by providing more RAM and processing capacity, which is necessary for handling large volumes of Change Data Capture (CDC) data and Parquet conversion overhead. This solution requires minimal architectural changes, as it only involves modifying the instance class in the DMS task settings.

Exam trap

The trap here is that candidates may think splitting tasks or changing formats reduces memory usage, but the root cause is insufficient instance resources, and AWS DMS tasks require adequate instance sizing for large-scale CDC workloads.

How to eliminate wrong answers

Option A is wrong because changing the target format to JSON would not reduce memory usage; JSON is typically larger than Parquet and would increase memory consumption during serialization, not decrease it. Option B is wrong because splitting the DMS task into multiple smaller tasks would increase complexity and overhead, potentially causing additional memory pressure from multiple connections and task management, and does not directly resolve the insufficient memory error. Option C is wrong because using CDC only without full load ignores the fact that the task is already failing during continuous replication (CDC phase), and the full load may have already completed; disabling full load does not address the memory issue in CDC processing.

6
MCQmedium

Refer to the exhibit. This log snippet is from a failed AWS Glue job. The job processes a large dataset in memory. What is the MOST likely cause of the OutOfMemoryError?

A.The Glue job is running with insufficient DPUs or worker type.
B.The input data is in an unsupported file format.
C.The job is attempting to join two tables with mismatched keys.
D.The job has too many partitions.
AnswerA

Insufficient resources cause out-of-memory.

Why this answer

Option B is correct because an OutOfMemoryError in Glue often indicates that the DPU (worker) allocation is insufficient for the data size. Option A is wrong because the error is not about file format. Option C is wrong because the job is not about joining; the error is heap space.

Option D is wrong because the error is not about partitioning.

7
MCQmedium

A company has a large volume of CSV files in S3 that need to be transformed into Parquet using AWS Glue. The files are partitioned by date. The engineer wants to minimize costs by processing only new files each day. Which approach should be used?

A.Use S3 partition discovery to automatically read new partitions.
B.Schedule the Glue job to run daily and process all files.
C.Enable job bookmarks in the Glue job.
D.Use S3 Event Notifications to trigger the Glue job on each new file.
AnswerC

Bookmarks track processed data and skip already processed files.

Why this answer

Option C is correct because using a Job Bookmark in Glue tracks processed data and skips already processed files, processing only new ones. Option A is wrong because it would reprocess all files. Option B is wrong because partitioning alone does not prevent reprocessing.

Option D is wrong because an SQS event can trigger a job, but without bookmarks, it may still reprocess.

8
MCQeasy

A data engineering team needs to ingest real-time streaming data from thousands of IoT devices and transform the data before storing it in Amazon S3. Which AWS service is most suitable for performing the transformation step in near real-time?

A.AWS Lambda
B.Amazon Kinesis Data Analytics
C.Amazon Kinesis Data Firehose
D.Amazon S3
AnswerA

Lambda can process Kinesis stream records and transform them.

Why this answer

Option A is correct because AWS Lambda can run code in response to Kinesis Data Streams events and perform transformations before writing to S3. Option B is incorrect because Kinesis Data Analytics is for running SQL/Java on streams, not simple transforms. Option C is incorrect because Kinesis Data Firehose is for loading data to destinations with optional simple transformations via Lambda.

Option D is incorrect because Amazon S3 is storage, not a transformation service.

9
MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices into Amazon S3 in near real-time. The data must be processed with minimal latency and stored in a columnar format for analytics. Which service should the engineer use to ingest the data?

A.Amazon Kinesis Data Analytics
B.Amazon Simple Queue Service (SQS)
C.Amazon Kinesis Data Streams with a Lambda consumer
D.Amazon Kinesis Data Firehose
AnswerD

Directly loads streaming data to S3 with transformation and columnar format support.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service for loading streaming data into S3, and it can convert data to columnar formats like Parquet and ORC. Option A (Kinesis Data Streams) requires custom consumers. Option C (Kinesis Data Analytics) processes data but does not load to S3 directly.

Option D (SQS) is a message queue, not designed for streaming ingestion.

10
MCQmedium

A data engineer needs to ingest data from an on-premises Oracle database to Amazon S3 daily. The data volume is 500 GB per day, and the network bandwidth is 200 Mbps. The requirement is to minimize the impact on the source database and ensure data integrity. Which combination of AWS services should be used?

A.AWS Database Migration Service (DMS) with S3 as target
B.AWS Glue ETL jobs with JDBC connection
C.Amazon Kinesis Data Firehose with Oracle as source
D.AWS Data Pipeline with SQLActivity
AnswerA

AWS DMS minimizes source impact by using change data capture and supports S3 as a target.

Why this answer

Option C is correct because AWS DMS can perform ongoing replication with minimal impact, and S3 as a target is supported. Option A (AWS Glue) is a batch ETL tool but may put load on the source. Option B (AWS Data Pipeline) is older and less efficient.

Option D (Amazon Kinesis) is for streaming, not batch database loads.

11
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and needs to be converted to Parquet. However, the conversion is failing. What is the most likely cause?

A.The data transformation Lambda function is not converting to Parquet
B.The schema is not defined in the AWS Glue Data Catalog
C.The data size exceeds the 1 MB limit per record
D.The delivery stream is configured to use Kinesis Data Streams as source
AnswerB

Parquet conversion requires a schema; without it, Firehose cannot convert.

Why this answer

Kinesis Data Firehose requires a schema to convert to Parquet. This schema can be provided by a Glue Data Catalog table. If the schema is not defined, the conversion fails.

Data size is not an issue for conversion. Kinesis Data Streams is not involved here. Lambda transformation can convert to Parquet but is not required.

12
MCQeasy

A company is running a production database on Amazon RDS for PostgreSQL. The database experiences high read traffic from multiple application servers. Which data store management strategy would reduce the load on the primary database instance?

A.Enable DynamoDB Accelerator (DAX) for the database.
B.Enable Multi-AZ deployment for automatic failover.
C.Create an RDS Read Replica in the same region.
D.Use Amazon ElastiCache to cache query results.
AnswerC

Read Replicas allow offloading read queries, reducing load on the primary.

Why this answer

Option B is correct because using RDS Read Replicas offloads read queries from the primary instance. Option A is incorrect because Multi-AZ is for high availability, not read scaling. Option C is incorrect because ElastiCache is a caching layer, not a replica.

Option D is incorrect because DynamoDB Accelerator is for DynamoDB, not RDS.

13
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The data is JSON and each record is about 2 KB. The delivery stream is configured to buffer incoming data to 5 MB or 60 seconds, whichever comes first. The data engineering team notices that the S3 bucket contains many small files (average 2 MB), which makes subsequent processing inefficient. They need to reduce the number of small files without increasing the latency beyond 5 minutes. Which solution should they implement?

A.Enable compression (GZIP) on the delivery stream.
B.Increase the buffer size to 50 MB and the buffer interval to 300 seconds.
C.Use a Lambda function to merge small files after delivery.
D.Decrease the buffer size to 1 MB and the buffer interval to 60 seconds.
AnswerB

Larger buffer creates larger files within latency limit.

Why this answer

Option B is correct because increasing the buffer size to 50 MB and the buffer interval to 300 seconds directly addresses the root cause: the current 5 MB buffer size triggers a flush too frequently, producing many 2 MB files. By raising the buffer size to 50 MB, each flush will contain more data, resulting in larger S3 objects (up to ~50 MB uncompressed), while the 300-second interval ensures latency stays within the 5-minute requirement. This reduces the number of small files without requiring additional services or post-processing.

Exam trap

The trap here is that candidates often assume compression (Option A) reduces file count, but compression only reduces file size, not the number of files; the real issue is the flush frequency controlled by buffer size and interval.

How to eliminate wrong answers

Option A is wrong because enabling GZIP compression reduces the storage size of each file but does not change the buffer size or flush behavior; the delivery stream will still flush at 5 MB or 60 seconds, producing the same number of small files (just compressed). Option C is wrong because using a Lambda function to merge small files after delivery adds complexity, cost, and latency (Lambda invocation delays, S3 event processing), and does not address the root cause of premature flushes; it also violates the goal of not increasing latency beyond 5 minutes due to the merging overhead. Option D is wrong because decreasing the buffer size to 1 MB and keeping the buffer interval at 60 seconds would make the problem worse, producing even smaller files (average ~1 MB) and more frequent flushes, increasing the number of small files.

14
Multi-Selecthard

A company uses Amazon DynamoDB for a gaming application that requires single-digit millisecond read and write latencies. The application experiences throttling on the 'GameScores' table during peak hours. The table has a partition key of 'game_id' and a sort key of 'player_id'. The data engineer needs to improve performance without changing the table's provisioned capacity. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers
A.Enable DynamoDB adaptive capacity to allow more throughput per partition.
B.Enable DynamoDB auto scaling to adjust capacity based on traffic patterns.
C.Add a Global Secondary Index (GSI) on a different partition key to offload reads.
D.Implement DynamoDB Accelerator (DAX) for caching frequent reads.
E.Increase the read capacity units (RCUs) to twice the peak observed value.
AnswersB, C, D

Auto scaling prevents throttling by adjusting capacity automatically.

Why this answer

Option B is correct because DynamoDB auto scaling automatically adjusts the provisioned read and write capacity based on actual traffic patterns, preventing throttling during peak hours without manual intervention. This allows the table to handle bursts while maintaining single-digit millisecond latencies, as long as the traffic stays within the auto scaling limits.

Exam trap

The trap here is that candidates may confuse adaptive capacity with auto scaling, or think that increasing provisioned capacity is the only solution, but the question explicitly prohibits changing provisioned capacity, making options that alter RCUs or WCUs incorrect.

15
Multi-Selecteasy

Which TWO features of Amazon S3 help protect data from accidental deletion or modification? (Choose two.)

Select 2 answers
A.Lifecycle policies
B.Default encryption
C.S3 MFA Delete
D.S3 Object Versioning
E.Cross-Region Replication
AnswersC, D

MFA Delete requires additional authentication for deletions.

Why this answer

Options A and B are correct because versioning allows recovery of previous versions and MFA delete adds an extra layer of protection. Option C (cross-region replication) is for disaster recovery, not accidental deletion. Option D (default encryption) protects data at rest, not deletion.

Option E (lifecycle policies) automate transitions, not protection.

16
MCQeasy

A data engineer is monitoring an Amazon Redshift cluster and notices that the disk space usage is increasing rapidly. The engineer wants to reclaim space from deleted rows. Which command should the engineer run?

A.VACUUM
B.ANALYZE
C.UNLOAD
D.COPY
AnswerA

VACUUM reclaims space from deleted rows.

Why this answer

Option D is correct because VACUUM reclaims space from deleted rows. Option A is wrong because ANALYZE updates statistics. Option B is wrong because COPY loads data.

Option C is wrong because UNLOAD exports data.

17
MCQmedium

A company uses AWS Glue to process sensitive customer data stored in S3. The data engineer must ensure that the Glue ETL jobs do not write any data to S3 buckets that lack encryption. Which approach meets this requirement?

A.Use AWS CloudTrail to monitor and alert on unencrypted writes
B.Attach an S3 bucket policy that denies s3:PutObject unless the request includes the x-amz-server-side-encryption header
C.Enable S3 default encryption on the bucket
D.Configure an IAM role for Glue with a policy that denies s3:PutObject without encryption
AnswerB

S3 bucket policies can enforce encryption on uploads.

Why this answer

Option B is correct because an S3 bucket policy with a condition key `s3:x-amz-server-side-encryption` or `s3:x-amz-server-side-encryption-aws-kms-key-id` can deny PutObject if encryption is not specified. Option A is wrong because IAM policies cannot enforce encryption at the bucket level as effectively. Option C is wrong because CloudTrail is for auditing, not prevention.

Option D is wrong because encryption at rest does not enforce encryption on writes.

18
MCQmedium

A company uses AWS KMS to encrypt data in Amazon S3. The security team requires that all encryption keys be rotated automatically every year. Which key type should the company use?

A.Use a customer managed key with automatic rotation enabled
B.Use the default S3 managed encryption key (SSE-S3)
C.Use an AWS managed key (aws/s3)
D.Use a customer managed key with manual rotation
AnswerD

Customer managed keys can be rotated manually or automatically; automatic rotation can be set to yearly, so this option is correct.

Why this answer

AWS managed keys (AWS-managed KMS keys) are automatically rotated every year. Customer managed keys do not rotate automatically unless configured with a custom rotation period. The correct answer is C because AWS managed keys satisfy the requirement of automatic yearly rotation without additional configuration.

19
Multi-Selectmedium

A data engineer is setting up a Redshift cluster and needs to ensure high availability. Which TWO actions should be taken?

Select 2 answers
A.Enable concurrency scaling.
B.Configure cross-region snapshot copy.
C.Enable automatic replication across Availability Zones.
D.Use a single-node cluster to reduce complexity.
E.Deploy a multi-node cluster with at least two compute nodes.
AnswersC, E

Cross-AZ replication ensures availability in case of AZ failure.

Why this answer

Options B and D are correct. Multi-node clusters with automatic replication provide high availability. Option A is wrong because single-node clusters are not HA.

Option C is wrong because concurrency scaling improves performance, not HA. Option E is wrong because cross-region snapshot is disaster recovery, not HA.

20
Multi-Selecthard

A company is ingesting Apache logs from multiple web servers into AWS. The logs are sent via Amazon CloudWatch Logs to a subscription filter that delivers to a Lambda function. The Lambda function parses the logs and writes to Amazon S3. However, there is a significant backlog. Which THREE actions can reduce the backlog?

Select 3 answers
A.Route the CloudWatch Logs subscription to an Amazon SQS queue first
B.Increase the Lambda function memory allocation
C.Increase the Lambda function reserved concurrency
D.Change the Lambda function runtime from Python to Node.js
E.Increase the Lambda function maximum concurrency (unreserved account concurrency)
AnswersB, C, E

More memory also increases CPU, speeding up processing.

Why this answer

Increasing Lambda concurrency, increasing memory (which also increases CPU), and using a reserved concurrency ensure capacity. Changing runtime has minor effect; SQS is not used in this path.

21
Multi-Selectmedium

A company is designing a disaster recovery strategy for an Amazon RDS for SQL Server database. The database must be recoverable in another AWS region within 15 minutes of a regional outage. Which TWO actions should the data engineer take?

Select 2 answers
A.Enable Multi-AZ deployment on the primary instance.
B.Configure automated backups to copy to the recovery region.
C.Configure automated cross-region snapshots to be copied to the recovery region.
D.Create a cross-region read replica in the desired recovery region.
AnswersC, D

Cross-region snapshot copy allows restoring in another region.

Why this answer

Options A and D are correct. A cross-region read replica can be promoted to a standalone instance in another region, providing recovery within minutes. Automated cross-region snapshots can also be used to restore to another region.

Option B is incorrect because Multi-AZ is within a region, not cross-region. Option C is incorrect because cross-region automated backups are not supported; you need snapshots.

22
MCQhard

A company runs an Amazon DynamoDB table with on-demand capacity. A new reporting application performs frequent Scan operations on the table, causing occasional 'ProvisionedThroughputExceededException' errors. The operations team needs to resolve this with minimal cost. What should they do?

A.Increase the table's maximum read capacity by requesting a limit increase from AWS Support.
B.Switch the table to provisioned capacity and increase the read capacity units.
C.Enable DynamoDB Accelerator (DAX) to cache the Scan results.
D.Create a global secondary index (GSI) on the attributes used in the reporting queries.
AnswerD

GSI enables efficient queries, reducing Scans and avoiding partition-level throttling.

Why this answer

Option B is correct because on-demand tables do not have provisioned throughput, but the error indicates throttling due to per-partition throughput limits. Creating a GSI allows the reporting queries to use a more efficient query pattern, reducing scans and partition hot spots. Option A is incorrect because switching to provisioned capacity would require careful capacity planning and might increase cost.

Option C is incorrect because the error is not due to table-level limits but partition-level limits. Option D is incorrect because DAX is a caching layer that can reduce read load but does not address the root cause of inefficient Scan operations.

23
MCQhard

A data engineer notices that an S3 bucket policy allows access to a user from another AWS account, but the access is being denied. What could be the reason?

A.The bucket policy does not include KMS permissions
B.The other account's IAM user does not have permissions to access the bucket
C.S3 does not support cross-account access
D.The bucket is in a different region
AnswerB

Cross-account access requires IAM permissions in the other account.

Why this answer

For cross-account S3 access to succeed, both the bucket policy (resource-based policy) and the IAM user policy (identity-based policy) in the other account must grant the necessary permissions. Option B is correct because even if the bucket policy allows access from the other account, the IAM user in that account must have an explicit IAM policy that permits the S3 action (e.g., s3:GetObject) on the bucket. Without this, the request is denied by the other account's own IAM evaluation.

Exam trap

The trap here is that candidates assume a bucket policy alone is sufficient for cross-account access, forgetting that the requesting account's IAM user must also have explicit permissions, which is a classic AWS cross-account authorization nuance.

How to eliminate wrong answers

Option A is wrong because KMS permissions are only required if the bucket uses SSE-KMS encryption; the question does not mention encryption, and a missing KMS permission would cause a different error (e.g., AccessDenied with KMS key context). Option C is wrong because S3 fully supports cross-account access via bucket policies and ACLs, as documented in the AWS S3 User Guide. Option D is wrong because S3 is a global service and cross-region access is allowed; region does not inherently block cross-account access.

24
MCQmedium

A company is using Amazon Redshift for its data warehouse. A data engineer notices that COPY commands from S3 are failing intermittently with 'S3ServiceException: Access Denied'. The IAM role used by Redshift has the correct permissions. What is the MOST likely cause?

A.The IAM role is not attached to the Redshift cluster.
B.The S3 bucket uses SSE-KMS encryption and the role lacks kms:Decrypt.
C.The IAM role name contains a typo in the COPY command.
D.The S3 bucket policy denies access to the Redshift cluster's IP addresses.
AnswerD

Bucket policies can override IAM permissions and cause Access Denied.

Why this answer

Option D is correct because S3 bucket policies may deny access even if the role allows it. Option A is wrong because the role is already attached. Option B is wrong because encryption would cause different errors.

Option C is wrong because if the role exists, it should work; the issue is likely external.

25
MCQeasy

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then loaded into an Amazon S3 bucket for long-term storage. Which AWS service should be used to perform the transformation and delivery to S3 with minimal operational overhead?

A.Amazon Kinesis Data Firehose
B.AWS Glue
C.Amazon EMR
D.Amazon Kinesis Data Analytics
AnswerA

Kinesis Data Firehose can subscribe to a Kinesis Data Stream, transform data, and automatically deliver to S3.

Why this answer

Option B is correct because Kinesis Data Firehose can subscribe to a Kinesis Data Stream, transform data using Lambda or its built-in transformations, and automatically deliver to S3 with no code required for delivery. Option A is wrong because Kinesis Data Analytics is for real-time analytics, not direct delivery to S3. Option C is wrong because AWS Glue is a batch ETL service, not real-time.

Option D is wrong because Amazon EMR is a managed Hadoop cluster that requires significant overhead.

26
MCQhard

A company has a multi-account AWS environment with a centralized data lake in the Security account. Data producers in other accounts use AWS Glue to write data to S3 buckets in the Security account. The Security account uses AWS Lake Formation to manage permissions. The data engineer is setting up cross-account access so that users in the Producer account can query the data using Athena in their own account. The engineer has registered the S3 buckets and Data Catalog tables in Lake Formation. The IAM roles in the Producer account have the necessary permissions. However, when a user in the Producer account tries to query the table, they get an AccessDenied error. The error message indicates that the principal is not authorized to perform lakeformation:GetTable on the resource. What is the most likely cause?

A.The Glue Data Catalog resource policy is missing a statement to allow cross-account access.
B.The S3 bucket policy does not allow the Producer account's IAM role to read the data.
C.The KMS key policy does not allow the Producer account's IAM role to decrypt objects.
D.The Lake Formation permissions in the Security account do not include a grant to the Producer account's IAM role.
AnswerD

Lake Formation must grant cross-account access to the external IAM role.

Why this answer

Option A is correct because Lake Formation requires that the producer account's IAM role be granted cross-account permissions in Lake Formation. Option B is wrong because the S3 bucket policy should allow the Producer account's IAM role. Option C is wrong because the KMS key policy must allow the Producer account's IAM role.

Option D is wrong because the Glue Data Catalog policy may need cross-account access but Lake Formation is the primary issue.

27
MCQeasy

A company wants to audit all data access events in their S3 buckets, including who accessed objects and from which IP address. Which AWS service should be used to capture these events?

A.AWS CloudTrail with data events enabled
B.Amazon CloudWatch Logs
C.AWS Config
D.Amazon S3 Server Access Logs
AnswerA

CloudTrail can log S3 object-level operations and capture user identity and source IP.

Why this answer

Option C is correct because AWS CloudTrail can log S3 API calls including GetObject, PutObject, etc. Option A is wrong because S3 Server Access Logs provide object-level logs but do not include IAM user details. Option B is wrong because CloudWatch Logs can store logs but not capture them directly.

Option D is wrong because Config records resource configuration changes, not data access.

28
MCQhard

A company uses Amazon Athena to query data in an S3 bucket. A data engineer notices that a query fails with the error: 'HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/path/file.parquet (Path does not exist)'. However, the file exists in S3. What is the most likely cause?

A.The file was uploaded using S3 multipart upload and is incomplete.
B.The table's metadata in the Glue Data Catalog is outdated.
C.The S3 bucket has a bucket policy that denies access to the Athena principal.
D.Another process deleted the file after Athena listed the files but before reading.
AnswerD

Eventual consistency for deletions can cause this.

Why this answer

Option C is correct because concurrent delete operations can cause a split to be read after deletion. Option A is incorrect because the error mentions the file path, not table metadata. Option B is incorrect because ACLs would cause permission errors.

Option D is incorrect because S3 is strongly consistent for new objects but deletions can race.

29
MCQeasy

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The data contains nested fields. Which Glue feature should the engineer use to define the schema and handle the nested structure?

A.Use the 'FindMatches' transform to identify duplicates.
B.Use the 'DropFields' transform to remove nested fields.
C.Use the 'Relationalize' transform in a Glue ETL script.
D.Use the 'Spigot' transform to write sample data.
AnswerC

Relationalize flattens nested JSON into relational tables.

Why this answer

Option A is correct because Glue's built-in transform 'Relationalize' converts nested JSON into flat tables. Option B is wrong because 'FindMatches' is for deduplication. Option C is wrong because 'Spigot' is for sampling.

Option D is wrong because 'DropFields' is for removing fields, not handling nesting.

30
Multi-Selectmedium

A data engineer is troubleshooting a slow Amazon Redshift query. The query plan shows a large number of 'DS_DIST_ALL_INNER' and 'DS_BCAST_INNER' operations. Which TWO actions would likely improve query performance?

Select 2 answers
A.Set DISTSTYLE to ALL for both tables.
B.Change the distribution style of large tables to KEY on the join column.
C.Increase the number of slices by resizing the cluster.
D.Define SORTKEYs on the join columns.
E.Drop and recreate the tables with the same DDL.
AnswersB, C

KEY distribution collocates data on the same slice, reducing redistribution.

Why this answer

Option A is correct because using DISTSTYLE KEY on join columns can reduce data redistribution. Option D is correct because increasing the number of slices distributes data across more compute nodes. Option B is incorrect because SORTKEY helps with range restriction, not joins.

Option C is incorrect because DISTSTYLE ALL on both tables would broadcast both, which is inefficient. Option E is incorrect because dropping and recreating tables is disruptive and may not help.

31
MCQhard

A company runs a data pipeline that ingests streaming data via Amazon Kinesis Data Streams, processes it with an AWS Lambda function, and stores results in Amazon DynamoDB. The Lambda function sometimes fails due to 'ProvisionedThroughputExceededException' on the DynamoDB table. Which combination of steps should a data engineer take to resolve this issue?

A.Enable DynamoDB auto scaling and configure a dead-letter queue for the Lambda function.
B.Increase the Lambda function timeout and enable batch windows.
C.Increase the number of Kinesis shards to reduce Lambda invocations.
D.Increase Lambda reserved concurrency and disable retries.
AnswerA

Auto scaling adjusts throughput; DLQ captures failed records for reprocessing.

Why this answer

Option C is correct because DynamoDB auto scaling adjusts capacity based on load, and adding a dead-letter queue (DLQ) for failed records prevents data loss and allows reprocessing. Option A is wrong because increasing Lambda timeout does not address DynamoDB throttling. Option B is wrong because Lambda reserved concurrency may limit processing, not help.

Option D is wrong because Kinesis shard count does not affect DynamoDB throughput.

32
Multi-Selectmedium

Which TWO of the following are benefits of using Amazon DynamoDB Accelerator (DAX)? (Choose TWO.)

Select 2 answers
A.Improves write throughput
B.Provides microsecond read latency
C.Offloads read traffic from the DynamoDB table
D.Provides data durability across Availability Zones
E.Reduces storage costs
AnswersB, C

DAX caches reads in memory for low latency.

Why this answer

Amazon DynamoDB Accelerator (DAX) is an in-memory cache that sits between your application and DynamoDB, providing microsecond response times for read-heavy workloads. It achieves this by caching frequently accessed data in memory, reducing the latency from single-digit milliseconds to microseconds for eventually consistent reads.

Exam trap

The trap here is that candidates often confuse DAX's read acceleration with write performance improvements, or assume that a cache provides durability guarantees similar to the underlying database.

33
Multi-Selectmedium

A data engineer is designing a data storage solution for IoT sensor data that is ingested at high velocity. The data is time-series and needs to be queried by time range. Which TWO AWS services are suitable for this use case? (Choose TWO)

Select 2 answers
A.Amazon RDS
B.Amazon Redshift
C.Amazon Timestream
D.Amazon DynamoDB
E.Amazon S3
AnswersC, D

Timestream is a time-series database built for IoT and operational applications.

Why this answer

Amazon Timestream is a purpose-built time-series database that automatically scales to handle high-velocity IoT sensor data, with built-in time-based partitioning and query optimization for time-range queries. It supports SQL-like queries with time-series functions (e.g., `BETWEEN`, `DATE_BIN`) and separates storage into a memory store for recent data and a magnetic store for historical data, enabling efficient querying by time range.

Exam trap

The trap here is that candidates often choose Amazon RDS or Redshift because they are familiar with SQL-based querying, overlooking that Timestream and DynamoDB are purpose-built for high-velocity time-series ingestion and time-range queries, while RDS and Redshift incur performance and cost penalties for such workloads.

34
Multi-Selecteasy

Which TWO AWS services can be used to transform data in an Amazon S3 data lake before loading into Amazon Redshift? (Choose 2.)

Select 2 answers
A.AWS Lambda
B.Amazon Athena
C.Amazon EMR
D.AWS Glue
E.Amazon Redshift Spectrum
AnswersD, E

Glue can transform data in S3 and load into Redshift.

Why this answer

AWS Glue can run ETL jobs to transform S3 data and load into Redshift. Amazon Redshift Spectrum can query S3 data directly and load results into Redshift tables. Amazon Athena is for querying, not loading.

Amazon EMR requires cluster management. AWS Lambda is for small data volumes.

35
MCQmedium

A data engineer needs to grant cross-account access to an S3 bucket. The engineer wants to use a role in the source account and assume that role from the target account. Which permissions are required?

A.Role in source account with trust policy allowing target account, and bucket policy granting access to the role
B.Bucket policy in source account allowing the target account's root user
C.Role in target account with trust policy allowing source account
D.IAM policy in target account allowing s3:GetObject on the bucket
AnswerA

This is the correct cross-account access pattern using role assumption.

Why this answer

The target account needs sts:AssumeRole permission on the role ARN in the source account. The source account's bucket policy must also grant access to the assumed role. Option D is correct.

36
MCQeasy

A company is using Amazon S3 as a data lake. Data is ingested hourly from multiple sources. The data engineer needs to ensure that once an object is written to S3, it cannot be overwritten or deleted for 30 days. Which S3 feature should be used?

A.Use S3 Lifecycle policies to transition objects to Glacier after 30 days.
B.Enable S3 Versioning and MFA Delete.
C.Configure a bucket policy that denies s3:DeleteObject for all principals.
D.Enable S3 Object Lock with a retention period of 30 days.
AnswerD

Object Lock enforces write-once-read-many (WORM) protection.

Why this answer

Option B is correct because S3 Object Lock with retention mode prevents overwrites and deletes. Option A is wrong because it only prevents accidental deletion. Option C is wrong because it only protects against delete, not overwrite.

Option D is wrong because it manages lifecycle, not immutability.

37
MCQeasy

A company needs to audit all API calls made in their AWS account, including actions performed by the root user. Which AWS service should be used?

A.VPC Flow Logs
B.AWS CloudTrail
C.Amazon CloudWatch Logs
D.AWS Config
AnswerB

CloudTrail records API activity for auditing.

Why this answer

Option B is correct because AWS CloudTrail records all API calls for auditing. Option A is incorrect because CloudWatch Logs stores logs but does not capture API calls natively. Option C is incorrect because AWS Config tracks resource configuration changes, not API calls.

Option D is incorrect because VPC Flow Logs capture network traffic, not API calls.

38
MCQmedium

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data volume is about 10 TB per day. The engineer wants to set up a managed Kafka connector. Which AWS service should they use?

A.AWS Database Migration Service
B.AWS Lambda with Kafka trigger
C.Amazon MSK Connect
D.Amazon Kinesis Data Streams
AnswerC

MSK Connect runs Kafka Connect workers, including S3 sink connectors.

Why this answer

Amazon MSK Connect is a fully managed Kafka Connect service that can run S3 sink connectors. Lambda is not a Kafka sink; Kinesis is a different streaming service; DMS is for databases.

39
MCQhard

A company is building a data lake on S3 and needs to ingest data from on-premises Oracle database. The data is 5 TB and changes incrementally. The ingestion must capture changes in near real-time (less than 1 minute latency) and be cost-effective. Which approach should be used?

A.Use AWS Database Migration Service (DMS) with ongoing replication to S3
B.Use Amazon Kinesis Data Firehose with an Oracle JDBC connector
C.Use AWS Glue to perform a full table export daily
D.Use AWS DataSync to sync the Oracle data files to S3
AnswerA

DMS supports CDC and can replicate changes to S3 with low latency.

Why this answer

Option B (Use AWS DMS with ongoing replication from Oracle to S3) is correct because DMS supports change data capture (CDC) with low latency. Option A (Full export using AWS Glue) does not capture incremental changes in near real-time. Option C (Use Kinesis Data Firehose with Oracle as source) is not directly possible without a connector.

Option D (Use AWS DataSync) is for file transfers, not database CDC.

40
MCQmedium

A data engineer is managing an Amazon Redshift cluster used for analytics. The cluster has a single node of type dc2.large. The engineer notices that queries are slowing down as data volume grows. The cluster's disk space is at 70% usage. The engineer needs to improve query performance and accommodate future growth. The budget allows for moderate cost increase. Which action should the engineer take?

A.Add another dc2.large node to the cluster.
B.Resize the cluster to a single ds2.xlarge node.
C.Migrate the cluster to a single ra3.xlplus node with managed storage.
D.Enable concurrency scaling and maintain the current cluster.
AnswerA

Adding nodes increases both compute and storage capacity for better performance.

Why this answer

Redshift's dc2 nodes are dense compute nodes limited by storage. Adding a node increases both compute and storage capacity, improving performance and scalability. Option B is correct.

Option A: resizing to a larger node type (ds2) increases storage but not necessarily compute proportionally; ds2 is dense storage. Option C: switching to ra3 nodes with managed storage allows compute scaling independent of storage, but ra3 nodes have a higher cost. Option D: enabling concurrency scaling adds cost for additional clusters, not addressing the single node bottleneck.

41
MCQmedium

A data engineer is troubleshooting a nightly ETL job that extracts data from an Amazon RDS MySQL instance and loads it into an Amazon S3 bucket in Parquet format. The job runs on an Amazon EMR cluster and has been failing with the error 'Access Denied' when writing to S3. The IAM role attached to the EMR cluster has permissions for S3 PutObject. What is the MOST likely cause?

A.The S3 bucket uses SSE-KMS encryption and the EMR role lacks kms:GenerateDataKey permission.
B.The S3 bucket has a Lifecycle rule that expires objects too quickly.
C.The EMR cluster was terminated before the write operation completed.
D.The S3 bucket policy denies access to the EMR cluster's IAM role.
AnswerD

S3 bucket policies can explicitly deny access, overriding IAM allow.

Why this answer

Option D is correct because S3 bucket policies can override IAM permissions; if the bucket policy denies access from the EMR cluster, the write will fail even with IAM allow. Option A is wrong because S3 Lifecycle rules do not affect write permissions. Option B is wrong because KMS permissions are needed only if the bucket uses SSE-KMS.

Option C is wrong because EMR cluster termination would cause a different error.

42
MCQmedium

A company ingests JSON logs into Amazon S3 using Kinesis Data Firehose. The logs contain a timestamp field, but the delivery to S3 is delayed by up to 15 minutes during peak hours. The business requires near-real-time availability (under 2 minutes). Which configuration change should the data engineer make?

A.Increase the number of shards in the Kinesis Data Firehose stream
B.Increase the buffer size to 128 MB
C.Decrease the buffer interval to 60 seconds
D.Enable buffering hints in the Firehose delivery stream
AnswerC

Shorter buffer interval reduces delivery latency.

Why this answer

Option B is correct because reducing the buffer interval to 60 seconds (minimum 60s for Firehose) forces Firehose to deliver data more frequently, reducing latency. Option A (increasing buffer size) would increase latency. Option C (increasing shards) does not apply to Firehose directly; shards are for Kinesis Data Streams.

Option D (buffering hints) is not a direct configuration in Firehose.

43
MCQmedium

A data engineer notices that an Amazon Redshift cluster is experiencing slow query performance. The engineer suspects that tables are not properly sorted. Which diagnostic query should the engineer run to identify unsorted rows?

A.SELECT * FROM SVV_TABLE_INFO ORDER BY unsorted DESC;
B.SELECT * FROM PG_CATALOG;
C.SELECT * FROM STV_TBL_PERM;
D.SELECT * FROM STL_LOAD_ERRORS;
AnswerA

SVV_TABLE_INFO shows unsorted rows for each table.

Why this answer

The `SVV_TABLE_INFO` system view in Amazon Redshift provides metadata about each table, including the `unsorted` column which shows the percentage of unsorted rows. By ordering by `unsorted DESC`, the engineer can quickly identify tables with the highest proportion of unsorted data, which directly impacts query performance due to inefficient zone maps and scan pruning.

Exam trap

The trap here is that candidates may confuse `SVV_TABLE_INFO` with `STV_TBL_PERM` (which shows block counts) or `STL_LOAD_ERRORS` (which is for load debugging), missing that only `SVV_TABLE_INFO` exposes the `unsorted` column specifically designed for sort health analysis.

How to eliminate wrong answers

Option B is wrong because `PG_CATALOG` is a system schema containing PostgreSQL catalog tables (e.g., `pg_class`, `pg_attribute`), not a diagnostic view for unsorted rows; it lacks the `unsorted` metric. Option C is wrong because `STV_TBL_PERM` provides block-level storage information (e.g., number of blocks per slice) but does not include a column for unsorted row percentage. Option D is wrong because `STL_LOAD_ERRORS` logs errors from COPY and INSERT operations, such as data type mismatches or malformed CSV rows, and has no relevance to sort key efficiency.

44
MCQeasy

A company uses Amazon DynamoDB for a gaming application. They need to store player session data that expires after 24 hours. Which DynamoDB feature should they use to automatically delete expired items?

A.Time to Live (TTL)
B.DynamoDB auto scaling
C.DynamoDB Streams
D.Point-in-time recovery
AnswerA

TTL automatically deletes expired items based on a timestamp attribute.

Why this answer

DynamoDB Time to Live (TTL) is the correct feature because it allows you to define a per-item timestamp attribute (e.g., `expireAt`) that DynamoDB automatically deletes once that timestamp is reached. This is ideal for expiring session data after 24 hours without requiring custom code or scheduled jobs to scan and delete items, reducing cost and operational overhead.

Exam trap

The trap here is that candidates may confuse TTL with DynamoDB Streams, thinking streams can automatically delete items, but streams only notify of changes and require separate logic to perform deletions.

How to eliminate wrong answers

Option B (DynamoDB auto scaling) is wrong because it manages throughput capacity (read/write units) based on traffic, not item expiration or deletion. Option C (DynamoDB Streams) is wrong because it captures item-level changes (inserts, updates, deletes) in near real-time for downstream processing, but does not automatically delete items. Option D (Point-in-time recovery) is wrong because it provides continuous backups to restore a table to any point within the last 35 days, but does not handle automatic deletion of expired data.

45
Multi-Selecteasy

A data engineer needs to ingest data from a SaaS application that sends webhooks in JSON format. The data must be stored in S3 for batch analysis. Which AWS services can receive the webhooks and store the data in S3 with minimal custom code? (Choose TWO.)

Select 2 answers
A.AWS Lambda with S3 SDK
B.AWS Glue with a Python shell job
C.Amazon Kinesis Data Streams
D.Amazon API Gateway with S3 integration
E.Amazon API Gateway with Kinesis Data Firehose integration
AnswersD, E

API Gateway can directly write to S3.

Why this answer

Option A and Option D are correct because API Gateway can receive webhooks and route to Firehose or S3. Option B is wrong because Glue cannot directly receive webhooks. Option C is wrong because Kinesis Data Streams requires custom producers.

Option E is wrong because Lambda can receive but not directly store to S3 without code.

46
MCQhard

A company is using Amazon Redshift for data warehousing. The data engineer notices that the STL_ALERT_EVENT_LOG table shows many 'missing statistics' alerts. What is the best course of action to address this issue?

A.Increase the WLM concurrency slots.
B.Run VACUUM on the tables.
C.Enable compression on the tables.
D.Run ANALYZE on the tables.
AnswerD

ANALYZE updates table statistics, resolving missing statistics alerts.

Why this answer

The STL_ALERT_EVENT_LOG table records alerts about query performance issues, including 'missing statistics' alerts. This indicates that the query optimizer lacks up-to-date table statistics, leading to suboptimal query plans. Running the ANALYZE command updates table statistics, enabling the optimizer to generate efficient execution plans.

Therefore, option D is the correct course of action.

Exam trap

The trap here is that candidates often confuse VACUUM (which reorganizes data) with ANALYZE (which updates statistics), assuming both are needed for query performance, but only ANALYZE directly resolves 'missing statistics' alerts.

How to eliminate wrong answers

Option A is wrong because increasing WLM concurrency slots does not address missing statistics; it only allows more queries to run simultaneously, which could worsen performance if statistics are outdated. Option B is wrong because VACUUM reclaims disk space and sorts rows but does not update table statistics; it is used for managing data storage, not query optimization. Option C is wrong because enabling compression reduces storage and I/O but does not provide the optimizer with the statistical metadata needed for efficient query planning.

47
Multi-Selecthard

A data engineer is designing a data transformation pipeline using AWS Glue. The source data is in Amazon S3 in Parquet format, and the transformed output must be written to another S3 bucket in Parquet format partitioned by year, month, day. The pipeline should handle incremental updates efficiently. Which three features should the engineer use? (Choose THREE.)

Select 3 answers
A.AWS Glue job bookmarks to track processed data
B.Use AWS Glue JobWatch for monitoring job progress
C.Use DynamicFrames instead of Spark DataFrames for schema handling
D.Enable partition pruning in the Glue job
E.Use Spark SQL for transformations
AnswersA, C, D

Enables incremental processing.

Why this answer

Options A, B, and D are correct. Bookmarks track processed data for incremental jobs. DynamicFrames are flexible for schema evolution.

Partition pruning helps read only relevant partitions. Option C (Spark SQL) is not specific to incremental. Option E (JobWatch) is not an AWS Glue feature.

48
MCQmedium

A company is using Amazon RDS for MySQL and needs to reduce read latency for a global user base. Which AWS feature should be implemented?

A.Multi-AZ deployment
B.Aurora Auto Scaling
C.Read Replicas
D.Cross-Region Replication
AnswerC

Read Replicas allow offloading read queries to reduce latency.

Why this answer

Option D is correct because Read Replicas can be promoted to stand-alone instances and reduce read latency. Option A is wrong because Multi-AZ is for high availability, not read scaling. Option B is wrong because Auto Scaling adjusts capacity, not read latency.

Option C is wrong because cross-region replication latency is high.

49
MCQhard

A company stores sensitive data in S3 and uses VPC Endpoints to restrict access. They want to ensure that data can only be accessed from their VPC. What configuration is required?

A.Configure VPC Flow Logs to monitor access
B.Add a bucket policy condition aws:SourceVpce
C.Enable S3 Block Public Access
D.Associate a security group with the S3 bucket
AnswerB

This condition restricts access to requests originating from the specified VPC endpoint.

Why this answer

S3 bucket policies can use aws:SourceVpce condition to restrict access to a specific VPC endpoint. Option A is wrong because S3 Block Public Access is a broader setting. Option B is wrong because VPC Flow Logs log traffic but do not restrict access.

Option D is wrong because security groups apply to EC2 instances, not to S3 bucket policies directly.

50
MCQmedium

A company stores sensitive customer data in an Amazon S3 bucket. The security team requires that all data be encrypted at rest using a key that is automatically rotated every year. Which encryption solution should the data engineer use?

A.SSE-S3
B.SSE-KMS with a customer managed key
C.SSE-C
D.Client-side encryption
AnswerB

KMS can automatically rotate customer managed keys annually.

Why this answer

SSE-KMS with a customer managed CMK allows automatic annual key rotation (when enabled). Option C is correct. Option A: SSE-S3 uses S3-managed keys, rotation is not customer-controlled.

Option B: SSE-C requires customer-provided keys, rotation is manual. Option D: client-side encryption does not use AWS KMS for key rotation.

51
Multi-Selecteasy

A company uses AWS Glue to catalog and transform data in Amazon S3. The Glue ETL jobs are failing intermittently with 'ThrottlingException' errors. Which THREE actions can help mitigate this issue? (Select THREE.)

Select 3 answers
A.Implement exponential backoff and retry in the Glue job code.
B.Increase the number of DPUs for the Glue job.
C.Request a service quota increase for the Glue API.
D.Enable job bookmarking to avoid reprocessing old data.
E.Switch from PySpark to Spark SQL.
AnswersA, C, D

Exponential backoff retries throttled requests, reducing failure impact.

Why this answer

Options A, C, and D are correct. Implementing retry logic with exponential backoff handles transient throttling. Increasing the service quota for Glue API calls if the limit is reached.

Using Glue job bookmarking reduces reprocessing, thus lowering API calls. Option B (increasing DPUs) addresses resource issues but not API throttling. Option E (changing to Spark SQL) does not affect API calls.

52
MCQmedium

A company uses AWS Glue crawlers to populate the Data Catalog from data in Amazon S3. The crawler fails to update the schema when new columns are added to the CSV files. What is the most likely cause?

A.The S3 bucket has versioning enabled.
B.The crawler is configured to only crawl new partitions.
C.The IAM role for the crawler lacks permissions to read the new columns.
D.The crawler uses a custom classifier that defines a fixed schema.
AnswerD

Custom classifiers can override schema inference.

Why this answer

Option A is correct because if the crawler is configured with a custom classifier that expects a fixed schema, it will ignore new columns. Option B is wrong because partition discovery does not prevent schema updates. Option C is wrong because S3 event notifications are not related.

Option D is wrong because IAM permissions would cause a different error.

53
MCQmedium

A company runs a multi-AZ Amazon RDS for PostgreSQL instance. They need to run a one-time analytical query that will take several hours and consume significant I/O. The query should not impact the primary workload. What should the data engineer do?

A.Create a read replica of the RDS instance and run the query on the replica.
B.Run the query directly on the primary instance during off-peak hours.
C.Increase the instance size to handle the load.
D.Enable Multi-AZ and run the query on the standby instance.
AnswerA

Read replica offloads read traffic from the primary.

Why this answer

Option A is correct because creating a read replica of the RDS for PostgreSQL instance allows the analytical query to run on a separate database engine without affecting the primary workload. Read replicas in Amazon RDS use asynchronous replication from the source instance, so the replica can handle heavy I/O and long-running queries independently. This ensures the primary instance remains available for the production workload without performance degradation.

Exam trap

The trap here is that candidates often confuse the Multi-AZ standby instance with a read replica, assuming the standby can be used for queries, but in Amazon RDS, the standby is only for high availability and is not accessible for read operations.

How to eliminate wrong answers

Option B is wrong because running the query directly on the primary instance, even during off-peak hours, still consumes significant I/O and CPU resources on that instance, which can impact the primary workload and potentially cause performance issues or increased latency. Option C is wrong because increasing the instance size only adds more resources to the same single instance; the analytical query would still compete with the primary workload for I/O and memory, and scaling up does not isolate the workload. Option D is wrong because the standby instance in a Multi-AZ deployment is not directly accessible for read or write operations; it is a synchronous replica used only for automatic failover, and Amazon RDS does not allow connecting to the standby for queries.

54
MCQeasy

A data engineer is tasked with ingesting on-premises database snapshots (full load) into Amazon S3 on a daily basis. The database is PostgreSQL and the snapshot size is 50 GB. The network link is 1 Gbps. Which approach is the MOST time-efficient and cost-effective?

A.Use AWS Database Migration Service (DMS) with S3 as target.
B.Use AWS Snowball Edge to transfer the snapshot.
C.Use AWS CLI to copy the snapshot file directly to S3.
D.Write a Lambda function to run pg_dump and upload to S3.
AnswerA

DMS can perform full loads from on-premises PostgreSQL to S3 in a managed and scalable way.

Why this answer

Option B is correct because AWS DMS can perform a full load from on-premises PostgreSQL to S3 efficiently, and it's a managed service. Option A (CLI) would be manual and slow. Option C (Snowball) is for larger datasets or slow networks; here 50 GB at 1 Gbps takes ~7 minutes, so Snowball is overkill.

Option D (Lambda + pg_dump) adds custom code and complexity.

55
MCQhard

A data engineer is designing a multi-region disaster recovery plan for an Amazon DynamoDB table. The table stores critical user profile data and must have a Recovery Point Objective (RPO) of less than 1 minute and a Recovery Time Objective (RTO) of less than 5 minutes. Which solution meets these requirements?

A.Configure DynamoDB Streams and a Lambda function to replicate data to another region.
B.Use DynamoDB on-demand backup and restore to another region.
C.Use DynamoDB global tables to replicate data to another region.
D.Enable point-in-time recovery (PITR) and restore to another region.
AnswerC

Global tables provide near-real-time replication and fast failover.

Why this answer

DynamoDB global tables provide multi-region active-active replication with sub-second replication latency, meeting RPO < 1 minute and RTO < 5 minutes (by failing over to another region). Option A is wrong because on-demand backup and restore can take hours to restore. Option B is wrong because point-in-time recovery (PITR) restores to a new table, which can take minutes to hours.

Option D is wrong because cross-region replication using Lambda is custom and may have higher latency.

56
Multi-Selecteasy

A data engineer is building a data ingestion pipeline that uses AWS Lambda to process records from Amazon Kinesis Data Streams. The Lambda function writes the processed data to Amazon DynamoDB. Which TWO factors affect the maximum number of concurrent Lambda executions for this stream? (Choose TWO.)

Select 2 answers
A.DynamoDB table's read capacity units
B.Lambda function's memory allocation
C.Kinesis stream name
D.Batch size configured for the Lambda event source mapping
E.Number of shards in the Kinesis stream
AnswersD, E

Batch size determines how many records are sent per invocation, affecting concurrency.

Why this answer

Correct options: A and C. The number of shards determines the maximum number of concurrent batches (one per shard). The batch size (records per invocation) also affects concurrency because smaller batches lead to more invocations.

Option B is wrong because DynamoDB table type does not affect Lambda concurrency. Option D is wrong because Lambda memory is not directly related to concurrency. Option E is wrong because stream name does not affect concurrency.

57
MCQeasy

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

A.Unnest
B.ResolveChoice
C.Relationalize
D.Map
AnswerC

Relationalize flattens nested structures into separate DynamicFrames.

Why this answer

The Relationalize transform is specifically designed to flatten nested JSON structures (arrays and objects) into a set of related tables, making it ideal for this use case. It automatically handles complex nesting by creating separate DataFrames for each nested level and linking them via foreign keys, which is exactly what is needed when ingesting JSON with nested arrays and objects into a relational format.

Exam trap

The trap here is that candidates often confuse the generic Spark SQL function `explode` (or the concept of 'unnesting') with a named AWS Glue transform, leading them to select 'Unnest' even though it does not exist as a Glue transform and would require manual handling of multiple nesting levels.

How to eliminate wrong answers

Option A (Unnest) is wrong because AWS Glue does not have a built-in transform named 'Unnest'; this is a Spark SQL function (e.g., `explode`) but not a named Glue transform, and it would require manual handling of multiple nesting levels. Option B (ResolveChoice) is wrong because it is used to resolve schema ambiguities (e.g., when a column has mixed types like string and int) and does not flatten nested structures. Option D (Map) is wrong because it applies a function to each record in a DynamicFrame for row-wise transformations, but it does not inherently flatten nested arrays or objects—you would need to write custom logic to handle the nesting.

58
MCQmedium

A data engineer runs the describe-table command shown in the exhibit. The application frequently queries by CustomerID alone. Currently, these queries result in full table scans. Which action should the engineer take to improve query performance?

A.Change the sort key to OrderID
B.Create a local secondary index on CustomerID
C.Increase the read capacity units to 500
D.Create a global secondary index on CustomerID
AnswerD

A GSI allows efficient queries on CustomerID.

Why this answer

The `describe-table` output shows CustomerID is the partition key, but the application frequently queries by CustomerID alone, which currently causes full table scans because there is no index supporting that query pattern. Creating a global secondary index (GSI) on CustomerID allows efficient querying by CustomerID without scanning the entire table, as the GSI provides a separate data structure with its own read/write capacity that can be queried directly.

Exam trap

AWS often tests the misconception that increasing capacity (RCUs/WCUs) can fix query performance issues, but the trap here is that throughput and indexing are separate concerns — full table scans are a design problem, not a capacity problem.

How to eliminate wrong answers

Option A is wrong because changing the sort key to OrderID would not help queries by CustomerID alone, as the sort key is used for sorting within a partition, not for filtering by a different attribute. Option B is wrong because DynamoDB does not support local secondary indexes (LSIs) on the partition key; LSIs can only be created on a different sort key within the same partition key, and CustomerID is already the partition key, so an LSI on CustomerID is invalid. Option C is wrong because increasing read capacity units only increases throughput, not query efficiency; full table scans still occur regardless of capacity, and the issue is the lack of an index, not insufficient capacity.

59
Multi-Selecthard

A company uses AWS KMS to encrypt data in multiple services. They want to ensure that only specific IAM roles can decrypt data using a particular KMS key. Which THREE steps are necessary?

Select 3 answers
A.Attach an IAM policy to each role with kms:Decrypt permission
B.Enable IAM policies in the key policy
C.Enable automatic key rotation
D.Ensure the key policy allows kms:GenerateDataKey for the roles
E.Add a statement to the KMS key policy allowing kms:Decrypt for the IAM roles
AnswersA, D, E

The roles need an IAM policy that allows decrypt actions.

Why this answer

The KMS key policy must grant decrypt permission to the IAM roles. Each role must have an IAM policy allowing kms:Decrypt. Additionally, the key policy should also grant kms:GenerateDataKey for encryption.

Option D is wrong because the key policy does not need to enable IAM policies; by default, IAM policies are allowed unless the key policy denies them. Option E is wrong because key rotation is not required for access control.

60
Multi-Selecthard

A company runs a production Amazon RDS for PostgreSQL database. The database is experiencing performance degradation due to a high number of concurrent read queries. The data engineer needs to improve read performance without significantly increasing costs. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers
A.Create one or more read replicas in the same region.
B.Enable Multi-AZ deployment for automatic failover.
C.Increase the allocated storage size to improve IOPS.
D.Enable Performance Insights to identify slow queries.
E.Delete unnecessary indexes to reduce write overhead.
AnswersA, D

Read replicas handle read traffic, reducing load on the primary.

Why this answer

Option A is correct because read replicas offload read traffic from the primary instance. Option D is correct because performance insights help identify bottlenecks. Option B is incorrect because Multi-AZ does not improve read performance.

Option C is incorrect because increasing storage size may not help with read performance. Option E is incorrect because deleting indexes would degrade performance.

61
MCQhard

A data engineer is setting up an Amazon S3 bucket for storing sensitive financial data. The compliance team requires that all data be encrypted at rest using a customer-managed AWS KMS key. Additionally, the bucket must block public access. Which combination of settings should the engineer configure?

A.Enable default encryption with AWS-KMS and a customer managed key. Enable block public access settings.
B.Use S3 Object Ownership to enforce bucket owner enforced. Enable block public access.
C.Use S3 Bucket Keys to reduce KMS costs. Enable block public access.
D.Create a bucket policy that denies PutObject without encryption. Enable block public access.
AnswerA

This ensures all objects are encrypted with the specified KMS key and public access is blocked.

Why this answer

Option A is correct because enabling default encryption with AWS-KMS using a customer-managed key ensures that all objects uploaded to the S3 bucket are automatically encrypted at rest with the required key type. Additionally, enabling block public access settings prevents any public access to the bucket, satisfying the compliance team's requirements.

Exam trap

The trap here is that candidates may think a bucket policy denying unencrypted uploads is sufficient, but it does not enforce the use of a customer-managed KMS key, nor does it automatically encrypt objects that lack encryption headers.

How to eliminate wrong answers

Option B is wrong because S3 Object Ownership with bucket owner enforced controls object ownership and ACLs, but does not enforce encryption at rest with a customer-managed KMS key. Option C is wrong because S3 Bucket Keys reduce KMS request costs by using a bucket-level key, but they do not enforce encryption with a customer-managed key or block public access. Option D is wrong because a bucket policy that denies PutObject without encryption can enforce encryption, but it does not guarantee that the encryption uses a customer-managed KMS key; it could allow SSE-S3 or SSE-KMS with an AWS-managed key, and it does not configure default encryption.

62
MCQmedium

A data engineer is troubleshooting an issue where an AWS Glue ETL job fails when trying to read data from an S3 bucket encrypted with SSE-KMS. The job has an IAM role that includes `kms:Decrypt` permission. What is the most likely reason for the failure?

A.The IAM role does not have s3:GetObject permission
B.The KMS key policy does not allow the Glue job to use the key
C.The S3 bucket is in a different AWS region than the Glue job
D.The Glue job is not configured to use the KMS key for decryption
AnswerD

Glue needs the key ID or alias in the job parameters.

Why this answer

Option B is correct because Glue needs the KMS key to be specified in the job parameters or the connection. Option A is wrong because the Glue role likely has the permission. Option C is wrong because the bucket may be in the same region.

Option D is wrong because the key must be specified explicitly.

63
MCQmedium

A company has an Amazon RDS for PostgreSQL DB instance with a large table that is frequently updated. The data engineer needs to reduce storage costs by archiving old records that are no longer accessed. The archived records must be retained for 7 years due to compliance requirements. Which solution is MOST cost-effective?

A.Use RDS native backup and restore to keep a separate backup.
B.Export old records using pg_dump and store in S3 Glacier Deep Archive.
C.Enable storage autoscaling on the RDS instance.
D.Move old records to a separate table in the same RDS instance.
AnswerB

This offloads old data to low-cost archival storage.

Why this answer

Using pg_dump to export old records and store them in S3 Glacier Deep Archive is cost-effective because Glacier Deep Archive is the lowest-cost storage for long-term archival. Option A is wrong because enabling storage autoscaling doesn't archive data. Option B is wrong because archiving within RDS still incurs storage costs.

Option D is wrong because S3 Standard is more expensive than Glacier Deep Archive for long-term retention.

64
MCQmedium

A gaming company ingests player event data from mobile games into Amazon Kinesis Data Streams. Each event is a small JSON payload (<1 KB). The data must be delivered to Amazon S3 for analytics, and the company wants to minimize storage costs by aggregating events into larger files (e.g., 100 MB per file). The current setup uses Kinesis Data Firehose with a buffer size of 10 MB and a buffer interval of 60 seconds, but the resulting files are very small (average 5 MB) because the data volume is low. The engineer needs to ensure that files are at least 100 MB to reduce the number of S3 objects and lower costs. What should the engineer do?

A.Use an AWS Glue streaming ETL job with a 100 MB file size threshold to write to S3.
B.Use an AWS Lambda function to buffer events in memory and write to S3 when buffer reaches 100 MB.
C.Increase the buffer size in Kinesis Data Firehose to 100 MB and increase the buffer interval to 300 seconds.
D.Use Amazon EMR with Spark Streaming to aggregate and write larger files to S3.
AnswerC

This allows Firehose to accumulate data until the buffer size or interval is reached, producing larger files.

Why this answer

Option D is correct: Increase the buffer size to 100 MB and the buffer interval to 300 seconds to allow more data to accumulate before writing. Option A (Lambda) would not aggregate efficiently. Option B (Glue) adds latency and cost.

Option C (EMR) is overkill.

65
MCQeasy

Refer to the exhibit. A data engineer runs this CLI command to check an object's metadata. The engineer wants to verify if the object is eligible for lifecycle transition to S3 Glacier based on its age. What additional information is needed?

A.The current date
B.The ETag value
C.The metadata archive flag
D.The ContentLength value
AnswerA

The object age is based on LastModified and current date.

Why this answer

Option C is correct because the last modified date shows when the object was last modified, and lifecycle policies are based on object age. The engineer needs the current date to calculate the age. Option A is wrong because ContentLength is not relevant.

Option B is wrong because ETag is for integrity. Option D is wrong because metadata is shown but not needed for age calculation.

66
MCQeasy

A company wants to use AWS Lake Formation to manage permissions on a data lake. What is the primary benefit of using Lake Formation for data security?

A.Automatically encrypts data at rest and in transit.
B.Replaces IAM for all data access policies.
C.Provides a unified view of data across all AWS regions.
D.Centralized fine-grained access control to data in S3, Redshift, and RDS.
AnswerD

Lake Formation provides column and row-level security.

Why this answer

Option A is correct because Lake Formation provides centralized fine-grained access control at the table, column, and row level. Option B is wrong because Lake Formation does not replace IAM. Option C is wrong because it does not encrypt data.

Option D is wrong because Lake Formation works with S3, not just Redshift.

67
Multi-Selecteasy

A financial services company is ingesting real-time stock trade data into Amazon Kinesis Data Streams. The data is then processed by a Kinesis Data Analytics application for fraud detection. The company must ensure that the data is processed in the correct order for each stock symbol. Which TWO configuration steps should be taken? (Choose two.)

Select 2 answers
A.Use a random partition key to distribute the load evenly.
B.Increase the number of shards to reduce latency.
C.Use the stock symbol as the partition key when putting records into the stream.
D.Use AWS Lambda with DynamoDB Streams instead of Kinesis Data Analytics.
E.Configure the Kinesis Client Library (KCL) to process records in the order they arrive in each shard.
AnswersC, E

Partition key determines shard assignment; same symbol goes to same shard, preserving order.

Why this answer

A is correct because using a partition key of stock symbol ensures records with the same symbol go to the same shard, preserving order. D is correct because enabling the Kinesis Client Library (KCL) with sequence number ordering ensures records are processed in order within a shard. B is wrong because increasing shards does not guarantee ordering.

C is wrong because using a random partition key would distribute records across shards, breaking order. E is wrong because DynamoDB Streams is for CDC, not for streaming order.

68
MCQhard

A company runs a streaming application on Amazon EC2 instances that writes data to an Amazon DynamoDB table (us-east-1). The data is later consumed by a reporting job that runs every hour. Recently, the reporting job has been failing with ProvisionedThroughputExceededException errors during peak hours. The DynamoDB table uses provisioned capacity with 1000 read capacity units (RCU) and 500 write capacity units (WCU). The reporting job performs scans and reads using eventually consistent reads. The application's write traffic is steady, but the reporting job's reads spike at the top of the hour. The data engineer needs to resolve the throughput exceptions without affecting the application's writes. Which solution should the data engineer implement?

A.Create a global secondary index (GSI) with enough RCU for the reporting job and have the job query the index instead of scanning the table.
B.Increase the table's RCU to 2000.
C.Create a read replica of the DynamoDB table in a different region.
D.Switch the table to on-demand capacity mode.
AnswerA

Using a GSI with dedicated capacity can isolate the reporting workload and avoid throttling.

Why this answer

By default, DynamoDB allocates read capacity equally between the table and its global secondary indexes (GSIs). If the table has a GSI, the reporting job's read requests may be throttled due to the GSI's capacity. Adding a GSI with provisioned read capacity dedicated for the reporting job would offload the reads from the main table.

Option D is correct. Option A: increasing RCU on the table may not help if the GSI is throttled. Option B: switching to on-demand eliminates throttling but may increase costs and does not guarantee performance.

Option C: adding a read replica is not a feature for DynamoDB (only for RDS).

69
MCQmedium

A company uses Amazon DynamoDB as a data store for a real-time dashboard application. The application performs point lookups and range queries on a table that has a partition key and sort key. The table uses on-demand capacity mode. Recently, the application's response time has increased, and CloudWatch metrics show high 'ThrottledRequests' for the table. The application uses the AWS SDK with default retry settings. The data access pattern is read-heavy with occasional spikes. What is the most effective way to reduce throttling?

A.Switch the table to provisioned capacity and set the read capacity units to a high value.
B.Enable DynamoDB Accelerator (DAX) to cache frequently read items.
C.Increase the read capacity units to a higher value.
D.Implement exponential backoff with jitter in the application code.
AnswerD

Retries with backoff reduce the rate of requests during throttling, allowing the table to recover.

Why this answer

Option C is correct because DynamoDB on-demand mode accommodates traffic spikes but can throttle if the spike exceeds the previous peak by more than double. Implementing exponential backoff with jitter in the application allows retries to succeed without overwhelming the table. Option A is wrong because switching to provisioned capacity would require predicting the peak and may still throttle if the spike is higher.

Option B is wrong because DAX (DynamoDB Accelerator) is an in-memory cache that reduces read load but does not eliminate throttling from write spikes; it also adds cost and complexity. Option D is wrong because increasing read capacity units only applies to provisioned mode.

70
MCQhard

A data pipeline uses AWS Glue to read from an Amazon S3 bucket containing millions of small CSV files (each < 1 MB). The ETL job is slow. Which optimization would most improve performance?

A.Write the ETL script using PySpark instead of Scala
B.Increase the number of Glue workers
C.Use the G.1X worker type for more memory
D.Use S3 file grouping to combine small files
AnswerD

Grouping small files reduces the number of partitions and improves Spark performance.

Why this answer

Using Amazon S3 file grouping or converting to columnar format like Parquet reduces the number of files and improves read performance. Increasing workers helps, but file consolidation is more impactful. Using G.1X worker type may help, but grouping files is key.

Using Spark SQL directly does not address the small files problem.

71
MCQeasy

A company needs to ingest data from an on-premises Hadoop cluster into Amazon S3 for archival and analysis. The total data volume is 50 TB. The migration must be completed within one week. The on-premises network has a 1 Gbps connection to AWS. Which AWS service should be used?

A.AWS Transfer Family
B.AWS Snowball Edge
C.AWS Glue
D.AWS DataSync
AnswerB

Snowball can physically ship data for large transfers.

Why this answer

Option C is correct because AWS Snowball is a physical device that can transfer large amounts of data faster than over the network given the bandwidth limitation. Option A (DataSync) can use the network but may not complete within a week due to bandwidth constraints. Option B (Glue) is for ETL, not raw transfer.

Option D (Transfer Family) is for SFTP.

72
MCQhard

A company is using AWS Glue to process data stored in an S3 bucket that is encrypted with SSE-KMS. The Glue job fails with an 'Access Denied' error when trying to read the data. The IAM role used by the Glue job has permissions to read from the S3 bucket and to use the KMS key. What is the most likely cause of the failure?

A.The S3 bucket is using SSE-S3 instead of SSE-KMS
B.The S3 bucket policy denies access to the Glue job
C.The KMS key is in a different AWS account
D.The IAM role is missing the kms:Decrypt permission
AnswerD

Glue jobs need kms:Decrypt to read SSE-KMS encrypted data.

Why this answer

Option D is correct because AWS Glue jobs need the kms:Decrypt permission on the KMS key to read data encrypted with SSE-KMS. The IAM role may have the S3 read permissions but lack the kms:Decrypt permission. Option A (SSE-S3) is not relevant as the bucket uses SSE-KMS.

Option B (bucket policy) could also be a cause, but the most common issue is missing kms:Decrypt. Option C (cross-account access) is less likely if the key is in the same account.

73
Multi-Selecthard

A data engineer is implementing a CDC (Change Data Capture) pipeline from a relational database to Amazon S3 using AWS Database Migration Service (DMS). Which TWO configurations are required for continuous replication?

Select 2 answers
A.Define transformation rules in the DMS task.
B.Enable binary logging on the source database.
C.Configure a VPC endpoint for DMS.
D.Enable 'Full load' and 'Ongoing replication' in the task.
E.Pre-create the target table in S3.
AnswersB, D

Binary logs are needed to capture changes for CDC.

Why this answer

Options A and C are correct. To enable ongoing replication, DMS needs to have full load completed (A) and the source database must have continuous change capture enabled (e.g., binary logs for MySQL) (C). Option B is wrong because the target table must exist before replication.

Option D is wrong because DMS does not need a VPC endpoint; it connects via the internet or VPN. Option E is wrong because transformation rules are optional.

74
MCQmedium

Refer to the exhibit. A Glue ETL job failed. What is the most likely cause?

A.Some source files have a different schema than others.
B.The job bookmarks are misconfigured.
C.The source data has inconsistent partitioning.
D.The job ran out of memory due to insufficient DPUs.
AnswerA

Directly matches the error.

Why this answer

Option D is correct because the error 'Cannot merge incompatible schemas' indicates differing schemas in the source data. Option A is wrong because insufficient DPUs cause memory errors, not schema conflicts. Option B is wrong because partition pruning is about filtering, not schema.

Option C is wrong because bookmarks track processed files, not schema issues.

75
MCQeasy

A data engineer needs to ensure that an Amazon Redshift cluster encrypts all data at rest. Which setting must be enabled when creating the cluster?

A.Enable automated snapshots
B.Enable encryption
C.Enable SSL/TLS
D.Enable VPC
AnswerB

Redshift supports encryption at rest.

Why this answer

Option B is correct. Redshift cluster encryption is enabled at creation. Option A is wrong because SSL is for in-transit.

Option C is wrong because VPC is network isolation. Option D is wrong because automated snapshots are for backup.

Page 1 of 24

Page 2