Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1651–1725

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 23 of 24

1651

MCQmedium

A company has a Kinesis Data Firehose delivery stream that receives JSON data from IoT devices. The data is delivered to an S3 bucket. The company notices that the data in S3 is delayed by up to 30 minutes. The Firehose stream is configured with a buffer size of 1 MB and a buffer interval of 60 seconds. The incoming data rate is approximately 100 KB per second. The company needs to reduce the delivery latency to under 5 minutes. Which action should the company take?

A.Enable Lambda transformation to process data faster.

B.Increase the buffer interval to 300 seconds.

C.Change the compression format from GZIP to Snappy.

D.Decrease the buffer size to 256 KB.

AnswerD

Smaller buffer size causes more frequent deliveries, reducing latency.

Why this answer

Option B is correct because reducing the buffer interval to 60 seconds (already set) is not enough; the actual issue is that buffer size is too large relative to data rate. Decreasing buffer interval to 60 seconds is already set, but they need to also reduce buffer size or increase data rate. Actually, the correct action is to decrease the buffer size to 1 MB (already) and decrease buffer interval to 60 seconds (already).

Wait, the latency is due to buffer interval of 60 seconds? No, the problem states latency up to 30 minutes. That suggests that the buffer interval is not the only factor; maybe the data rate is low. Actually, with 100 KB/s, it takes about 10 seconds to fill 1 MB buffer.

So buffer interval of 60 seconds should cause latency up to 60 seconds. The 30-minute delay suggests another issue. Perhaps the Firehose is waiting for more data or there is a backlog.

The correct answer is to decrease the buffer interval to 60 seconds (already) and also decrease the buffer size? Actually, option B says decrease buffer size to 256 KB? That would cause more frequent deliveries and reduce latency. Option A is wrong because increasing buffer interval would increase latency. Option C is wrong because changing compression format does not affect latency.

Option D is wrong because using Lambda adds processing time.

Full explanation →

1652

MCQmedium

Refer to the exhibit. A data engineer needs to connect to the Redshift cluster from an EC2 instance in the same VPC. The engineer can ping the EC2 instance but cannot connect to Redshift using the endpoint address and port 5439. What is the most likely cause?

A.The security group for the Redshift cluster does not allow inbound traffic on port 5439 from the EC2 instance.

B.The Redshift cluster is in a different VPC.

C.The Redshift cluster is not in an available state.

D.The Redshift cluster is publicly accessible and requires an internet gateway.

AnswerA

Security group rules must permit the connection.

Why this answer

The most likely cause is that the security group associated with the Redshift cluster does not have an inbound rule allowing TCP traffic on port 5439 from the security group or IP address of the EC2 instance. Since the engineer can ping the EC2 instance (ICMP works), but cannot connect to Redshift on port 5439, this points to a firewall or security group rule blocking the specific port, not a network reachability issue.

Exam trap

AWS often tests the distinction between ICMP reachability (ping) and TCP port-level connectivity, leading candidates to overlook security group rules when they see successful ping results.

How to eliminate wrong answers

Option B is wrong because if the Redshift cluster were in a different VPC, the engineer would not be able to ping the EC2 instance from the same VPC context, and VPC peering or transit gateway would be required; the question states they are in the same VPC. Option C is wrong because if the cluster were not in an available state, the engineer would likely receive a different error (e.g., 'cluster not found' or connection timeout), and the question does not indicate any cluster status issues. Option D is wrong because the Redshift cluster is in the same VPC as the EC2 instance, so public accessibility and an internet gateway are not required; traffic stays within the VPC and uses private IPs.

Full explanation →

1653

Multi-Selecthard

A data engineer is building a pipeline to ingest data from an on-premises Oracle database into Amazon S3. The pipeline must capture change data (CDC) in near real-time and handle schema changes. Which TWO AWS services should the engineer use?

Select 2 answers

A.AWS Glue Schema Registry

B.AWS Snowball Edge

C.Amazon AppFlow

D.Amazon Kinesis Data Streams with Kinesis Agent

E.AWS Database Migration Service (DMS) with CDC

AnswersA, E

Manages schema evolution for streaming data.

Why this answer

Options A and D are correct. AWS DMS with CDC captures ongoing changes, and AWS Glue Schema Registry handles schema evolution. Option B (Kinesis) is not for database CDC.

Option C (Snowball) is batch. Option E (AppFlow) is for SaaS, not on-prem databases.

Full explanation →

1654

Multi-Selecteasy

A company is using AWS Glue to catalog data in S3. The security team wants to ensure that only authorized users can access the Glue Data Catalog and that data lineage is tracked. Which AWS services can be used together to meet these requirements? (Choose TWO.)

Select 2 answers

A.AWS CloudTrail

B.AWS Glue DataBrew

C.Amazon Athena

D.AWS Lake Formation

E.Amazon Kinesis

AnswersB, D

Provides data lineage tracking.

Why this answer

Options A and D are correct. AWS Lake Formation provides fine-grained access control for the Data Catalog. AWS Glue DataBrew provides data lineage visualization.

Option B is wrong because Athena is a query service, not for access control or lineage. Option C is wrong because CloudTrail logs API calls but does not manage permissions. Option E is wrong because Kinesis is for streaming.

Full explanation →

1655

Multi-Selecthard

A company is using AWS Lake Formation to manage a data lake. The data engineer needs to set up fine-grained access control so that users can only see specific columns in a table based on their IAM role. Which THREE steps should the data engineer take?

Select 3 answers

A.Ensure that the users query the table through a service integrated with Lake Formation, such as Athena.

B.Grant the IAM role SELECT permission on the table with column-level restrictions.

C.Create a view in the Data Catalog that exposes only the required columns.

D.Define a Lake Formation data permissions policy that includes column-level filtering.

E.Attach an S3 bucket policy to restrict access to the underlying data.

AnswersA, B, D

Lake Formation enforces permissions when queries are run through integrated services.

Why this answer

Options A, B, and D are correct. Lake Formation column-level access requires defining the policy in Lake Formation, granting permissions to the IAM role, and the user must use Lake Formation-enabled services. Option C is wrong because S3 bucket policies are not used for column-level control.

Option E is wrong because the table must be registered with Lake Formation.

Full explanation →

1656

Multi-Selecteasy

A company is evaluating Amazon DynamoDB for a new application. The application requires single-digit millisecond latency for read and write operations. Which TWO DynamoDB features should the company enable to achieve this? (Choose TWO.)

Select 2 answers

A.Use DAX with Write-Through caching.

B.Enable DynamoDB Global Tables.

C.Enable DynamoDB Streams.

D.Enable DynamoDB Accelerator (DAX).

E.Enable auto-scaling for read and write capacity.

AnswersA, D

Why this answer

Option A is correct because DAX with Write-Through caching ensures that every write to DynamoDB is also written to the DAX cache, so subsequent reads of the same item are served from the in-memory cache with single-digit millisecond latency. Option D is correct because DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that reduces read response times from single-digit milliseconds to microseconds, directly meeting the latency requirement for read operations.

Exam trap

The trap here is that candidates often confuse DynamoDB Accelerator (DAX) with DynamoDB Global Tables, assuming that multi-region replication improves local latency, when in fact DAX is the only feature that provides an in-memory cache for single-digit millisecond reads within a region.

Full explanation →

1657

Multi-Selecteasy

Which TWO AWS services can be used to ingest data from an on-premise relational database into Amazon S3 on a one-time basis?

Select 2 answers

A.AWS Data Pipeline

B.AWS Database Migration Service (DMS)

C.AWS Glue

D.Amazon Simple Queue Service (SQS)

E.Amazon Kinesis Data Streams

AnswersB, C

DMS supports full load migrations to S3.

Why this answer

Options A and C are correct. AWS DMS can perform one-time full load from databases to S3. AWS Glue can connect to JDBC sources and write to S3.

Option B is wrong because Kinesis Data Streams is for streaming. Option D is wrong because SQS is a queue. Option E is wrong because Data Pipeline is a legacy service.

Full explanation →

1658

MCQmedium

A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?

A.Enable GZIP compression on the Firehose delivery stream.

B.Increase the buffer size to 128 MB to accommodate larger batches.

C.Switch to Kinesis Data Streams with a Lambda consumer.

D.Reduce the buffer interval to 60 seconds.

AnswerD

This forces delivery every 60 seconds, meeting the latency requirement.

Why this answer

Reducing the buffer interval to 60 seconds ensures that Firehose delivers data to S3 at most every 60 seconds, directly capping latency even if the buffer size is not full. This aligns with the requirement to keep latency under 60 seconds, as Firehose delivers data when either the buffer interval or buffer size threshold is met first.

Exam trap

AWS often tests the misconception that increasing buffer size or enabling compression reduces latency, when in fact these options increase latency by allowing more data to accumulate before delivery.

How to eliminate wrong answers

Option A is wrong because enabling GZIP compression reduces data size but does not affect the buffer interval or delivery frequency; it may even increase latency due to compression overhead. Option B is wrong because increasing the buffer size to 128 MB would allow more data to accumulate before delivery, which would increase latency during spikes, not decrease it. Option C is wrong because switching to Kinesis Data Streams with a Lambda consumer introduces additional complexity and potential for increased latency due to Lambda invocation overhead and scaling limitations, and does not directly guarantee sub-60-second delivery to S3.

Full explanation →

1659

Multi-Selectmedium

Which TWO actions can improve query performance on an Amazon Redshift cluster? (Choose two.)

Select 2 answers

A.Define appropriate sort keys

B.Increase the number of nodes

C.Use EVEN distribution style for all tables

D.Use columnar compression

E.Run VACUUM command regularly

AnswersA, D

Sort keys reduce the amount of data scanned.

Why this answer

Options A and B are correct because sort keys improve data organization and compression reduces I/O. Option C (distribution style) affects data distribution but not directly query performance. Option D (increasing number of nodes) adds cost.

Option E (vacuum) reclaims space but does not directly improve query speed.

Full explanation →

1660

MCQmedium

A data engineer notices that an AWS Glue job processing data from an Amazon S3 bucket frequently fails with 'OutOfMemoryError'. The job reads CSV files, applies transformations, and writes Parquet to another S3 bucket. The job has 10 workers of type G.1X. Which change is MOST likely to resolve the issue?

A.Change the worker type from G.1X to G.2X

B.Increase the number of workers to 20

C.Change the worker type from G.1X to G.8X

D.Enable the Spark UI to monitor memory and tune the job

AnswerA

G.2X provides 2x the memory of G.1X, directly addressing the OutOfMemoryError.

Why this answer

The G.1X worker type provides 16 GB of memory per worker. An OutOfMemoryError indicates that the job's memory requirements exceed this limit. Upgrading to G.2X doubles the memory per worker to 32 GB, directly addressing the memory shortage without changing the parallelism or incurring the overhead of additional workers.

Exam trap

The trap here is that candidates might think adding more workers (Option B) solves memory issues, but OutOfMemoryError is per-worker, not a cluster-wide shortage, so increasing parallelism does not fix the root cause.

How to eliminate wrong answers

Option B is wrong because increasing the number of workers to 20 does not increase the memory per worker; it only adds more workers, which can help with parallelism but not with per-worker memory exhaustion. Option C is wrong because G.8X provides 64 GB of memory per worker, which is excessive and likely unnecessary; the most cost-effective fix is G.2X. Option D is wrong because enabling the Spark UI only helps with monitoring and debugging, not with resolving the memory issue; it does not allocate additional memory.

Full explanation →

1661

MCQeasy

A data engineer notices that an AWS Glue ETL job is failing intermittently with the error 'Connection refused'. The job reads from Amazon RDS for MySQL and writes to Amazon S3. What is the MOST likely cause?

A.The RDS instance has reached its maximum number of connections.

B.The security group for the RDS instance is not allowing inbound traffic from the Glue job's subnet.

C.The Glue job is using too many DPUs and hitting resource limits.

D.The IAM role associated with the Glue job lacks permissions to write to the S3 bucket.

AnswerB

The 'Connection refused' error typically occurs due to network or security group misconfiguration blocking access to the RDS instance.

Why this answer

Option A is correct because the error indicates a network connectivity issue to the RDS database. Option B is incorrect because the error is not about permissions. Option C is incorrect because the error is about connection, not resource limits.

Option D is incorrect because the error is not about job parallelism.

Full explanation →

1662

MCQmedium

A company uses S3 as a data lake. They want to ingest on-premises relational database data daily with full-load snapshots. The data volume is 500 GB per day. The database is accessible over the internet. Which service should they use for this ingestion?

A.Kinesis Data Firehose

B.AWS Glue ETL job reading from JDBC

C.AWS Database Migration Service (DMS)

D.AWS Transfer Family

AnswerC

DMS supports full-load migration from on-premises databases to S3.

Why this answer

Option D is correct because AWS Database Migration Service (DMS) can perform full-load migrations from on-premises databases to S3. Option A is wrong because Kinesis Data Firehose is for streaming data, not database snapshots. Option B is wrong because Glue ETL can read from JDBC but is not optimized for large full-load snapshots.

Option C is wrong because Transfer Family is for file transfers over SFTP, not database connections.

Full explanation →

1663

MCQhard

A data engineer is designing a solution to securely store and rotate database credentials used by an application. The credentials should be automatically rotated every 90 days. Which AWS service should be used?

A.AWS Secrets Manager

B.AWS Systems Manager Parameter Store

C.AWS Key Management Service (KMS)

D.AWS Identity and Access Management (IAM)

AnswerA

Secrets Manager provides automatic rotation of secrets.

Why this answer

Option C is correct because AWS Secrets Manager can automatically rotate secrets, including database credentials. Option A (IAM) manages users and roles, not credentials rotation. Option B (AWS KMS) manages encryption keys, not secrets.

Option D (Parameter Store) can store secrets but does not have built-in automatic rotation.

Full explanation →

1664

MCQmedium

A data engineer is troubleshooting an AWS Glue job that writes data to an S3 bucket. The IAM role attached to the Glue job has the policy shown in the exhibit. The job fails when writing to the 'secrets/' prefix but succeeds when writing to other prefixes. What is the reason for the failure?

A.The job does not have permission to write to the bucket at all.

B.The resource ARN in the Allow statement does not include the bucket itself.

C.The Deny statement is not effective because it is placed after the Allow.

D.The Deny statement explicitly denies PutObject to the secrets/ prefix.

AnswerD

Deny overrides Allow.

Why this answer

Option B is correct because the Deny statement explicitly denies s3:PutObject to the secrets/ prefix, which overrides the Allow. Option A is wrong because the resource is correctly specified for both statements. Option C is wrong because the Deny is explicit.

Option D is wrong because the job can write to other prefixes.

Full explanation →

1665

Multi-Selecteasy

A data engineer is monitoring an Amazon RDS for PostgreSQL instance. The engineer wants to set up alerts for high CPU utilization and low free storage space. Which AWS services can be used together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon Simple Notification Service (SNS)

B.Amazon CloudWatch

C.AWS CloudTrail

D.AWS Config

E.Amazon Route 53

AnswersA, B

SNS delivers alarm notifications.

Why this answer

Amazon CloudWatch is the correct service because it can monitor RDS metrics such as CPUUtilization and FreeStorageSpace, and trigger alarms based on thresholds. Amazon SNS is correct because it can receive CloudWatch alarm notifications and deliver them via email, SMS, or other endpoints, enabling the data engineer to be alerted when high CPU or low storage conditions occur.

Exam trap

The trap here is that candidates often confuse AWS CloudTrail (audit logging) with CloudWatch (monitoring), or think AWS Config can monitor performance metrics instead of just configuration compliance.

Full explanation →

1666

MCQmedium

A data engineer manages an Amazon Redshift cluster that hosts a 10 TB data warehouse. The cluster uses a single node of type dc2.large (160 GB SSD). The engineer notices that the cluster's disk space is 95% full, and queries are running slowly. The engineer runs the STV_PARTITIONS view and sees that many slices have high 'tossed' counts. The engineer also runs VACUUM and ANALYZE commands, but the disk space does not improve. The engineer suspects that the cluster needs more storage. However, the company wants to minimize cost. Which action should the engineer take to resolve the disk space issue most cost-effectively?

A.Switch to a ra3.xlplus node with managed storage.

B.Replace the cluster with a single ds2.xlarge node.

C.Scale the cluster to a single dc2.8xlarge node.

D.Add another dc2.large node to the cluster to increase total storage.

AnswerB

ds2.xlarge provides 2 TB HDD storage at a lower cost than dc2 options, solving the disk space issue.

Why this answer

Option C is correct because dc2.large is a dense compute node with limited SSD storage; upgrading to ds2.xlarge provides more storage (2 TB HDD) at a lower cost per GB compared to scaling up to dc2.8xlarge. Option A is wrong because adding more dc2.large nodes increases storage but also CPU and memory, which may be unnecessary and more expensive than a single ds2 node. Option B is wrong because dc2.8xlarge has 2.56 TB SSD, which is more expensive than ds2.xlarge.

Option D is wrong because switching to RA3 nodes is costly and designed for managed storage, which may be overkill.

Full explanation →

1667

MCQmedium

A data engineer is running a Spark job on Amazon EMR. The job reads from S3, processes data, and writes to S3. The job is taking longer than expected. The engineer notices that the job is spending a lot of time in the 'GC' (garbage collection) phase. Which configuration change is most likely to improve performance?

A.Increase the spark.executor.memory setting.

B.Increase the spark.sql.shuffle.partitions.

C.Decrease the number of executor cores.

D.Decrease the spark.executor.memoryOverhead.

AnswerA

More memory reduces GC overhead.

Why this answer

Option A is correct because increasing executor memory reduces GC frequency. Option B is wrong because it reduces parallelism. Option C is wrong because it reduces memory per task.

Option D is wrong because it reduces memory and may increase GC.

Full explanation →

1668

Multi-Selecthard

A company needs to implement a data encryption strategy for data in transit between an Amazon EC2 instance and an Amazon RDS for MySQL database. Which THREE actions should be taken?

Select 3 answers

A.Configure the RDS instance to require SSL connections

B.Set up VPC peering between the EC2 and RDS subnets

C.Enable encryption at rest using KMS on the RDS instance

D.Use a JDBC driver with the useSSL property set to true

E.Enable the rds.force_ssl parameter in the RDS parameter group

AnswersA, D, E

Requires SSL for connections.

Why this answer

Options A, B, and D are correct. Enforcing SSL/TLS, enabling SSL parameter, and using a JDBC driver with SSL property ensure encryption in transit. Option C is wrong because encryption at rest is separate.

Option E is wrong because VPC peering doesn't encrypt data.

Full explanation →

1669

MCQeasy

A data engineer needs to transform JSON data from an S3 bucket into Parquet format for efficient querying with Amazon Athena. The transformation must be serverless and event-driven. Which approach meets these requirements?

A.Use AWS Glue with a scheduled crawler to convert the data.

B.Use Amazon Athena to convert JSON to Parquet on the fly.

C.Use S3 event notifications to invoke a Lambda function that runs PySpark to convert the data.

D.Use Amazon EMR with a long-running cluster to process S3 data.

AnswerC

Lambda can run PySpark (e.g., using AWS Glue ETL library) and is triggered by S3 events, making it serverless and event-driven.

Why this answer

S3 event notifications trigger a Lambda function that uses PySpark to convert JSON to Parquet. Athena is query-only, Glue requires a job trigger, and EMR is not serverless.

Full explanation →

1670

Multi-Selecthard

A company runs a real-time analytics platform using Amazon Kinesis Data Streams. The data is consumed by multiple consumers: one for real-time dashboard (using Lambda) and one for long-term storage (using Firehose to S3). The Kinesis stream has 10 shards. Each record is 1 KB, and the total incoming data rate is 5 MB/s. The Lambda consumer is falling behind and processing latency exceeds 10 seconds. Which TWO actions should be taken to resolve the issue?

Select 2 answers

A.Increase the Lambda function's memory allocation

B.Increase the number of shards to 20

C.Enable enhanced fan-out for the Lambda consumer

D.Switch to using Kinesis Client Library (KCL) instead of Lambda

E.Decrease the batch size in the Lambda event source mapping

AnswersB, C

More shards increase the total throughput of the stream, allowing Lambda to process more data in parallel.

Why this answer

Options A and D are correct. Increasing the number of shards to 20 doubles the throughput capacity, allowing Lambda to process more records per second. Using enhanced fan-out eliminates contention between consumers, giving each consumer dedicated read throughput.

Option B (increase Lambda memory) may help but limited by shard throughput. Option C (decrease batch size) would increase number of invocations, possibly worsening latency. Option E (use KCL) is already used by Lambda.

Full explanation →

1671

MCQmedium

Refer to the exhibit. An IAM policy is attached to a role used by an AWS Glue job. The job fails with an 'AccessDenied' error when trying to write to 's3://my-bucket/output/'. What is the most likely cause?

A.The resource ARN for S3 should include the bucket itself.

B.The glue:StartJobRun action is not allowed.

C.The policy does not grant s3:ListBucket permission.

D.The s3:GetObject action is missing.

AnswerC

Writing to S3 often requires ListBucket on the bucket.

Why this answer

Option B is correct because the policy only allows s3:PutObject on objects inside the bucket, but not on the bucket itself. The error may be due to missing s3:ListBucket permission for the bucket. Option A is wrong because actions include s3:PutObject.

Option C is wrong because glue actions allow StartJobRun. Option D is wrong because the job itself runs, only output fails.

Full explanation →

1672

MCQeasy

A company has CSV files in an S3 bucket that need to be converted to Parquet and loaded into a Redshift table daily. The transformation is a simple schema mapping without joins. Which AWS Glue feature is BEST suited for this task?

A.AWS Glue ETL job

B.AWS Glue DataBrew

C.AWS Glue Workflow

D.AWS Glue Crawler

AnswerA

Glue ETL jobs can read CSV, convert to Parquet, and write to Redshift.

Why this answer

Option B is correct because Glue ETL job is the standard way to transform and load data. Option A (Crawler) only catalogs data. Option C (DataBrew) is a visual tool, but the question asks for Glue feature for automated daily jobs.

Option D (Workflows) orchestrates jobs but does not perform transformation.

Full explanation →

1673

MCQmedium

A company uses AWS Glue to process data in Amazon S3. The Glue job fails with an error indicating that the partition keys in the catalog do not match the actual S3 partition structure. What is the most likely cause?

A.The IAM role does not have permissions to read the S3 data

B.The data files are encrypted with SSE-KMS

C.The table name in the catalog is different from the one used in the job

D.The Glue Data Catalog partition metadata is outdated after the S3 structure changed

AnswerD

The catalog must be refreshed by running a crawler to reflect S3 changes.

Why this answer

Option B is correct because partition structure changes in S3 (e.g., adding a new partition) mean the catalog must be updated via a crawler. Option A is wrong because IAM permissions would cause a different error. Option C is wrong because encryption would cause decryption errors.

Option D is wrong because a table name mismatch would result in a 'table not found' error.

Full explanation →

1674

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3 for analytics. The data includes sensitive PII that must be encrypted at rest. The company requires that the encryption keys be managed by the company's own hardware security module (HSM) and rotated every 90 days. Which TWO options meet these requirements? (Choose TWO.)

Select 2 answers

A.Use S3 server-side encryption with AWS KMS and an AWS managed key

B.Use S3 server-side encryption with customer-provided keys (SSE-C)

C.Use client-side encryption with keys stored in AWS Secrets Manager

D.Use S3 server-side encryption with AWS KMS (SSE-KMS) and a customer-managed key with imported key material from your HSM

E.Use S3 server-side encryption with S3 managed keys (SSE-S3)

AnswersB, D

SSE-C allows you to supply your own encryption keys, which you can rotate by re-encrypting objects.

Why this answer

Option B is correct because SSE-C allows you to provide your own encryption keys, which can be managed and rotated from your own HSM. The keys are used server-side by S3 to encrypt objects at rest, but S3 does not store the keys—you manage them entirely, meeting the requirement for key management on your own HSM with 90-day rotation.

Exam trap

The trap here is that candidates often assume only SSE-KMS can meet key management requirements, but they overlook that SSE-C directly supports customer-supplied keys from an HSM without any AWS key storage, and that SSE-KMS with imported key material also satisfies the HSM and rotation needs when properly configured.

Full explanation →

1675

MCQeasy

A data engineer needs to store semi-structured JSON log files from multiple sources and query them using SQL. The data is rarely updated and access frequency is low. Which storage solution is MOST cost-effective?

A.Amazon Redshift with JSON ingestion and compression.

B.Amazon DynamoDB with JSON documents.

C.Amazon S3 with Amazon Athena for querying.

D.Amazon RDS for PostgreSQL with JSONB columns.

AnswerC

S3 provides cheap storage and Athena allows serverless SQL queries, ideal for low-frequency access.

Why this answer

Amazon S3 with Athena is the most cost-effective solution because the data is semi-structured JSON, rarely updated, and accessed infrequently. S3 provides low-cost storage for static data, and Athena uses a serverless, pay-per-query model, eliminating the need for a running cluster or provisioned capacity. This combination avoids the fixed costs of Redshift, DynamoDB, or RDS, making it ideal for low-frequency SQL querying of archival logs.

Exam trap

The trap here is that candidates often choose Redshift or RDS because they associate SQL querying with traditional databases, overlooking that Athena's serverless, pay-per-query model is far more cost-effective for infrequent access to static data stored in S3.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift requires a provisioned cluster with ongoing compute costs, making it overkill and expensive for rarely accessed data; its JSON ingestion and compression do not offset the fixed infrastructure cost. Option B is wrong because Amazon DynamoDB is a NoSQL key-value store optimized for high-frequency, low-latency reads/writes, not for SQL-based ad-hoc querying of large JSON logs; its on-demand capacity mode still incurs per-request charges that are wasteful for infrequent access. Option D is wrong because Amazon RDS for PostgreSQL with JSONB columns requires a provisioned database instance with continuous compute and storage costs, and while JSONB supports indexing, it is not cost-effective for rarely queried, static log data compared to S3's pay-per-byte storage and Athena's pay-per-query model.

Full explanation →

1676

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data includes sensitive personally identifiable information (PII). Which combination of services would provide the most comprehensive data protection?

A.Use S3 Transfer Acceleration and enable versioning

B.Enable S3 server-side encryption with AWS KMS

C.Use Amazon CloudWatch Logs to monitor access and enable MFA Delete

D.Enable S3 Block Public Access and use Amazon Macie to discover and classify PII

AnswerD

Block Public Access prevents exposure; Macie identifies and alerts on PII.

Why this answer

Option B is correct because S3 Block Public Access prevents exposure and Macie identifies sensitive data. Option A is wrong because KMS only encrypts. Option C is wrong because CloudWatch does not protect data.

Option D is wrong because S3 Transfer Acceleration is for speed.

Full explanation →

1677

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The data must be delivered within 60 seconds of ingestion. Currently, the delivery takes 3 minutes due to large buffer sizes. How should the engineer adjust the Firehose configuration?

A.Decrease the buffer interval to 60 seconds.

B.Increase the buffer interval to 120 seconds.

C.Increase the buffer size to 128 MB.

D.Decrease the buffer size to 1 MB.

AnswerA

Lowering the buffer interval triggers delivery sooner, meeting the latency requirement.

Why this answer

Amazon Kinesis Data Firehose delivers data to S3 based on either a buffer size threshold or a buffer interval (in seconds), whichever is reached first. To ensure delivery within 60 seconds, you must decrease the buffer interval to 60 seconds, which forces Firehose to flush data to S3 every 60 seconds regardless of buffer size. The current 3-minute delay is caused by the buffer interval being larger than 60 seconds, so reducing it directly meets the requirement.

Exam trap

The trap here is that candidates mistakenly think decreasing the buffer size alone will speed up delivery, but without adjusting the buffer interval, Firehose may still wait up to the default interval (e.g., 300 seconds) before flushing, so both parameters must be considered to meet a time-based requirement.

How to eliminate wrong answers

Option B is wrong because increasing the buffer interval to 120 seconds would make the delivery delay even longer (up to 2 minutes), not shorter, and fails to meet the 60-second requirement. Option C is wrong because increasing the buffer size to 128 MB does not reduce delivery time; it may actually increase latency since Firehose waits for more data to accumulate before flushing, and the buffer interval is the primary control for time-based delivery. Option D is wrong because decreasing the buffer size to 1 MB could cause more frequent flushes but does not guarantee delivery within 60 seconds if the buffer interval remains larger than 60 seconds; the buffer interval must be explicitly set to 60 seconds to enforce the time constraint.

Full explanation →

1678

Multi-Selectmedium

A data engineer is designing a data pipeline that uses AWS Glue to transform data stored in Amazon S3. The transformation logic must be written in Python and should handle schema evolution automatically. Which THREE features or configurations should the engineer use? (Select THREE.)

Select 3 answers

A.Schedule a Glue crawler to update the schema

B.Use `applyMapping` transformations

C.Use Spark SQL for transformations

D.Enable schema detection in the Glue job

E.Use DynamicFrames instead of DataFrames

AnswersB, D, E

Facilitates schema manipulation.

Why this answer

Correct options: A, B, C. AWS Glue DynamicFrames handle schema evolution automatically. Glue's schema detection can infer schema from data.

Using `applyMapping` allows for schema transformations. Option D (Spark SQL) does not handle schema evolution automatically. Option E (Glue crawlers) are for cataloging, not transformation.

Full explanation →

1679

MCQeasy

A logistics company uses AWS Glue to process GPS data from delivery trucks. The data is stored in Amazon S3 as JSON files. The Glue job reads the JSON files, converts them to Parquet, and writes them back to S3. The company notices that the Glue job takes too long to complete. The data engineer wants to improve the job's performance without changing the code. What should the data engineer do?

A.Increase the number of DPUs to 20.

B.Change the worker type to G.2X.

C.Change the worker type to G.1X.

D.Decrease the number of DPUs to 5 to reduce overhead.

AnswerB

G.2X workers have double the memory and compute, accelerating the transformation.

Why this answer

Option D is correct. Using a G.2X worker type provides more memory and CPU per worker, improving performance. Option A is wrong because increasing DPUs may help but is less efficient than using larger workers for memory-intensive tasks.

Option B is wrong because decreasing DPUs would worsen performance. Option C is wrong because using G.1X is the default; upgrading to G.2X is better.

Full explanation →

1680

MCQhard

Refer to the exhibit. A data engineer configured CloudTrail to log data events for an S3 bucket. However, the engineer notices that no data events are being logged for objects in the 'logs/' prefix. What is the most likely reason?

A.The S3 bucket policy does not allow CloudTrail to write logs

B.The data resource should specify the bucket ARN without a prefix

C.The prefix 'logs/' must not include a trailing slash

D.Data events are not supported for S3

AnswerA

CloudTrail needs a bucket policy granting s3:PutObject.

Why this answer

The data resource value is missing a trailing slash (should be 'logs/')? Actually it has a trailing slash. Wait, the issue is that the ARN is for a prefix, but CloudTrail data event selectors for S3 require a bucket ARN or prefix ARN with a trailing slash. The provided ARN 'arn:aws:s3:::my-bucket/logs/' is correct format.

However, the likely issue is that the bucket is in a different region, but the trail is in a different region? No, more common: the IAM role for CloudTrail lacks permissions to log to S3. But the exhibit shows a correct selector. Actually, common mistake: the selector must have a trailing slash, which it does.

The most likely cause is that the trail is not logging because the S3 bucket policy does not grant CloudTrail write access. Option A is plausible. Option B is wrong because prefix is correct.

Option C is wrong because it includes trailing slash. Option D is wrong because data events are enabled. So option A is correct.

Full explanation →

1681

MCQhard

A company runs a critical data pipeline using Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is compressed with GZIP and partitioned by year/month/day/hour. Recently, the delivery to S3 has been failing with 'Rate exceeded' errors. The Firehose delivery stream has a buffer size of 128 MB and buffer interval of 60 seconds. What is the most effective way to resolve this issue?

A.Transition objects to S3 Glacier after 30 days.

B.Decrease the buffer size to 64 MB and buffer interval to 30 seconds.

C.Increase the buffer size to 256 MB and buffer interval to 120 seconds.

D.Enable server-side encryption on the S3 bucket.

AnswerC

Larger buffers reduce the number of S3 PUT requests, alleviating throttling.

Why this answer

The 'Rate exceeded' error indicates that Kinesis Data Firehose is sending requests to S3 at a rate that exceeds the S3 bucket's request rate limits for PUT operations. Increasing the buffer size to 256 MB and the buffer interval to 120 seconds allows Firehose to accumulate more data before each S3 PUT request, reducing the number of requests per second and staying within S3's 3,500 PUT requests per second limit per prefix. This directly addresses the throttling issue without changing the data volume.

Exam trap

The trap here is that candidates mistakenly think reducing buffer size or interval will speed up delivery, but in reality, it increases request frequency and worsens S3 throttling, while increasing buffers is the correct way to reduce request rate.

How to eliminate wrong answers

Option A is wrong because transitioning objects to S3 Glacier after 30 days does not affect the rate of PUT requests to the S3 bucket; it only changes storage class after delivery, so it cannot resolve current delivery failures. Option B is wrong because decreasing the buffer size to 64 MB and buffer interval to 30 seconds would increase the frequency of S3 PUT requests, worsening the 'Rate exceeded' errors by exceeding the bucket's request rate limits even more. Option D is wrong because enabling server-side encryption on the S3 bucket does not change the request rate or throughput; it only encrypts objects at rest and has no impact on throttling of PUT operations.

Full explanation →

1682

MCQhard

A company is using AWS DMS to replicate data from an on-premises Oracle database to Amazon RDS for MySQL. The replication is working, but the target table has a different schema. Which DMS feature should be used to transform the source schema to match the target?

A.Use AWS Schema Conversion Tool (SCT)

B.Use AWS Glue ETL jobs

C.Use DMS transformation rules

D.Use AWS Lambda triggers

AnswerC

DMS transformation rules allow renaming tables, schemas, and columns during replication.

Why this answer

Option A is correct because DMS supports transformation rules that can change table names, schemas, and columns during migration. Option B is wrong because DMS does not have a built-in schema conversion feature (that's AWS SCT). Option C is wrong because Lambda can be used for custom transformations but adds complexity.

Option D is wrong because Glue is a separate ETL service, not integrated with DMS.

Full explanation →

1683

MCQmedium

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an Oracle RDS instance to S3. The data is used for analytics. The replication lags behind the source by several hours. Which change would most likely reduce the lag?

A.Change the target endpoint from S3 to Kinesis Data Firehose.

B.Increase the source RDS instance storage to improve I/O.

C.Use a larger DMS replication instance (e.g., dms.c5.large instead of dms.t3.medium).

D.Change the target data format from CSV to Parquet.

AnswerC

More compute resources reduce lag.

Why this answer

Option A is correct because using a larger instance provides more CPU and memory for DMS. Option B is wrong because S3 endpoint cannot be changed to Kinesis without redesign. Option C is wrong because increasing source RDS storage doesn't directly impact DMS performance.

Option D is wrong because changing target to Parquet adds transformation overhead.

Full explanation →

1684

MCQeasy

A company is migrating its on-premises MySQL database to Amazon RDS for MySQL. They want to minimize downtime and ensure data consistency. Which AWS service should be used for the migration?

A.AWS S3 Transfer Acceleration

B.AWS Glue

C.AWS Database Migration Service (DMS)

D.AWS Snowball Edge

AnswerC

DMS supports continuous replication and minimal downtime for database migrations.

Why this answer

AWS Database Migration Service (DMS) is the correct choice because it is specifically designed for migrating databases to AWS with minimal downtime. DMS supports homogeneous migrations like MySQL to Amazon RDS for MySQL, and it uses ongoing replication (change data capture) to keep the source and target databases in sync during the migration, ensuring data consistency and allowing a cutover with only seconds of downtime.

Exam trap

The trap here is that candidates may confuse AWS DMS with AWS Glue, thinking both are for data migration, but Glue is for batch ETL and cannot perform live database replication with minimal downtime, while DMS is purpose-built for that task.

How to eliminate wrong answers

Option A is wrong because AWS S3 Transfer Acceleration is a service that speeds up uploads to Amazon S3 by using optimized network paths and edge locations; it has no capability to migrate or replicate a live MySQL database to RDS. Option B is wrong because AWS Glue is a serverless data integration service for ETL (extract, transform, load) jobs, primarily used for preparing and transforming data for analytics, not for ongoing database replication or minimizing downtime during a live database migration. Option D is wrong because AWS Snowball Edge is a physical data transport device used for large-scale data transfers over slow or unreliable networks, but it is not suitable for minimizing downtime in a live database migration as it involves shipping hardware and cannot perform continuous replication.

Full explanation →

1685

MCQhard

A data engineer is monitoring an Amazon Redshift cluster and notices that queries are taking longer than expected. The engineer checks the system tables and sees that many queries are waiting for 'WLM' resources. What is the most likely cause and recommended fix?

A.The table sort keys are poorly designed; recreate tables with better sort keys.

B.The distribution style is set to ALL; change to KEY distribution.

C.The WLM queue concurrency is set too low; increase the concurrency level.

D.The cluster is running low on disk space; resize the cluster.

AnswerC

Higher concurrency allows more simultaneous queries.

Why this answer

Option D is correct because WLM queue wait indicates concurrency throttling. Option A is wrong because disk space is unrelated. Option B is wrong because sort keys improve scan efficiency, not concurrency.

Option C is wrong because distribution style affects data movement, not queue wait.

Full explanation →

1686

MCQhard

A company has an S3 data lake with millions of objects. A data engineer needs to provide a daily report of objects that are not accessed for 90 days. The engineer must minimize cost and impact on performance. Which approach should be used?

A.Enable S3 Inventory and query with Athena

B.Use S3 Select on each object to check last access metadata

C.Analyze S3 server access logs to find objects not accessed

D.Use S3 Storage Lens to generate a dashboard of object age and last access

AnswerD

Storage Lens provides built-in metrics at low cost.

Why this answer

Option C is correct because S3 Storage Lens provides cost-effective analytics including last access date. Option A is wrong because S3 Inventory creates daily lists but requires Athena queries and is more complex. Option B is wrong because S3 Server Access Logs can be large and costly to query.

Option D is wrong because S3 Select is for querying objects' content, not metadata.

Full explanation →

1687

Multi-Selectmedium

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. The team wants to resolve this issue without increasing costs significantly. Which TWO actions should the team take?

Select 2 answers

A.Increase the Spark memory overhead parameter in the Glue job configuration.

B.Use DynamicFrame instead of Spark DataFrame for transformations.

C.Increase the number of workers to maximum allowed.

D.Switch from a Spark job to a Python shell job.

E.Change the worker type from 'G.1x' to 'G.2x' to double memory per worker.

AnswersA, E

Allocates more memory per worker for Spark processing, reducing OOM errors.

Why this answer

The correct answers are A and C. Increasing Spark memory overhead per worker (option A) provides more memory for Spark operations. Using the 'g.2x' worker type (option C) offers more memory per worker compared to 'g.1x' without doubling cost.

Option B (increasing number of workers) increases cost linearly. Option D (using Python shell) is not suitable for large data. Option E (using DynamicFrame) does not address memory.

Full explanation →

1688

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources and must be queryable using Amazon Athena. The engineer needs to optimize query performance and reduce costs. Which THREE actions would achieve this?

Select 3 answers

A.Store data in many small files to increase parallelism.

B.Partition the data by a commonly used filter column.

C.Use S3 Select instead of Athena for queries.

D.Compress data with a splittable compression format like Snappy.

E.Convert data to Apache Parquet or ORC format.

AnswersB, D, E

Partition pruning limits the data scanned.

Why this answer

Option A: Partitioning reduces data scanned. Option C: Using columnar formats like Parquet reduces data scanned. Option E: Compression reduces storage and data scanned.

Option B is wrong because the number of files should be optimized, not increased. Option D is wrong because S3 Select is for filtering within a single file, not for Athena.

Full explanation →

1689

MCQhard

A company uses Amazon EMR to process large datasets stored in Amazon S3. The data is in Parquet format and partitioned by date. The EMR cluster uses Spark SQL for transformations. Recently, the job has been slow and some tasks are failing due to 'java.lang.OutOfMemoryError'. The cluster has 10 core nodes of type m5.xlarge. Which configuration change would MOST improve performance and stability?

A.Increase the number of Spark partitions using repartition(), but keep the same nodes.

B.Change the core node instance type to r5.xlarge (memory-optimized).

C.Increase the number of executor cores in the Spark configuration.

D.Enable Kryo serialization in the Spark configuration.

AnswerB

More memory per node helps OOM.

Why this answer

Option A is correct because using a memory-optimized instance type like r5.xlarge provides more memory per core. Option B is wrong because more partitions with same resources can cause overhead. Option C is wrong because increasing executor cores without increasing memory can worsen memory issues.

Option D is wrong because Kryo serialization reduces memory for serialized objects, not OOM from processing.

Full explanation →

1690

MCQmedium

The IAM policy shown is attached to an IAM role. When a user assumes this role and tries to read an object in example-bucket that has no tags, what will happen?

A.The request will be denied because the object does not have the 'public' tag

B.The request will be allowed because the Allow statement grants access to all objects

C.The request will be allowed because there is no explicit Deny

D.The request will be denied because the Deny statement applies when the tag is missing

AnswerD

The Deny statement explicitly denies access when the tag is null.

Why this answer

Option D is correct. The Deny statement denies s3:GetObject if the object does not have the tag 'classification' (i.e., the tag is null). Since the object has no tags, the condition evaluates to true, and the action is denied.

The Allow statement only allows if the tag equals 'public', which is not the case. The explicit Deny overrides any Allow, so access is denied.

Full explanation →

1691

MCQhard

Your company runs a critical data processing pipeline that ingests data from multiple sources into an Amazon S3 bucket. An AWS Glue ETL job processes this data and writes the output to an Amazon Redshift cluster. The pipeline is triggered by an S3 event notification that invokes an AWS Lambda function, which starts the Glue job. Recently, you have observed that the Glue job occasionally fails with an AccessDenied error when trying to access the S3 bucket. The IAM role used by the Glue job has the following policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::input-bucket", "arn:aws:s3:::input-bucket/*" ] }, { "Effect": "Allow", "Action": [ "redshift:CopyData" ], "Resource": "*" } ] }. The S3 bucket has a bucket policy that allows access only from a specific VPC. The Glue job runs in a VPC with the appropriate VPC endpoints configured. The error occurs intermittently and sometimes retries succeed. What is the most likely cause and correct course of action?

A.Add a VPC endpoint for S3 and configure the bucket policy to allow access from the Glue job's VPC endpoint.

B.Ensure the Glue job's VPC configuration includes a NAT gateway to route traffic to S3.

C.Change the Lambda function to use a different IAM role with broader S3 permissions.

D.Modify the Glue job's IAM role to include s3:PutObject permission for the output bucket.

AnswerA

This ensures requests from Glue are routed through the VPC endpoint and comply with the bucket policy.

Why this answer

Option D is correct because the Glue job runs in a VPC, but if the S3 bucket policy requires requests to come from the VPC endpoint, the Glue job's requests must originate from that endpoint. However, Glue jobs running in a VPC do not automatically route S3 traffic through VPC endpoints; they go through the internet unless a VPC endpoint is explicitly used. The intermittent success might be due to other requests coming from the same IP range.

The correct action is to ensure the Glue job uses a VPC endpoint for S3. Option A is wrong because the bucket policy is the issue, not the IAM policy. Option B is wrong because the Lambda function is not directly accessing S3.

Option C is wrong because the Glue job already runs in the VPC.

Full explanation →

1692

MCQmedium

A data engineer reviews this IAM policy attached to an S3 bucket. What is the effect of this policy?

A.Denies PutObject when encryption is not SSE-KMS.

B.Denies all PutObject requests.

C.Allows PutObject only when the object is encrypted with SSE-KMS.

D.Allows PutObject only when the object is NOT encrypted with SSE-KMS.

AnswerD

The condition StringNotEquals allows only if encryption is not KMS.

Why this answer

Option D is correct because the IAM policy uses a `Deny` effect with a `StringNotEquals` condition on `s3:x-amz-server-side-encryption` for `aws:kms`. This means that if the encryption header is NOT `aws:kms` (i.e., SSE-KMS), the request is denied. The net effect is that `PutObject` is allowed only when the object is encrypted with SSE-KMS, as any other encryption or no encryption triggers the deny.

Exam trap

The trap here is that candidates misread `StringNotEquals` as `StringEquals`, thinking the policy denies SSE-KMS encryption, when in fact it denies everything except SSE-KMS.

How to eliminate wrong answers

Option A is wrong because the policy denies `PutObject` when encryption is NOT SSE-KMS, not when it is SSE-KMS; the condition `StringNotEquals` denies non-matching values. Option B is wrong because the policy does not deny all `PutObject` requests; it only denies those that do not meet the encryption condition, allowing those with SSE-KMS. Option C is wrong because the policy uses a `Deny` effect, not an `Allow`; it does not explicitly allow `PutObject` with SSE-KMS, but rather denies everything else, making SSE-KMS the only permitted case.

Full explanation →

1693

MCQmedium

A company uses Amazon S3 to store sensitive documents. The security team has mandated that all objects must be encrypted at rest using server-side encryption with AWS KMS (SSE-KMS). Additionally, the company wants to ensure that any attempt to upload an unencrypted object is denied. A data engineer has configured a bucket policy that denies PutObject if the encryption header does not include x-amz-server-side-encryption: aws:kms. However, the engineer notices that some objects are still being stored without encryption. Upon investigation, the engineer suspects that the policy is not being evaluated correctly. What should the engineer do to ensure that all objects are encrypted with SSE-KMS?

A.Use an IAM policy to require encryption instead of a bucket policy.

B.Enable S3 Block Public Access settings.

C.Add a condition to the bucket policy that checks for aws:SourceVpce.

D.Enable default encryption on the S3 bucket with SSE-KMS.

AnswerD

Default encryption ensures all objects are encrypted, complementing the bucket policy.

Why this answer

Option D is correct because enabling default encryption on the S3 bucket with SSE-KMS ensures that any object uploaded without an explicit encryption header is automatically encrypted with SSE-KMS. This closes the gap where the bucket policy condition fails to catch uploads that omit the `x-amz-server-side-encryption` header entirely, as the policy only denies requests with an incorrect header but does not block requests that lack the header altogether. Default encryption applies server-side encryption at the bucket level, making it impossible to store an unencrypted object.

Exam trap

The trap here is that candidates assume a bucket policy condition denying PutObject without `x-amz-server-side-encryption: aws:kms` will block all unencrypted uploads, but they overlook that the condition only matches when the header is present with a wrong value, not when the header is absent entirely.

How to eliminate wrong answers

Option A is wrong because IAM policies cannot enforce encryption requirements on S3 PutObject operations as effectively as bucket policies; IAM policies lack the ability to condition on S3-specific headers like `x-amz-server-side-encryption`, and they apply to users/roles rather than the bucket itself, leaving gaps for anonymous or cross-account uploads. Option B is wrong because S3 Block Public Access settings only prevent public access to objects and buckets, not encryption enforcement; they have no effect on whether objects are encrypted at rest. Option C is wrong because checking for `aws:SourceVpce` restricts access based on VPC endpoint origin, which is unrelated to encryption requirements and would not prevent unencrypted uploads from other sources.

Full explanation →

1694

MCQeasy

A data engineer is monitoring an Amazon Kinesis Data Analytics application that uses a SQL query to aggregate streaming data. The application is falling behind and the millisBehindLatest metric is increasing. Which action should the engineer take to improve performance?

A.Switch from SQL to Apache Flink for the analytics application

B.Increase the number of shards in the source Kinesis stream

C.Increase the Parallelism setting of the Kinesis Data Analytics application

D.Decrease the window duration of the SQL query

AnswerC

Higher parallelism increases processing capacity, reducing lag.

Why this answer

Increasing the Parallelism setting of the Kinesis Data Analytics application allows the SQL query to process data across more in-application streams and operators concurrently, directly addressing the lag indicated by the rising millisBehindLatest metric. This action scales the compute resources allocated to the application without changing the source stream or the query logic, making it the most direct way to improve throughput for a SQL-based Kinesis Data Analytics application.

Exam trap

The trap here is that candidates often confuse scaling the source (shards) with scaling the processing engine (parallelism), assuming that more data input automatically fixes processing lag, when in fact the bottleneck is the application's compute capacity.

How to eliminate wrong answers

Option A is wrong because switching from SQL to Apache Flink is a fundamental architectural change that is not required to address performance tuning; the question specifically states the application uses SQL, and Flink would require rewriting the application entirely, not a simple performance fix. Option B is wrong because increasing the number of shards in the source Kinesis stream increases the ingestion capacity but does not directly improve the processing speed of the Kinesis Data Analytics application; if the application is already falling behind due to insufficient compute, more shards will only increase the backlog. Option D is wrong because decreasing the window duration of the SQL query reduces the amount of data aggregated per window, which may reduce latency but does not increase the overall processing parallelism or throughput; it could even cause more frequent window triggers, potentially worsening the lag.

Full explanation →

1695

MCQhard

A company uses Amazon Redshift for data warehousing. They notice that queries are running slowly, and the STL_LOAD_ERRORS table shows many 'Parse error' entries. The data is loaded from Amazon S3 using COPY commands. What is the MOST likely cause of the parse errors?

A.The source data files have a different schema or delimiter than what is specified in the COPY command.

B.The Redshift cluster does not have enough compute nodes to process the data.

C.The IAM role used by Redshift does not have permission to decrypt the S3 objects.

D.The source data files are compressed using an unsupported compression format.

AnswerA

Schema mismatch leads to parse errors.

Why this answer

Option A is correct because parse errors during COPY typically indicate that the source data does not match the target table schema (e.g., data type mismatch or delimiter issues). Option B is incorrect because incorrect compression would cause decompression errors, not parse errors. Option C is incorrect because data encryption issues would cause access denied errors.

Option D is incorrect because insufficient compute resources would cause performance issues but not parse errors.

Full explanation →

1696

MCQhard

A logistics company ingests GPS tracking data from thousands of vehicles into Amazon S3 via AWS Direct Connect. Each vehicle sends a message every 5 seconds, resulting in about 200,000 messages per second. Each message is about 200 bytes. The company uses AWS Glue to transform the data into a parquet format and load it into Amazon Redshift for real-time analytics. However, the Glue jobs are failing due to memory issues and the data is not being loaded into Redshift quickly enough. The company needs to reduce the latency of data availability in Redshift. Which action should the data engineer take?

A.Use Amazon Kinesis Data Analytics to process the data in real-time and write to Redshift directly.

B.Increase the size of the Redshift cluster to improve load performance.

C.Use Amazon Kinesis Data Firehose to ingest the data directly into S3 and then use Redshift Spectrum to query the data without loading.

D.Increase the number of DPUs and allocate more memory to the Glue job.

AnswerC

Firehose can handle high throughput and Redshift Spectrum reduces load time.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose can ingest high-throughput streaming data and deliver it to S3 in near-real-time, reducing latency. Option A is wrong because increasing Glue DPUs may not fix memory issues and doesn't reduce latency significantly. Option C is wrong because Kinesis Data Analytics adds processing overhead.

Option D is wrong because Redshift Spectrum is for querying S3, not reducing ingestion latency.

Full explanation →

1697

Multi-Selectmedium

Which TWO actions can help improve the read performance of an Amazon DynamoDB table that is experiencing throttling? (Choose two.)

Select 2 answers

A.Enable DynamoDB Accelerator (DAX) for write-heavy workloads.

B.Increase the provisioned read capacity units (RCUs) for the table.

C.Use eventually consistent reads instead of strongly consistent reads.

D.Add a global secondary index (GSI) with a different partition key.

E.Enable auto-scaling with a lower minimum capacity.

AnswersB, D

More RCUs allow higher throughput.

Why this answer

Options A and D are correct. Adding a GSI distributes read load, and increasing read capacity directly addresses throttling. Option B is for writes.

Option C reduces latency but not throttling. Option E is for cost reduction.

Full explanation →

1698

MCQeasy

A company uses Amazon EMR to run Spark jobs on a transient cluster. The jobs process data from S3 and write results back to S3. The team wants to reduce costs by optimizing the cluster. Which action should the team take?

A.Use Spot Instances for the task nodes.

B.Increase the number of core nodes and use larger instance types.

C.Enable EMRFS consistent view.

D.Terminate the cluster after each job and manually restart it for the next job.

AnswerA

Spot instances are cheaper than on-demand.

Why this answer

Using Spot Instances for task nodes in a transient EMR cluster significantly reduces compute costs because Spot Instances are spare AWS EC2 capacity offered at up to 90% discount compared to On-Demand. Since transient clusters are terminated after job completion, the risk of Spot Instance interruptions is mitigated—the job can simply be retried on a new cluster if needed. This directly addresses the cost optimization goal without sacrificing job functionality.

Exam trap

The trap here is that candidates confuse cost optimization with performance tuning, leading them to choose larger instances (Option B) or consistency features (Option C), when the real cost lever for transient workloads is leveraging Spot pricing for ephemeral compute.

How to eliminate wrong answers

Option B is wrong because increasing the number of core nodes and using larger instance types increases costs, not reduces them, and core nodes host HDFS which is unnecessary for a transient cluster that reads/writes directly to S3. Option C is wrong because EMRFS consistent view is a feature to handle S3 eventual consistency for listing and renaming, not a cost optimization mechanism—it adds overhead without reducing spend. Option D is wrong because EMR transient clusters already terminate automatically after the job completes; manually restarting is redundant and does not further reduce costs, and it introduces operational overhead.

Full explanation →

1699

MCQhard

A company runs a data pipeline using AWS Glue ETL jobs to process daily files from an S3 bucket. The files are in CSV format and range from 1 GB to 10 GB. The Glue job runs successfully for small files but fails with an 'Out of Memory' error for files larger than 5 GB. The job uses a single G.1X DPU (16 GB memory). The company needs to process these large files without changing the existing ETL script. Which solution should the company implement?

A.Convert the input files from CSV to Parquet format to reduce memory usage.

B.Use the Optimus format in AWS Glue to compress data.

C.Use Amazon EMR with Spark instead of AWS Glue.

D.Increase the number of DPUs and use the G.2X worker type to provide more memory per worker.

AnswerD

More DPUs and G.2X provide additional memory.

Why this answer

Option A is correct because increasing the DPU count and using G.2X worker type provides more memory per worker, resolving the memory issue without script changes. Option B is wrong because converting to Parquet may reduce data size but does not guarantee memory issues are resolved and may require script changes. Option C is wrong because using a different file format may not address memory issues.

Option D is wrong because using Spark on EMR requires rewriting the script.

Full explanation →

1700

MCQeasy

A data engineer runs the above SQL commands on an Amazon Redshift cluster. The table 'users' is created with DISTSTYLE EVEN. What is the effect of the DISTSTYLE EVEN on query performance?

A.It stores all data on a single node for fast local queries.

B.It ensures data is evenly distributed across all nodes to prevent data skew.

C.It reduces data movement during queries by co-locating data based on user_id.

D.It improves join performance when joining on user_id.

AnswerB

EVEN distribution spreads rows evenly, avoiding skew.

Why this answer

Option B is correct because DISTSTYLE EVEN distributes rows evenly across nodes, which is beneficial when there is no clear join or aggregation key. Option A is incorrect because it does not optimize for joins on user_id. Option C is incorrect because EVEN distribution does not minimize data movement for joins.

Option D is incorrect because EVEN distribution distributes data, not co-locates.

Full explanation →

1701

MCQhard

Refer to the exhibit. An IAM policy is attached to an AWS Glue ETL job. The job reads from the Kinesis stream 'input-stream' and writes to S3 bucket 'data-lake-bucket'. The job fails with an access denied error. Which missing permission is most likely the cause?

A.kinesis:PutRecord permission on a wildcard stream ARN

B.kinesis:DescribeStream permission

C.s3:PutObject permission on a specific prefix

D.s3:ListBucket permission on the bucket

AnswerD

Glue needs ListBucket to verify bucket existence and structure.

Why this answer

Option D is correct. The policy includes s3:PutObject but not s3:ListBucket, which is required for Glue to write to S3. Option A is wrong because the policy does not restrict to a prefix.

Option B is wrong because kinesis:DescribeStream is not required for writing. Option C is wrong because the stream ARN is specific.

Full explanation →

1702

MCQmedium

A data engineering team uses AWS Glue ETL jobs to process data daily. They notice that job run times are increasing as data volume grows. Which action will most effectively improve performance without changing the code?

A.Use a smaller instance type for the Glue job.

B.Enable job bookmark to skip previously processed data.

C.Split the data into more files in S3.

D.Increase the number of DPUs for the Glue job.

AnswerD

More DPUs increase parallelism and can significantly reduce run time.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the job provides more parallelism and memory, speeding up processing without code changes.

Full explanation →

1703

MCQhard

A data engineer is troubleshooting an Amazon Redshift cluster that is experiencing slow query performance. The engineer notices that the disk space is heavily utilized and queries are spilling to disk. What is the most cost-effective solution to improve performance?

A.Run VACUUM command to reclaim space

B.Change distribution style to KEY

C.Resize the cluster to a larger node type or add nodes

D.Apply compression encoding to tables

AnswerC

Adding memory and disk reduces spilling to disk.

Why this answer

When queries spill to disk due to heavy disk utilization, the root cause is insufficient memory or compute capacity relative to the workload. Resizing the cluster (adding nodes or moving to a larger node type) directly increases available memory and CPU, reducing or eliminating disk spill and improving query performance. This is the most cost-effective solution because it scales resources proportionally without requiring manual tuning or schema changes.

Exam trap

The trap here is that candidates confuse disk space management (VACUUM, compression) with memory/query execution issues, leading them to choose storage optimization options when the real bottleneck is insufficient compute resources.

How to eliminate wrong answers

Option A is wrong because VACUUM reclaims space from deleted rows but does not increase memory or reduce disk spill; it only reorganizes existing data. Option B is wrong because changing distribution style (e.g., to KEY) optimizes data redistribution for joins but does not address insufficient memory or disk spill. Option D is wrong because applying compression encoding reduces storage footprint and I/O, but does not increase memory or compute capacity to prevent queries from spilling to disk.

Full explanation →

1704

Multi-Selectmedium

A company uses AWS Glue to catalog data stored in Amazon S3. The data is in Parquet format and partitioned by date. The company wants to improve query performance in Amazon Athena and reduce costs. Which THREE actions should the company take? (Choose THREE.)

Select 3 answers

A.Convert the data to JSON format for better schema evolution.

B.Use Glue DataBrew to clean the data before querying.

C.Partition the data by date so Athena can use partition pruning.

D.Ensure the data is in a columnar format like Parquet or ORC.

E.Compress the data using a codec like Snappy or Gzip.

AnswersC, D, E

Partition pruning limits the amount of data scanned per query.

Why this answer

Using columnar formats like Parquet improves performance and reduces scanned data. Partitioning by date allows Athena to prune partitions. Compressing data reduces storage and scan costs.

Converting to JSON would hurt performance. Increasing partition count may help but is not a guarantee. Using Glue DataBrew is for data preparation, not performance optimization.

Full explanation →

1705

MCQhard

A company is designing a data pipeline using Amazon Kinesis Data Streams. The data includes personally identifiable information (PII). The security team requires that data be encrypted at rest using a customer-managed KMS key. How should the data engineer configure the Kinesis stream?

A.Configure the Kinesis stream to use AWS CloudHSM for encryption.

B.Enable server-side encryption on the Kinesis stream and specify the customer-managed KMS key.

C.Store the encrypted data in S3 and use Kinesis to stream the S3 object keys.

D.Use client-side encryption in the producer application to encrypt data before sending to Kinesis.

AnswerB

Kinesis supports server-side encryption with KMS.

Why this answer

Option A is correct because Kinesis Data Streams supports server-side encryption using a KMS key. Option B is incorrect because client-side encryption must be implemented by the producer, not the stream. Option C is incorrect because CloudHSM is not directly supported for Kinesis encryption.

Option D is incorrect because Kinesis does not use S3 for storage.

Full explanation →

1706

MCQmedium

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data is then consumed by a Lambda function that transforms the records and writes them to an S3 bucket in Parquet format. Recently, the Lambda function has been timing out and the S3 bucket is not receiving all expected records. The Kinesis stream has a shard count of 10 and the Lambda function's reserved concurrency is set to the default. Which change would MOST likely resolve the issue?

A.Decrease the batch window from the default 300 seconds to 60 seconds.

B.Configure the Kinesis stream to directly write to S3 using a delivery stream.

C.Increase the Lambda function's reserved concurrency.

D.Increase the batch size from the default 100 to 500 records per invocation.

AnswerD

Larger batch size means fewer invocations, reducing overhead and allowing the function to process more records before timeout.

Why this answer

The Lambda function is timing out because it cannot keep up with the incoming data rate. Increasing the batch size allows each invocation to process more records, reducing the number of invocations and the overhead. Option A is wrong because increasing concurrency would help but may exceed account limits.

Option B is wrong because decreasing batch window would increase invocation frequency. Option D is wrong because S3 destination is not an output target for Kinesis streams directly.

Full explanation →

1707

MCQhard

A data engineer is running an AWS Glue ETL job that converts CSV files to Parquet. The job fails with the error shown in the exhibit. The input files are about 500 MB each. The job uses 5 workers of type G.1X (16 GB memory each). What is the MOST likely cause?

A.The output Parquet file size is too large for the executor memory

B.The data is highly skewed causing a single partition to receive too much data

C.The Spark driver does not have enough memory to handle the schema inference

D.The input CSV files are corrupt or have inconsistent schema

AnswerA

Writing a large file requires memory proportional to file size; splitting into smaller files can help.

Why this answer

Option D is correct because the error shows OOM in the write task, which typically occurs when writing large files. Spark tries to write a large Parquet file that exceeds the executor memory. Option A would cause different errors.

Option B is about reading, not writing. Option C is about data skew, which would cause OOM in shuffle, not in write.

Full explanation →

1708

MCQmedium

A company uses Amazon Redshift for data warehousing. The data team notices that queries are slow due to high disk usage on the cluster. They need to free up space without deleting any data. What should they do?

A.Change the table's sort keys

B.Run a deep copy to re-sort and reclaim space

C.Run VACUUM command

D.Add more nodes to the cluster

AnswerB

Deep copy reorganizes data and reclaims disk space effectively.

Why this answer

Option B is correct because deep copy reorganizes data and reclaims disk space. Option A (VACUUM) reclaims space from deleted rows but does not reorganize. Option C (increasing node count) adds cost.

Option D (changing sort keys) requires a table redesign.

Full explanation →

1709

MCQmedium

A company is using AWS Lake Formation to manage access to a data lake in S3. They want to grant a data analyst access to specific columns in a table, but not to the entire table. Which Lake Formation feature should be used?

A.Row-level security (cell-level filtering)

B.IAM policies on the S3 bucket

C.Column-level filtering

D.Tag-based access control (TBAC)

AnswerC

Column-level filtering restricts access to specific columns.

Why this answer

Option B is correct because Lake Formation column-level filtering allows granting access to specific columns. Option A (cell-level security) is for row-level security, not columns. Option C (tag-based access control) uses tags to control access, not column-level.

Option D (IAM policies) are not column-specific.

Full explanation →

1710

MCQmedium

A data engineer is designing a data pipeline that ingests customer data from an on-premises database into Amazon S3. The data contains personally identifiable information (PII). The company policy requires that all PII be masked before it is stored in S3. The pipeline uses AWS DMS for migration and AWS Glue for transformation. The engineer needs to ensure that the masking is applied consistently and that no unmasked data is written to S3. The engineer has set up DMS to replicate data to an S3 bucket, and then a Glue job reads from S3, applies masking, and writes to another S3 bucket. However, there is a risk that unmasked data in the first S3 bucket could be accessed before the Glue job runs. What should the engineer do to mitigate this risk?

A.Configure DMS to apply masking transformations before writing to S3 using DMS's built-in transformation rules.

B.Block all access to the first S3 bucket except for the Glue job's IAM role.

C.Use Amazon Kinesis Data Firehose to stream data directly to Glue for real-time masking.

D.Set an S3 Lifecycle policy on the first bucket to delete objects after 1 hour.

AnswerD

This limits the time unmasked data is available.

Why this answer

Option B is correct because an S3 Lifecycle policy with expiration can automatically delete objects from the first bucket after a short time, reducing the window of exposure. Option A is wrong because DMS does not have native masking capabilities. Option C is wrong because blocking all access would prevent the Glue job from reading.

Option D is wrong because Kinesis is not part of the pipeline.

Full explanation →

1711

Multi-Selecthard

A company is running a critical application that generates millions of small JSON files every hour in an S3 bucket. A data engineer needs to process these files in near real-time using AWS Glue. The engineer wants to minimize the latency between file arrival and Glue job start. Which TWO actions should the engineer take?

Select 2 answers

A.Increase the Glue job's batch window to 600 seconds.

B.Increase the number of DPUs for the Glue job to accelerate processing.

C.Pre-process the files to consolidate them into larger files before the Glue job runs.

D.Use Amazon S3 event notifications to trigger an AWS Lambda function that starts the Glue job upon file arrival.

AnswersC, D

Fewer larger files reduce Glue job overhead and improve throughput.

Why this answer

Using S3 event notifications to trigger a Lambda that starts the Glue job reduces latency compared to scheduled jobs. Also, grouping small files into larger ones before Glue processing reduces the overhead. Option A is wrong because increasing DPUs does not reduce start latency.

Option C is wrong because batch window increases latency.

Full explanation →

1712

MCQhard

A data engineer needs to grant a data scientist access to query a Glue Data Catalog database but must prevent the data scientist from seeing the underlying S3 data locations. Which approach should be used?

A.Use a Glue resource policy to restrict access to the database

B.Grant the data scientist IAM permissions to access the Glue Data Catalog and the underlying S3 data

C.Create a VPC endpoint for Glue and S3 to restrict network access

D.Use AWS Lake Formation to grant SELECT permission on the database and tables without granting S3 access

AnswerD

Lake Formation can grant access to the Data Catalog and data without giving direct S3 access, and it can hide the S3 locations.

Why this answer

Lake Formation can be used to grant SELECT permission on the database and tables, and by using column-level and row-level filters, but to hide S3 locations, the data scientist should not have direct S3 access. Lake Formation does not require the user to see the S3 path. Granting IAM read-only access to S3 would expose locations.

Using a VPC endpoint does not hide locations. Glue resource policies cannot hide S3 locations.

Full explanation →

1713

MCQeasy

A company receives streaming clickstream data from its website. The data must be ingested with low latency and transformed in real time before being stored in Amazon S3. Which AWS service combination is most suitable for this use case?

A.Amazon S3 with S3 Object Lambda

B.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics

C.Amazon Kinesis Data Firehose with AWS Lambda for transformation

D.AWS Glue jobs triggered by Amazon S3 events

AnswerB

Kinesis Data Streams provides low-latency ingestion and Kinesis Data Analytics enables real-time transformations.

Why this answer

Option B is correct because Amazon Kinesis Data Streams ingests streaming data with low latency, and Kinesis Data Analytics can perform real-time transformations using SQL or Apache Flink. Option A is wrong because Kinesis Data Firehose is a delivery service that can transform data but with higher latency. Option C is wrong because AWS Glue is a batch ETL service.

Option D is wrong because Amazon S3 is a storage service, not for real-time transformation.

Full explanation →

1714

MCQeasy

A company uses AWS Glue to run ETL jobs daily. The data is stored in S3 as Parquet files partitioned by date. Recently, jobs have failed with the error 'No such file or directory' for certain partitions. What is the MOST likely cause?

A.The schema has changed and Glue cannot parse the data.

B.A partition folder was deleted or not created by the upstream process.

C.The files are compressed with an unsupported codec.

D.The IAM role does not have s3:GetObject permission.

AnswerB

Missing partition leads to 'No such file or directory'.

Why this answer

Option B is correct because if a partition folder is missing, Glue may fail looking for it. Option A is wrong because a lack of permissions would produce an access denied error. Option C is wrong because schema evolution typically causes type mismatches, not missing files.

Option D is wrong because compression issues cause read errors, not missing files.

Full explanation →

1715

Multi-Selecteasy

A data engineer is setting up a data pipeline using AWS DMS to migrate data from an on-premises database to Amazon RDS for MySQL. The data must be encrypted in transit. Which TWO options can the engineer use? (Choose TWO.)

Select 2 answers

A.Use VPC peering between on-premises and AWS

B.Enable SSL encryption on the DMS endpoint

C.Set up a VPN connection between on-premises and AWS

D.Use KMS to encrypt the DMS connection

E.Use a VPC endpoint for DMS

AnswersB, C

SSL encrypts the connection.

Why this answer

DMS supports SSL/TLS for encrypting connections. Option A is wrong because VPN creates encrypted tunnel. Option D is wrong because VPC peering does not encrypt.

Options B and C are correct (B uses SSL, C uses VPN). Option E is wrong because DMS does not use KMS for transit encryption.

Full explanation →

1716

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for real-time clickstream data from a website. The data must be ingested with low latency (seconds) and made available for multiple consumer applications, including a dashboard that refreshes every minute and a machine learning model that processes data in near-real-time. The engineer needs to choose a streaming ingestion service. Which TWO services meet these requirements? (Select TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

C.Amazon Kinesis Data Streams

D.Amazon Simple Queue Service (SQS)

E.Amazon S3

AnswersB, C

MSK is a fully managed Kafka service that provides low-latency streaming and supports multiple consumer groups.

Why this answer

Option A (Kinesis Data Streams) provides low-latency ingestion and allows multiple consumers to process data independently via enhanced fan-out. Option D (Amazon MSK) is also a streaming platform that supports multiple consumers with low latency. Option B (Kinesis Data Firehose) is designed for loading data into destinations with buffering, not for multiple consumers.

Option C (SQS) is a message queue with at-least-once delivery but is not a streaming platform and does not support replay efficiently. Option E (S3) is object storage, not a streaming ingestion service.

Full explanation →

1717

MCQhard

A data engineer is designing a streaming pipeline that ingests data from an Amazon Kinesis Data Stream (with 5 shards) into Amazon S3. The data must be transformed using a complex stateful operation that cannot be done in a Lambda function (limited to 15 minutes). The engineer needs a solution that can maintain state across multiple records. Which service should be used?

A.Amazon EMR running Spark Structured Streaming

B.Amazon Kinesis Data Firehose with Lambda transformation

C.AWS Glue streaming ETL job

D.Amazon Kinesis Data Analytics for Apache Flink

AnswerD

Flink supports stateful stream processing, exactly what is needed.

Why this answer

Option A is correct because Kinesis Data Analytics for Apache Flink supports stateful processing with checkpointing. Option B (Firehose) is stateless. Option C (Glue Streaming) is in beta and less mature.

Option D (EMR) requires cluster management.

Full explanation →

1718

MCQhard

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object with fields: trade_id, symbol, price, quantity, timestamp. The stream has 5 shards. The data is consumed by an AWS Lambda function that aggregates trades per symbol every minute and writes the results to an Amazon DynamoDB table for a real-time dashboard. Recently, the dashboard has been showing outdated data, and the Lambda function is experiencing high error rates. The CloudWatch logs show 'ProvisionedThroughputExceededException' errors from DynamoDB. The DynamoDB table has 10 read capacity units (RCU) and 10 write capacity units (WCU). The average trade volume is 5,000 trades per second across all symbols, and there are 100 symbols. The Lambda function is configured with a batch size of 100 and a 1-minute window. The data volume is expected to double in the next month. As a data engineer, what is the most appropriate course of action?

A.Switch the storage from DynamoDB to Amazon S3 and use Amazon Athena for the dashboard

B.Increase the number of Kinesis shards to 10 to increase Lambda concurrency

C.Increase the DynamoDB write capacity units to 100 and enable auto scaling

D.Use Amazon Kinesis Data Firehose to deliver data to S3 and use Amazon QuickSight for the dashboard

AnswerC

Resolves the write throttling error and auto scaling handles future growth.

Why this answer

Option C is correct because the DynamoDB table is throttling due to insufficient write capacity. With 5,000 trades/s and updating per symbol per minute, the write rate is about 100 writes per minute (one per symbol), but the aggregation may cause bursts. However, the 'ProvisionedThroughputExceededException' indicates WCU is too low.

Increasing WCU to 100 resolves the immediate issue; auto scaling handles future growth. Option A (increase shards) addresses Lambda concurrency but not DynamoDB throttling. Option B (use S3) changes the architecture and loses real-time capabilities.

Option D (use Firehose) is for delivery to S3, not real-time dashboard.

Full explanation →

1719

Multi-Selecteasy

A data engineer is migrating a legacy data warehouse to Amazon Redshift. The engineer needs to load data from multiple sources efficiently. Which THREE services can be used to load data into Redshift? (Choose THREE.)

Select 3 answers

A.Use the COPY command to load from Amazon DynamoDB.

B.Use Kinesis Data Firehose to deliver data directly to Redshift.

C.Use AWS DMS to replicate data continuously.

D.Use S3 Transfer Acceleration.

E.Use the COPY command to load from Amazon S3.

AnswersA, B, E

COPY can load data from DynamoDB tables.

Why this answer

Options A, B, and D are correct: COPY from S3, COPY from DynamoDB, and Kinesis Data Firehose directly load into Redshift. Option C is wrong because S3 Transfer Acceleration is for uploading to S3, not loading into Redshift. Option E is wrong because AWS DMS can migrate data but is not a direct load service for Redshift; it can load into Redshift as a target.

Full explanation →

1720

MCQeasy

A company wants to enforce that all data in an S3 bucket is encrypted at rest using AWS KMS. Which bucket policy condition key should be used?

A.s3:x-amz-acl with value bucket-owner-full-control

B.s3:x-amz-server-side-encryption with value aws:kms

C.s3:x-amz-server-side-encryption with value AES256

D.aws:SourceIp with value 10.0.0.0/8

AnswerB

Enforces SSE-KMS.

Why this answer

Condition key 's3:x-amz-server-side-encryption' with value 'aws:kms' enforces KMS encryption. 's3:x-amz-server-side-encryption-aws-kms-key-id' is for specific key ID.

Full explanation →

1721

MCQhard

A company ingests streaming data from social media APIs into Kinesis Data Streams. Each record is approximately 5 KB. The data must be enriched with geolocation information from a DynamoDB table before being stored in S3. The enrichment process takes about 200 ms per record. Which architecture minimizes latency and cost?

A.Use an EC2 instance running a custom application to consume from Kinesis, enrich, and write to S3

B.Use AWS Glue ETL jobs running continuously on the stream

C.Use Kinesis Data Analytics to perform enrichment with SQL

D.Use Kinesis Data Firehose with a Lambda function that queries DynamoDB

AnswerD

Firehose with Lambda can perform enrichment per record.

Why this answer

Option D is correct because using Kinesis Data Firehose with a Lambda function for enrichment is efficient for moderate-size records. Option A is wrong because Kinesis Data Analytics is for real-time analytics, not enrichment. Option B is wrong because EC2 adds significant operational overhead.

Option C is wrong because Glue ETL is for batch processing, not streaming.

Full explanation →

1722

MCQmedium

A data engineer is troubleshooting a Kinesis Data Analytics application that processes streaming data. The application is falling behind and has a high 'MillisBehindLatest' metric. The application uses a parallelism of 2. The source stream has 4 shards. What is the MOST likely cause and solution?

A.The application is using a JSON format; switch to Avro.

B.The source stream has too many shards; decrease to 2.

C.The application parallelism is too low; increase it to 4.

D.The output destination is slow; change to a faster sink.

AnswerC

With 4 shards, parallelism should be at least 4 to process all shards concurrently.

Why this answer

The 'MillisBehindLatest' metric indicates the application is not keeping up with the incoming data. With a source stream of 4 shards and a parallelism of only 2, the application cannot process data from all shards concurrently, leading to backpressure. Increasing parallelism to match the shard count (4) allows each shard to be processed by a separate task, reducing lag.

Exam trap

The trap here is that candidates may assume increasing parallelism always improves performance, but the key insight is that parallelism must match or exceed the number of source shards to avoid a concurrency bottleneck, not just be arbitrarily high.

How to eliminate wrong answers

Option A is wrong because changing the serialization format from JSON to Avro reduces data size but does not address the fundamental throughput mismatch between shard count and parallelism; the bottleneck is concurrency, not serialization efficiency. Option B is wrong because reducing the number of shards would decrease the source stream's throughput capacity, potentially causing data loss or throttling; the correct approach is to scale application parallelism to match the existing shards. Option D is wrong because a slow output sink would cause backpressure that manifests as increased 'MillisBehindLatest', but the question states the application is falling behind, and the most direct cause given the parallelism of 2 versus 4 shards is insufficient processing concurrency, not sink performance.

Full explanation →

1723

Multi-Selectmedium

A data engineer needs to protect sensitive data in an S3 bucket. Which TWO AWS services can be used to detect and prevent accidental public access?

Select 2 answers

A.AWS Config

B.AWS Trusted Advisor

C.AWS CloudTrail

D.S3 Block Public Access

E.Amazon Macie

AnswersB, D

Trusted Advisor checks for S3 buckets that have public read/write access.

Why this answer

AWS Trusted Advisor checks for S3 buckets with public access. S3 Block Public Access can be enabled on the account or bucket level to prevent public access. Option C is wrong because CloudTrail records API calls but does not prevent public access.

Option D is wrong because Macie discovers sensitive data, not public access. Option E is wrong because Config can evaluate rules but Trusted Advisor and Block Public Access are more direct.

Full explanation →

1724

MCQmedium

A company is using Amazon RDS for MySQL with Multi-AZ deployment. The database experiences intermittent slowdowns during peak hours. The company's DevOps team suspects that the primary instance is overwhelmed. Which action should the team take to distribute the read load without changing the application code?

A.Increase the instance size of the RDS instance.

B.Create a read replica and modify the connection string to point to the replica for read queries.

C.Enable Multi-AZ on the existing instance.

D.Configure DynamoDB Accelerator (DAX) in front of the RDS instance.

AnswerB

Read replicas offload read traffic from the primary instance.

Why this answer

Creating a read replica and modifying the connection string to point to the replica for read queries (Option B) offloads read traffic from the primary RDS instance without requiring application code changes. This directly addresses the intermittent slowdowns during peak hours by distributing the read load, leveraging MySQL’s native replication to keep the replica synchronized. The key constraint is 'without changing the application code,' which is satisfied by simply updating the connection string in the application configuration.

Exam trap

The trap here is that candidates confuse Multi-AZ with read replicas, assuming Multi-AZ can distribute read traffic, but in RDS for MySQL, the standby in a Multi-AZ deployment is not accessible for reads—it only provides failover support.

How to eliminate wrong answers

Option A is wrong because increasing the instance size scales vertically, which does not distribute the read load; it only provides more resources to a single instance, which may still be overwhelmed during peak hours and does not leverage Multi-AZ or read replicas. Option C is wrong because Multi-AZ is already enabled (as stated in the question) and its purpose is high availability and failover, not read load distribution; the standby instance in Multi-AZ cannot serve read traffic. Option D is wrong because DynamoDB Accelerator (DAX) is an in-memory cache for Amazon DynamoDB, not for Amazon RDS for MySQL; it cannot be placed in front of an RDS instance and would require significant application code changes to integrate.

Full explanation →

1725

Matchingmedium

Match each AWS monitoring tool to its primary use.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Metrics, logs, and alarms

API call history and auditing

Trace and analyze distributed applications

Event-driven automation

Resource configuration tracking

Why these pairings

Monitoring tools help maintain data pipelines.

Full explanation →

Page 23 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →