AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 826900

1786 questions total · 24pages · All types, answers revealed

Page 11

Page 12 of 24

Page 13
826
MCQeasy

A data pipeline uses AWS Glue to process data from an S3 data lake. The pipeline fails intermittently with a 'ThrottlingException' when writing to a DynamoDB table. What is the MOST likely cause?

A.The DynamoDB table's write capacity is insufficient for the workload.
B.The network connection between Glue and DynamoDB is unstable.
C.The Glue job's timeout setting is too low.
D.The Glue job does not have sufficient IAM permissions to write to DynamoDB.
AnswerA

ThrottlingException indicates the write capacity is exceeded; increasing capacity or using auto-scaling resolves it.

Why this answer

A ThrottlingException from DynamoDB indicates that the request rate to the table has exceeded the provisioned write capacity. AWS Glue jobs can generate high-throughput writes, and if the DynamoDB table's write capacity units (WCUs) are not sufficient to handle the burst, DynamoDB will throttle the requests. This is the most direct cause of the intermittent failure described.

Exam trap

The trap here is that candidates may confuse ThrottlingException with permission errors (Option D) or network issues (Option B), but AWS specifically tests the understanding that DynamoDB throttling is a capacity management mechanism, not a connectivity or authorization problem.

How to eliminate wrong answers

Option B is wrong because network instability between Glue and DynamoDB would typically result in connection timeouts or retryable network errors, not a specific ThrottlingException which is an application-level error from DynamoDB's API. Option C is wrong because a Glue job's timeout setting controls how long the job can run before being terminated, not how it handles individual API throttling errors; a timeout would cause a different error (e.g., 'Timeout exceeded'). Option D is wrong because insufficient IAM permissions would result in an AccessDeniedException, not a ThrottlingException; the error message directly indicates capacity limits, not authorization failures.

827
MCQeasy

A data engineer configured an AWS Glue job that reads from an S3 bucket and writes to an Amazon Redshift table. The job runs successfully, but the data in Redshift is missing some records that exist in S3. The engineer suspects the issue may be related to the job's bookmarks. What should the engineer do to ensure all records are processed?

A.Update the IAM role to grant additional S3 read permissions.
B.Reset the job bookmark using the AWS Glue API.
C.Increase the number of workers in the Glue job.
D.Disable job bookmarks in the Glue job configuration.
AnswerD

Disabling bookmarks forces reprocessing of all data.

Why this answer

Option B is correct because disabling job bookmarks forces the Glue job to reprocess all data, which will include the missing records. Option A is wrong because increasing the number of workers does not address the bookmark issue. Option C is wrong because updating the IAM role does not affect bookmarks.

Option D is wrong because job bookmarks do not have a 'reset' API; they are managed via the job configuration.

828
Multi-Selectmedium

A company is building a data lake on Amazon S3 and needs to ingest data from multiple sources. The ingestion must be automated and handle schema changes. Which THREE services can be used together to achieve this? (Choose THREE.)

Select 3 answers
A.Amazon Redshift
B.AWS Glue Crawler
C.Amazon Kinesis Data Firehose
D.Amazon EMR
E.AWS Lambda
AnswersB, C, E

Glue Crawler can discover schema and update the Data Catalog.

Why this answer

Option A, B, and D are correct: AWS Glue can crawl schemas, Kinesis Firehose can ingest streaming data, and Lambda can transform data. Option C is for data warehouse, not data lake. Option E is for big data processing, not ingestion automation.

829
MCQeasy

A company stores IoT sensor data in S3 as JSON files. They need to convert the data to Parquet format for efficient querying with Amazon Athena. Which AWS service can perform this transformation with minimal effort?

A.Kinesis Data Firehose
B.Amazon Athena
C.AWS Glue ETL job
D.AWS Lambda
AnswerC

Glue ETL can convert JSON to Parquet.

Why this answer

Option B is correct because AWS Glue ETL can easily convert JSON to Parquet. Option A is wrong because Athena is a query engine, not a transformation service. Option C is wrong because Lambda is for small, event-driven transformations.

Option D is wrong because Kinesis Data Firehose is for streaming data.

830
MCQhard

A data engineering team uses Amazon Redshift for analytics. They notice that queries on a large fact table are slow. The table is distributed using DISTSTYLE ALL. Which design change would most likely improve query performance?

A.Change DISTSTYLE to EVEN to distribute rows evenly across slices.
B.Increase the number of nodes in the Redshift cluster.
C.Change the table to use a SORTKEY on the most frequently filtered column.
D.Change DISTSTYLE to KEY on a column used in frequent joins.
AnswerD

KEY distribution collocates rows on the same node, reducing data movement during joins.

Why this answer

DISTSTYLE ALL copies the entire table to every node, which is inefficient for large fact tables because it wastes storage and network bandwidth during data loading and query execution. Changing to DISTSTYLE KEY on a column used in frequent joins collocates related rows on the same slice, reducing the need to broadcast or redistribute data across the network during joins, which directly improves query performance.

Exam trap

The trap here is that candidates often assume adding a SORTKEY (Option C) is the universal performance fix, but for large fact tables the dominant bottleneck is data distribution and join collocation, not scan efficiency.

How to eliminate wrong answers

Option A is wrong because DISTSTYLE EVEN distributes rows randomly across slices, which can still cause significant data movement during joins and does not leverage join key locality, often leading to slower queries on large fact tables. Option B is wrong because simply increasing the number of nodes adds more slices and parallelism but does not address the root cause of inefficient data distribution; it may even worsen the overhead of broadcasting the ALL-distributed table. Option C is wrong because adding a SORTKEY improves the efficiency of range-restricted scans and ORDER BY operations, but it does not reduce the network shuffling required during joins, which is the primary bottleneck for a large fact table with DISTSTYLE ALL.

831
Multi-Selecteasy

A data engineer needs to ingest JSON files from an Amazon S3 bucket into an Amazon DynamoDB table. The files are uploaded every hour. Which THREE services can be used together to build this ingestion pipeline?

Select 3 answers
A.AWS Step Functions
B.Amazon SQS
C.Amazon DynamoDB Streams
D.Amazon S3 Event Notifications
E.AWS Lambda
AnswersB, D, E

SQS can decouple S3 events from Lambda for reliability.

Why this answer

Amazon SQS is correct because it decouples the ingestion pipeline, allowing S3 Event Notifications to send messages to an SQS queue when new JSON files arrive. AWS Lambda can then poll the SQS queue to process the files and write to DynamoDB, ensuring reliable, asynchronous ingestion without data loss.

Exam trap

The trap here is that candidates often confuse DynamoDB Streams (for capturing table changes) with the ingestion pipeline itself, or incorrectly assume Step Functions is needed for simple event-driven workflows, when SQS+Lambda is the standard serverless pattern for this use case.

832
Multi-Selecthard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application is experiencing high latency and checkpoint failures. Which THREE actions should the data engineer take to improve performance and reliability? (Choose three.)

Select 3 answers
A.Increase the parallelism of the Flink application
B.Configure the application to use event time processing instead of processing time
C.Increase the checkpoint interval to reduce the frequency of checkpoints
D.Decrease the parallelism to reduce resource contention
E.Disable checkpointing to avoid checkpoint failures
AnswersA, B, C

Higher parallelism improves throughput.

Why this answer

Options A, C, and E are correct. Option A: Increasing parallelism improves throughput. Option C: Increasing checkpoint interval reduces checkpoint failures.

Option E: Using event time helps with out-of-order data. Option B is wrong because decreasing parallelism reduces throughput. Option D is wrong because disabling checkpointing hurts reliability.

833
MCQeasy

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is then consumed by a custom application for real-time analytics. Recently, the application has been experiencing high latency. The operations team suspects the shard count is insufficient. How should the team increase the shard count of the existing stream?

A.Use the UpdateShardCount API to increase the shard count for the stream.
B.Delete the existing stream and create a new one with a higher shard count.
C.Manually split a shard using the SplitShard API on each existing shard.
D.Modify the PutRecord calls to include a new shard key that distributes data across more shards.
AnswerA

UpdateShardCount correctly increases shards.

Why this answer

Option C is correct because you can use the UpdateShardCount API to increase shards. Option A is wrong because you cannot manually split shards; the API handles it. Option B is wrong because you cannot change shard count via PutRecord.

Option D is wrong because you cannot delete and recreate a stream without data loss.

834
MCQmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data must be transformed (e.g., enrich with user location) before being stored in Amazon S3. Which architecture is MOST efficient for this transformation?

A.Use AWS Glue to run a streaming ETL job.
B.Use Amazon EMR to consume the stream using Spark Streaming.
C.Use AWS Lambda to process each record from the stream and write to S3.
D.Use Amazon Kinesis Data Analytics to transform the stream and output to Amazon Kinesis Data Firehose, which writes to S3.
AnswerD

Kinesis Data Analytics can run SQL on the stream, and Firehose delivers to S3 in batches.

Why this answer

Option B is correct because Kinesis Data Analytics can run SQL on the stream for enrichment, then output to Firehose for delivery to S3. Option A is wrong because Lambda can process each record but may be slower and more expensive for high throughput. Option C is wrong because Glue is batch, not streaming.

Option D is wrong because EMR is more complex than needed.

835
MCQmedium

A company is using AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError' when the stream has a sudden spike in data volume. Which configuration change would best prevent this error?

A.Increase the number of DPUs (Data Processing Units) for the Glue job.
B.Store intermediate results in Amazon RDS.
C.Use a batch transformation instead of streaming.
D.Increase the number of shards in the Kinesis data stream.
AnswerA

More DPUs provide more memory to handle spikes.

Why this answer

Option A is correct because increasing the number of DPUs in the AWS Glue job provides more memory for processing spikes. Option B is wrong because Kinesis shard count affects throughput, not memory. Option C is wrong because batch processing does not help streaming jobs.

Option D is wrong because RDS is unrelated to Glue memory.

836
MCQeasy

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data arrives in JSON format but needs to be converted to Parquet for efficient querying. Which AWS Glue feature should be used to infer the schema and generate transformation code?

A.Amazon S3 Select
B.Amazon Athena
C.Amazon Kinesis Data Analytics
D.AWS Glue crawlers
AnswerD

Crawlers populate the Data Catalog with schema information used by Glue ETL jobs.

Why this answer

Option B is correct because AWS Glue crawlers populate the Data Catalog with table metadata and schema, and the ETL job can use that schema to convert JSON to Parquet. Option A is wrong because Athena is for querying, not schema inference. Option C is wrong because S3 Select operates on individual files.

Option D is wrong because Kinesis Data Analytics is for streaming.

837
MCQeasy

A company wants to transform data in Amazon S3 using SQL queries without provisioning servers. The transformations are ad-hoc and run occasionally. Which service should be used?

A.AWS Glue
B.Amazon Redshift Spectrum
C.Amazon EMR
D.Amazon Athena
AnswerD

Athena is serverless and supports SQL queries directly on S3 data.

Why this answer

Amazon Athena allows querying S3 data with standard SQL without server management. AWS Glue is for scheduled ETL jobs. Amazon EMR requires cluster provisioning.

Amazon Redshift Spectrum queries S3 but requires a Redshift cluster.

838
MCQmedium

A company is using AWS DMS to migrate a 5 TB SQL Server database to Amazon Aurora PostgreSQL. The migration is using full load plus CDC. After the full load completes, the ongoing replication task is failing with errors related to large transactions on the source. The team needs to ensure that CDC continues without falling behind. What should the team do?

A.Use Amazon Kinesis Data Streams as an intermediate target for CDC.
B.Increase the DMS replication instance size to provide more memory and CPU.
C.Modify the DMS task settings to increase MaxFileSize and decrease the CommitRate.
D.Disable foreign key constraints on the target Aurora database.
AnswerC

Allows DMS to break down large transactions into smaller commits.

Why this answer

Option B is correct because increasing the MaxFileSize and reducing CommitRate in the DMS task settings allows DMS to handle large transactions by committing more frequently and using larger log files. Option A is wrong because increasing the instance size helps but does not directly address large transaction handling. Option C is wrong because disabling foreign keys is a schema change, not a CDC fix.

Option D is wrong because DMS cannot replicate to Kinesis as a target for CDC to Aurora.

839
MCQmedium

A data engineer needs to audit all access to an S3 bucket for compliance. They want to capture object-level operations such as GetObject and PutObject, as well as bucket-level operations like ListBucket. Which AWS service should be used?

A.Amazon CloudWatch Logs
B.S3 server access logs
C.AWS CloudTrail management events
D.AWS Config
AnswerB

S3 server access logs provide detailed records about requests made to a bucket, including object-level and bucket-level operations.

Why this answer

S3 server access logs record object-level and bucket-level operations. CloudTrail can also record S3 API calls, but by default it logs bucket-level operations only; object-level logging requires enabling data events. Option A is wrong because CloudTrail management events do not include object-level operations.

Option C is wrong because CloudWatch Logs alone does not capture S3 access. Option D is wrong because Config records resource configuration changes, not API calls.

840
MCQhard

A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?

A.Disable checkpointing to avoid failures.
B.Switch the state backend from in-memory to RocksDB.
C.Increase the parallelism of the application.
D.Increase the checkpointing interval.
AnswerD

Longer intervals reduce checkpoint frequency and associated failures.

Why this answer

Increasing the checkpointing interval reduces the frequency of checkpoint operations, giving the system more time to complete each checkpoint before the next one starts. This alleviates backpressure and resource contention, which is critical when dealing with large state, as checkpointing large state is I/O and CPU intensive and can fail if intervals are too tight.

Exam trap

The trap here is that candidates often confuse improving state backend performance (RocksDB) with fixing checkpoint reliability, when the root cause is checkpoint timing pressure, not state storage efficiency.

How to eliminate wrong answers

Option A is wrong because disabling checkpointing eliminates fault tolerance entirely, which would cause data loss on failure and is not a valid reliability improvement. Option B is wrong because switching to RocksDB improves state storage efficiency and reduces memory pressure, but it does not directly address checkpoint failures caused by overly frequent checkpointing; RocksDB can even increase checkpoint duration due to disk I/O. Option C is wrong because increasing parallelism distributes workload but also increases the number of concurrent checkpoint operations and network overhead, potentially worsening checkpoint failures when state is large.

841
MCQmedium

Refer to the exhibit. A data engineer runs the above CLI command and sees the output. The security team requires that the RDS instance not be accessible from the internet. Which change should the engineer make?

A.Change the storage type to io1 for better performance.
B.Modify the DB instance to set PubliclyAccessible to false.
C.Enable Multi-AZ deployment to improve security.
D.Update the VPC security group to deny inbound traffic from 0.0.0.0/0.
AnswerB

This removes the public IP address.

Why this answer

Option A is correct because setting PubliclyAccessible to false removes public IP. Option B (Multi-AZ) is for high availability. Option C (storage type) does not affect accessibility.

Option D (security group) can restrict but public IP is still assigned.

842
MCQmedium

A data engineer is designing a data lake on Amazon S3 and needs to ensure that objects are automatically encrypted at rest using server-side encryption with AWS KMS. Which bucket policy statement achieves this?

A.Deny PutObject requests where the x-amz-server-side-encryption header is not set to aws:kms.
B.Deny PutObject requests that do not include the x-amz-server-side-encryption header.
C.Deny PutObject requests where the x-amz-server-side-encryption header is not set to AES256.
D.Allow PutObject requests only if the x-amz-server-side-encryption header is set to AES256.
AnswerA

Enforces SSE-KMS encryption.

Why this answer

Option A is correct because it enforces server-side encryption with AWS KMS (SSE-KMS) by denying any PutObject request that does not include the `x-amz-server-side-encryption` header set to `aws:kms`. This bucket policy ensures that all objects written to the S3 bucket are automatically encrypted at rest using AWS KMS, meeting the requirement for mandatory encryption with a specific key management service.

Exam trap

The trap here is that candidates often confuse the encryption header values (`aws:kms` vs `AES256`) and mistakenly choose an option that enforces SSE-S3 (AES256) instead of SSE-KMS, or they pick a Deny statement that only checks for the presence of the header without validating its specific value.

How to eliminate wrong answers

Option B is wrong because it denies PutObject requests that do not include the `x-amz-server-side-encryption` header at all, but it does not enforce the use of `aws:kms`; a request with the header set to `AES256` (SSE-S3) would still be denied, which is overly restrictive and not aligned with the requirement for KMS encryption. Option C is wrong because it denies PutObject requests where the header is not set to `AES256`, which would enforce SSE-S3 instead of SSE-KMS, directly contradicting the requirement for AWS KMS encryption. Option D is wrong because it allows PutObject requests only if the header is set to `AES256`, which again enforces SSE-S3, not SSE-KMS, and an Allow statement alone does not block requests that omit the header entirely, leaving a gap for unencrypted uploads.

843
MCQhard

A data pipeline using Amazon Kinesis Data Streams is experiencing high consumer lag. The stream has 10 shards. The consumer is an AWS Lambda function that processes each record and writes to Amazon DynamoDB. What is the MOST likely cause of the lag?

A.The Lambda function's reserved concurrency is set too low
B.The DynamoDB table's write capacity is throttling writes
C.The number of shards is insufficient for the data volume
D.The Lambda function is not authorized to read from Kinesis
AnswerA

Low concurrency limits parallel processing of shards.

Why this answer

Option C is correct because if the Lambda function's concurrency limit is reached, it cannot process all shards simultaneously, causing lag. Option A is wrong because increasing shards would increase parallelism but is not the root cause. Option B is wrong because Lambda can process from multiple shards.

Option D is wrong because DynamoDB write capacity could be a bottleneck but is less likely than Lambda concurrency.

844
MCQeasy

An e-commerce company wants to capture clickstream data from its website and store it in Amazon S3 for analytics. The data arrives continuously and the company needs near-real-time processing. Which solution is most appropriate?

A.AWS Data Pipeline
B.AWS Snowball Edge
C.Amazon Kinesis Data Firehose
D.Amazon S3 Transfer Acceleration
AnswerC

Firehose captures streaming data and delivers to S3 with low latency.

Why this answer

Option A is correct because Kinesis Data Firehose is a fully managed service for loading streaming data into S3 with near-real-time delivery. Option B is wrong because S3 Transfer Acceleration is for faster uploads, not streaming. Option C is wrong because Data Pipeline is for batch processing.

Option D is wrong because Snowball is offline.

845
MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is compressed with GZIP and partitioned by year, month, day, and hour. The delivery stream is configured to buffer up to 5 MB or 60 seconds. Some records are missing from S3. What is the most likely cause?

A.The S3 bucket does not have sufficient permissions
B.The data compression format is incompatible with S3
C.The Lambda transformation function timed out and records were skipped
D.The partition key configuration is incorrect
AnswerC

Firehose drops records if the transformation Lambda exceeds the timeout.

Why this answer

Option A is correct because Kinesis Data Firehose transformation Lambda functions have a 5-minute timeout; if the transformation takes longer, records are dropped. Option B is wrong because GZIP compression is supported. Option C is wrong because partitioning does not cause data loss.

Option D is wrong because S3 permissions would cause delivery failures, not silent missing records.

846
MCQhard

Refer to the exhibit. A data engineer is configuring an IAM policy for a Lambda function that writes transformed data to S3. The function writes to both 'example-bucket/data/' and 'example-bucket/public/'. The policy is intended to enforce server-side encryption with SSE-S3 for all objects written to the 'public/' prefix, while allowing all operations on other prefixes. However, the Lambda function is failing with an AccessDenied error when writing to 'example-bucket/public/'. What is the most likely cause?

A.The policy denies DeleteObject on 'public/'.
B.The policy denies PutObject on 'public/' unconditionally.
C.The policy does not allow GetObject for 'public/'.
D.The Lambda function is not setting the 'x-amz-server-side-encryption' header to 'AES256' when writing to 'public/'.
AnswerD

The Deny condition requires AES256 encryption.

Why this answer

Option A is correct because the Deny statement uses a condition that denies PutObject if the encryption is not AES256. If the Lambda function does not set SSE-S3, the condition fails and the request is denied. Option B is wrong because the policy allows PutObject on all resources.

Option C is wrong because DeleteObject is not denied. Option D is wrong because GetObject is allowed.

847
MCQmedium

A company is using an Amazon RDS for MySQL database for an e-commerce application. During a sales event, the database experiences high read traffic, causing slow query performance. The company wants to reduce the read load on the primary database without changing the application code. Which solution meets these requirements?

A.Enable Multi-AZ on the RDS instance.
B.Create an Amazon RDS read replica and direct read traffic to it.
C.Increase the instance size of the RDS database.
D.Deploy Amazon ElastiCache to cache query results.
AnswerB

Read replicas handle read-only traffic, reducing load on the primary.

Why this answer

Option B is correct because an RDS read replica offloads read queries from the primary DB without application changes. Option A is wrong because ElastiCache requires code changes to cache queries. Option C is wrong because Multi-AZ is for high availability, not read scaling.

Option D is wrong because increasing instance size helps but may require downtime and is less efficient.

848
Multi-Selecteasy

A data engineer is designing a data ingestion pipeline for real-time clickstream data. Which TWO services can be used to ingest the data into Amazon Kinesis Data Streams?

Select 2 answers
A.Amazon S3
B.Kinesis Producer Library (KPL)
C.Kinesis Data Firehose
D.AWS SDK
E.AWS Glue
AnswersB, D

KPL is designed to send data to Kinesis Data Streams efficiently.

Why this answer

Options A and D are correct. The Kinesis Producer Library (KPL) is a library for producers to send data to Kinesis Data Streams. AWS SDK can also be used directly.

Option B is wrong because Kinesis Data Firehose is a downstream consumer, not a producer. Option C is wrong because AWS Glue is an ETL service, not a producer. Option E is wrong because S3 is a destination, not a producer.

849
Multi-Selectmedium

A data engineer is setting up Amazon CloudWatch alarms for an Amazon Redshift cluster. The engineer wants to be alerted when the disk space usage exceeds 80% for more than 5 minutes and when the CPU utilization exceeds 90% for more than 10 minutes. Which TWO CloudWatch metrics and conditions should the engineer use? (Choose two.)

Select 2 answers
A.Metric: CPUUtilization; Condition: > 90 for 10 minutes
B.Metric: DatabaseConnections; Condition: > 500 for 5 minutes
C.Metric: NetworkReceiveThroughput; Condition: > 1 GB for 10 minutes
D.Metric: WLMQueueLength; Condition: > 100 for 5 minutes
E.Metric: PercentageDiskSpace; Condition: > 80 for 5 minutes
AnswersA, E

This alarm triggers on CPU usage.

Why this answer

Options B and D are correct. Option B: PercentageDiskSpace metric with threshold 80. Option D: CPUUtilization metric with threshold 90.

Option A is wrong because WLM is not for disk. Option C is wrong because NetworkReceiveThroughput is not relevant. Option E is wrong because DatabaseConnections is not about CPU.

850
MCQhard

A company has multiple AWS accounts and wants to centrally manage permissions and access to data lakes. They have enabled AWS Organizations and want to use a single set of policies that apply to all accounts. Which policy type should be used at the organization level?

A.IAM policies
B.KMS key policies
C.S3 bucket policies
D.Service control policies (SCPs)
AnswerD

SCPs centrally manage permissions across all accounts in an organization.

Why this answer

Option D is correct because Service Control Policies (SCPs) are used in AWS Organizations to centrally manage permissions across accounts. Option A (IAM policies) are attached to IAM users/roles within an account, not across accounts. Option B (bucket policies) are specific to S3 buckets.

Option C (KMS key policies) control access to KMS keys.

851
Multi-Selectmedium

Which TWO actions should a data engineer take to protect sensitive data in an Amazon S3 bucket from being accessed by unauthorized users? (Select TWO.)

Select 2 answers
A.Create a VPC endpoint for S3
B.Enable S3 server access logging
C.Add a bucket policy with a Deny effect for unauthorized principals
D.Enable AWS CloudTrail for the bucket
E.Enable S3 Block Public Access
AnswersC, E

A Deny policy explicitly denies access.

Why this answer

Options A and C are correct. Option A (S3 Block Public Access) prevents public access to the bucket. Option C (bucket policy with Deny effect) explicitly denies access to unauthorized users.

Option B (S3 server access logs) is for auditing, not prevention. Option D (CloudTrail) is for logging, not prevention. Option E (VPC endpoint) is for network connectivity, not access control.

852
MCQeasy

A company needs to ingest data from an on-premises MySQL database into Amazon S3 for analytics. The database is 2 TB in size. The company has a low-bandwidth internet connection (10 Mbps). They need to perform an initial full load and then incremental updates every hour. Which approach should they use?

A.Use Kinesis Data Firehose to stream data from MySQL to S3.
B.Use AWS Database Migration Service (DMS) to perform the full load and ongoing replication.
C.Use AWS Glue ETL jobs to extract data and load into S3.
D.Use AWS Snowball Edge to transfer the initial full load, then use AWS DataSync for incremental updates.
AnswerB

DMS supports full load and CDC with low bandwidth.

Why this answer

Option B is correct because AWS Database Migration Service (DMS) supports full load and ongoing replication, and can be used with limited bandwidth. Option A is wrong because Snowball Edge is for offline transfer, not ongoing replication. Option C is wrong because Glue ETL is not optimized for continuous replication.

Option D is wrong because Kinesis Data Firehose is for streaming data, not database replication.

853
MCQmedium

Refer to the exhibit. A data engineer is troubleshooting a Kinesis Data Streams consumer that is falling behind. The stream has 2 shards and is receiving data at a rate of 2 MB/s. The consumer is an AWS Lambda function with a batch size of 100 records. What should the engineer do to improve consumer throughput?

A.Decrease the Lambda batch size to 10 records
B.Increase the retention period of the stream to 168 hours
C.Increase the number of shards in the stream to 4
D.Increase the memory allocation of the Lambda function
AnswerC

More shards increase parallelism and throughput for both producers and consumers.

Why this answer

Option A is correct because the stream is at its write limit (2 MB/s, but each shard supports 1 MB/s input, so 2 shards = 2 MB/s, but the consumer may need more read capacity. Increasing the number of shards increases both write and read capacity. Option B (Lambda memory) may help but not as effective.

Option C (increasing retention) does not help throughput. Option D (decreasing batch size) reduces throughput.

854
MCQmedium

A data engineer is tasked with reducing costs for an Amazon Redshift cluster. The cluster is used for both ETL workloads and BI reporting. The engineer observes that the cluster is over-provisioned during off-peak hours. Which action would be MOST effective in reducing costs while maintaining performance during peak hours?

A.Switch to RA3 node types for managed storage.
B.Enable concurrency scaling to automatically add cluster capacity during peak hours.
C.Purchase Reserved Instances for the cluster.
D.Reduce the number of nodes in the cluster.
AnswerB

Concurrency scaling adds transient clusters only when needed, reducing cost during off-peak hours.

Why this answer

Option B is correct because concurrency scaling adds additional capacity on demand and is cost-effective for variable workloads. Option A is incorrect because reserved instances require upfront payment and are for steady-state, not variable. Option C is incorrect because reducing node count may impact performance during peak hours.

Option D is incorrect because RA3 nodes are for managed storage, not cost reduction for variable workloads.

855
MCQeasy

A data engineer needs to transform JSON data from an S3 bucket into Parquet format and load it into Amazon Redshift. The transformation must be performed incrementally as new data arrives. Which AWS service is BEST suited for this task?

A.Use AWS Lambda to transform the data on the fly and write to Redshift.
B.Use Amazon EMR with Apache Spark to transform the data and load it into Redshift.
C.Use AWS Glue to create an ETL job that runs on a schedule or trigger.
D.Use Amazon Kinesis Data Firehose to transform and load data into Redshift in real time.
AnswerC

AWS Glue is a serverless ETL service that can handle incremental transformations and load data into Redshift efficiently.

Why this answer

Option C is correct because AWS Glue provides a serverless ETL service that can run jobs triggered by S3 events to transform data incrementally and load into Redshift. Option A (Amazon EMR) is more suited for large-scale big data processing but requires cluster management. Option B (AWS Lambda) can be used for simple transformations but may hit time limits for complex transformations.

Option D (Amazon Kinesis Data Firehose) is for streaming data, not for batch transformation of existing S3 objects.

856
MCQhard

A company is ingesting streaming data from multiple sources using Amazon Kinesis Data Streams. The data is then processed by an AWS Lambda function that transforms the records and writes them to an Amazon S3 bucket. The Lambda function is failing intermittently with timeout errors. The average record size is 5 KB, and the shard count is 2. What is the MOST likely cause of the timeout errors?

A.The Lambda function timeout is set too low for the processing time required.
B.The Kinesis data retention period is too short, causing data to be lost before processing.
C.The Lambda function's reserved concurrency is set too low, causing throttling.
D.The Lambda function is receiving too many records per invocation, exceeding the 6 MB payload limit.
AnswerA

The default Lambda timeout is 3 seconds, which may not be sufficient for processing each batch of records and writing to S3.

Why this answer

Option B is correct because the default Lambda timeout is 3 seconds, which may not be sufficient for processing and writing records, especially if the transformation logic is complex or the S3 write takes longer than expected. Option A is incorrect because Kinesis Data Streams has a default retention period of 24 hours, which is unlikely to cause timeouts. Option C is incorrect because the batch size of records per Lambda invocation can be adjusted, but the default is 100 records, which at 5 KB each is only 500 KB, well within Lambda limits.

Option D is incorrect because Lambda concurrency limits affect scaling, not individual invocation timeouts.

857
Multi-Selecthard

A company uses Amazon RDS for MySQL as a source for AWS DMS to replicate data to S3. The replication task is failing with 'OutOfMemory' errors on the DMS instance. The source table has 10 million rows with large BLOB columns. Which THREE changes would most likely resolve the issue?

Select 3 answers
A.Set the LOB column settings to 'Limited LOB mode' and specify a max LOB size.
B.Disable logging for the DMS task to free memory.
C.Enable Full LOB mode to handle LOBs more efficiently.
D.Increase the DMS replication instance size to a compute-optimized class.
E.Increase the number of parallel threads in the task settings.
AnswersA, D, E

Limited LOB mode avoids loading entire LOBs into memory.

Why this answer

Option A is correct because setting LOB columns to 'Limited LOB mode' with a specified max LOB size prevents DMS from loading entire LOBs into memory. Instead, DMS truncates LOBs to the specified size, reducing memory consumption and avoiding OutOfMemory errors when replicating large BLOB columns from MySQL to S3.

Exam trap

The trap here is that candidates often assume Full LOB mode is always the safest choice for large objects, but it actually increases memory usage and can cause OutOfMemory errors, whereas Limited LOB mode with a max size is the correct memory-saving approach.

858
MCQmedium

A data engineer is troubleshooting a slow-running query on an Amazon Redshift cluster. The query involves joining two large tables. The engineer notices that the query plan shows a large number of distribution and broadcast operations. Which design change would most likely improve query performance?

A.Change the distribution style of both tables to ALL
B.Change the distribution style of both tables to KEY on the join column
C.Change the distribution style of both tables to EVEN
D.Add a sort key on the join column
AnswerB

KEY distribution on the join column ensures matching rows are on the same node, reducing redistribution.

Why this answer

Option B is correct because changing the distribution style of both tables to KEY on the join column ensures that rows with the same join key value are co-located on the same node. This eliminates the need for expensive broadcast or redistribution operations during the join, as Redshift can perform the join locally on each slice without moving data across the network.

Exam trap

The trap here is that candidates often confuse distribution and sort keys, thinking a sort key on the join column will reduce data movement, when in fact only distribution key alignment eliminates broadcast/redistribution operations in the query plan.

How to eliminate wrong answers

Option A is wrong because setting both tables to ALL distribution replicates the entire table to every node, which increases storage and maintenance overhead, and does not address the root cause of excessive data movement during joins; it can also degrade performance for large tables due to increased load and memory pressure. Option C is wrong because EVEN distribution distributes rows round-robin across nodes, which does not co-locate join keys and forces Redshift to redistribute or broadcast rows during the join, exacerbating the problem. Option D is wrong because adding a sort key on the join column improves the efficiency of range-restricted scans and merge joins but does not reduce the number of distribution or broadcast operations; the query plan's large number of such operations indicates a distribution mismatch, not a sorting issue.

859
MCQeasy

A data engineer is monitoring an Amazon EMR cluster and notices that one core node is running out of disk space. The cluster is running a Spark job that processes large Parquet files. What should the engineer do to prevent the issue?

A.Terminate the core node and replace it with a larger instance type
B.Use Spark's in-memory processing to avoid writing intermediate data to disk
C.Enable Snappy compression for intermediate data
D.Increase the number of core nodes
AnswerC

Compression reduces disk usage for intermediate data.

Why this answer

Option C is correct because enabling Snappy compression for intermediate data reduces the volume of data written to disk during Spark shuffle operations, directly addressing the disk space issue on the core node. Snappy provides a good balance between compression ratio and speed, minimizing I/O overhead while conserving storage. This is a standard tuning practice in Amazon EMR for Spark jobs that process large Parquet files.

Exam trap

The trap here is that candidates may confuse increasing cluster capacity (options A or D) with optimizing data handling, whereas the exam tests the understanding that compression of intermediate data directly reduces disk usage without requiring hardware changes.

How to eliminate wrong answers

Option A is wrong because terminating the core node and replacing it with a larger instance type is disruptive and does not prevent the recurrence of disk space issues; it only temporarily increases capacity without addressing the root cause of excessive intermediate data. Option B is wrong because Spark's in-memory processing cannot fully avoid writing intermediate data to disk during shuffle operations, as spill-to-disk is inherent when memory is insufficient; relying solely on in-memory processing does not prevent disk exhaustion. Option D is wrong because increasing the number of core nodes distributes the storage load but does not reduce the amount of intermediate data written per node; it may delay but not prevent disk space issues if the data volume per node remains high.

860
MCQeasy

Refer to the exhibit. A data engineer runs this AWS Glue Data Catalog DDL statement to create a table. The CSV files in 's3://my-bucket/sales/' use a pipe delimiter (|) instead of a comma. What change is needed to correctly read the data?

A.Change the 'field.delim' property to '|'.
B.Change the LOCATION to read from a subfolder.
C.Add a partition projection configuration.
D.Run a crawler to detect the schema automatically.
AnswerA

The delimiter must match the actual file format.

Why this answer

Option A is correct. The SERDEPROPERTIES specify 'field.delim' = ',' but the files use pipe. Changing the delimiter to '|' will allow correct parsing.

Option B is wrong because partition projection is not needed. Option C is wrong because the table already exists. Option D is wrong because the location is correct.

861
MCQeasy

A data engineer needs to transform JSON data into CSV format using AWS Glue. The transformation is simple and must be executed on a schedule. Which Glue component is MOST suitable?

A.Glue Crawler
B.Glue Data Catalog
C.Glue Development Endpoint
D.Glue ETL job
AnswerD

Glue ETL jobs run transformations and can be scheduled.

Why this answer

Option B is correct. A Glue ETL job can be scheduled to run a script that transforms data. Option A is wrong because crawlers only catalog data.

Option C is wrong because the Data Catalog is a metadata repository. Option D is wrong because a development endpoint is for interactive development, not production scheduling.

862
Multi-Selectmedium

An e-commerce company is building a near-real-time dashboard to monitor customer clickstream data. The data is ingested via Amazon Kinesis Data Streams, transformed using AWS Lambda, and stored in Amazon S3. The team needs to query the data using Amazon Athena. Which THREE steps should be taken to optimize cost and performance? (Choose three.)

Select 3 answers
A.Use AWS Glue Data Catalog to store the table metadata.
B.Store the data in JSON format for flexibility.
C.Convert the data to Apache Parquet or ORC format.
D.Compress the data using gzip or snappy.
E.Partition the data by date in S3 (e.g., year/month/day).
AnswersC, D, E

Columnar formats reduce data scanned and improve compression.

Why this answer

B is correct because partitioning by date reduces the amount of data scanned by Athena. C is correct because converting data to Parquet or ORC reduces storage size and improves query performance. E is correct because compressing data reduces storage costs and scanning.

A is wrong because JSON is not optimal; columnar formats are better. D is wrong because Glue Data Catalog is needed for Athena, but it is not an optimization step; it's a prerequisite.

863
MCQeasy

A small startup is building a data pipeline to ingest customer orders from a web application into Amazon Redshift for analytics. The orders are written to an Amazon RDS MySQL database. The startup wants to replicate the orders to Redshift in near-real time (within 5 minutes) with minimal operational overhead. The data volume is low, averaging 100 new orders per minute. The startup has a single data engineer who is also responsible for other tasks. What is the simplest solution?

A.Use AWS Glue with a scheduled job every 5 minutes to copy data from MySQL to Redshift
B.Use Amazon EMR with Spark streaming to read from MySQL and write to Redshift
C.Use an AWS Lambda function to query MySQL every minute and insert into Redshift
D.Use AWS Database Migration Service (DMS) with continuous replication
AnswerD

DMS is purpose-built for database replication and easy to set up.

Why this answer

Option D is correct because AWS DMS can continuously replicate from MySQL to Redshift with minimal setup and low overhead. Option A (Lambda) requires custom code. Option B (Glue) is batch-oriented and may not meet the 5-minute latency.

Option C (EMR) is overkill.

864
Multi-Selecteasy

A data engineer needs to securely store database credentials for an RDS instance. Which TWO AWS services can be used?

Select 2 answers
A.AWS KMS
B.AWS Secrets Manager
C.AWS IAM
D.AWS CloudFormation
E.AWS Systems Manager Parameter Store
AnswersB, E

Secrets Manager is designed for managing secrets, including automatic rotation.

Why this answer

AWS Systems Manager Parameter Store can securely store secrets like database credentials. AWS Secrets Manager is designed specifically for secrets management and automatic rotation. Option C is wrong because CloudFormation is for infrastructure as code.

Option D is wrong because KMS is a key management service, not a secret store (though it can encrypt secrets stored elsewhere). Option E is wrong because IAM is for identity management.

865
Matchingmedium

Match each AWS Glue component to its role.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Scans data sources and populates catalog

Central metadata repository

Transform and load data

Orchestrates multiple jobs and crawlers

Interactive development environment

Why these pairings

Glue components work together for ETL.

866
MCQhard

A data engineer creates an IAM policy as shown in the exhibit. The engineer then attaches this policy to an IAM role used by an application that uploads objects to the S3 bucket 'my-bucket'. When the application uploads an object without specifying server-side encryption, what happens?

A.The object is uploaded with SSE-S3 encryption by default.
B.The upload fails with a 403 Access Denied error.
C.The object is uploaded without encryption.
D.The object is uploaded with SSE-C encryption.
AnswerB

The condition is not met, so the request is denied.

Why this answer

Option D is correct because the policy requires the s3:x-amz-server-side-encryption header to be set to AES256. If the request does not include that header, the condition is not met, and the request is denied with a 403 Access Denied error. Option A is wrong because the object is not uploaded.

Option B is wrong because the condition requires AES256, not SSE-C. Option C is wrong because the condition is not met if the header is absent; the request is denied.

867
MCQmedium

A company uses AWS Glue to catalog data in Amazon S3. The data includes personally identifiable information (PII). The security team requires that PII be masked when queried by users who are not data owners. Which AWS service should be used to enforce this requirement?

A.Use Amazon Macie to automatically redact PII from S3 objects.
B.Use IAM policies with condition keys to restrict access based on tags.
C.Use AWS Lake Formation to define column-level security and data masking.
D.Use Amazon S3 Object Lambda to transform data on the fly.
AnswerC

Lake Formation provides column-level permissions and dynamic masking.

Why this answer

Option B is correct because AWS Lake Formation provides fine-grained access control and column-level masking for data cataloged in the Glue Data Catalog. Option A is wrong because S3 Object Lambda modifies data at the S3 API level, not at the query level. Option C is wrong because IAM policies cannot mask data.

Option D is wrong because Macie discovers and classifies PII but does not enforce access controls.

868
MCQhard

A data engineer is responsible for a data pipeline that uses Amazon S3 as a data lake, AWS Glue for ETL, and Amazon Athena for ad-hoc queries. The pipeline ingests CSV files from an external partner via SFTP into an S3 bucket. The files are then processed by a Glue job that converts them to Parquet and writes to a separate S3 bucket partitioned by date. The Glue job runs daily and is triggered by a scheduled CloudWatch Events rule. Recently, the data engineer noticed that some days the Glue job fails because of memory errors, and on those days the Athena queries that rely on the data return incomplete results. The engineer needs to ensure that the pipeline is resilient and that Athena queries always see a complete view of the data, even if the Glue job fails mid-run. The engineer also needs to minimize re-processing of data. Which course of action should the engineer take?

A.Increase the number of workers and the worker type to G.2X to handle the memory errors, and enable job retries.
B.Replace the Glue job with an AWS Lambda function that processes the CSV files and writes Parquet to S3, and use S3 Event Notifications to trigger the function.
C.Modify the Glue job to use job bookmarks for incremental processing and write the Parquet output to a temporary location, then use an S3 copy operation to move the data into the final partitioned location only after the job completes successfully.
D.Use Athena partition projection to automatically discover partitions and set up a retry mechanism using AWS Step Functions.
AnswerC

Bookmarks prevent reprocessing; atomic move ensures Athena sees complete data.

Why this answer

Option B is correct. Using Glue job bookmarks enables incremental processing and the ability to resume from the last successful checkpoint. Staging the data in a temporary location and moving it atomically ensures that Athena sees only complete data.

Option A is wrong because increasing worker capacity does not prevent partial writes. Option C is wrong because using Lambda for conversion is less scalable and error-prone. Option D is wrong because partition projection does not solve the atomicity issue.

869
MCQmedium

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is failing to deliver data to an Amazon S3 bucket. The stream is configured with a Lambda transformation function. The CloudWatch logs show that the Lambda function is timing out. Which action should the engineer take to resolve the issue?

A.Reduce the Firehose buffer interval.
B.Increase the Lambda function timeout setting.
C.Decrease the Lambda function's batch size in Firehose.
D.Increase the memory allocated to the Lambda function.
AnswerB

Extending timeout allows more time for processing.

Why this answer

Option C is correct because increasing the Lambda timeout allows the function to complete processing. Option A is wrong because the issue is timeout, not memory. Option B is wrong because the stream configuration does not control Lambda timeout.

Option D is wrong because reducing the buffer interval may increase invocation frequency but not fix the timeout.

870
MCQmedium

A company uses Amazon S3 to store historical stock market data as CSV files. They run daily Amazon Athena queries to generate reports. Recently, the finance team reported that queries are timing out and costs have increased significantly. The data engineering team notices that the S3 bucket contains thousands of small files (average 100 KB) due to a misconfigured ingestion pipeline. They need to improve query performance and reduce costs without changing the existing reporting schedule. The team has access to AWS Glue and can create new tables. Which solution should they implement?

A.Partition the data by date and create a new Athena table with partitions.
B.Use S3 Select to filter rows within each file before Athena processes them.
C.Increase the Athena query timeout to 30 minutes.
D.Use AWS Glue ETL to read the CSV files, convert them to Parquet, and write them back to S3 in fewer, larger files.
AnswerD

Consolidates small files and uses columnar format to reduce scan size.

Why this answer

Option D is correct because converting the thousands of small CSV files into fewer, larger Parquet files using AWS Glue ETL directly addresses the root cause of poor Athena performance and high costs. Parquet is a columnar format that reduces the amount of data scanned per query, and larger files minimize the overhead of S3 LIST and GET operations, improving throughput. This solution does not change the reporting schedule and leverages existing Glue capabilities to create new optimized tables.

Exam trap

The trap here is that candidates often assume partitioning (Option A) is a universal performance fix, but they overlook that partitioning does not address the 'small files problem' which is a distinct performance killer in Athena due to S3 request overhead and file open costs.

How to eliminate wrong answers

Option A is wrong because partitioning by date does not solve the problem of thousands of tiny files; while partitioning can help prune scanned data, the overhead of reading many small files per partition still causes high latency and cost due to excessive S3 API calls. Option B is wrong because S3 Select operates at the object level to filter rows within a single file, but it does not consolidate files or change the file format; Athena would still need to process thousands of small files, and S3 Select cannot be used directly within Athena queries to replace table scans. Option C is wrong because increasing the query timeout does not reduce the amount of data scanned or the number of S3 requests; it merely allows the query to run longer without addressing the performance bottleneck or cost issue.

871
Multi-Selectmedium

A company is using AWS Glue to process data from an Amazon S3 data lake. The Glue job runs daily and transforms data into multiple output formats. Which TWO actions can the company take to optimize the Glue job's performance and reduce costs? (Choose TWO.)

Select 2 answers
A.Increase the number of DPUs allocated to the job.
B.Reduce the number of DPUs (Data Processing Units) allocated to the job.
C.Disable job bookmarking to force full reprocessing every run.
D.Increase the job timeout to allow more time for processing.
E.Enable job bookmarking to process only new data.
AnswersA, E

More DPUs can speed up processing, reducing runtime and possibly cost.

Why this answer

Options A and D are correct. Using job bookmarking (A) ensures that only new data is processed, reducing processing time and cost. Using a larger number of DPUs (D) can improve performance for data-intensive jobs, though it increases cost per job, but if the job runs faster, overall cost may be lower.

Option B (using a smaller number of DPUs) would reduce performance. Option C (increasing the job timeout) does not optimize performance or cost. Option E (disabling job bookmarking) would reprocess all data, increasing cost.

872
MCQhard

A company ingests IoT sensor data into an S3 bucket. Daily, a Lambda function reads new objects, processes them, and writes results to a DynamoDB table. Recently, the Lambda function started timing out after 15 minutes. The data volume has increased, and the function processes records one by one. Which solution would improve performance without significant cost increase?

A.Replace Lambda with an AWS Glue ETL job.
B.Increase the Lambda function timeout to 30 minutes.
C.Use S3 Batch Operations to invoke the Lambda function in parallel for each object.
D.Increase the DynamoDB write capacity units.
AnswerC

S3 Batch Operations processes objects concurrently, drastically reducing processing time.

Why this answer

Option B is correct because using S3 Batch Operations invokes a Lambda function for each object in parallel, handling large volumes efficiently. Option A is wrong because increasing Lambda timeout does not address the root cause of sequential processing. Option C is wrong because Glue jobs have a startup overhead and may cost more.

Option D is wrong because increasing DynamoDB write capacity does not speed up the Lambda processing.

873
MCQhard

A company ingests millions of small files (1-10 KB) into Amazon S3 every hour. These files are then processed by AWS Glue ETL jobs. The Glue jobs are slow because of the overhead of reading many small files. Which strategy will most effectively improve Glue job performance?

A.Enable Glue job bookmark.
B.Increase the number of DPUs for the Glue job.
C.Use S3 Select to filter data before Glue reads it.
D.Use a Lambda function to merge small files into larger ones before Glue processes them.
AnswerD

Merging files reduces the number of objects, speeding up Glue's list and read operations.

Why this answer

Grouping small files into larger ones (e.g., by merging in a preprocessing step) reduces the number of file read operations and improves Glue's efficiency. Using S3 Select or increasing DPUs helps but doesn't address the root cause.

874
Multi-Selectmedium

A company is using Amazon Kinesis Data Streams to process real-time stock trade data. The data is consumed by a Lambda function that calculates moving averages and stores results in Amazon DynamoDB. The Lambda function is failing with 'ProvisionedThroughputExceededException' on the DynamoDB table. The table has on-demand capacity. Which TWO actions should the engineer take to resolve this issue?

Select 2 answers
A.Add a dead-letter queue and configure the Lambda function to retry on failure with exponential backoff.
B.Decrease the batch window to 0 seconds to process records immediately.
C.Increase the Lambda function's reserved concurrency to process more shards.
D.Increase the batch size of the Kinesis event source mapping for the Lambda function.
AnswersA, D

Retries with backoff help handle throttling gracefully.

Why this answer

On-demand DynamoDB can handle bursts but may throttle if the traffic is too high. Increasing batch size reduces write frequency. Adding a retry mechanism with exponential backoff handles throttling gracefully.

Option B is wrong because increasing Lambda concurrency increases write pressure. Option C is wrong because reducing batch size increases frequency. Option D is wrong because Firehose is not needed.

875
MCQmedium

A company is migrating an on-premises Hadoop cluster to AWS. The cluster processes large files in CSV format using Apache Spark. Which data store should be used as the primary storage for the data lake to optimize cost and performance?

A.Amazon EMR File System (EMRFS) backed by HDFS
B.Amazon RDS for MySQL
C.Amazon EBS volumes attached to the EMR cluster
D.Amazon S3
AnswerD

S3 provides unlimited storage, high durability, and integrates with EMR via EMRFS.

Why this answer

Option C is correct because Amazon S3 is the best choice for data lake storage due to its durability, scalability, and cost-effectiveness. Option A is wrong because EBS is block storage for EC2 instances, not suitable for large-scale data lakes. Option B is wrong because EMRFS is a connector for S3, not a separate storage.

Option D is wrong because RDS is relational and not designed for large file storage.

876
Multi-Selecthard

Which THREE factors should a data engineer consider when choosing between Amazon RDS and Amazon DynamoDB for a new application? (Choose three.)

Select 3 answers
A.Whether the workload requires serverless scaling.
B.Whether the data model is relational or key-value.
C.Whether the data must be encrypted at rest by default.
D.Whether the application requires VPC isolation.
E.Whether the application needs to scale horizontally for high throughput.
AnswersA, B, E

DynamoDB is serverless; RDS requires manual scaling.

Why this answer

Options B, C, and D are correct. B: relational vs NoSQL. C: DynamoDB is serverless; RDS requires provisioning.

D: DynamoDB scales horizontally; RDS scales vertically. A is irrelevant because both can use VPC. E is not a primary factor.

877
MCQeasy

A data engineer needs to set up a new Amazon RDS for MySQL database for a web application. The application experiences variable read traffic and requires low read latency. The engineer needs to minimize downtime during maintenance and provide read scalability. Which configuration meets these requirements?

A.Multi-AZ db.r5.large instance with two Read Replicas
B.Multi-AZ db.r5.large instance
C.Single-AZ db.r5.large instance
D.Single-AZ db.r5.xlarge instance
AnswerA

Multi-AZ provides failover, and Read Replicas provide read scalability.

Why this answer

Option A is correct because a Multi-AZ deployment provides high availability and automatic failover to minimize downtime during maintenance, while adding two Read Replicas offloads read traffic from the primary instance, reducing read latency and enabling read scalability. The db.r5.large instance size is sufficient for the variable read workload, and Read Replicas can be promoted to standalone instances if needed.

Exam trap

The trap here is that candidates often assume Multi-AZ alone provides read scalability, but Multi-AZ only provides high availability and failover, not read offloading—Read Replicas are required for read scaling.

How to eliminate wrong answers

Option B is wrong because a Multi-AZ instance alone provides high availability and failover but does not offer read scalability or reduce read latency for variable read traffic, as all reads still hit the primary instance. Option C is wrong because a Single-AZ instance lacks high availability, meaning any maintenance or failure causes downtime, and it provides no read scalability. Option D is wrong because a Single-AZ db.r5.xlarge instance, while larger, still lacks high availability and read scalability; scaling vertically does not address variable read traffic efficiently and does not minimize downtime during maintenance.

878
MCQmedium

A company is using Amazon S3 to store sensitive data. The security team requires that all objects be encrypted using server-side encryption with AWS KMS (SSE-KMS) and that the bucket policy denies any PutObject request that does not include the required encryption header. Which bucket policy condition should be added?

A.s3:x-amz-server-side-encryption-aws-kms-key-id
B.s3:x-amz-server-side-encryption
C.kms:EncryptionContext
D.aws:SecureTransport
AnswerA

This condition enforces the use of a specific KMS key.

Why this answer

Option A is correct because s3:x-amz-server-side-encryption-aws-kms-key-id can be used to enforce a specific KMS key. Option B is wrong because s3:x-amz-server-side-encryption only enforces SSE-S3 or SSE-KMS, not a specific key. Option C is wrong because kms:EncryptionContext is for KMS, not S3.

Option D is wrong because aws:SecureTransport is for in-transit encryption.

879
MCQmedium

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms and loads it into Amazon S3. Recently, the team noticed that the Lambda function is failing with throttling errors (HTTP 429) from the Kinesis API. Which configuration change should the team make to resolve this issue?

A.Disable retries on the Lambda function and configure a dead-letter queue for failed records.
B.Replace Kinesis Data Streams with Amazon DynamoDB Streams for ingestion.
C.Reduce the batch size and increase the number of shards in the Kinesis stream to increase parallelism.
D.Increase the batch size in the Lambda event source mapping to reduce the number of invocations.
AnswerC

Reducing batch size lowers records per invocation, and more shards increase parallelism, reducing throttling.

Why this answer

The correct answer is to reduce the batch size and increase the number of shards. The Lambda function is experiencing throttling because it is trying to process too many records per invocation. Reducing the batch size lowers the number of records per invocation, and increasing shards increases parallelism.

Option A is incorrect because increasing the batch size would worsen throttling. Option B is incorrect because using a DynamoDB stream is a different ingestion mechanism and doesn't address the Kinesis throttling. Option D is incorrect because disabling retries would cause data loss.

Option C directly addresses the throttling by reducing load per invocation and increasing parallelism.

880
MCQhard

A data engineer is migrating an on-premises Oracle database to Amazon RDS for Oracle. The database is 5 TB in size and has a 1 Gbps network connection. The migration must be completed within 48 hours. Which service should be used?

A.AWS DataSync.
B.Amazon S3 Transfer Acceleration.
C.AWS Snowball Edge.
D.AWS Database Migration Service (DMS).
AnswerD

Online migration over network, capable of migrating 5 TB within 48 hours.

Why this answer

AWS DMS is the correct choice because it is designed for migrating databases to AWS with minimal downtime, and it can handle a 5 TB Oracle database over a 1 Gbps network within 48 hours. DMS supports ongoing replication to keep the source and target in sync, and it can use Oracle-specific features like supplemental logging and change data capture (CDC) to reduce migration time. The 1 Gbps connection provides sufficient bandwidth to transfer 5 TB in under 12 hours at full utilization, leaving ample time for setup and validation.

Exam trap

The trap here is that candidates might choose Snowball Edge (option C) thinking 5 TB is too large for a 1 Gbps connection within 48 hours, but they overlook that the bandwidth is sufficient (5 TB at 1 Gbps takes ~11 hours), making an online migration via DMS the correct and more practical choice.

How to eliminate wrong answers

Option A is wrong because AWS DataSync is designed for moving large volumes of file data (e.g., NFS, SMB) to Amazon S3 or EFS, not for database migrations; it cannot handle Oracle-specific schema, stored procedures, or ongoing replication. Option B is wrong because Amazon S3 Transfer Acceleration is a feature for speeding up uploads to S3 buckets over long distances using AWS edge locations, but it does not migrate databases or support Oracle database engines. Option C is wrong because AWS Snowball Edge is a physical device for offline data transfer when network bandwidth is insufficient (e.g., less than 1 Gbps or limited time), but here the 1 Gbps connection is adequate to transfer 5 TB within 48 hours, making an online migration via DMS more efficient and less complex.

881
MCQhard

Refer to the exhibit. A data engineer applies this bucket policy to an S3 bucket named my-data-bucket. The bucket contains sensitive data. The company's security team reports that data was accessed from an IP address outside the allowed range. What is the MOST likely reason that the policy failed to block the unauthorized access?

A.The Deny statement's condition on SecureTransport overrides the IP condition.
B.The policy has a syntax error in the Condition element.
C.The Deny statement does not restrict access based on IP address; it only denies non-HTTPS requests.
D.The bucket policy does not apply to requests made from within the same AWS account.
AnswerC

The Deny only applies to non-SecureTransport, not to IP addresses outside the allowed range.

Why this answer

Option C is correct because the Deny statement in the policy only denies requests that are not using HTTPS (SecureTransport: false). It does not include any condition to restrict access based on IP address. Therefore, a request made from an IP outside the allowed range but using HTTPS would not be denied by this policy, allowing unauthorized access to the sensitive data.

Exam trap

The trap here is that candidates assume a Deny statement with any condition will block all unauthorized access, but in reality, each condition must be explicitly specified to deny the intended requests.

How to eliminate wrong answers

Option A is wrong because SecureTransport and IP address conditions are independent; a Deny statement with SecureTransport does not override an IP condition—it simply does not evaluate IP at all. Option B is wrong because there is no syntax error indicated in the exhibit; the policy is syntactically valid but logically incomplete. Option D is wrong because bucket policies apply to all principals, including requests made from within the same AWS account, unless explicitly scoped otherwise.

882
MCQeasy

A company wants to encrypt data at rest in Amazon S3 using server-side encryption. They need to manage the encryption keys themselves and rotate them annually. Which S3 encryption option should they use?

A.SSE-KMS
B.SSE-S3
C.SSE-C
D.Client-side encryption
AnswerC

SSE-C allows the customer to provide their own encryption keys and manage them.

Why this answer

SSE-C allows the customer to provide and manage their own encryption keys. SSE-S3 uses AWS-managed keys, and SSE-KMS uses AWS KMS keys but with AWS managing the key material. Option A is wrong because SSE-S3 does not allow customer-managed keys.

Option B is wrong because SSE-KMS still involves AWS management of the key material. Option D is wrong because client-side encryption is not server-side.

883
MCQhard

A data engineer is designing a data warehouse using Amazon Redshift. The workload includes complex queries that join large tables. The engineer notices that queries are slow due to disk-based operations. Which configuration change would MOST improve query performance?

A.Define appropriate sort keys on the large tables.
B.Increase the number of slices per node by choosing a different node type.
C.Choose an appropriate distribution style (e.g., KEY or ALL) for the tables.
D.Enable compression on all columns.
AnswerC

Proper distribution minimizes data movement across nodes, reducing disk I/O for joins.

Why this answer

Option C is correct because choosing an appropriate distribution style (KEY or ALL) minimizes data movement between nodes during query execution. In Amazon Redshift, disk-based operations often result from large volumes of data being redistributed across the network for joins. By colocating related data on the same slices via KEY distribution or replicating small tables with ALL distribution, you reduce the need for broadcast or redistribution, which directly alleviates disk-based spills and improves query performance.

Exam trap

The trap here is that candidates often confuse sort keys (which improve scan efficiency) with distribution keys (which reduce data movement), leading them to choose sort keys when the real bottleneck is disk-based operations from join-related data shuffling.

How to eliminate wrong answers

Option A is wrong because sort keys primarily optimize data skipping and range-restricted scans, not the data movement or disk spills caused by large joins. Option B is wrong because increasing the number of slices per node (by choosing a different node type) does not inherently reduce disk-based operations; it may even increase network shuffling if distribution is not optimized, and the bottleneck is often data redistribution, not slice count. Option D is wrong because compression reduces storage size and I/O for scans, but it does not address the root cause of disk-based operations during joins, which is excessive data movement and intermediate result spills.

884
MCQhard

An IAM policy is attached to a user who tries to upload an object to the S3 bucket example-bucket using the AWS CLI without specifying the --server-side-encryption flag. What will happen?

A.The upload fails with an AccessDenied error.
B.The upload succeeds and the object is encrypted with SSE-S3 by default.
C.The upload fails because the user does not have permission to use KMS.
D.The upload succeeds because the policy allows s3:PutObject.
AnswerA

The condition is not satisfied, so the upload is denied.

Why this answer

The IAM policy denies s3:PutObject unless the request includes the `x-amz-server-side-encryption` header with a value of `AES256`. Since the user did not specify `--server-side-encryption` in the AWS CLI command, the request lacks this required header, causing S3 to evaluate the policy and return an AccessDenied error. The upload fails before any default encryption setting on the bucket is applied.

Exam trap

AWS often tests the misconception that bucket default encryption automatically satisfies an IAM policy requiring encryption headers, but in reality, the policy condition is evaluated first and the request is denied if the header is missing, regardless of the bucket's default encryption setting.

How to eliminate wrong answers

Option B is wrong because the bucket's default encryption (SSE-S3) only applies when the PutObject request does not include an encryption header and the policy does not explicitly require one; here the policy requires the header, so the request is denied before default encryption can take effect. Option C is wrong because the policy does not mention KMS at all; the error is due to the missing `x-amz-server-side-encryption` header, not any KMS permission issue. Option D is wrong because the policy condition `s3:x-amz-server-side-encryption` is not satisfied, so the `s3:PutObject` action is effectively denied despite the user having the action allowed in the policy.

885
MCQeasy

A data engineer is building a pipeline to ingest JSON files from Amazon S3 into Amazon Redshift. The files are 100 MB each and arrive every 5 minutes. Which service is BEST suited for this ingestion?

A.AWS Glue ETL job
B.Amazon Redshift COPY command
C.AWS Lambda with Redshift Data API
D.Amazon Kinesis Data Firehose with Redshift destination
AnswerB

COPY is optimized for loading large data from S3.

Why this answer

Amazon Redshift COPY command efficiently loads large files from S3 in parallel. It is the most direct and performant approach for bulk loading into Redshift.

886
MCQeasy

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources and needs to be partitioned by year, month, day, and event type for efficient querying with Amazon Athena. Which S3 key prefix structure is most appropriate?

A.s3://bucket/events/2024-01-01/event_type=data.parquet
B.s3://bucket/2024/01/01/event_type/events/data.parquet
C.s3://bucket/event_type=events/year=2024/month=01/day=01/data.parquet
D.s3://bucket/event_type=events/year=2024/month=01/day=01/data.parquet
AnswerC

This uses Hive-style partitioning with partition column names, which Athena supports.

Why this answer

Option C uses Hive-style partitioning (event_type=events/year=2024/month=01/day=01), which Athena and other query engines natively support. This structure allows Athena to perform partition pruning, reading only the relevant directories based on WHERE clause filters, significantly reducing data scanned and improving query performance.

Exam trap

AWS often tests the distinction between Hive-style partitioning (key=value) and flat or date-only prefixes, where candidates mistakenly choose a structure that does not support partition pruning or is incompatible with Athena's partition discovery.

How to eliminate wrong answers

Option A is wrong because it embeds the date as a single prefix (2024-01-01) and places event_type as a filename suffix, which does not create separate partition directories; Athena cannot prune partitions efficiently without explicit partition columns. Option B is wrong because it uses a date-only hierarchy (year/month/day) but does not include event_type as a partition column, forcing full scans when filtering by event type. Option D is identical to C and is also correct, but the question expects the most appropriate structure; since both C and D are the same, the intended correct answer is C (the first occurrence).

887
MCQhard

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that sends data to an Amazon S3 bucket. The delivery stream has a buffer size of 5 MB and a buffer interval of 60 seconds. The data ingestion rate is 2 MB per second. The engineer notices that S3 objects are created every 60 seconds but each object is only about 2 MB. What should the engineer do to reduce the number of small S3 objects?

A.Increase the buffer size to 10 MB.
B.Decrease the buffer interval to 30 seconds.
C.Reduce the buffer size to 2 MB.
D.Switch to Kinesis Data Streams and use a Lambda function to write to S3.
AnswerA

Larger buffer size means more data accumulates before an S3 write.

Why this answer

Option A is correct because increasing the buffer size to 10 MB will allow the stream to buffer more data before writing to S3, resulting in larger objects. Option B is wrong because decreasing the buffer interval would create objects more frequently, making the problem worse. Option C is wrong because switching to Kinesis Data Streams does not solve the buffering issue.

Option D is wrong because reducing the buffer size would create even smaller objects.

888
MCQmedium

Refer to the exhibit. An IAM policy is attached to an IAM user. The user is trying to download an object from the S3 bucket 'example-bucket' from an IP address 10.1.1.1, but the request is denied. What is the most likely reason?

A.The policy does not allow the s3:GetObject action.
B.The policy has a syntax error.
C.There is an explicit deny statement elsewhere that overrides the allow.
D.The user's IP address does not match the condition in the policy.
AnswerD

The condition restricts access to IP range 10.0.0.0/16; 10.1.1.1 is outside.

Why this answer

The policy uses a condition to allow access only from the 10.0.0.0/16 IP range. The user's IP 10.1.1.1 is outside that range, so the condition fails and access is denied (implicit deny). Option B is correct.

Option A is wrong because the policy allows GetObject. Option C is wrong because there is no explicit deny. Option D is wrong because the policy is valid.

889
MCQmedium

A data engineer is troubleshooting an AWS Glue ETL job that fails intermittently with the error 'Rate exceeded.' The job reads from an Amazon RDS for MySQL source and writes to Amazon S3. What is the MOST likely cause of this error?

A.The Glue job is using Amazon Kinesis Data Streams as a source, which has a shard throughput limit.
B.The number of Glue job workers or parallel queries is exceeding the maximum connections or IOPS of the RDS instance.
C.The Amazon S3 bucket has a bucket policy that limits the number of objects written per second.
D.The IAM role attached to the Glue job does not have sufficient permissions to read from RDS.
AnswerB

This is the typical cause of rate exceeded errors when reading from RDS.

Why this answer

Option D is correct because the 'Rate exceeded' error in AWS Glue when reading from RDS typically indicates that the number of connections or queries per second exceeds the RDS instance's maximum limits. Option A is wrong because AWS Glue does not directly use Amazon Kinesis. Option B is wrong because insufficient IAM permissions would cause an access denied error, not rate exceeded.

Option C is wrong because Amazon S3 does not have a rate exceeded error for writes; it would be a 503 SlowDown error.

890
MCQmedium

A data engineer needs to share a dataset stored in Amazon S3 with another AWS account. The bucket policy currently grants access only to the owning account. What is the simplest way to grant cross-account access?

A.Add a bucket policy that grants access to the other account's IAM role
B.Set the object ACL to public-read
C.Use an S3 access control list (ACL) to grant access to the other account
D.Create an IAM role in the other account and attach a policy to it
AnswerA

A bucket policy can specify a principal from another account.

Why this answer

Option C is correct because a bucket policy with a principal ARN of the other account's IAM role is the standard way to grant cross-account access. Option A is wrong because that would make the object public, which is not recommended. Option B is wrong because a bucket policy can directly grant cross-account access without needing an IAM role in the other account.

Option D is wrong because an ACL grants basic permissions but is less flexible.

891
Multi-Selecthard

A company is using Amazon S3 for a data lake. The data engineer needs to ensure that all new objects are automatically encrypted with a customer-managed KMS key and that the bucket policy enforces encryption. Which THREE steps should be taken? (Choose THREE.)

Select 3 answers
A.Configure default encryption on the bucket to use SSE-KMS.
B.Add a bucket policy that denies PutObject if the x-amz-server-side-encryption header is not set to aws:kms.
C.Create a customer-managed KMS key.
D.Use a lifecycle policy to apply encryption to existing objects.
E.Enable AWS CloudTrail to monitor encryption.
AnswersA, B, C

Ensures new objects are encrypted with SSE-KMS by default.

Why this answer

Option A is correct because configuring default encryption on the S3 bucket to use SSE-KMS ensures that any object uploaded without an explicit encryption header is automatically encrypted with the specified customer-managed KMS key. This satisfies the requirement for automatic encryption of all new objects.

Exam trap

The trap here is that candidates often confuse lifecycle policies (which manage object transitions) with encryption enforcement, or they think CloudTrail can enforce encryption rather than just audit it.

892
Matchingmedium

Match each AWS data compression format to its typical use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

General-purpose, good compression ratio

Fast compression/decompression for real-time

Columnar storage with built-in compression

Optimized for Hive and large-scale analytics

High compression ratio, slower speed

Why these pairings

Compression formats affect storage and performance in AWS.

893
MCQeasy

A data engineer is configuring an S3 bucket for a data lake. The engineer runs the command shown in the exhibit. What does the output indicate about the bucket?

A.Versioning is enabled on the bucket.
B.The bucket retains only the latest version of each object.
C.Versioning is suspended on the bucket.
D.MFA Delete is enabled for the bucket.
AnswerA

Status: Enabled means versioning is active.

Why this answer

Option A is correct because Status: Enabled means versioning is enabled. Option B is wrong because MFADelete is disabled. Option C is wrong because versioning is enabled, not suspended.

Option D is wrong because it does not indicate specific versions.

894
Multi-Selecthard

A company wants to monitor and alert on any IAM user creation in their AWS account. Which THREE services should be used together to achieve this? (Choose three.)

Select 3 answers
A.Amazon Simple Notification Service (SNS)
B.AWS CloudTrail
C.Amazon CloudWatch Logs
D.Amazon CloudWatch Events (EventBridge)
E.AWS Config
AnswersB, C, D

Records API calls including IAM user creation.

Why this answer

AWS CloudTrail captures IAM user creation API calls (option A). Amazon CloudWatch Logs can be the target for CloudTrail logs (option B). Amazon CloudWatch Events (EventBridge) can create a rule to trigger an alert (option C).

Option D is for configuration compliance, not API monitoring. Option E is for sending notifications, but the question asks for monitoring and alerting, and EventBridge can directly trigger SNS or Lambda.

895
Multi-Selectmedium

A data engineer is designing a data pipeline that ingests streaming data from an IoT device fleet. The data must be processed in near real-time and stored in Amazon S3 for long-term analytics. Which TWO AWS services should the engineer use together to achieve this?

Select 2 answers
A.Amazon Athena
B.AWS Glue
C.Amazon Kinesis Data Firehose
D.Amazon Kinesis Data Streams
E.Amazon Simple Queue Service (SQS)
AnswersC, D

Delivers streaming data to S3.

Why this answer

Option A is correct because Kinesis Data Streams ingests streaming data. Option C is correct because Kinesis Data Firehose can deliver data from the stream to S3. Option B is wrong because SQS is for message queues, not real-time streaming.

Option D is wrong because Glue is for batch ETL, not real-time. Option E is wrong because Athena is a query service, not a delivery service.

896
MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is accessed frequently for the first 30 days, then rarely after that. Which lifecycle policy is MOST cost-effective?

A.Transition to S3 Standard-Infrequent Access (Standard-IA) after 30 days.
B.Transition to S3 One Zone-IA after 30 days.
C.Transition to S3 Glacier Deep Archive after 30 days.
D.Keep in S3 Standard for 90 days, then delete.
AnswerA

Standard-IA is cost-effective for infrequently accessed data with low latency.

Why this answer

Transitioning to S3 Standard-IA after 30 days reduces costs for infrequently accessed data while retaining low latency. Option A is wrong because Glacier has retrieval times not suitable for rare but possible access. Option B is wrong because One Zone-IA is less durable.

Option D is wrong because keeping in Standard is more expensive.

897
MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that fails with an 'Access Denied' error when trying to write to an S3 bucket. The IAM role used by the job has the policy shown in the exhibit. The bucket 'my-bucket' uses S3 default encryption with AWS KMS. What is the most likely missing permission?

A.s3:GetObjectVersion
B.glue:GetObject
C.s3:ListBucketMultipartUploads
D.s3:PutObjectAcl
E.kms:GenerateDataKey and kms:Decrypt
AnswerE

KMS permissions are necessary to encrypt and decrypt objects when default encryption uses KMS.

Why this answer

Option D is correct because when S3 default encryption uses KMS, the IAM role must have kms:GenerateDataKey and kms:Decrypt permissions on the KMS key. The policy in the exhibit does not include any KMS actions. Option A is wrong because s3:PutObject is already granted.

Option B is wrong because s3:GetObject is allowed for reading. Option C is wrong because s3:ListBucket is allowed for listing. Option E is wrong because Glue GetObject is not a valid action.

898
MCQhard

A data engineer runs the AWS CLI command shown in the exhibit to list objects in an S3 bucket. The command returns only two objects even though the bucket contains thousands of objects under the prefix. What should the engineer do to retrieve the next batch of objects?

A.Increase the --max-items value to a larger number.
B.Use the --starting-token parameter with the value from the NextToken field.
C.Use the --page-size parameter to request more items per API call.
D.Change the --prefix to a more specific value.
AnswerB

The NextToken is used with --starting-token to get the next page of results.

Why this answer

The AWS CLI `list-objects-v2` command paginates results by default. When the output is truncated, the response includes a `NextToken` field. To retrieve the next batch, the engineer must use the `--starting-token` parameter with the value from that `NextToken` field, which tells the CLI to resume listing from where it left off.

Exam trap

The trap here is confusing `--page-size` (which controls API call size but not pagination) with `--starting-token` (which actually advances the pagination cursor), leading candidates to incorrectly choose option C.

How to eliminate wrong answers

Option A is wrong because `--max-items` controls the maximum number of items returned per paginated output, not the total number of items retrieved; increasing it would still only return a single page of up to that many items, not the next batch. Option C is wrong because `--page-size` controls the number of items requested per underlying API call (e.g., `ListObjectsV2`), but the CLI automatically handles pagination; changing it does not retrieve the next batch—it only affects the size of each API request. Option D is wrong because changing the `--prefix` would filter to a different set of objects, not retrieve the next batch of objects under the original prefix.

899
MCQhard

A company uses Amazon Kinesis Data Streams to ingest IoT sensor data. The data is processed by an AWS Lambda function that transforms the records and writes to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'Rate exceeded' errors for the S3 PUT API calls. The data volume is 10 MB/s with average record size 2 KB. What should be done to resolve this issue?

A.Add a random prefix to the S3 object key to distribute writes across multiple prefixes
B.Switch to Amazon Kinesis Data Firehose to write to S3
C.Increase the Lambda function's reserved concurrency
D.Increase the number of Kinesis shards
AnswerA

Random prefixes increase the number of S3 partitions, raising the PUT request limit.

Why this answer

Option A is correct because S3 has a PUT request rate limit of 3,500 objects/s per prefix. With 10 MB/s and 2 KB records, that's 5,000 records/s, exceeding the limit. Increasing the number of S3 prefixes distributes writes across multiple partitions.

Option B (increase Lambda concurrency) would worsen the issue. Option C (increase Kinesis shards) doesn't address S3 throttling. Option D (use Firehose) may help but is a bigger change; adding prefixes is simpler.

900
Multi-Selectmedium

A company uses AWS Glue to process data from Amazon S3. The Glue job fails with a 'SchemaDetectionException'. The data engineer wants to ensure the schema is correctly inferred. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers
A.Use the Glue Data Catalog as the source for schema definition.
B.Add a column with a default value to the data.
C.Increase the number of Glue DPUs to speed up processing.
D.Convert all input files to Parquet format.
E.Set the 'groupFiles' option to 'inPartition' to combine small files.
AnswersA, E

Data Catalog provides a predefined schema.

Why this answer

Options A and D are correct. Option A ensures the job reads all files for schema inference. Option D uses the Glue Data Catalog to store a consistent schema.

Option B is wrong because adding a separate column does not help schema detection. Option C is wrong because converting to Parquet may change the schema but does not guarantee correct inference. Option E is wrong because increasing parallelism does not affect schema detection.

Page 11

Page 12 of 24

Page 13