Knowledge + Practice

CCNA Data Ingestion and Transformation Questions

75 of 610 questions · Page 5/9 · Data Ingestion and Transformation · Answers revealed

Practice these questions Domain overview All questions

301

Multi-Selecthard

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 (Parquet) into a denormalized format for Amazon Redshift. The Glue job uses the DynamicFrame API. The job is failing with a 'MemoryError' when performing a join operation. The data is skewed on the join key. Which THREE actions can reduce memory usage and improve job stability? (Choose THREE.)

Select 3 answers

A.Use a broadcast join if one of the tables is small enough.

B.Use a salted join key to distribute skewed keys across partitions.

C.Increase the number of DPUs for the Glue job.

D.Repartition the data on the join key before the join operation.

E.Split the transformation into multiple Glue job steps to reduce per-step memory.

AnswersA, B, E

Avoids shuffling small table.

Why this answer

Options B, C, and D are correct. Salting the join key distributes skew, using broadcast join for a smaller table reduces shuffle, and splitting the job into multiple steps reduces memory per stage. Option A is wrong because increasing DPUs may help but is not a targeted solution.

Option E is wrong because repartitioning does not fix skew and may increase overhead.

Practice this question →

302

MCQhard

Refer to the exhibit. A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream that writes to an S3 bucket named 'example-bucket'. The IAM role assumed by Firehose has the attached policy shown. When testing, the Firehose delivery stream fails with an access denied error. What is the most likely cause?

A.The S3 bucket has server-side encryption enabled that needs additional permissions.

B.The IAM role does not have permission to use AWS KMS keys.

C.The bucket policy denies access from the Firehose service principal.

D.The IAM policy is missing the s3:AbortMultipartUpload and s3:ListBucket actions.

AnswerD

Firehose uses multipart uploads and needs these permissions.

Why this answer

Option A is correct because Kinesis Data Firehose requires the s3:AbortMultipartUpload and s3:ListBucket permissions to write to S3. The policy only allows s3:PutObject and s3:GetObject. Option B is wrong because SSE is not required.

Option C is wrong because the bucket policy is not shown to be restrictive. Option D is wrong because encryption keys are not mentioned.

Practice this question →

303

MCQeasy

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The source files are in a bucket with thousands of small files. What is the best practice to optimize the Glue job performance?

A.Convert the JSON files to CSV before processing with Glue.

B.Enable 'Group small files' in the Glue job or use a DynamicFrame with coalesce.

C.Use an AWS Lambda function to pre-process the files.

D.Increase the number of DPUs to the maximum.

AnswerB

Grouping reduces the number of tasks and improves performance.

Why this answer

Grouping small files into fewer partitions reduces overhead. Option A is correct. Option B is wrong because increasing DPUs may not help with small files.

Option C is wrong because converting to CSV adds extra step. Option D is wrong because Parquet is more efficient for Glue.

Practice this question →

304

Multi-Selecthard

A company is migrating its data warehouse from on-premises to Amazon Redshift. The migration involves copying 50 TB of data from an S3 bucket to Redshift. The network bandwidth is limited to 1 Gbps. Which TWO approaches should the team use to complete the transfer within 7 days?

Select 2 answers

A.Use Amazon S3 Transfer Acceleration

B.Use AWS Direct Connect with 10 Gbps

C.Use AWS Snowball Edge to transfer the data to S3

D.Use AWS Lambda to copy data in parallel

E.Use Amazon Kinesis Data Firehose

AnswersA, C

S3 Transfer Acceleration can speed up uploads over the network.

Why this answer

A (Snowball Edge) allows physical transfer of the data, bypassing network limitations. D (S3 Transfer Acceleration) optimizes network transfer but may not be sufficient alone. B (Direct Connect) helps but still limited by 1 Gbps.

C (Kinesis Firehose) is for streaming. E (Lambda) is not for bulk transfer.

Practice this question →

305

MCQeasy

A company needs to ingest data from an on-premises database to Amazon S3 with minimal impact on the source database. The data volume is several TB. Which AWS service is best suited for this task?

A.AWS Direct Connect

B.AWS Snowball Edge

C.AWS Database Migration Service (DMS)

D.Amazon S3 Transfer Acceleration

AnswerC

DMS can migrate data from on-premises to S3 with minimal impact using CDC.

Why this answer

AWS DMS can perform a full load with minimal impact on source. Option B (S3 Transfer Acceleration) is for speeding up uploads; Option C (Direct Connect) is a network connection; Option D (Snowball) is for offline transfer.

Practice this question →

306

MCQmedium

A data engineer needs to transform CSV files arriving in S3 into Parquet format and partition them by date. The transformation should be event-driven and run immediately after each file is uploaded. Which approach is most efficient?

A.Use S3 event notification to trigger an AWS Glue job

B.Use an S3 event notification to invoke a Lambda function that converts the file

C.Use an Amazon EMR cluster running Spark to process files as they arrive

D.Use Amazon Athena CREATE TABLE AS SELECT (CTAS) on a schedule

AnswerA

Glue jobs can be triggered by S3 events and efficiently convert to Parquet with partitioning.

Why this answer

Option D (Use S3 event notification to trigger an AWS Glue job) is correct because Glue can be triggered by S3 events and supports Parquet conversion and partitioning. Option A (AWS Lambda) has a 15-minute timeout and may not handle large files. Option B (Amazon Athena CTAS) is not event-driven and does not automatically run on upload.

Option C (Amazon EMR) is overkill for this use case.

Practice this question →

307

MCQhard

A healthcare company processes patient records in near-real-time using Amazon Kinesis Data Streams. Each record contains sensitive personal health information (PHI). The data must be encrypted at rest and in transit. The company also needs to audit access to the data. The data engineer is designing the ingestion pipeline. Which combination of services and configurations meets these requirements?

A.Use Kinesis Data Firehose to deliver data to S3 with SSE-S3, and enable CloudTrail for S3.

B.Use Kinesis Data Streams with TLS and enable CloudTrail for auditing. Do not enable SSE.

C.Use Kinesis Data Streams with SSE-KMS and TLS, and enable CloudTrail for data events.

D.Use Kinesis Data Streams with SSE-KMS and TLS. Do not enable any auditing.

AnswerC

Provides encryption at rest and in transit, plus auditing.

Why this answer

Option C is correct. Kinesis Data Streams supports server-side encryption (SSE) using AWS KMS for at-rest encryption, and TLS for in-transit. CloudTrail can log Kinesis API calls for auditing.

Option A lacks encryption at rest. Option B lacks auditing. Option D is wrong because S3 does not replace Kinesis for streaming.

Practice this question →

308

MCQeasy

A data engineer needs to ingest data from an external HTTP API into Amazon S3. The API returns JSON data for a list of users, updated hourly. The engineer wants to use a serverless solution with minimal operational overhead. Which AWS service should the engineer use?

A.Amazon Kinesis Data Firehose with a custom HTTP endpoint.

B.AWS Lambda function triggered by CloudWatch Events.

C.Amazon AppFlow with an HTTP connector on a scheduled flow.

D.AWS Glue ETL job triggered by EventBridge.

AnswerC

AppFlow is serverless and designed for API ingestion.

Why this answer

Option D is correct because Amazon AppFlow can connect to HTTP APIs and load data to S3 on a schedule. Option A is wrong because Kinesis Data Firehose requires a stream source. Option B is wrong because Glue ETL can do it but requires more configuration.

Option C is wrong because Lambda alone lacks scheduling.

Practice this question →

309

MCQeasy

A data engineer needs to ingest on-premises CSV files into Amazon S3 every hour. The files are less than 1 GB each. Which service is the most cost-effective and requires the least operational overhead?

A.AWS DataSync

B.Amazon Kinesis Data Firehose

C.AWS Snowball Edge

D.AWS Database Migration Service (DMS)

AnswerA

DataSync automates scheduled transfers from on-premises to S3.

Why this answer

Option B is correct because AWS DataSync is designed for scheduled transfers from on-premises to S3 with minimal setup. Option A is wrong because Snowball is for large, offline transfers. Option C is wrong because Kinesis is for streaming data.

Option D is wrong because DMS is for database migrations.

Practice this question →

310

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format. The delivery stream is configured with a buffer size of 5 MB and a buffer interval of 60 seconds. However, the data engineer notices that S3 objects are being created with sizes much smaller than 5 MB. What is a likely cause?

A.The data is being compressed before delivery, reducing object size.

B.The incoming data rate is too low, causing the buffer interval to trigger before reaching the buffer size.

C.The data transformation lambda is splitting records into smaller ones.

D.The S3 bucket is configured with a lifecycle policy that splits objects.

AnswerB

Buffer interval triggers first.

Why this answer

Option B is correct because Kinesis Data Firehose delivers data to S3 when either the buffer size (5 MB) or the buffer interval (60 seconds) is reached, whichever occurs first. If the incoming data rate is low, the buffer interval will expire before accumulating 5 MB of data, resulting in smaller S3 objects.

Exam trap

The trap here is that candidates may assume the buffer size is a hard minimum that must be reached before delivery, but Firehose uses an 'or' condition between buffer size and buffer interval, so low data rate causes interval-based delivery of small objects.

How to eliminate wrong answers

Option A is wrong because compression reduces the size of data after buffering, but the buffer size limit is based on the uncompressed data; compression does not cause smaller objects to be created before the buffer interval triggers. Option C is wrong because a data transformation Lambda can modify records but does not inherently split records into smaller ones; it processes records as a batch and returns them, and any splitting would be a custom logic not default behavior. Option D is wrong because S3 lifecycle policies manage object transitions or deletions after objects are created; they do not split objects during delivery.

Practice this question →

311

MCQmedium

Refer to the exhibit. An IAM policy for an AWS Lambda function. The Lambda function is triggered by an S3 event (object created) and needs to read from a Kinesis stream. However, the function fails with access denied when trying to read from Kinesis. What is the most likely cause?

A.The Lambda function is not in the same region as the Kinesis stream

B.The Lambda function does not have permission to list S3 buckets

C.The Kinesis stream is encrypted with a customer managed KMS key, and the Lambda function lacks kms:Decrypt permission

D.The S3 bucket policy denies access to the Lambda function

AnswerC

If the stream uses SSE-KMS, Lambda needs kms:Decrypt on the key.

Why this answer

The Lambda function lacks permission to list the Kinesis stream (kinesis:ListStreams). The policy only allows DescribeStream, GetShardIterator, and GetRecords. To read from a stream, the initial DescribeStream call works, but the Lambda execution role may need additional actions depending on the SDK.

However, often the error is due to missing kinesis:ListStreams if the SDK lists streams first. But more commonly, the policy is correct; the failure might be due to missing s3:ListBucket? No, the error is Kinesis. Actually, the policy allows GetRecords, so the most likely cause is that the Kinesis stream is encrypted with a KMS key and the Lambda function lacks kms:Decrypt permission.

This is a common scenario. The policy does not include KMS permissions.

Practice this question →

312

MCQhard

A company uses AWS Glue to transform data in an S3 data lake. The transformation logic requires joining two large datasets that are each hundreds of gigabytes. The Glue job runs out of memory. Which configuration change will most likely resolve this issue?

A.Repartition the data before the join.

B.Increase the number of DPUs for the Glue job.

C.Use a different file format like Parquet with compression.

D.Use the 'spark.sql.autoBroadcastJoinThreshold' setting to broadcast the smaller table.

AnswerB

More DPUs provide more memory and parallelism, helping the join fit in memory.

Why this answer

Increasing the number of DPUs provides more memory for the join operation. Glue automatically distributes data across workers, so more workers mean more total memory.

Practice this question →

313

MCQeasy

A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream to load data into Amazon Redshift. The data is coming from an application that produces JSON records. The engineer needs to transform the data to match the Redshift table schema. Which approach is the MOST cost-effective and requires the least operational overhead?

A.Use AWS Glue as a transformation step between Firehose and Redshift, with a trigger on S3.

B.Use Kinesis Data Firehose with direct PUT to Redshift and rely on Redshift's COPY command to transform.

C.Configure a Lambda function in the Firehose delivery stream to transform records before delivery.

D.Use the Kinesis Client Library (KCL) to consume the stream, transform in an EC2 instance, and then load to Redshift.

AnswerC

Firehose supports Lambda for data transformation with minimal overhead.

Why this answer

Using Firehose's built-in Lambda transformation function is the simplest and most cost-effective approach. Option A is wrong because Firehose cannot directly write to Redshift without transformation. Option B is wrong because Glue adds extra cost and complexity.

Option D is wrong because KCL requires custom code and management.

Practice this question →

314

MCQmedium

A company is building a data lake on Amazon S3 and wants to ingest data from multiple AWS services (CloudTrail, VPC Flow Logs, and ALB logs). The data should be stored in a central S3 bucket with a common partitioning scheme. Which service can be used to collect and centralize this data with minimal configuration?

A.Use AWS Data Pipeline to copy logs from each source S3 bucket to the central bucket.

B.Use AWS Glue to crawl the logs from each source and write to a central S3 bucket.

C.Set up Amazon Kinesis Data Firehose to ingest logs from each service and write to S3.

D.Configure each source service to deliver logs directly to the central S3 bucket.

AnswerD

CloudTrail, VPC Flow Logs, and ALB can all deliver to S3 directly.

Why this answer

Amazon S3 can be configured as a destination for CloudTrail, VPC Flow Logs, and ALB logs directly, but centralizing requires custom logic. AWS Glue can crawl and catalog, but not ingest. AWS Data Pipeline can copy data but requires setup.

However, the best answer is to use S3 replication or a simple Lambda function. But among the options, the most suitable is to use AWS Glue with a custom script? Actually, the question expects S3 cross-region replication? Let's see. The correct answer is to use S3 as a central bucket and configure each service to deliver logs to that bucket.

But the options may include 'Use Amazon S3 event notifications to trigger a Lambda function that copies logs to a central bucket.' That is plausible. However, the simplest is to configure each service to deliver to the same bucket. But the question likely expects using S3 replication? I'll go with: Configure each service to deliver logs to a common S3 bucket prefix.

Practice this question →

315

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. The transfer must be completed within one week. Which service should be used?

A.AWS DataSync

B.AWS Snowball

C.Amazon CloudFront

D.AWS Database Migration Service (DMS)

AnswerB

Physical device for large data transfers.

Why this answer

AWS Snowball is the correct choice because transferring 50 TB over a 100 Mbps network would take approximately 46 days (50 TB * 8 bits/byte / 100 Mbps / 86400 seconds/day), far exceeding the one-week deadline. Snowball provides a physical storage device that can be shipped to the on-premises location, allowing data to be loaded locally and shipped to AWS, bypassing network bandwidth constraints entirely.

Exam trap

The trap here is that candidates may underestimate the time required for online transfer and choose AWS DataSync, failing to calculate that 50 TB at 100 Mbps takes over 46 days, not one week, and overlooking Snowball's physical shipping approach for offline data transfer.

How to eliminate wrong answers

Option A is wrong because AWS DataSync is designed for online data transfer over the network, and at 100 Mbps, it would take over 46 days to transfer 50 TB, which does not meet the one-week requirement. Option C is wrong because Amazon CloudFront is a content delivery network (CDN) for caching and distributing content to edge locations, not a data transfer service for ingesting large volumes of historical data into S3. Option D is wrong because AWS Database Migration Service (DMS) is specialized for migrating databases (e.g., relational, NoSQL) and does not support transferring HDFS files or large-scale file-based data to S3.

Practice this question →

316

MCQmedium

A company uses AWS Glue to catalog data in Amazon S3. The data arrives in Parquet format, but the crawler fails to update the schema when new columns are added. What is the most likely cause?

A.The Glue Data Catalog is not configured to accept schema changes.

B.Parquet files do not support schema evolution.

C.The crawler configuration has 'Update the table definition' set to 'Ignore the change'.

D.The S3 bucket has versioning disabled.

AnswerC

If set to ignore changes, the crawler will not add new columns.

Why this answer

Option C is correct because the crawler's catalog update policy may be set to 'Update the table definition' but not 'Add new columns only', or it might be set to 'Ignore the change'. Option A is wrong because Parquet schema evolves naturally. Option B is wrong because S3 bucket versioning does not affect crawler schema updates.

Option D is wrong because the Glue Data Catalog is the target, not the source.

Practice this question →

317

MCQeasy

A data engineer needs to ingest data from an Amazon RDS for PostgreSQL database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. Which service is most appropriate for this task?

A.AWS Database Migration Service (DMS) with continuous replication

B.Amazon Athena with federated query to RDS

C.Amazon EMR with Spark job

D.AWS Glue with a scheduled ETL job

AnswerD

AWS Glue can run scheduled ETL jobs to extract from RDS and load to S3.

Why this answer

Option B is correct because AWS Glue can run ETL jobs to extract data from RDS and write to S3. Option A is wrong because AWS DMS is for database migration and continuous replication, not for daily batch jobs. Option C is wrong because Amazon Athena is a query service, not for data transfer.

Option D is wrong because Amazon EMR is for big data processing and is overkill for this simple task.

Practice this question →

318

MCQmedium

A company uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration is taking longer than expected due to the initial load. Which AWS service can be used to accelerate the initial load by transferring the database files directly?

A.AWS Snowball

B.Amazon S3 Transfer Acceleration

C.AWS Direct Connect

D.Amazon Kinesis Data Firehose

AnswerA

Snowball allows physical transfer of data, which can be faster than network transfer for very large datasets.

Why this answer

Option C is correct because AWS Snowball can be used to transfer large datasets physically, bypassing network constraints. Option A is wrong because S3 Transfer Acceleration speeds up uploads to S3 over the internet, but the data still goes over the network. Option B is wrong because Direct Connect provides a dedicated network connection, but the data still traverses the network.

Option D is wrong because Kinesis Data Firehose is for streaming, not bulk data transfer.

Practice this question →

319

MCQmedium

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that some changes are not being captured in the target database. What is the MOST likely cause?

A.The VPC peering connection between on-premises and AWS is down.

B.The DMS task's table mapping is incorrectly configured.

C.DMS is not publishing task logs to CloudWatch Logs.

D.The source Oracle database is not configured to retain archived redo logs for a sufficient period.

AnswerD

DMS requires archived logs to capture changes; if logs are purged, changes are lost.

Why this answer

Option A is correct because DMS uses the source database's transaction logs (redo logs) for CDC; if they are not retained, DMS cannot read changes. Option B is wrong because VPC peering affects network connectivity, not CDC. Option C is wrong because DMS uses its own task logs, not CloudTrail.

Option D is wrong because table mapping would cause missing tables, not missing changes.

Practice this question →

320

MCQeasy

A company is using Amazon Kinesis Data Firehose to ingest clickstream data from a website into an S3 bucket. The data is then analyzed using Amazon Athena. Recently, the company noticed that Athena queries are returning incomplete results for the last 30 minutes of data. The Firehose delivery stream is configured to buffer data for 60 seconds or 5 MB before delivering to S3. The S3 bucket has a lifecycle policy that transitions objects to Amazon S3 Glacier after 30 days. The IAM role for Firehose has permissions to write to S3 and access a CloudWatch Logs group. The engineer checks the Firehose monitoring and sees that the delivery rate is healthy, but the 'S3.Bytes' metric shows a spike in the last hour. The 'BackupToS3.Bytes' metric is zero. What is the MOST likely cause of the missing data?

A.The lifecycle policy is transitioning data before Athena can query it.

B.The backup is enabled and data is being sent to the backup bucket instead.

C.The data is still being buffered in Firehose and has not yet been delivered to S3.

D.The IAM role for Firehose does not have permissions to write to the S3 bucket.

AnswerC

Firehose buffers data for up to 60 seconds; recent data may still be in transit and not yet queryable by Athena.

Why this answer

Option A is correct. The buffer settings (60 seconds or 5 MB) mean that data may be held in the buffer for up to 60 seconds before being written to S3. For the last 30 minutes, some data may still be in the buffer and not yet delivered.

Athena queries only see data that has been delivered. Option B (lifecycle policy) would not affect recent data. Option C (IAM permissions) would cause errors, not missing data.

Option D (backup) is unrelated.

Practice this question →

321

Multi-Selecteasy

A company is designing a data lake on Amazon S3. The data ingestion pipeline must handle both structured and unstructured data. The data must be cataloged for easy discovery. Which THREE services should be included in the solution? (Choose THREE.)

Select 3 answers

A.Amazon S3

B.AWS Glue Data Catalog

C.Amazon Athena

D.Amazon RDS

E.Amazon Redshift

AnswersA, B, C

S3 is the core storage for data lakes.

Why this answer

Option A is correct because S3 is the storage layer. Option B is correct because Glue Crawlers can catalog data. Option C is correct because Athena can query the cataloged data.

Option D (RDS) is a relational database, not for a data lake. Option E (Redshift) is a data warehouse, not a data lake.

Practice this question →

322

MCQeasy

A data engineer needs to transform JSON data from a Kinesis Data Stream into Parquet format and store it in an S3 data lake. The transformation includes simple field mapping and data type conversions. Which AWS service is the most cost-effective for performing this transformation in near-real-time?

A.Amazon Athena with CTAS (CREATE TABLE AS SELECT)

B.AWS Lambda function triggered by Kinesis Data Firehose

C.AWS Glue ETL job

D.Amazon EMR with Spark Streaming

AnswerB

Lambda can be invoked by Firehose for record transformation and can output Parquet; it is serverless and cost-effective for near-real-time.

Why this answer

Option D is correct because Kinesis Data Firehose can perform data transformation using a Lambda function before delivering to S3, and it supports converting JSON to Parquet using its built-in capability or Lambda. Option A is wrong because AWS Glue is a serverless Spark ETL service, but it is designed for batch processing, not streaming. Option B is wrong because Amazon EMR is a managed Hadoop cluster that is more complex and costly for simple transformations.

Option C is wrong because Amazon Athena is an interactive query service, not a transformation engine.

Practice this question →

323

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError'. What is the MOST likely cause?

A.The Glue job worker type is too small for the data volume

B.The Glue job uses too many DynamicFrames

C.The S3 output bucket is in a different region

D.The Kinesis stream has insufficient shards

AnswerA

Small worker type leads to out-of-memory errors when data volume exceeds capacity.

Why this answer

The error suggests the job runs out of memory. Increasing the DPU count can allocate more memory per worker and help process larger data volumes without memory errors.

Practice this question →

324

MCQmedium

A company uses AWS Glue DataBrew to clean and transform data for analytics. The source data is in Parquet format in Amazon S3. The transformation includes filtering rows and adding calculated columns. What is the MOST cost-effective way to run these transformations on a schedule?

A.Use Amazon EMR with Spark

B.Create a Glue DataBrew recipe and schedule the job using a cron expression

C.Create an AWS Lambda function triggered by S3 events

D.Use AWS Glue ETL with PySpark

AnswerB

DataBrew supports scheduling directly.

Why this answer

DataBrew jobs can be scheduled using a cron expression. This is the native way to run transformations automatically without additional services.

Practice this question →

325

MCQhard

A data engineer is designing a data pipeline that ingests CSV files from an FTP server to Amazon S3. The files arrive hourly and each file is about 500 MB. The engineer wants to minimize operational overhead and cost. Which approach is best?

A.Write a Python script in AWS Lambda using boto3 to download from FTP and upload to S3

B.Use AWS Snowball Edge to transfer files weekly

C.Use AWS Transfer for SFTP and point the endpoint to an S3 bucket

D.Deploy an Amazon EC2 instance with a cron job to run wget and aws s3 cp

AnswerC

Fully managed, no servers to manage, direct to S3.

Why this answer

AWS Transfer for SFTP provides a fully managed FTP service that writes directly to S3, eliminating the need to manage servers. Lambda with boto3 is code-heavy; EC2 requires management; Snowball is for large offline transfers.

Practice this question →

326

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Some records fail transformation and are lost because the Lambda function throws an exception. The data engineer needs to capture the failed records for analysis without affecting the pipeline. What should the engineer do?

A.Configure the Firehose delivery stream to send failed records to a backup S3 bucket

B.Increase the buffer size of the Firehose stream

C.Disable the Lambda transformation and process all records in batch later

D.Modify the Lambda function to write failed records to Amazon DynamoDB

AnswerA

Firehose can be configured to send failed records to a backup S3 bucket.

Why this answer

Option B is correct because configuring a backup S3 bucket for failed records captures them without stopping the pipeline. Option A is wrong because storing in DynamoDB adds complexity and is not a native Firehose feature. Option C is wrong because disabling transformation loses data quality.

Option D is wrong because increasing buffer size does not capture failed records.

Practice this question →

327

MCQeasy

A company needs to transfer 20 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited and the transfer must complete within one week. Which service should the company use?

A.Amazon S3 Transfer Acceleration

B.AWS Snowball Edge

C.AWS Direct Connect with DataSync

D.AWS DataSync over a VPN connection

AnswerB

Snowball is a physical device that can transfer large data quickly.

Why this answer

Option C is correct because AWS Snowball is designed for large-scale data transfers with limited bandwidth. Option A is wrong because DataSync over the internet may not meet the deadline. Option B is wrong because the bandwidth is insufficient.

Option D is wrong because S3 Transfer Acceleration speeds up uploads but still uses the internet.

Practice this question →

328

MCQeasy

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed into Parquet format and stored in Amazon S3. Which AWS service can perform the transformation in near real-time with minimal operational overhead?

A.Amazon EMR cluster running Spark Streaming

B.AWS Glue ETL job triggered by Kinesis stream

C.Amazon Kinesis Data Firehose with a transformation Lambda function

D.Amazon Kinesis Data Analytics for Apache Flink

AnswerC

Kinesis Data Firehose can invoke a Lambda function to convert data to Parquet and deliver to S3.

Why this answer

Option A is correct because Kinesis Data Firehose can transform data using Lambda functions and deliver to S3 in Parquet format with no servers to manage. Option B (Kinesis Data Analytics) is for SQL queries, not direct S3 delivery. Option C (Glue ETL) is batch-oriented.

Option D (EMR) requires cluster management.

Practice this question →

329

MCQmedium

A company uses AWS Glue to transform data in S3. The transformation job reads Parquet files, filters rows, and writes to another S3 bucket. The job takes longer than expected. Which change would MOST likely reduce the job execution time?

A.Use a single large file instead of multiple small files.

B.Reduce the number of partitions in the output data.

C.Convert the input files from Parquet to CSV format.

D.Increase the number of DPUs allocated to the Glue job.

AnswerD

More DPUs allow more parallel processing, reducing runtime.

Why this answer

Option B is correct because AWS Glue jobs can benefit from increased DPU allocation for parallel processing. Option A is wrong because converting to CSV would increase size and likely slow processing. Option C is wrong because reducing the number of partitions reduces parallelism, increasing time.

Option D is wrong because using a single file does not improve parallelism.

Practice this question →

330

MCQhard

A company uses AWS Glue to process data from Amazon RDS MySQL into Amazon S3. The Glue job uses a JDBC connection and runs on a schedule. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which troubleshooting step should the data engineer take FIRST?

A.Check the Glue job's DPU allocation; increase if too low.

B.Review the Glue job script for data type mismatches.

C.Verify that the Glue job's VPC subnet and security group allow outbound traffic to RDS.

D.Increase the RDS instance's max_connections parameter.

AnswerC

Network connectivity is the first thing to check for link failures.

Why this answer

Option B is correct because the Glue job's network connectivity to RDS is the likely issue; checking security groups and route tables is the first step. Option A is wrong because connection pooling is a database configuration, not a network issue. Option C is wrong because data format changes cause parsing errors, not connection failures.

Option D is wrong because Glue resource limits would cause different errors (e.g., 'Resource exceeded').

Practice this question →

331

Multi-Selecthard

A company is running a 10-node Amazon EMR cluster to process data from Amazon S3. The cluster is using Apache Spark for transformations. The data processing is taking longer than expected. Which THREE actions can improve the performance of the Spark jobs on EMR? (Choose THREE.)

Select 3 answers

A.Reduce the number of shuffle partitions.

B.Enable dynamic allocation of executors.

C.Disable speculative execution to reduce redundant tasks.

D.Use a larger instance type for core nodes.

E.Use EMRFS consistent view to ensure data consistency.

AnswersB, D, E

Dynamic allocation allows Spark to scale resources based on workload.

Why this answer

Options A, C, and D are correct. Using EMRFS consistent view (A) improves consistency and reduces errors. Enabling dynamic allocation (C) allows Spark to adjust executor resources based on workload.

Using a larger instance type (D) can provide more memory and CPU. Option B (disabling speculation) can improve performance for some jobs but is not always beneficial. Option E (reducing shuffle partitions) can help if there are too many small partitions, but increasing is also possible; the question asks for improvements, and increasing partitions can help with data skew.

Practice this question →

332

MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. Recently, the application has been failing with 'ResourceNotFoundException' for the S3 bucket. What is the MOST likely cause?

A.The IAM role used by the application does not have s3:PutObject permission.

B.The S3 bucket ARN is incorrectly specified in the application configuration.

C.The Flink application code specifies the wrong AWS Region for the S3 bucket.

D.The S3 bucket has versioning disabled.

AnswerA

Without write permissions, the application cannot write to S3.

Why this answer

Option C is correct because Kinesis Data Analytics needs permissions to write to the S3 bucket; if the IAM role lacks s3:PutObject, the error occurs. Option A is wrong because the bucket name is used, not ARN. Option B is wrong because S3 bucket policies do not require versioning.

Option D is wrong because Flink application code does not configure S3 bucket region; it's determined by the bucket location.

Practice this question →

333

MCQeasy

Refer to the exhibit. A data engineer runs this CLI command on an S3 bucket. The data is ingested from multiple sources. Which AWS service would be best to process these files in a single batch transformation?

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.AWS Glue

AnswerD

Glue can run batch ETL jobs on multiple files.

Why this answer

Option C is correct because AWS Glue can process multiple files of varying sizes in a batch ETL job. Option A is wrong because Athena is for ad-hoc queries, not transformations. Option B is wrong because Kinesis is for streaming.

Option D is wrong because Lambda has limits on execution time and memory for large files.

Practice this question →

334

Multi-Selecthard

A data engineering team is designing a batch processing workflow using AWS Glue. The job reads from an S3 bucket, transforms data, and writes to another S3 bucket. The job runs daily and processes new data incrementally. Which THREE features should they use to optimize performance and cost?

Select 3 answers

A.Convert all input data to Parquet format before processing.

B.Enable Glue job autoscaling.

C.Manually increase the number of DPUs for each run.

D.Use predicate pushdown and column pruning in the script.

E.Enable job bookmarks to process only new data.

AnswersB, D, E

Adjusts resources to workload.

Why this answer

Options B, C, and D are correct: job bookmarks for incremental processing, autoscaling for dynamic resource allocation, and predicate pushdown for reducing data scanned. Option A (increasing DPUs manually) is not optimal. Option E (converting to Parquet) may help but is not a Glue feature.

Practice this question →

335

Multi-Selecteasy

A company is designing a data ingestion pipeline for real-time IoT sensor data. The data volume peaks at 10,000 messages per second. The pipeline must process messages in order per sensor and persist raw data to Amazon S3 for archival. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.AWS Lambda

D.Amazon AppFlow

E.Amazon Simple Queue Service (Amazon SQS)

AnswersA, B

Delivers streaming data to S3.

Why this answer

Options B and C are correct. Kinesis Data Streams captures data in order per shard (sensor), and Kinesis Data Firehose delivers to S3. Option A is wrong because SQS does not guarantee order per sensor.

Option D is wrong because AppFlow is for SaaS integration. Option E is wrong because Lambda can process but not persist to S3 efficiently for archival.

Practice this question →

336

MCQmedium

A data engineer is designing a pipeline to ingest change data capture (CDC) events from an Amazon RDS for PostgreSQL database into Amazon S3. The CDC events are captured using AWS DMS. The data must be available for querying within 5 minutes of the change. Which approach meets these requirements?

A.Export the database to S3 using pg_dump and then use AWS Glue to load into S3 in Parquet format.

B.Use AWS DMS to replicate data to Amazon Redshift, then unload to S3.

C.Use AWS DMS to replicate data directly to S3 in near real-time.

D.Use AWS DMS to replicate data to an SQS queue, then process with Lambda to write to S3.

AnswerC

DMS can write CDC to S3 with low latency.

Why this answer

Option A is correct because DMS can stream CDC data to S3 in near real-time, meeting the 5-minute SLA. Option B is wrong because a Lambda function triggered by S3 events is not needed for the initial ingestion. Option C is wrong because exporting to S3 and then using Glue adds latency.

Option D is wrong because Redshift is a separate service and adds complexity.

Practice this question →

337

MCQeasy

A data engineer attached this IAM policy to a Lambda function used to transform data in S3. The function is unable to write output to the bucket. What is the most likely reason?

A.The resource ARN is missing the bucket-level ARN.

B.The policy does not allow the s3:DeleteObject action.

C.The policy does not allow the s3:ListBucket action on the bucket.

D.The policy does not allow the s3:PutObjectAcl action.

AnswerC

To write objects, the function needs ListBucket permission on the bucket itself.

Why this answer

The policy allows GetObject and PutObject on objects, but not the s3:ListBucket action required to check existence or list objects. The function likely needs ListBucket to write or verify.

Practice this question →

338

MCQeasy

An IAM policy includes the above resource ARN for CloudWatch Logs. A data engineer needs to allow a Lambda function to create log streams and put logs to the log group 'my-log-group'. However, the Lambda function is failing with access denied. What is the issue?

A.The region in the ARN does not match the Lambda function's region.

B.The Lambda function does not have an execution role.

C.The ARN does not include the log-stream portion.

D.The ARN is incorrectly formatted because of the wildcard.

AnswerC

The correct ARN for log streams should be 'arn:aws:logs:us-east-1:123456789012:log-group:my-log-group:log-stream:*'.

Why this answer

Option B is correct because the ARN includes a wildcard for log streams, but the correct ARN for log streams should be 'arn:aws:logs:us-east-1:123456789012:log-group:my-log-group:log-stream:*'. However, the more common issue is that the ARN is missing the log-stream prefix. Option A is wrong because the ARN format is valid.

Option C is wrong because the Lambda execution role needs permissions, not the function itself. Option D is wrong because the region is correct.

Practice this question →

339

MCQmedium

A data pipeline ingests CSV files from an S3 bucket into a Redshift table using the COPY command. Recently, files with inconsistent column delimiters (some use pipes, others use commas) have been arriving. The pipeline must handle both delimiters without manual intervention. What is the MOST efficient solution?

A.Configure the COPY command with a fixed delimiter (e.g., comma) and manually convert files with pipes before ingestion.

B.Create an AWS Lambda function triggered by S3 events that reads the first line of each file, detects the delimiter, and runs the COPY command with the appropriate DELIMITER option.

C.Use AWS Glue to crawl the S3 bucket and automatically detect the schema and delimiter before writing to Redshift.

D.Use Amazon Athena to query the files with the OpenCSVSerDe, which automatically detects delimiters, and then write the results to Redshift.

AnswerB

Lambda provides a lightweight, event-driven solution to dynamically detect and handle delimiters.

Why this answer

Option C is correct because using a Lambda function to inspect the first line and set the DELIMITER parameter dynamically handles multiple delimiters without manual intervention, and it's efficient as it runs per object. Option A is wrong because it requires manual reconfiguration per file. Option B is wrong because AWS Glue is overkill for simple delimiter detection.

Option D is wrong because it still requires manual schema redefinition.

Practice this question →

340

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue ETL job that fails with an access denied error when writing to S3. The IAM role attached to the Glue job has the policy shown. What is the most likely cause of the error?

A.The Glue job is missing the s3:ListBucket permission on the bucket

B.The Glue job is writing to an S3 bucket that is not included in the Resource ARN

C.The S3 bucket is encrypted with AWS KMS and the policy does not include kms:Decrypt permissions

D.The Glue job does not have permission to call glue:StartJobRun

AnswerB

The policy only grants access to my-data-bucket, not other buckets.

Why this answer

Option B is correct because the policy only allows s3:PutObject on s3://my-data-bucket/*, but if the job writes to a different bucket or a path that doesn't exist, it may fail. Option A (KMS) is plausible but not indicated. Option C (Glue StartJobRun) is allowed.

Option D (S3 list) is not needed for writing.

Practice this question →

341

MCQhard

A company is ingesting CSV files into Amazon S3. Each file contains a header row. The pipeline uses AWS Glue to crawl the S3 bucket and create a table in the AWS Glue Data Catalog. However, the crawler is including the header as data. What is the most likely cause?

A.The S3 bucket has versioning enabled

B.The crawler is configured to use a custom classifier that does not skip headers

C.The crawler is running in 'crawl all folders' mode

D.The CSV files are not compressed

AnswerB

Without a classifier that sets skip.header.line.count=1, headers are included.

Why this answer

By default, Glue crawlers do not skip headers unless configured to do so. Adding an explicit 'skip.header.line.count=1' in the table properties fixes it. The crawler classifies correctly but needs the property.

Practice this question →

342

MCQmedium

An IAM policy is attached to an AWS Glue job. The job needs to read from and write to S3 buckets, and also trigger other Glue jobs. The job is failing with an AccessDenied error when trying to write to a bucket named 'example-bucket'. What is the MOST likely cause?

A.The policy does not include s3:PutObject action.

B.The bucket name in the resource ARN does not match the actual bucket name.

C.The policy uses a resource ARN with a wildcard, which is not allowed.

D.The policy does not allow Glue actions.

AnswerB

The ARN uses 'example-bucket' but the actual bucket might have a different name.

Why this answer

Option B is correct because the policy allows both GetObject and PutObject on the specified bucket, so the issue is likely that the bucket name is misspelled or the ARN is incorrect. Option A is wrong because the policy does allow Glue actions. Option C is wrong because the policy uses the correct ARN format.

Option D is wrong because the policy does allow PutObject.

Practice this question →

343

MCQhard

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

A.Enable job bookmark and schedule the job to run more frequently.

B.Use multiple JDBC connections in parallel by setting 'hashexpression' and 'hashfield'.

C.Partition the source table by year and use pushdown predicates in the Glue job.

D.Increase the number of DPUs for the Glue job to 100.

AnswerC

This reduces the data scanned by filtering on partition columns.

Why this answer

Option C is correct because partitioning the source table by year and using pushdown predicates allows AWS Glue to read only the relevant partitions from PostgreSQL, drastically reducing the data scanned and transferred. This directly addresses the 500 million row volume and frequent updates by minimizing the JDBC read workload, which is the primary bottleneck in the 6-hour runtime.

Exam trap

The trap here is that candidates often assume increasing DPUs (Option D) or adding parallelism (Option B) will linearly speed up JDBC reads, but they fail to recognize that the bottleneck is the source database's I/O and network throughput, not Glue's compute capacity, and that predicate pushdown is the only option that reduces the data volume at the source.

How to eliminate wrong answers

Option A is wrong because job bookmarks track previously processed data to avoid reprocessing, but they do not reduce the initial full load or the per-run data volume; scheduling more frequently would only compound the problem by running incomplete jobs. Option B is wrong because 'hashexpression' and 'hashfield' are not valid JDBC parallelism parameters in AWS Glue; the correct approach for parallel JDBC reads is to set 'hashfield' and 'hashpartitions' (not 'hashexpression'), and even then, parallelism alone cannot overcome the I/O bottleneck of scanning 500 million rows without filtering. Option D is wrong because increasing DPUs to 100 may improve compute parallelism for transformations, but the bottleneck is the JDBC read from PostgreSQL, which is constrained by the source database's network and query capacity, not by Glue's compute resources; excessive DPUs can also cause throttling or connection limits.

Practice this question →

344

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline to load clickstream data from an Amazon S3 bucket into an Amazon Redshift cluster. The data arrives in 5-minute batches. Which TWO actions should the engineer take to ensure data consistency and avoid duplicates? (Select TWO.)

Select 2 answers

A.Define a SORTKEY on the target table to improve deduplication.

B.Disable workload management (WLM) to maximize resources.

C.Use the STL_LOAD_ERRORS system table to monitor and resolve load errors.

D.Load data into a single slice to maintain order.

E.Use a staging table and perform a MERGE operation to avoid duplicates.

AnswersC, E

Monitoring load errors helps catch and fix issues that could cause duplicates or missing data.

Why this answer

Options B and D are correct. Using the STL_LOAD_ERRORS table helps identify errors, and using a staging table with a MERGE/upsert pattern ensures idempotent loads. Option A (disabling WLM) is unrelated.

Option C (loading into a single slice) reduces performance. Option E (using SORTKEY) improves query performance but not consistency.

Practice this question →

345

MCQhard

A company uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The stream has 5 shards. Each consumer reads from all shards. The total incoming data rate is 25 MB/s. What is the maximum read throughput per consumer if enhanced fan-out is enabled?

A.2 MB/s

B.10 MB/s

C.1 MB/s

D.5 MB/s

AnswerB

Each shard provides 2 MB/s read capacity with enhanced fan-out; 5 shards = 10 MB/s.

Why this answer

Option A is correct because enhanced fan-out provides each consumer with dedicated 2 MB/s per shard read throughput. With 5 shards, each consumer can read up to 10 MB/s. Option B is wrong because that is the total write capacity.

Option C is wrong because that is the read capacity without enhanced fan-out. Option D is wrong because that is the per-shard write capacity.

Practice this question →

346

MCQmedium

A company is ingesting log files from multiple EC2 instances into Amazon S3 using the CloudWatch agent. The logs are delivered to a CloudWatch Logs group, and a subscription filter sends them to a Lambda function for transformation, then to Firehose. The Firehose stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The logs are critical and must be available in S3 within 5 minutes. What is the most cost-effective way to reduce the delivery latency?

A.Replace Firehose with Amazon Kinesis Data Streams

B.Increase the buffer size to 10 MB

C.Increase the buffer interval to 120 seconds

D.Decrease the buffer interval to 10 seconds

AnswerD

Lower buffer interval reduces delivery latency.

Why this answer

Option D is correct because reducing the buffer interval from 60 to 10 seconds directly reduces latency without adding significant cost. Option A (increase buffer size) would increase latency. Option B (increase buffer interval) worsens latency.

Option C (change to Kinesis Data Streams) adds cost and complexity.

Practice this question →

347

MCQhard

A company runs a nightly batch ETL job using AWS Glue to transform data from Amazon RDS for MySQL to Amazon S3. The job reads 100 tables and writes Parquet files partitioned by date. Recently, the job started failing with 'ThrottlingException' from the RDS database. The data volume has increased, and the Glue job is reading large tables without any filtering. The job uses a single Glue job with multiple Spark executors. The engineer needs to reduce the load on the RDS database while maintaining the same processing time. What should the engineer do?

A.Use AWS DMS to continuously replicate data to S3 and then run Glue on the S3 data.

B.Change the job to read all data from RDS into a staging table in S3 first, then transform.

C.Increase the number of DPUs to process data faster.

D.Modify the Glue job to use a JDBC connection with a WHERE clause to read only the latest partition.

AnswerD

Filtering reduces data read from RDS, reducing throttling.

Why this answer

Option D is correct because using a JDBC connection with a WHERE clause to filter data by date reduces the amount of data read from RDS, reducing throttling. Option A is wrong because increasing DPUs would increase parallelism and potentially increase load on RDS. Option B is wrong because DMS is a migration tool, not for regular ETL.

Option C is wrong because reading all data into S3 first does not reduce load on RDS; the initial read still causes throttling.

Practice this question →

348

Multi-Selectmedium

A company uses AWS Glue to perform ETL on data stored in Amazon S3. The Glue job reads CSV files, converts them to Parquet, and partitions by date. The job runs daily and processes about 500 GB of data. The team wants to optimize costs and performance. Which three actions should the team take? (Select THREE.)

Select 3 answers

A.Increase the Spark shuffle partitions to 500.

B.Use column pruning to read only necessary columns in the Glue script.

C.Use G.1X or G.2X worker types for better performance.

D.Increase the number of DPUs for the job.

E.Write the output as JSON instead of Parquet to avoid compression overhead.

AnswersB, C, D

Reduces data scanned and improves performance.

Why this answer

Option B is correct because column pruning in AWS Glue scripts reduces the amount of data read from Amazon S3 by specifying only the columns needed for the ETL transformation. This minimizes I/O and network overhead, directly lowering costs and improving job performance, especially when processing large CSV files.

Exam trap

The trap here is that candidates often confuse increasing DPUs or shuffle partitions as a universal performance fix, but AWS Glue's cost optimization relies on reducing data processed (column pruning) and choosing appropriate worker types for the workload, not simply scaling resources.

Practice this question →

349

MCQeasy

A company needs to transfer 10 TB of historical data from an on-premises HDFS cluster to Amazon S3. The data is stored on a single 20 TB disk. The network link to AWS has a bandwidth of 1 Gbps. The transfer must be completed within 2 days. Which solution meets these requirements?

A.Use AWS Snowball Edge to transfer the data physically.

B.Use Amazon Kinesis Data Streams to stream data to S3.

C.Use AWS DMS to migrate data from HDFS to S3.

D.Use AWS CLI to copy data directly to S3 over the network.

AnswerA

Snowball Edge provides fast, reliable transfer for large datasets.

Why this answer

Option B is correct because AWS Snowball Edge can physically ship the data, bypassing network bandwidth limits. Option A is wrong because 1 Gbps for 2 days can only transfer about 21.6 TB, which is enough, but network stability and other traffic may cause delays. Option C is wrong because AWS DMS is for database migration, not HDFS.

Option D is wrong because Kinesis is for streaming, not bulk transfer.

Practice this question →

350

MCQhard

A company runs a data ingestion pipeline that uses AWS Glue to read 500 GB of JSON files from an S3 bucket (s3://raw-data/) every hour. The Glue ETL job transforms the data and writes Parquet files to another S3 bucket (s3://processed-data/). The job is triggered by a time-based CloudWatch Events rule. Recently, the job has started taking over 2 hours to complete, causing delays in downstream processes. The data volume has been consistent, and no changes have been made to the job code or infrastructure. The S3 bucket 's3://raw-data/' receives new files continuously, but the Glue job reads all files in the bucket each run (no incremental processing). The engineer suspects that the job is reprocessing old data. Which action should the engineer take FIRST to reduce the job duration?

A.Enable Glue job bookmarking and configure the job to process only new data.

B.Increase the parallelism of the Spark job by repartitioning the data.

C.Add partition pruning by modifying the S3 path to include date-based partitions.

D.Increase the number of DPUs for the Glue job to 100.

AnswerA

Bookmarking tracks processed files, so subsequent runs only process new files, drastically reducing runtime.

Why this answer

Option C is correct because enabling job bookmarking in Glue allows the job to process only new files since the last run, dramatically reducing processing time. Option A (increasing DPUs) would help with resource constraints but does not address the root cause of reprocessing old data. Option B (increasing parallelism) may help but not as much as eliminating reprocessing.

Option D (using partition pruning) assumes the data is partitioned, but the stem says all files are read; partitioning might not help if files are not organized by time. The most impactful first step is to enable bookmarking.

Practice this question →

351

MCQmedium

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database to Amazon S3. The pipeline should capture changes in near real-time (within minutes) and minimize impact on the source database. The source table has a 'last_modified' timestamp column. Which service combination would meet these requirements?

A.AWS DMS with a replication task in CDC mode, writing to S3 in Parquet format.

B.Amazon Kinesis Data Firehose with a Lambda function that queries Oracle.

C.AWS Data Pipeline with a periodic SQL query activity to copy full table snapshots.

D.AWS Glue with a JDBC connection to Oracle, running a crawler every 5 minutes.

AnswerA

DMS CDC captures minimal changes and writes to S3 with low latency.

Why this answer

AWS Database Migration Service (DMS) with change data capture (CDC) can capture changes from Oracle in near real-time and write to S3. Option A is wrong because Glue does not support CDC. Option B is wrong because Data Pipeline is batch-oriented.

Option D is wrong because Firehose does not connect to databases directly.

Practice this question →

352

MCQhard

A data engineer is designing a data ingestion pipeline for clickstream data from a mobile app. The data volume varies, with occasional spikes up to 10 MB/s. The pipeline must persist the raw data in Amazon S3 and make it available for near-real-time analytics via Amazon Athena. Which combination of services minimizes cost and operational overhead?

A.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics, then Amazon S3

B.Amazon SQS with an Auto Scaling group of EC2 instances writing to Amazon S3

C.Amazon Kinesis Data Streams with AWS Lambda for transformation, then Amazon S3

D.Amazon Kinesis Data Firehose with direct delivery to Amazon S3, then Amazon Athena

AnswerD

Firehose is fully managed, scales automatically, and delivers to S3.

Why this answer

Option C is correct because Firehose can buffer and deliver data to S3 with minimal configuration, and Athena can query directly from S3. Option A (Kinesis Data Streams + Lambda) incurs higher costs and management overhead. Option B (SQS + EC2) requires managing EC2 instances and auto-scaling.

Option D (Kinesis Data Streams + Kinesis Data Analytics) adds complexity and cost.

Practice this question →

353

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a streaming ingestion architecture? (Choose 3.)

Select 3 answers

A.Ability to deliver data directly to Amazon Redshift

B.Data retention period longer than 7 days

C.Automatic scaling of ingestion throughput

D.Lower cost per GB ingested

E.Need for custom data processing using AWS Lambda or KCL

AnswersA, C, E

Firehose can deliver to Redshift; KDS requires additional services.

Why this answer

KDS offers custom processing with shard-level control, while Firehose auto-scales and can deliver to S3, Redshift, etc. KDS requires consumer management; Firehose is managed.

Practice this question →

354

MCQeasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a nightly basis. The data volume is approximately 10 GB per night. The database is accessible over the internet. Which AWS service is MOST appropriate for this task?

A.AWS Glue ETL job with a JDBC connection

B.AWS DataSync

C.AWS Transfer Family

D.Amazon Kinesis Data Streams

Why this answer

AWS Database Migration Service (DMS) is designed for migrating databases to AWS and can continuously replicate changes. For nightly batch loads, DMS with a full load or CDC is ideal. Option A (AWS DataSync) is for file transfers, not databases.

Option B (AWS Glue) can connect to JDBC sources but is more suitable for ETL; DMS is simpler for database migration. Option C (Amazon Kinesis) is for real-time streaming. Option D (AWS Transfer Family) is for file transfers over SFTP/FTPS.

Practice this question →

355

MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices and immediately process each record with minimal latency. Which AWS service should be used as the ingestion point?

A.AWS Lambda

B.Amazon S3

C.Amazon Kinesis Data Streams

D.AWS Glue

AnswerC

Kinesis Data Streams ingests streaming data with low latency and can be consumed by multiple applications.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion with low latency. AWS Glue is for batch ETL, S3 is object storage, and Lambda is compute but not an ingestion endpoint itself.

Practice this question →

356

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Recently, the transformation errors have increased due to Lambda timeouts. The data engineer needs to diagnose and resolve the issue without losing data. What should the engineer do?

A.Increase the Lambda function timeout and ensure that failed records are sent to a backup S3 bucket

B.Enable Amazon CloudWatch Logs for the Lambda function to capture errors and store failed records in CloudWatch

C.Configure the Lambda function to write failed records to an Amazon SQS queue for later reprocessing

D.Modify the Lambda function to store failed records in Amazon S3 before processing

AnswerA

Increasing timeout reduces failures, and configuring a backup bucket prevents data loss.

Why this answer

Option A is correct because increasing the Lambda timeout gives the function more time to complete, and Kinesis Firehose retries failures; data is not lost if the Lambda fails because Firehose can retry or send failed records to a backup S3 bucket. Option B is wrong because Lambda never stores data. Option C is wrong because CloudWatch Logs captures logs but does not store failed records.

Option D is wrong because Lambda cannot write to SQS directly from Firehose transformation; Firehose handles failed records internally.

Practice this question →

357

MCQeasy

The exhibit shows the output of describing an Amazon Kinesis Data Stream. A producer is sending records but the consumer is not receiving all records. What is the most likely cause?

A.The stream has only one shard, causing write throttling

B.The stream is in ACTIVE status, which prevents reading

C.The retention period is too short

D.The hash key range is too wide

AnswerA

With one shard, write throughput is limited; exceeding it causes throttling and missed records.

Why this answer

The stream has only one shard, which provides a maximum throughput of 1 MB/s or 1000 records/s for writes. If the producer exceeds this, records will be throttled. The retention period is 24 hours, which is fine.

The stream status is ACTIVE. There is no indication of a faulty shard. The consumer might be slow, but the question asks for cause of not receiving all records; throttling due to insufficient shards is a common issue.

Practice this question →

358

MCQeasy

A company needs to ingest data from a MySQL database into Amazon S3 in near real-time. The database is running on EC2. The data engineer wants to minimize the impact on the source database. Which service should be used?

A.AWS Database Migration Service (DMS) with ongoing replication

B.AWS Glue ETL job with a JDBC connection

C.Amazon RDS for MySQL with read replica

D.AWS Schema Conversion Tool (SCT)

AnswerA

DMS CDC uses binary logs to capture changes with minimal overhead.

Why this answer

AWS DMS with ongoing replication (change data capture) is the correct choice because it can continuously replicate changes from a MySQL source database to Amazon S3 with minimal performance impact. DMS uses a transactional log-based approach (MySQL binlog) to capture changes as they occur, avoiding heavy SELECT queries on the source. This enables near real-time ingestion without adding significant load to the production database.

Exam trap

The trap here is that candidates often confuse AWS Glue's batch JDBC capabilities with streaming ingestion, or assume that a read replica can directly feed data into S3 without an intermediary service like DMS or Kinesis.

How to eliminate wrong answers

Option B is wrong because AWS Glue ETL jobs with JDBC connections run batch queries that pull full table snapshots or large result sets, which can cause significant performance degradation on the source MySQL database and cannot achieve near real-time latency. Option C is wrong because Amazon RDS for MySQL with a read replica is a database migration or read scaling solution, not a data ingestion service to S3; it does not natively stream data to S3 without additional tooling. Option D is wrong because AWS Schema Conversion Tool (SCT) is designed for converting database schemas between different database engines (e.g., Oracle to Aurora), not for ingesting data into S3.

Practice this question →

359

MCQhard

A data pipeline uses Amazon Kinesis Data Firehose to ingest log data from web servers and deliver it to Amazon S3. The data is then transformed by an AWS Glue job before being loaded into Amazon Redshift. The pipeline must handle a sudden spike in log volume without data loss. Which configuration change is MOST appropriate?

A.Increase the AWS Glue job timeout and allocate more DPUs.

B.Configure Kinesis Data Firehose to back up all data to S3 in case of delivery failures.

C.Increase the number of nodes in the Redshift cluster to handle higher load.

D.Increase the S3 bucket size limit and enable versioning.

AnswerB

S3 backup for failed records ensures no data loss.

Why this answer

Option C is correct because enabling S3 backup for failed records in Firehose ensures that if the delivery fails after retries, the data is stored in S3 and can be reprocessed later, preventing data loss. Option A is wrong because increasing S3 bucket size doesn't help with data loss. Option B is wrong because increasing Glue job timeout does not prevent data loss during ingestion.

Option D is wrong because increasing Redshift concurrency does not affect the ingestion pipeline.

Practice this question →

360

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The engineer wants to ensure that the data is organized in a directory structure by year, month, day, and hour. Which TWO configurations should the engineer set on the Firehose delivery stream? (Choose TWO.)

Select 2 answers

A.Enable dynamic partitioning

B.Set a custom prefix with '!{timestamp:yyyy}/!{timestamp:MM}/!{timestamp:dd}/!{timestamp:HH}/'

C.Use an AWS Lambda function to write to S3 with the desired prefix

D.Enable format conversion to Parquet

E.Configure an S3 bucket with versioning enabled

AnswersA, B

Dynamic partitioning allows Firehose to partition data based on keys.

Why this answer

Correct options: A and D. Option A is correct because enabling dynamic partitioning allows Firehose to use partition keys from the data. Option D is correct because custom prefix configuration allows specifying the directory structure with expressions like '!{timestamp:yyyy}/!{timestamp:MM}/...'.

Option B is wrong because S3 objects have keys, not directories; the prefix defines the path. Option C is wrong because Lambda transformation is not required for partitioning. Option E is wrong because data format conversion is separate from partitioning.

Practice this question →

361

MCQhard

Refer to the exhibit. A data engineer runs the describe-stream command and sees this output. The application is writing records to the stream but is experiencing high write latency. The average record size is 50 KB, and the write rate is 1500 records per second. What is the MOST likely cause of the latency?

A.The application is running in a different AWS region.

B.The application is exceeding the DynamoDB provisioned throughput.

C.The Kinesis stream is throttling the application because of a hot shard.

D.The stream does not have enough shards to handle the write throughput.

AnswerD

2 shards provide 2 MB/s write capacity; the application requires 75 MB/s.

Why this answer

Option A is correct. Each shard can ingest up to 1 MB/s or 1000 records/s (for 1 KB records). With 2 shards, the total capacity is 2000 records/s, but the record size is 50 KB, so the throughput in MB/s is 1500 * 50 KB = 75 MB/s, far exceeding the 2 MB/s total write capacity.

The shards are overloaded. Option B is wrong because provisioned throughput is not relevant for Kinesis. Option C is wrong because the application does not need to be in the same region for performance.

Option D is wrong because there is no mention of throttling from downstream.

Practice this question →

362

MCQmedium

A data engineer is designing a data ingestion pipeline for real-time clickstream data using Amazon Kinesis Data Streams. The data must be transformed using AWS Lambda and then stored in Amazon S3 in Parquet format. Which Kinesis client library configuration should be used to minimize the number of Lambda invocations while ensuring data is processed within 60 seconds?

A.Set batch size to 100 records and disable batch window

B.Set batch size to 10000 records and batch window to 60 seconds

C.Set batch size to 100 records and batch window to 0 seconds

D.Set batch size to 10000 records and batch window to 5 seconds

AnswerB

Large batch size and 60-second window reduce invocations.

Why this answer

Option B is correct because setting a larger batch window (up to 300 seconds) reduces invocations; 60 seconds is acceptable. Option A (smaller batch size) increases invocations. Option C (smaller batch window) increases invocations.

Option D (disable retries) is not recommended.

Practice this question →

363

MCQhard

A data engineer is designing a streaming pipeline that ingests IoT sensor data from 10,000 devices. Each device sends a 1 KB message every second. The data must be processed in near real-time and stored in S3 for analytics. Which combination of services provides the most cost-effective solution?

A.AWS Data Pipeline with periodic S3 copy.

B.Amazon Kinesis Data Streams with Kinesis Data Firehose delivery to S3.

C.Amazon MSK (Managed Streaming for Kafka) with Kafka Connect S3 sink.

D.Amazon SQS FIFO queue with Lambda consumers writing to S3.

AnswerB

Handles high throughput, Firehose batches to S3.

Why this answer

Option B is correct because Kinesis Data Streams ingests high-throughput data, and Firehose delivers batches to S3 with optional Lambda transformations. Option A is wrong because SQS FIFO is not designed for high-throughput streaming. Option C is wrong because MSK requires more management.

Option D is wrong because Data Pipeline is batch-oriented.

Practice this question →

364

MCQhard

A team is designing a data ingestion pipeline to load JSON files from an Amazon S3 bucket into Amazon Redshift. The files arrive every 5 minutes, and each file is between 10 MB and 50 MB. The team wants to minimize the time between file arrival and data availability in Redshift. Which approach should the team use?

A.Schedule an AWS Glue job to run every 5 minutes to load the data.

B.Use S3 Event Notifications to trigger an AWS Lambda function that runs the COPY command to load data into Redshift.

C.Use Amazon Redshift Spectrum to query the data directly from S3 without loading.

D.Configure Amazon Kinesis Data Firehose to stream data from S3 to Redshift.

AnswerB

Lambda responds quickly to S3 events and runs COPY for efficient bulk loading.

Why this answer

The correct answer is to use S3 Event Notifications to invoke a Lambda function that calls the Redshift COPY command. This provides near-real-time ingestion with minimal latency. Option B (Redshift Spectrum) queries data in S3 without loading, but data is not in Redshift tables for fast querying.

Option C (Kinesis Firehose) can load into Redshift but adds streaming overhead and is not optimized for batch files. Option D (AWS DMS) is for database migration, not file loading.

Practice this question →

365

MCQmedium

A data engineering team needs to ingest CSV files from an S3 bucket into a Redshift cluster on a daily basis. The files are large (up to 100 GB each). Which approach is MOST cost-effective and efficient?

A.Use Amazon Kinesis Data Firehose to load data into Redshift.

B.Use AWS Data Pipeline to copy data from S3 to Redshift.

C.Use a Lambda function to read each file and insert rows individually.

D.Use the Redshift COPY command with an IAM role to load data directly from S3.

AnswerD

COPY command is optimized for bulk loading large datasets from S3 into Redshift.

Why this answer

Using the COPY command with IAM role is the recommended way to load large data into Redshift efficiently. Option B (AWS Data Pipeline) adds cost and complexity; Option C (Kinesis Firehose) is for streaming; Option D (Lambda) has time and memory limits.

Practice this question →

366

Multi-Selecthard

A data engineering team is designing a near-real-time data ingestion pipeline for IoT sensor data. The data must be processed and stored in Amazon S3, with transformations applied before storage. The team needs to handle potential duplicates and ensure exactly-once processing semantics. Which TWO AWS services should be used together? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Simple Queue Service (SQS)

C.Amazon Kinesis Data Analytics for Apache Flink

D.Amazon Kinesis Data Streams

E.AWS Database Migration Service (DMS)

AnswersC, D

Flink can provide exactly-once semantics with checkpointing.

Why this answer

Options B and C are correct. Kinesis Data Streams provides ordered data and can integrate with Kinesis Data Analytics for Flink, which supports exactly-once semantics. Option A (Firehose) provides at-least-once delivery.

Option D (SQS) is for decoupling but does not provide exactly-once. Option E (DMS) is for database replication.

Practice this question →

367

MCQmedium

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for near-real-time analytics. The data volume varies significantly and can spike unpredictably. The engineer wants to minimize operational overhead and ensure that data is durably stored as soon as it arrives. Which AWS service combination should the engineer use?

A.Use Amazon S3 Transfer Acceleration with S3 Event Notifications to trigger AWS Lambda for processing.

B.Use Amazon Kinesis Data Firehose to ingest data into Amazon S3 and use AWS Lambda to transform data during delivery.

C.Use Amazon Simple Queue Service (SQS) to buffer the streaming data and configure an Auto Scaling group of EC2 instances to poll and process the data.

D.Use Amazon Kinesis Data Streams to ingest the data and AWS Lambda to process records in real-time with automatic scaling.

AnswerD

Kinesis Data Streams provides durable, scalable, low-latency ingestion; Lambda can process each shard in parallel and scales automatically.

Why this answer

Option D is correct because Kinesis Data Streams provides durable, scalable ingestion for streaming data, and Lambda can process records in near-real-time with automatic scaling. Option A is wrong because S3 Transfer Acceleration is for accelerating uploads to S3, not for streaming ingestion. Option B is wrong because Kinesis Data Firehose is designed for loading streaming data into destinations like S3 but does not offer sub-second latency and has buffering delays.

Option C is wrong because SQS is a message queue that decouples producers and consumers but does not natively support streaming data partitioning or replay.

Practice this question →

368

MCQmedium

A data engineer needs to ingest data from an Amazon RDS for MySQL database into Amazon S3 on a daily basis. The data volume is about 50 GB per day. The engineer wants to minimize the impact on the source database. Which AWS service should be used?

A.AWS Glue with a JDBC connection

B.Amazon Athena Federated Query

C.AWS Database Migration Service (DMS)

D.AWS DataSync

AnswerC

DMS is optimized for database migrations with minimal impact.

Why this answer

Option D is correct because AWS DMS can perform full load and ongoing replication with minimal impact. Option A is wrong because Glue's JDBC connection can impact the source. Option B is wrong because DataSync is for file storage.

Option C is wrong because Athena cannot read from RDS directly.

Practice this question →

369

Multi-Selectmedium

A company is ingesting IoT sensor data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a Lambda function that transforms and writes to Amazon S3. The company notices that occasionally records are dropped. The data engineer needs to identify the cause and prevent data loss. Which TWO actions should the data engineer take? (Choose TWO.)

Select 2 answers

A.Enable CloudWatch Logs on the Kinesis stream to log all records.

B.Decrease the Lambda batch size to process records more frequently.

C.Add an Amazon SQS queue between Kinesis and Lambda to buffer records.

D.Increase the number of shards in the Kinesis data stream.

E.Configure a dead-letter queue on the Lambda function to capture failed records.

AnswersD, E

More shards provide higher throughput, reducing throttling.

Why this answer

Options A and D are correct. Option A: Increasing shards increases throughput capacity. Option D: Configuring a dead-letter queue (DLQ) for Lambda catches failed records.

Option B is wrong because Lambda can process up to 10 MB per batch, but default is not the issue here. Option C is wrong because SQS is not used with Kinesis directly; Lambda polls Kinesis. Option E is wrong because CloudWatch Logs do not prevent data loss.

Practice this question →

370

MCQhard

A data engineer is troubleshooting a Lambda function that reads from a Kinesis Data Stream, processes records, and writes to a Kinesis Data Firehose delivery stream. The Firehose delivery stream is configured to deliver data to an S3 bucket. The Lambda function is failing with an access denied error. The IAM policy attached to the Lambda execution role is shown in the exhibit. Which permission is missing?

A.kinesis:PutRecord on the Kinesis stream

B.firehose:DescribeDeliveryStream on the Firehose delivery stream

C.logs:CreateLogGroup and logs:CreateLogStream on the CloudWatch log group

D.s3:PutObjectAcl on the S3 bucket

AnswerB

The Lambda function needs to describe the Firehose delivery stream to obtain the endpoint before putting records; the policy only allows PutRecord and PutRecordBatch but not DescribeDeliveryStream.

Why this answer

Option B is correct because the Lambda function needs permission to describe the Firehose delivery stream to get the endpoint, and kinesis:DescribeStream is already allowed for the Kinesis stream, but firehose:DescribeDeliveryStream is missing for the Firehose resource. Option A is wrong because Kinesis actions are allowed. Option C is wrong because s3:PutObject is allowed.

Option D is wrong because the function is not writing to CloudWatch Logs; the error is about access to Firehose.

Practice this question →

371

MCQeasy

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time processing. Each device sends JSON payloads of about 2 KB at a rate of 1 message per second. The data must be processed with a durable, ordered stream per device. Which service should the company use as the ingestion layer?

A.Amazon Simple Queue Service (Amazon SQS) with a FIFO queue.

B.Amazon Kinesis Data Streams.

C.Amazon Simple Notification Service (Amazon SNS) with a Lambda subscriber.

D.Amazon Kinesis Data Firehose with Direct Put.

AnswerB

Provides ordered, durable streaming.

Why this answer

Option C is correct because Kinesis Data Streams provides ordered data per shard, and shards can be partitioned by device ID. Option A is wrong because SQS does not guarantee order. Option B is wrong because Firehose is for delivery, not ordered processing.

Option D is wrong because SNS is pub/sub and doesn't support ordered streams.

Practice this question →

372

MCQmedium

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then processed by an AWS Lambda function that transforms the records and writes them to an Amazon S3 bucket. Recently, the Lambda function has been timing out and the S3 bucket is not receiving all expected data. The Kinesis stream is not throttling and has sufficient shards. Which step should the company take to resolve this issue?

A.Increase the Lambda function's reserved concurrency.

B.Increase the Lambda function's timeout and memory allocation.

C.Increase the number of shards in the Kinesis stream.

D.Enable enhanced fan-out on the Kinesis stream to reduce latency.

AnswerB

Increasing timeout and memory can prevent timeouts and improve processing speed.

Why this answer

Option B is correct because increasing the Lambda function's timeout and memory allows it to handle larger or more frequent records within the execution duration. Option A is wrong because Kinesis is not the bottleneck. Option C is wrong because increasing shards does not address Lambda timeouts.

Option D is wrong because there is no mention of Lambda hitting concurrency limits.

Practice this question →

373

MCQhard

A CloudFormation template defines an AWS Glue job. The job fails during execution with the error 'Unable to locate script: s3://scripts-bucket/etl-script.py'. The S3 bucket 'scripts-bucket' exists and the script file is present. What is the most likely cause?

A.The script location path is incorrect; it should include the bucket's region.

B.The IAM role for the Glue job does not have s3:GetObject permission on the scripts bucket.

C.The Glue job requires Python version 2, but the script uses Python 3 syntax.

D.The S3 bucket is in a different AWS region than the Glue job.

AnswerB

Glue needs to read the script from S3.

Why this answer

The Glue job's IAM role likely lacks s3:GetObject permission for the scripts bucket. Option A is correct. Option B is wrong because the bucket is in the same region.

Option C is wrong because the script location is correct syntax. Option D is wrong because Glue supports Python 3.

Practice this question →

374

MCQmedium

A company uses AWS Data Pipeline to copy data from DynamoDB to S3 daily. Recently, the pipeline started failing with 'ThrottlingException' errors. The DynamoDB table has on-demand capacity. Which action should be taken to resolve the issue?

A.Increase the write capacity units of the DynamoDB table.

B.Replace Data Pipeline with AWS Glue using a DynamoDB connector.

C.Configure the pipeline to use a retry strategy with exponential backoff.

D.Disable the pipeline's retry logic and increase the timeout.

AnswerC

Retries with backoff alleviate throttling by slowing down requests.

Why this answer

Option C is correct because ThrottlingException errors in AWS Data Pipeline when reading from DynamoDB indicate that the pipeline's read requests are exceeding the table's available throughput. Since the table uses on-demand capacity, which can handle spikes but has a per-second throughput limit, implementing exponential backoff in the pipeline's retry strategy allows it to reduce request rate upon throttling, aligning with AWS SDK best practices for handling DynamoDB throttling.

Exam trap

The trap here is that candidates assume on-demand capacity eliminates all throttling, but it only handles traffic spikes within a per-second limit, so throttling can still occur with sustained high read rates, and the correct fix is to implement exponential backoff in the pipeline's retry strategy rather than modifying capacity or switching tools.

How to eliminate wrong answers

Option A is wrong because DynamoDB on-demand capacity does not use provisioned write capacity units; increasing write capacity units is irrelevant and would require switching to provisioned mode, which is unnecessary. Option B is wrong because replacing Data Pipeline with AWS Glue using a DynamoDB connector does not inherently resolve throttling; Glue also uses the same DynamoDB read APIs and would face the same throttling issue without proper retry handling. Option D is wrong because disabling retry logic and increasing the timeout would cause the pipeline to fail permanently on the first throttling error, as it would not retry the request, and a longer timeout does not prevent throttling.

Practice this question →

375

MCQmedium

A real-time analytics application uses Amazon Kinesis Data Streams. The consumer application falls behind, causing increased latency. Which action would MOST effectively improve throughput?

A.Reduce the RecordMaxBufferedTime parameter in the Firehose delivery stream.

B.Increase the number of shards in the data stream.

C.Increase the batch size in the Kinesis Producer Library.

D.Use enhanced fan-out to dedicate a shard per consumer.

AnswerB

More shards increase parallelism and throughput capacity.

Why this answer

Increasing the number of shards in the Kinesis Data Stream directly increases the stream's read capacity (each shard supports up to 2 MB/s read and 5 transactions per second for shared throughput). This allows the consumer application to process more data in parallel, reducing the backlog and latency. The question specifies a consumer application falling behind, which is a read-throughput bottleneck, and scaling shards is the most effective way to address it.

Exam trap

The trap here is that candidates confuse producer-side optimizations (like KPL batch size or Firehose buffering) with consumer-side throughput issues, or they assume enhanced fan-out alone solves a shard-scaling problem without recognizing that the root cause is insufficient shard count for the consumer's processing rate.

How to eliminate wrong answers

Option A is wrong because RecordMaxBufferedTime is a Kinesis Firehose parameter that controls how long data is buffered before delivery to a destination; it does not affect Kinesis Data Streams consumer throughput or latency. Option C is wrong because increasing the batch size in the Kinesis Producer Library (KPL) improves write efficiency by aggregating records, but the problem is with the consumer falling behind, not the producer. Option D is wrong because enhanced fan-out provides dedicated 2 MB/s read throughput per consumer per shard, but it does not increase the total number of shards; if the consumer is already saturated on a single shard, enhanced fan-out helps only if multiple consumers exist, but the core issue of insufficient shard count remains.

Practice this question →

← PreviousPage 5 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion and Transformation questions.

Start 20-question session