A company runs a data pipeline that ingests user activity logs from an API gateway into an Amazon Kinesis Data Firehose delivery stream. The Firehose stream writes data to an S3 bucket. The data is then processed by a scheduled AWS Glue job that runs every hour. Recently, the company noticed that the data in S3 is incomplete: some logs from the API are missing. The Glue job processes all files in the S3 bucket. The Firehose stream has a buffer size of 5 MB and a buffer interval of 60 seconds. The API sends data at a rate of approximately 2 MB per minute. What should the company do to reduce data loss?
A DLQ captures failed deliveries so data can be reprocessed.
Why this answer
Option C is correct because the buffer interval is 60 seconds, and data is sent at 2 MB/min. If the Firehose stream fails to deliver within the buffer interval, it retries and eventually writes to the S3 bucket. However, if the buffer size is not met within the interval, Firehose will still deliver after the interval.
Data loss could occur if the delivery fails permanently. Increasing the buffer interval reduces the frequency of deliveries but may increase latency; however, it does not directly prevent data loss. The real issue is likely that the Firehose stream is configured with a small buffer interval, causing frequent writes that may fail.
However, the best practice to prevent data loss is to enable S3 backup or use a Dead Letter Queue. Option A is wrong because increasing buffer size may cause more data to be buffered, but if the interval is the same, it may not help. Option B is wrong because enabling compression does not prevent data loss.
Option D is wrong because adding a Lambda function does not directly prevent data loss.