CCNA Data Ingestion Transformation Questions

75 of 610 questions · Page 2/9 · Data Ingestion Transformation topic · Answers revealed

76
MCQmedium

A company uses Amazon Kinesis Data Analytics to process real-time data. The application needs to aggregate data over a 10-minute window. The team notices that late-arriving events are being dropped. Which configuration should they adjust?

A.Configure a Kinesis Firehose delivery stream to buffer the late events.
B.Increase the shard count of the source Kinesis stream.
C.Set the allowed_lateness parameter in the application's windowed aggregation.
D.Increase the RecordColumn count in the input stream mapping.
AnswerC

Controls how late events are accepted.

Why this answer

Option C is correct because Kinesis Data Analytics allows you to set a lateness tolerance (allowed_lateness) for streaming aggregations. Option A is wrong because RecordColumn is for schema. Option B is wrong because ShardCount affects parallelism, not late events.

Option D is wrong because the application uses Kinesis Data Analytics, not Firehose.

77
MCQhard

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams with a shard count of 10. The incoming data rate is 1 MB/second. The consuming application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely performance bottleneck?

A.The Lambda function invoked by the stream has a cold start issue
B.The data stream has insufficient write capacity
C.The single KCL worker cannot process all shards in parallel
D.The shard count is too low to handle the data rate
AnswerC

KCL workers should be scaled to match shard count for parallel processing.

Why this answer

Option A is correct because a single KCL worker processes all shards sequentially, limiting throughput. Option B is wrong because the shard count is adequate for 1 MB/s (each shard can ingest 1 MB/s). Option C is wrong because provisioned throughput is not relevant.

Option D is wrong because Lambda concurrency would apply if using Lambda, not KCL.

78
MCQeasy

A company wants to ingest real-time clickstream data from a website into Amazon S3 with a maximum latency of 60 seconds. The data volume peaks at 500 MB/s. Which service should they use to buffer and deliver the data to S3?

A.Amazon Kinesis Data Firehose
B.Amazon Simple Queue Service (SQS)
C.Amazon Kinesis Data Streams
D.AWS Lambda
AnswerA

Firehose is designed for streaming ingestion into S3 with configurable buffering.

Why this answer

Option A is correct because Kinesis Data Firehose can buffer incoming data for up to 60 seconds and deliver to S3. Option B is wrong because Kinesis Data Streams stores data for up to 365 days but requires a consumer to write to S3. Option C is wrong because SQS is a message queue, not designed for streaming ingestion.

Option D is wrong because Lambda is not a buffer service.

79
MCQhard

A company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is in JSON format and each record is approximately 2 KB. The delivery stream is configured to buffer incoming records for 60 seconds or 5 MB, whichever comes first. The company notices that the data in S3 is delayed by up to 5 minutes during peak hours. Which action would most effectively reduce the delivery latency?

A.Increase the buffer size to 10 MB to allow more records per delivery.
B.Decrease the buffer interval to 15 seconds.
C.Enable compression (GZIP) on the delivery stream.
D.Enable data transformation with AWS Lambda to convert JSON to Parquet.
AnswerB

Shorter buffer interval triggers more frequent deliveries, reducing latency.

Why this answer

The observed delay of up to 5 minutes during peak hours indicates that the buffer size threshold (5 MB) is rarely reached because each record is only ~2 KB, so the delivery stream relies on the buffer interval (60 seconds) to trigger delivery. By decreasing the buffer interval to 15 seconds, Kinesis Data Firehose will push data to S3 more frequently, directly reducing the maximum latency from 60 seconds to 15 seconds per batch, which eliminates the compounding delays caused by queuing during high-throughput periods.

Exam trap

The trap here is that candidates assume increasing buffer size or enabling compression will speed up delivery, but they fail to recognize that with small records, the buffer interval is the bottleneck, and only reducing that interval directly lowers latency.

How to eliminate wrong answers

Option A is wrong because increasing the buffer size to 10 MB would actually increase the time needed to fill the buffer, worsening the latency issue during peak hours when records are small and the buffer interval is the primary trigger. Option C is wrong because enabling GZIP compression reduces storage size and cost but does not affect the delivery frequency or buffer flush timing, so it has no impact on latency. Option D is wrong because converting JSON to Parquet via Lambda adds processing overhead and introduces additional latency from the transformation invocation, which would increase rather than reduce delivery delay.

80
MCQmedium

A data engineering team is responsible for ingesting streaming data from a fleet of IoT devices into Amazon S3 using Kinesis Data Firehose. The data volume spikes unpredictably, and the team has configured Kinesis Data Firehose with a buffer size of 5 MB and buffer interval of 60 seconds. During spikes, the team notices that the delivery to S3 is delayed, and some records are lost due to exceeding the service limits. The team needs to ensure no data loss and reduce delivery latency. What should the team do?

A.Implement an AWS Lambda function to pre-process the data and send it to Firehose in a throttled manner.
B.Increase the buffer size to 10 MB and buffer interval to 120 seconds to allow more data accumulation before delivery.
C.Use Amazon Kinesis Data Streams as the data source for Firehose to decouple ingestion and delivery.
D.Enable S3 Transfer Acceleration on the destination bucket.
AnswerB

Larger buffers reduce the frequency of delivery calls and help handle spikes.

Why this answer

Option A is correct because increasing the buffer size and interval reduces the number of PutRecord.Batch calls and allows Firehose to handle larger spikes without throttling. Option B is wrong because using Kinesis Data Streams as a source adds a buffer but does not directly fix Firehose delivery issues. Option C is wrong because Lambda cannot directly increase Firehose throughput.

Option D is wrong because S3 Transfer Acceleration improves upload speed from clients, not Firehose delivery.

81
MCQhard

A company uses AWS Glue to process data from multiple S3 buckets. The Glue job runs daily and reads data from a bucket that contains millions of small files (each < 1 MB). The job has been running for hours and is often close to the 8-hour timeout limit. Which optimization would MOST reduce the job's runtime?

A.Pre-process the data to consolidate small files into larger files before the Glue job.
B.Convert the source data from CSV to Parquet format.
C.Increase the number of DPUs allocated to the Glue job.
D.Use a larger Spark shuffle partition size.
AnswerA

Fewer, larger files reduce the overhead of opening and reading files.

Why this answer

Small files cause overhead in reading and processing. Grouping them into larger files (e.g., by using S3 batch operations or a compaction step) reduces the number of files and improves performance. Option A is wrong because increasing DPUs may help but not as much as file consolidation.

Option B is wrong because Spark is already used by Glue. Option D is wrong because Parquet helps but if files are small, the benefit is limited.

82
MCQhard

A data engineer uses AWS Glue to catalog data from an S3 bucket. The data is partitioned by year, month, day. After adding new partitions, the Glue Crawler does not detect them. What is the MOST likely reason?

A.The crawler is configured to only add new partitions to existing tables, but the table schema has changed.
B.The crawler runs only once and does not schedule subsequent runs.
C.The IAM role lacks permission to write to the Glue Data Catalog.
D.The partition depth exceeds the crawler's default limit.
AnswerA

If the schema changed, the crawler may skip partitions; or if the crawler is set to not update new partitions, it won't add them.

Why this answer

Option A is correct because the Glue Crawler, by default, is configured to add new partitions only if the table schema remains unchanged. When new partitions are added to an S3 bucket, if the underlying data schema has changed (e.g., new columns, different data types), the crawler will not add those partitions to the existing table. This is a common safeguard to prevent schema drift from corrupting the cataloged table structure.

Exam trap

The trap here is that candidates often assume partition detection failures are due to permissions or depth limits, but AWS Glue's default schema-change protection is the subtle and less obvious cause.

How to eliminate wrong answers

Option B is wrong because even if the crawler runs only once, it should still detect new partitions during that single run; the issue is about detection failure, not scheduling frequency. Option C is wrong because if the IAM role lacked permission to write to the Glue Data Catalog, the crawler would fail entirely or produce an error, not silently skip new partitions. Option D is wrong because the default partition depth limit in AWS Glue Crawlers is 10 levels, and the given path (year/month/day) is only 3 levels deep, well within the limit.

83
Multi-Selectmedium

A company uses Amazon Kinesis Data Firehose to ingest data into an S3 bucket. The data is in JSON format and the team wants to convert it to Parquet before storage. Which TWO configurations are required?

Select 2 answers
A.Use Kinesis Data Analytics to transform data to Parquet.
B.Create a Glue table with the schema of the data.
C.Configure a Lambda function to convert data on the fly.
D.Set up an Athena table to read the data.
E.Enable data format conversion in Firehose and set Output format to Parquet.
AnswersB, E

Firehose needs a schema for Parquet conversion.

Why this answer

Options A and B are correct: Firehose can convert to Parquet if you specify a data format conversion and a Glue table (schema). Option C (Kinesis Data Analytics) is not needed. Option D (Athena) is for querying.

Option E (Lambda) can be used but is not required.

84
MCQhard

A company uses AWS Glue DataBrew for data preparation. The data source is an S3 bucket with millions of small CSV files (each < 1 MB). The DataBrew project takes a long time to load the sample data. What is the most likely cause and solution?

A.Use Amazon Athena to query the data instead of DataBrew
B.The DataBrew job is under-provisioned; increase the number of DPUs
C.The large number of small files causes S3 LIST overhead; concatenate files into larger files
D.Use AWS Glue ETL instead of DataBrew for this volume
AnswerC

S3 performance degrades with many small files; combining them reduces API calls.

Why this answer

Option B is correct because DataBrew samples data by reading files, and many small files cause high overhead due to S3 LIST and GET requests. Concatenating files into fewer larger files reduces this overhead. Option A (increase DPUs) does not help with the LIST overhead.

Option C (use Glue ETL) is a different service. Option D (use Athena) is for querying, not data preparation.

85
MCQmedium

A data engineering team needs to load data from an on-premises Oracle database to Amazon S3 daily. The data volume is about 50 GB per day, and the network bandwidth is 100 Mbps. The team wants to minimize operational overhead and use AWS managed services. Which solution should they choose?

A.Use AWS Database Migration Service (DMS) to migrate the data to S3.
B.Use AWS DataSync to copy the database files directly to S3.
C.Use AWS Glue with a JDBC connection and schedule a crawler to load data into S3.
D.Use Amazon Kinesis Data Firehose to stream data from Oracle to S3.
AnswerA

DMS supports ongoing replication and scheduled migrations from Oracle to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can continuously replicate or schedule data migration from on-premises databases to S3 with minimal setup. Option A is wrong because AWS DataSync is for file transfers, not database tables. Option B is wrong because AWS Glue can connect to JDBC sources but requires more configuration for scheduled loads.

Option D is wrong because Amazon Kinesis Firehose is for streaming data, not for batch database loads.

86
Multi-Selecthard

A company ingests IoT sensor data into Amazon Kinesis Data Streams. The data must be enriched with device metadata from Amazon DynamoDB and then stored in Amazon S3 in Apache Parquet format. The solution must minimize latency and cost. Which THREE steps should a data engineer implement? (Choose three.)

Select 3 answers
A.Deliver the enriched data to Amazon Kinesis Data Firehose and enable Parquet conversion.
B.Configure an AWS Lambda function to read from the stream, enrich, and write to S3.
C.Use AWS Glue streaming ETL to enrich and convert data to Parquet.
D.Use Amazon EMR with Spark Streaming to process and store the data.
E.Perform a DynamoDB lookup in the Flink application for each record.
.Use Amazon Kinesis Data Analytics for Apache Flink to enrich the stream with data from DynamoDB.
AnswersA, E

Kinesis Data Firehose can convert incoming data to Parquet and write to S3.

Why this answer

Option A: Using Kinesis Data Analytics for Flink allows stream enrichment with low latency. Option D: Kinesis Data Firehose can buffer data and convert it to Parquet before delivering to S3. Option E: Kinesis Data Analytics for Flink can look up DynamoDB data for enrichment.

Option B (Lambda) has invocation limits and higher cost per record. Option C (Glue) adds latency and cost for streaming. Option F (EMR) is managed and not as seamless.

87
MCQhard

A data engineer runs the describe-stream command and sees the output above. The stream has a retention period of 24 hours. The engineer needs to ensure that consumers can replay data for up to 7 days. Which action is required?

A.Increase the number of shards to allow more data storage.
B.Delete the stream and recreate it with a longer retention period.
C.Use the IncreaseStreamRetentionPeriod API to set retention to 168 hours.
D.Create new consumer applications that read from the stream.
AnswerC

The API can increase retention up to 365 days.

Why this answer

Option C is correct because to increase retention, you call IncreaseStreamRetentionPeriod. Option A is wrong because changing shard count does not affect retention. Option B is wrong because starting new consumers does not change retention.

Option D is wrong because you can only increase retention up to 365 days.

88
MCQhard

A company uses Amazon Kinesis Data Streams with a shard count of 10 to ingest clickstream data. The data is consumed by a Lambda function that transforms the records and writes to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. The average record size is 5 KB, and the incoming data rate is 15 MB/s. What is the most likely cause and solution?

A.Increase the number of shards in the Kinesis data stream to 15.
B.Decrease the batch size of the Lambda event source mapping.
C.Increase the Lambda function's memory allocation to 3008 MB.
D.Increase the Lambda function's reserved concurrency.
AnswerA

Each shard provides 1 MB/s write capacity; 15 shards would support 15 MB/s.

Why this answer

Option B is correct because with 10 shards, the total write capacity is 10 MB/s (1 MB/s per shard). The incoming rate is 15 MB/s, exceeding capacity. Increasing shards to 15 would provide 15 MB/s write capacity.

Option A is wrong because Lambda concurrency is not the issue; the error is about throughput exceeded. Option C is wrong because increasing Lambda memory does not affect Kinesis write limits. Option D is wrong because the error is on the producer side, not the consumer.

89
MCQeasy

A data engineer needs to ingest streaming data from a social media API into Amazon S3 for batch analytics. The data arrives at a rate of 500 records per second. Which service should be used to capture the stream?

A.Amazon Simple Notification Service (SNS)
B.Amazon Simple Queue Service (SQS)
C.Amazon Kinesis Data Streams
D.Amazon MQ
AnswerC

Kinesis Data Streams is designed for real-time streaming data ingestion.

Why this answer

Option B is correct because Kinesis Data Streams can capture high-throughput streaming data. Option A is wrong because SQS is for message queues, not for streaming analytics. Option C is wrong because SNS is for pub/sub notifications.

Option D is wrong because MQ is for message brokers.

90
Multi-Selecthard

A data engineer needs to set up a data ingestion pipeline that reads from Amazon MSK (Managed Streaming for Kafka) and writes to Amazon S3 with transformations. The data is in Avro format and must be converted to Parquet. Which THREE components should be used together? (Choose THREE.)

Select 3 answers
A.AWS Lambda function to convert Avro to Parquet as a Firehose transformation
B.Amazon Athena to convert the data format
C.Amazon Kinesis Data Firehose delivery stream with MSK as source
D.Amazon MSK cluster as the data source
E.AWS Glue ETL job to read from MSK
AnswersA, C, D

Lambda can be used in Firehose to perform data transformation.

Why this answer

Options A, B, and D are correct. A: MSK is the source. B: Kinesis Data Firehose can consume from MSK via a custom endpoint and deliver to S3, with Lambda for transformation.

D: Lambda can be used as a transformation function within Firehose to convert Avro to Parquet. C: Glue is not directly integrated with MSK as a Firehose source. E: Athena is for querying, not part of the ingestion pipeline.

91
MCQmedium

A company is streaming IoT data from thousands of devices into Amazon Kinesis Data Streams. The data must be transformed in real time before being stored in Amazon S3. Which service should be used to perform the transformation as the data streams through Kinesis?

A.AWS Glue
B.Amazon Kinesis Data Analytics for Apache Flink
C.Amazon EMR
D.AWS Lambda
AnswerB

Correctly processes streaming data in real time with Flink.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics for Apache Flink can process streaming data in real time using Flink, which is ideal for transformations. Option B (Lambda) can process records but is better for lightweight transformations and may not handle complex stateful operations. Option C (Glue) is batch-oriented, not real-time.

Option D (EMR) is for big data processing but adds latency.

92
Multi-Selecteasy

Which TWO AWS services can be used as sources for AWS Glue ETL jobs? (Choose two.)

Select 2 answers
A.Amazon Route 53
B.Amazon CloudFront
C.Amazon API Gateway
D.Amazon S3
E.Amazon RDS
AnswersD, E

S3 is a common source for Glue jobs.

Why this answer

Glue can read from S3 and JDBC sources like RDS and Redshift. Option C (API Gateway) is not a data store; Option D (CloudFront) is a CDN; Option E (Route 53) is DNS.

93
MCQmedium

Refer to the exhibit. A data engineer is running an AWS Glue job that reads data from an S3 source. The job fails with the error shown. What is the MOST likely cause?

A.The IAM role does not have s3:GetObject permission.
B.One of the source files is empty or corrupted.
C.The file is in JSON format but the schema expects Parquet.
D.The Glue job has insufficient memory allocated.
AnswerB

Empty file can return None when read, causing 'NoneType' has no attribute 'read'.

Why this answer

Option C is correct because a corrupted file can cause a read error. Option A is wrong because missing permissions would cause an access denied error, not a read attribute error. Option B is wrong because a wrong file format would cause a parsing error, not a NoneType error.

Option D is wrong because insufficient memory would cause OOM error.

94
Matchingmedium

Match each AWS service to its primary purpose in data engineering.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Serverless ETL and data catalog

Data warehousing and SQL analytics

Big data processing using Hadoop/Spark

Building and managing data lakes

Real-time streaming data ingestion

Why these pairings

These are core AWS services for data engineering workloads.

95
Multi-Selectmedium

Which TWO services can be used to ingest streaming data into Amazon S3? (Choose two.)

Select 2 answers
A.Amazon Athena
B.AWS Glue
C.Amazon Kinesis Data Streams
D.AWS Database Migration Service (DMS)
E.Amazon Kinesis Data Firehose
AnswersC, E

Data Streams can be consumed and written to S3 via a consumer application.

Why this answer

Amazon Kinesis Data Streams is a real-time streaming service that can ingest and store streaming data, which can then be consumed and written to Amazon S3 using a Kinesis Data Analytics or a custom consumer application. Amazon Kinesis Data Firehose is a fully managed service that can directly load streaming data into Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service, with optional data transformation and compression.

Exam trap

The trap here is that candidates often confuse AWS Glue's ETL capabilities with real-time streaming ingestion, or mistakenly think Amazon Athena can ingest data because it queries S3, but neither service is designed for streaming data ingestion.

96
MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices and store it in Amazon S3 for batch processing. The data arrives at a rate of 10 MB/s, with occasional spikes up to 50 MB/s. The data must be processed in near real-time with minimal latency. Which AWS service should be used for ingestion?

A.Amazon DynamoDB Streams
B.Amazon Kinesis Data Streams
C.Amazon SQS
D.Amazon S3
AnswerB

Designed for real-time data streaming with high throughput and S3 integration via Kinesis Firehose.

Why this answer

Option C is correct because Kinesis Data Streams can handle high throughput and is designed for real-time ingestion. Option A is wrong because SQS is for decoupled messaging, not high-throughput streaming; it also lacks built-in S3 integration. Option B is wrong because S3 is a storage service, not an ingestion endpoint.

Option D is wrong because DynamoDB Streams captures changes to DynamoDB tables, not external IoT data.

97
MCQhard

A financial services company is building a real-time fraud detection system. Transaction data is ingested via Amazon Kinesis Data Streams and processed by an Amazon Kinesis Data Analytics for Apache Flink application that runs sliding window aggregations. The output is written to an Amazon S3 bucket for downstream analysis. The Flink application is configured with parallelism of 4 and checkpointing every minute. The company has noticed that the application is experiencing high latency and the checkpointing is frequently failing. The CloudWatch metrics show that the Flink application's CPU utilization is near 100% and the checkpoint duration is spiking to over 5 minutes. The data engineer needs to improve performance. Which action should the data engineer take?

A.Increase the number of shards in the source Kinesis stream to improve throughput.
B.Increase the parallelism of the Flink application to distribute the workload across more resources.
C.Increase the heap memory of the Flink application to handle larger state.
D.Decrease the checkpoint interval to 30 seconds to reduce the amount of state being checkpointed.
AnswerB

More parallelism can reduce CPU utilization and checkpoint time.

Why this answer

Option C is correct because increasing parallelism allows the workload to be distributed across more resources, reducing CPU pressure and checkpoint duration. Option A is wrong because reducing checkpoint interval would increase frequency and likely worsen failures. Option B is wrong because increasing shards without increasing parallelism may not help if CPU is bottleneck.

Option D is wrong because memory increase is not the primary issue.

98
Multi-Selecthard

A data engineer is troubleshooting a Kinesis Data Streams consumer application that is falling behind. The stream has 10 shards and is receiving 5 MB/s of data. The consumer uses the Kinesis Client Library (KCL) with a single worker. The worker is processing all 10 shards but is experiencing high latency and checkpointing delays. Which THREE actions should the engineer take to improve consumer performance? (Select THREE.)

Select 3 answers
A.Increase the number of KCL workers to match the number of shards.
B.Enable enhanced fan-out for the consumer.
C.Decrease the checkpoint interval to reduce checkpointing overhead.
D.Increase the KCL maxRecords parameter to process more records per call.
E.Increase the number of shards in the stream.
AnswersA, B, D

Multiple workers can process shards in parallel, reducing per-worker load.

Why this answer

Option A (increase number of workers) allows shard distribution across multiple workers, improving parallelism. Option D (increase KCL maxRecords per call) reduces the number of API calls, improving throughput. Option E (enable enhanced fan-out) dedicates throughput to the consumer, reducing read throttling.

Option B is wrong because increasing shards would increase the load on the consumer. Option C is wrong because reducing the checkpoint interval would cause more frequent checkpointing, potentially increasing delays.

99
MCQeasy

A data engineer needs to ingest real-time clickstream data from a website into Amazon S3 for analytics. The data arrives as JSON records, each under 1 KB. The engineer wants to use a serverless solution with automatic scaling and minimal operational overhead. Which AWS service should be used as the ingestion endpoint?

A.Amazon S3 with presigned URLs
B.Amazon Kinesis Data Analytics
C.Amazon Kinesis Data Firehose
D.AWS Lambda function behind an API Gateway
AnswerC

Serverless, automatically scales, delivers to S3 with optional transformation.

Why this answer

Option C is correct because Kinesis Data Firehose is a serverless service that automatically scales and delivers streaming data to S3. Option A is wrong because Lambda can process but not serve as a persistent ingestion endpoint; it would require custom scaling. Option B is wrong because S3 is not a real-time ingestion endpoint.

Option D is wrong because Kinesis Data Analytics is for real-time analysis, not ingestion.

100
MCQhard

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and each record is about 2 KB. The delivery stream is configured to buffer data for 60 seconds or 5 MB, whichever comes first. The team notices that the S3 objects are very small (around 1 MB) and numerous, causing high costs due to S3 PUT requests. Which configuration change should the team make to reduce the number of S3 objects?

A.Enable compression (GZIP) on the delivery stream.
B.Increase the buffer size to 50 MB and the buffer interval to 300 seconds.
C.Reduce the buffer interval to 30 seconds and keep buffer size at 5 MB.
D.Switch from Kinesis Data Firehose to Amazon Kinesis Data Streams and use a Lambda function to write to S3.
AnswerB

Larger buffer accumulates more data before writing, resulting in fewer, larger objects.

Why this answer

The correct answer is to increase the buffer size to a larger value (e.g., 50 MB) and increase the buffer interval to 300 seconds. Larger buffers produce fewer, larger S3 objects. Option A (reducing buffer interval) would create even more objects.

Option B (compression) reduces object size but not count. Option D (switching to Kinesis Streams) changes the architecture. Option C directly addresses the issue.

101
MCQhard

A data engineer is troubleshooting a Kinesis Data Streams application that is experiencing high latency. The stream has 2 shards. The application is using a single Kinesis Client Library (KCL) worker to process all shards. Which change will MOST likely reduce latency?

A.Increase the number of shards to 4.
B.Deploy multiple KCL workers to process shards in parallel.
C.Use a larger instance type for the Kinesis stream.
D.Decrease the number of shards to 1.
AnswerB

Multiple workers can process shards concurrently, reducing latency.

Why this answer

Option D is correct because using multiple KCL workers, one per shard, allows parallel processing of each shard, reducing latency. Option A is incorrect because increasing shard count would increase capacity but not necessarily reduce latency if the processing is bottlenecked by a single worker. Option B is incorrect because decreasing shard count would reduce parallelism.

Option C is incorrect because the KCL worker runs in the application, not in Kinesis.

102
Multi-Selectmedium

A company is using AWS Glue to run ETL jobs that read from Amazon S3 and write to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. Which TWO actions should the data engineer take to resolve this issue? (Choose TWO.)

Select 2 answers
A.Switch the output to Amazon S3 instead of Redshift
B.Increase the number of DPUs allocated to the Glue job
C.Reduce the number of partitions in the input data
D.Increase the spark.sql.shuffle.partitions parameter
E.Enable job metrics in CloudWatch to monitor memory usage
AnswersB, E

More DPUs provide more memory.

Why this answer

Option A is correct because increasing the DPU count provides more memory and processing power. Option B is correct because enabling job metrics helps identify memory bottlenecks. Option C (increasing shuffle threshold) may help but is not a primary solution.

Option D (reducing parallelism) could reduce memory but may slow down the job. Option E (changing Redshift) is unrelated to memory in Glue.

103
Multi-Selecteasy

A data engineer needs to transfer 50 TB of data from an on-premises data center to Amazon S3 over a 1 Gbps network. The transfer must be completed within one week. Which TWO AWS services can be used for this task? (Choose TWO.)

Select 2 answers
A.AWS Glue
B.AWS DataSync
C.AWS Snowball
D.Amazon S3 Transfer Acceleration
E.AWS Direct Connect
AnswersB, C

Designed for network-based bulk data transfer.

Why this answer

Option A (AWS DataSync) is correct because it can transfer large volumes over the network efficiently. Option D (AWS Snowball) is correct because it can be used for offline transfer if network is insufficient. Option B is wrong because S3 Transfer Acceleration only speeds up internet transfers but not necessarily achieve the required throughput.

Option C is wrong because Direct Connect is a network connection, not a transfer service. Option E is wrong because Glue is for ETL, not bulk transfer.

104
MCQhard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a fleet of EC2 instances running a custom application that processes the records and writes to DynamoDB. The application is experiencing high latency and records are being processed slower than they are produced. The stream has 5 shards. Which action would MOST effectively improve processing speed?

A.Use the Kinesis Client Library (KCL) to automatically distribute shards among instances.
B.Increase the EC2 instance size to provide more CPU and memory.
C.Add more EC2 instances consuming from the same stream without changing shard count.
D.Increase the number of shards in the Kinesis stream.
AnswerD

More shards increase the stream's capacity and allow more parallel consumers.

Why this answer

Increasing the number of shards increases the throughput of the stream, allowing more parallel consumers. Option A is wrong because increasing instance size may not help if the bottleneck is the stream. Option C is wrong because adding more consumers without more shards doesn't help (each shard supports one consumer per application).

Option D is wrong because KCL handles shard distribution, but more shards are needed.

105
MCQeasy

A data engineer is ingesting streaming data from thousands of IoT devices into AWS. The data is JSON-formatted and must be stored in Amazon S3 for long-term analytics. Which service is most appropriate for real-time ingestion and routing to S3?

A.Amazon SQS
B.Amazon Kinesis Data Firehose
C.Amazon Kinesis Data Streams
D.AWS Glue
AnswerB

Kinesis Data Firehose can deliver streaming data directly to S3 without additional code.

Why this answer

Amazon Kinesis Data Firehose is the easiest way to load streaming data into S3. Kinesis Data Streams requires custom consumers; AWS Glue is batch ETL; Amazon SQS is not designed for direct S3 delivery.

106
MCQhard

A healthcare company is ingesting patient data from a legacy system into an Amazon S3 data lake using AWS Glue. The legacy system produces CSV files with inconsistent schemas (columns may appear or disappear in different files). The data engineer needs to create a Glue ETL job that can handle schema evolution and transform the data into a standardized parquet format. The job should also be able to process new files as they arrive. Which approach should the data engineer use?

A.Use AWS Glue crawlers to create a schema in the Data Catalog and then use a standard Spark DataFrame for transformation.
B.Use AWS Glue DynamicFrames to read the CSV files and apply transformations using resolveChoice and applyMapping.
C.Use a Python shell job in Glue to manually parse each file and write to parquet.
D.Use a Glue ETL job with a static schema defined in the script and ignore files that don't match.
AnswerB

DynamicFrames support schema evolution.

Why this answer

Option C is correct because Glue DynamicFrames can handle schema evolution by allowing schema-on-read and resolving schema inconsistencies. Option A is wrong because crawlers are for cataloging, not ETL. Option B is wrong because explicit schema mapping would fail with evolving schemas.

Option D is wrong because Spark DataFrames require a predefined schema.

107
Multi-Selectmedium

A data engineer needs to schedule a nightly ETL job that reads from an Amazon RDS database and writes to Amazon S3 in Parquet format. The solution must be serverless and minimize cost. Which TWO AWS services should be used? (Choose TWO.)

Select 2 answers
A.AWS Data Pipeline
B.AWS Lambda
C.Amazon Athena
D.Amazon S3
E.AWS Glue
AnswersD, E

S3 is the destination for the transformed data.

Why this answer

AWS Glue can run serverless ETL jobs. Amazon S3 is the destination. Lambda could trigger but not run the ETL itself; Data Pipeline is not serverless; Athena is query-only.

108
MCQhard

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

A.Use a job parameter to store the last processed timestamp with millisecond precision and query records greater than that value.
B.Increase the job frequency to every 30 minutes.
C.Run a full refresh of the table each time instead of incremental.
D.Modify the MySQL table to use a DATE data type instead of TIMESTAMP.
AnswerA

Custom job bookmark with higher precision.

Why this answer

Option A is correct because AWS Glue job bookmarks track timestamps with only second precision, so records with microsecond differences within the same second are missed. By using a custom job parameter to store the last processed timestamp with millisecond precision and querying records greater than that value, you bypass Glue's bookmark limitation and capture all new or modified records.

Exam trap

The trap here is that candidates assume Glue job bookmarks automatically handle all timestamp precisions, but the exam tests awareness that bookmarks default to second-level granularity and that custom logic is required for sub-second precision.

How to eliminate wrong answers

Option B is wrong because increasing job frequency does not address the precision mismatch; it only reduces the window for missed records but does not eliminate the root cause of second-level granularity. Option C is wrong because running a full refresh each time is inefficient and costly, and it does not solve the precision issue—it simply avoids incremental processing. Option D is wrong because changing the column to DATE data type would lose time-of-day information entirely, making incremental processing based on last_modified impossible.

109
MCQmedium

An e-commerce company ingests clickstream data from their website into Amazon S3. The data is in JSON format, and each file is about 10 MB. They need to transform the data into a columnar format for analytics and load it into Amazon Redshift nightly. The transformation should be cost-effective and require minimal operational overhead. Which approach meets these requirements?

A.Use AWS Glue ETL job to convert to Parquet and load into Redshift.
B.Use Amazon Redshift COPY command to load JSON directly.
C.Use Amazon EMR with Spark to transform and load data.
D.Use AWS Lambda to transform each file and write to Redshift.
AnswerA

Serverless and minimal overhead.

Why this answer

AWS Glue ETL is the correct choice because it is a serverless, managed service that can efficiently convert JSON to Parquet (a columnar format optimized for Redshift) and load the data into Redshift with minimal operational overhead. The nightly batch processing of 10 MB files is well-suited for Glue's pay-per-use pricing, making it cost-effective without requiring infrastructure management.

Exam trap

The trap here is that candidates may choose Amazon EMR or Lambda because they are familiar with Spark or serverless functions, but they overlook the operational overhead of EMR and the execution limits of Lambda for batch workloads, while Glue provides a balanced, managed solution for this specific use case.

How to eliminate wrong answers

Option B is wrong because the Redshift COPY command can load JSON directly, but it does not transform the data into a columnar format like Parquet; it loads JSON as-is, which is less efficient for analytics and may require additional schema handling. Option C is wrong because Amazon EMR with Spark introduces significant operational overhead for managing clusters, tuning, and monitoring, which is unnecessary for a simple nightly transformation of small 10 MB files. Option D is wrong because AWS Lambda has a maximum execution timeout of 15 minutes and limited memory (up to 10 GB), making it unsuitable for batch processing multiple files or handling large datasets; it is designed for event-driven, short-lived tasks, not nightly ETL workloads.

110
MCQeasy

A data engineer is tasked with transforming JSON data from an S3 bucket into Parquet format for efficient querying. The transformation should run on a schedule every hour. Which AWS service is best suited for this task?

A.AWS Lambda
B.Amazon Athena
C.AWS Glue
D.Amazon EMR
AnswerC

Glue provides managed ETL jobs that can be scheduled and support Parquet conversion.

Why this answer

Option C (AWS Glue) is correct because it offers scheduled ETL jobs that can read from S3, transform data, and write back in Parquet format. Option A (Amazon Athena) is a query service, not for scheduled transformations. Option B (Amazon EMR) is more complex and typically used for large-scale processing.

Option D (AWS Lambda) has a 15-minute timeout and is not ideal for hourly batch jobs.

111
MCQmedium

A company uses AWS Glue ETL jobs to process data from an S3 data lake. The job reads data in CSV format, transforms it, and writes to Parquet. The job runs daily and takes 2 hours to complete. The data volume is increasing by 20% each month. The engineer wants to reduce the job runtime. Which action is most effective?

A.Increase the number of DPUs for the Glue job
B.Enable compression on the input CSV files
C.Switch from Python Shell to Spark ETL
D.Partition the input data in S3 by date and use partition pruning in the job
AnswerD

Partition pruning limits the data read to only relevant partitions, drastically reducing processing time.

Why this answer

Option B is correct because partitioning data by date (e.g., year/month/day) allows Glue to read only the new partitions incrementally, reducing data scanned. Option A (increasing DPUs) may help but not as much as reducing data volume. Option C (using Spark) is already used by Glue.

Option D (compression) is already Parquet, which is compressed.

112
MCQmedium

A streaming application sends data to Amazon Kinesis Data Streams. The data must be enriched with reference data from an Amazon DynamoDB table in real-time. Which AWS service can be used to perform this enrichment with minimal latency?

A.Amazon Kinesis Data Analytics for Apache Flink
B.Amazon Kinesis Data Firehose with Lambda transformation
C.AWS Lambda function triggered by Kinesis Data Streams
D.AWS Glue streaming ETL
AnswerA

Flink can perform low-latency stream processing and join with DynamoDB.

Why this answer

Option B is correct because Kinesis Data Analytics for Apache Flink can process streaming data and join with DynamoDB in real-time. Option A is wrong because Lambda can enrich but may add latency due to cold starts. Option C is wrong because Kinesis Data Firehose does not support real-time enrichment from DynamoDB.

Option D is wrong because Glue is batch-oriented.

113
MCQhard

A company uses AWS Glue to transform data stored in Amazon S3. During a run, the job fails with a 'OutOfMemoryError' in the Spark executor. The job processes 2 TB of parquet files using 10 DPUs. The data is evenly distributed across partitions. Which action would MOST likely resolve the issue without impacting the job logic?

A.Enable S3 request rate increase to speed up data reading.
B.Increase the number of DPUs allocated to the Glue job.
C.Repartition the data to a larger number of partitions.
D.Change the input format from Parquet to Snappy-compressed CSV.
AnswerB

More DPUs increase total memory available.

Why this answer

Option A is correct because increasing DPUs provides more memory and cores. Option B is wrong because repartitioning may not reduce memory per executor if total DPUs unchanged. Option C is wrong because compressing reduces input size, but Spark uses memory for processing, not just storage.

Option D is wrong because increasing S3 request rate does not help with executor memory.

114
MCQeasy

A data engineer needs to ingest data from an external partner's FTP server to Amazon S3. The data arrives once daily as a CSV file. Which AWS service should be used for this ingestion?

A.AWS DataSync
B.Amazon Kinesis Data Firehose
C.Amazon AppFlow
D.AWS Transfer Family
AnswerD

AWS Transfer Family provides managed FTP and SFTP support for S3.

Why this answer

Option A is correct because AWS Transfer Family supports SFTP and FTP-based transfers into S3. Option B is wrong because AWS DataSync is for transfers between storage systems, but it requires agent installation. Option C is wrong because Amazon AppFlow is for SaaS applications, not FTP.

Option D is wrong because Amazon Kinesis Firehose is for streaming data.

115
MCQeasy

A company is using Amazon Kinesis Data Firehose to ingest data into Amazon S3. The data must be transformed from JSON to Parquet format before delivery. Which feature should be enabled on the Firehose delivery stream?

A.Amazon Kinesis Data Analytics
B.Amazon S3 event notifications
C.Format conversion (Parquet/ORC)
D.AWS Lambda transformation
AnswerC

Firehose natively supports converting JSON to Parquet or ORC.

Why this answer

Option D is correct because Firehose can convert the input data format to Parquet or ORC using its built-in format conversion feature. Option A is wrong because Lambda transformation is for custom code, not format conversion. Option B is wrong because S3 events are for notifications, not transformation.

Option C is wrong because Kinesis Data Analytics is for stream processing, not directly tied to Firehose.

116
Multi-Selectmedium

A data engineer needs to design a data ingestion pipeline that captures streaming data from mobile app events into Amazon S3 for analytics. The pipeline must support real-time processing of events and allow for schema evolution over time. Which AWS services should the engineer use? (Choose THREE.)

Select 3 answers
A.Amazon Kinesis Data Analytics
B.Amazon Kinesis Data Firehose
C.AWS Glue ETL jobs
D.Amazon Kinesis Data Streams
E.AWS AppFlow
AnswersA, B, D

Enables real-time processing and schema evolution.

Why this answer

Options A, C, and D are correct. Kinesis Data Streams ingests streaming data. Kinesis Data Analytics can perform real-time processing.

Kinesis Data Firehose delivers data to S3. Option B (Glue) is batch-oriented and not real-time. Option E (AppFlow) is for SaaS data ingestion.

117
Multi-Selecthard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application reads from a Kinesis stream and writes aggregated results to an Amazon S3 bucket. The company notices that the application is falling behind and the checkpoint duration is increasing. Which THREE actions should the data engineer take to improve performance? (Choose THREE.)

Select 3 answers
A.Decrease the number of shards in the source Kinesis stream.
B.Use multiple S3 prefixes in the output path to avoid throttling.
C.Increase the heap memory of the Flink application.
D.Increase the checkpoint interval to reduce checkpoint overhead.
E.Increase the parallelism of the Flink application.
AnswersB, D, E

Multiple prefixes increase S3 write performance.

Why this answer

Options A, C, and E are correct. Increasing parallelism allows more parallel processing. Using S3 with multiple prefixes reduces S3 write throttling.

Increasing checkpoint interval reduces overhead. Option B is wrong because heap memory increase is not directly related to checkpoint duration. Option D is wrong because reducing the number of shards would decrease throughput.

118
MCQmedium

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms each record and writes it to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors when writing to S3. The team has already increased the Lambda function's memory and timeout. Which action should the team take to resolve the issue?

A.Use S3 Batch Operations to write data in batches.
B.Increase the number of shards in the Kinesis data stream.
C.Enable S3 Transfer Acceleration on the destination bucket.
D.Implement retries with exponential backoff in the Lambda function for S3 put operations.
AnswerD

This handles transient S3 throttling by retrying with backoff.

Why this answer

The 'ProvisionedThroughputExceededException' error indicates that the Lambda function is being throttled by S3 due to exceeding the bucket's request rate limits. Implementing retries with exponential backoff in the Lambda function for S3 put operations is the correct solution because it allows the function to gracefully handle transient throttling errors by waiting progressively longer between retries, which aligns with AWS's guidance for managing S3 request rate limits.

Exam trap

The trap here is that candidates confuse the source of the error (Kinesis vs. S3) and incorrectly assume that increasing Kinesis shards will fix the S3 throttling, or they mistake S3 Transfer Acceleration for a solution to rate limits when it only improves network latency.

How to eliminate wrong answers

Option A is wrong because S3 Batch Operations is designed for bulk processing of existing objects in S3, not for handling real-time streaming writes from Lambda, and it does not address the immediate throttling issue during individual put operations. Option B is wrong because increasing the number of shards in the Kinesis data stream would increase the parallelism of data ingestion into Lambda, but the error occurs when writing to S3, not when reading from Kinesis, so it would not resolve the S3 throttling. Option C is wrong because S3 Transfer Acceleration optimizes network transfer speed by using AWS edge locations, but it does not affect S3's internal request rate limits or throttle errors, which are based on bucket-level throughput capacity.

119
Multi-Selecteasy

A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)

Select 2 answers
A.Set up an S3 event notification to trigger the Lambda function on object creation.
B.Configure a Lambda function as a data transformation source in the Firehose delivery stream.
C.Ensure the Lambda function returns the transformed data in the format required by Firehose.
D.Subscribe the Lambda function to the CloudWatch Logs log group for the Firehose stream.
E.Have the Lambda function write the transformed data directly to the S3 bucket.
AnswersB, C

This enables Firehose to invoke Lambda for transformation.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose can be configured to invoke a Lambda function as a data transformation source. This allows Firehose to pass incoming records to the Lambda function, which processes and returns the transformed records before they are delivered to the S3 destination. Option C is correct because the Lambda function must return data in the specific format that Firehose expects, including a record ID, result status, and base64-encoded data, otherwise the transformation will fail.

Exam trap

The trap here is that candidates often confuse post-delivery transformations (using S3 event notifications) with in-stream transformations (using Firehose's built-in Lambda integration), leading them to select Option A instead of the correct Firehose-specific configuration.

120
Multi-Selecthard

A company uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The data contains personally identifiable information (PII) that must be redacted before storage. Which THREE actions can achieve this requirement? (Choose THREE.)

Select 3 answers
A.Use AWS Glue ETL to read from the S3 bucket and write redacted data to another S3 bucket.
B.Use Amazon Athena to query the data and redact PII on the fly.
C.Use Amazon Macie to discover and automatically redact PII before storage.
D.Use AWS Database Migration Service (AWS DMS) to replicate data and apply transformations.
E.Use an AWS Lambda function as a transformation in the Firehose delivery stream.
AnswersA, C, E

Glue can process and redact PII in a batch job after data is in S3.

Why this answer

Option A is correct because AWS Glue ETL can read data from the source S3 bucket, apply transformations (including redacting PII fields using custom scripts or built-in transforms), and write the cleaned data to a target S3 bucket. This decouples the redaction from the ingestion pipeline and allows for complex, schema-aware transformations using Apache Spark or Python.

Exam trap

The trap here is that candidates may assume Macie can automatically redact PII during ingestion, but Macie is a discovery and classification service, not a data transformation service; it cannot modify data in flight or in place without additional automation.

121
MCQeasy

A data engineer is using AWS Glue to run an ETL job that reads data from Amazon DynamoDB and writes to Amazon Redshift. The job fails with a 'ThroughputExceededException' error. What is the most likely cause?

A.The Glue job has a timeout setting that is too low
B.The Redshift cluster's concurrency scaling is insufficient
C.The DynamoDB table's read capacity is insufficient for the Glue job's read rate
D.The S3 bucket where Glue writes temporary data does not have proper permissions
AnswerC

Glue reads from DynamoDB and may exceed provisioned read capacity, causing throttling.

Why this answer

Option A is correct because DynamoDB throttles requests when read/write capacity is exceeded. The Glue job may be reading too fast for the table's provisioned capacity. Option B is wrong because Redshift concurrency scaling does not cause this error.

Option C is wrong because Glue job timeout would be a different error. Option D is wrong because S3 permissions would be a different error.

122
MCQhard

A data pipeline uses AWS Glue ETL to process data from an S3 bucket and write results to a Redshift cluster. The job fails with a 'DiskFull' error on the Glue worker nodes. What is the best way to resolve this issue?

A.Increase the number of Glue DPUs or use G.1X worker type.
B.Decrease the number of partitions in the output.
C.Use a different file format like Parquet to reduce storage.
D.Increase the job timeout setting.
AnswerA

More DPUs or larger workers provide additional disk and memory.

Why this answer

Increasing the number of DPUs or using G.1X worker types provides more disk space. Option A (increase timeout) doesn't help; Option B (use G.2X) is better; Option D (compression) might reduce data size but not the root cause.

123
MCQhard

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is experiencing high error rates when writing to an S3 bucket. The error logs indicate 'AccessDenied' errors. The S3 bucket policy allows access from the Firehose service, but the errors persist. What is the most likely cause?

A.The S3 bucket has a lifecycle policy that is deleting objects too quickly
B.The IAM role assumed by Firehose does not have the s3:PutObject permission
C.The S3 bucket has default encryption enabled
D.The S3 bucket uses an AWS KMS key for encryption and Firehose does not have kms:Decrypt permission
AnswerB

Firehose requires the IAM role to have S3 write permissions.

Why this answer

Option C is correct because Firehose uses a service-linked IAM role to write to S3; if the role lacks proper permissions, AccessDenied occurs even if the bucket policy allows it. Option A (encryption) would cause different errors. Option B (KMS key) is needed only if SSE-KMS is used.

Option D (S3 event notifications) is unrelated.

124
Drag & Dropmedium

Arrange the steps to implement a data lake on Amazon S3 with AWS Lake Formation.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Start by creating the S3 bucket. Then register it in Lake Formation, set up administrators, define the schema in Glue Catalog, and finally grant access to users.

125
Multi-Selecteasy

Which TWO AWS services can be used to transform data in transit during ingestion? (Choose 2.)

Select 2 answers
A.Amazon S3 Transfer Acceleration
B.Amazon Kinesis Data Firehose with Lambda transformation
C.AWS Glue ETL
D.Amazon Athena
E.AWS Data Pipeline
AnswersB, C

Firehose can call Lambda to transform records.

Why this answer

AWS Glue ETL can transform data before writing to target. Kinesis Data Firehose can invoke Lambda for transformation. Both transform data in transit.

126
MCQeasy

A company uses AWS Database Migration Service (DMS) to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. The migration is ongoing and uses change data capture (CDC). The engineer notices that the target database is falling behind the source by several hours. What is the MOST likely cause?

A.The target table has disabled parallel apply.
B.A large full load is being transferred, which delays the start of CDC.
C.The replication instance is under-provisioned and needs to be larger.
D.Change data capture has been disabled on the source database.
AnswerB

Full load must complete before CDC begins, causing a backlog.

Why this answer

Option B is correct because a large full load can delay the start of CDC, causing the target to fall behind. Option A is incorrect because increasing the replication instance size typically improves performance. Option C is wrong because CDC requires an active transaction log; disabling it would stop replication.

Option D is wrong because parallel apply can actually speed up replication.

127
Multi-Selectmedium

Which TWO AWS services can be used to ingest streaming data into Amazon S3 with minimal code? (Choose two.)

Select 2 answers
A.AWS Lambda
B.Amazon Kinesis Data Firehose
C.Amazon Managed Streaming for Apache Kafka (MSK) with S3 sink connector
D.AWS Database Migration Service (DMS)
E.AWS DataSync
AnswersB, C

Firehose is serverless and delivers streaming data to S3 without code.

Why this answer

Option A (Kinesis Data Firehose) and Option D (Amazon MSK with S3 connector) are correct. Option B (Lambda) requires code. Option C (DMS) is for databases.

Option E (DataSync) is for batch file transfers.

128
Multi-Selectmedium

A company is using Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that enriches records and writes to Amazon S3. The Lambda function is experiencing high error rates due to records exceeding the 256 KB payload limit. Which TWO actions should the team take to resolve this issue?

Select 2 answers
A.Increase the Lambda function timeout.
B.Enable compression on the producer side before sending records to Kinesis.
C.Use the Kinesis Producer Library (KPL) to aggregate multiple small records into a single larger record.
D.Switch from Kinesis Data Streams to Kinesis Data Firehose.
E.Increase the number of shards in the Kinesis stream.
AnswersB, C

Compression reduces record size below the 256 KB limit.

Why this answer

The correct answers are A and C. Enabling compression reduces record size. Using Kinesis Aggregation Library (KPL) aggregates multiple records into a single larger record while staying within limits.

Option B (increasing shards) does not reduce record size. Option D (using Kinesis Firehose) changes architecture but does not solve the payload limit. Option E (increasing Lambda timeout) does not address size.

129
MCQmedium

A data engineer needs to transform CSV files arriving in an S3 bucket into Parquet format and store them in another S3 bucket. The transformation is simple and on-demand, triggered by data arrival. Which solution is the MOST cost-effective and requires the least operational overhead?

A.Use Amazon EMR with Spark streaming
B.Use Amazon Athena to create a new table with Parquet format
C.Use AWS Glue ETL jobs scheduled to run every hour
D.Use S3 Events to trigger an AWS Lambda function that transforms the data
AnswerD

Lambda is event-driven, cost-effective, and serverless.

Why this answer

Option C is correct because S3 Events can trigger a Lambda function to perform the transformation when a new object is created. Option A is wrong because Glue jobs have startup time and cost more. Option B is wrong because EMR requires cluster management.

Option D is wrong because Athena is for querying, not transforming.

130
MCQhard

A company uses Amazon Kinesis Data Firehose to deliver log data to Amazon S3. The data is transformed by a Lambda function that adds a timestamp field. Recently, the Firehose delivery stream has been failing with 'Lambda invocation failed' errors. The Lambda function's CloudWatch Logs show that the function is timing out. What is the MOST likely cause?

A.The Lambda function lacks permission to write to CloudWatch Logs.
B.The Firehose buffer size is too large, causing the Lambda function to receive too many records.
C.The Lambda function timeout is set to 1 minute, which is adequate.
D.The Lambda function is running out of memory.
AnswerB

Large buffer size leads to large batches that exceed Lambda timeout.

Why this answer

Option D is correct because Firehose sends batches of records to Lambda; if the batch size is too large, the function may exceed its timeout. Option A is wrong because the Lambda function is being invoked, so permissions are fine. Option B is wrong because the error is about Lambda invocation, not memory.

Option C is wrong because a 1-minute timeout is typical and may be insufficient for large batches.

131
Drag & Dropmedium

Order the steps to troubleshoot a failed AWS Glue job that reads from JDBC and writes to S3.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Start with logs to identify errors, then check connectivity, IAM permissions, test connection, and review script.

132
MCQhard

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object containing fields: trade_id, symbol, price, quantity, and timestamp. The data is consumed by an AWS Lambda function that performs data validation and enrichment, then writes the processed records to an Amazon DynamoDB table for low-latency querying. Recently, the Lambda function has been timing out and failing to process all records. The Lambda function is configured with a 5-second timeout and 128 MB memory. The average record size is 2 KB, and the stream receives about 1000 records per second. The Lambda function's concurrency limit is 1000. Which set of actions should the data engineer take to resolve the issue without losing data?

A.Increase the Lambda function timeout to 60 seconds and memory to 1024 MB. Set the batch size to 100 records and enable parallelization factor of 10.
B.Increase the number of shards in the Kinesis data stream to 20 and keep the Lambda configuration unchanged.
C.Replace the Lambda function with a Kinesis Data Firehose delivery stream that writes directly to DynamoDB using a Lambda transformation.
D.Increase the Lambda function timeout to 60 seconds and memory to 1024 MB. Set the batch size to 100 records.
AnswerA

This combination increases processing capacity and prevents timeouts.

Why this answer

Option A is correct because increasing the Lambda timeout and memory addresses the processing bottleneck, while setting the batch size to 100 and enabling a parallelization factor of 10 allows each shard to process up to 10 concurrent batches, dramatically increasing throughput to handle 1000 records/sec (each shard can process 10 batches of 100 records concurrently, yielding 1000 records/sec per shard if the stream has at least 1 shard). This combination ensures no data loss by keeping up with the ingestion rate without exceeding the Lambda concurrency limit of 1000.

Exam trap

The trap here is that candidates often overlook the parallelization factor setting, assuming that increasing batch size and Lambda resources alone will suffice, but without parallelization, each shard can only process one batch at a time, creating a throughput bottleneck that leads to data loss.

How to eliminate wrong answers

Option B is wrong because simply increasing shards to 20 does not resolve the Lambda timeout issue; the function still has only 5 seconds and 128 MB, so it will continue to fail even with more shards. Option C is wrong because Kinesis Data Firehose cannot write directly to DynamoDB; it only supports destinations like S3, Redshift, Elasticsearch, and Splunk, and using a Lambda transformation would still require sufficient timeout and memory. Option D is wrong because increasing timeout and memory alone without enabling parallelization factor means each shard can only process one batch at a time, which at 100 records per batch would only handle 100 records per second per shard, insufficient for the 1000 records/sec load.

133
Multi-Selectmedium

A data engineering team uses AWS Glue to extract, transform, and load (ETL) data from Amazon RDS for MySQL to Amazon S3. The job runs daily and processes incremental data. The team notices that the job is taking longer than expected. Which TWO actions can improve the job performance? (Choose two.)

Select 2 answers
A.Change the worker type to Standard (single node).
B.Use pushdown predicates to filter data at the source.
C.Add more transformations to the ETL script to clean data.
D.Increase the number of DPUs for the Glue job.
E.Disable compression on the output data to reduce CPU usage.
AnswersB, D

Reduces the data scanned and transferred from RDS.

Why this answer

B is correct because increasing the number of DPUs (Data Processing Units) allocates more resources to the Glue job, speeding it up. D is correct because using pushdown predicates filters data at the source, reducing the amount of data transferred. A is wrong because writing to S3 with no compression increases I/O.

C is wrong because using a single node (Standard) reduces parallelism. E is wrong because adding more transformations increases processing time.

134
Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for IoT sensor data. The sensors send JSON messages every second, and the data must be stored in Amazon S3 in near real-time (within 5 minutes). The engineer also needs to transform the data by adding a timestamp and filtering out malformed records. Which THREE services should be used together?

Select 2 answers
A.AWS Glue
B.Amazon Athena
C.Amazon Simple Queue Service (SQS)
D.AWS IoT Core
E.Amazon Kinesis Data Firehose
AnswersD, E

IoT Core can receive sensor messages and route to Firehose.

Why this answer

IoT Core can ingest sensor data, Kinesis Data Firehose can buffer and write to S3, and Lambda can transform records within Firehose. Option A is wrong because SQS is not needed. Option D is wrong because Glue is for batch ETL, not real-time.

Option E is wrong because Athena is for querying.

135
MCQeasy

Refer to the exhibit. A company uses S3 Event Notifications to trigger an AWS Lambda function whenever a new object is uploaded to an S3 bucket. The Lambda function processes the file and moves it to a different bucket. Recently, the function has been failing intermittently. The engineer checks the Lambda CloudWatch logs and sees the above event. What is the MOST likely cause of the intermittent failures?

A.The event JSON is malformed; the 'EventSource' should be 's3.amazonaws.com'.
B.The S3 bucket name contains a hyphen, which is not allowed.
C.The S3 event notifications are not guaranteed to be delivered exactly once, causing duplicate processing.
D.The event is missing the 'object:versionId' field.
AnswerC

At-least-once delivery can cause issues.

Why this answer

Option C is correct because S3 Event Notifications are asynchronous and may be duplicated or delivered out of order, causing race conditions. Option A is wrong because the event has all required fields. Option B is wrong because the bucket name is valid.

Option D is wrong because the event format is correct.

136
MCQmedium

A data engineer is designing a pipeline that ingests JSON logs from an application into Amazon S3. The logs contain a timestamp field. The pipeline must partition the data by date in S3 (e.g., year=2024/month=10/day=01). Which approach minimizes transformation effort?

A.Use Amazon Kinesis Data Firehose with dynamic partitioning
B.Use AWS Glue crawlers to infer schema and create partitions
C.Use AWS Lambda to process each object and copy to the appropriate prefix
D.Use Amazon Athena to create partitions on the existing data
AnswerA

Firehose can dynamically partition data based on the timestamp and deliver to S3 partitioned prefixes.

Why this answer

Option D is correct because S3 Batch Operations can copy objects and apply prefix changes, but for real-time partitioning, Kinesis Data Firehose with dynamic partitioning is the best approach. The correct answer is actually Kinesis Data Firehose. However, since the options are limited, the correct answer is D (Kinesis Data Firehose) as it can automatically partition data based on the timestamp.

Option A is wrong because Lambda would require custom code. Option B is wrong because Athena is a query service. Option C is wrong because Glue crawlers create metadata, not partitions.

137
MCQhard

An e-commerce company uses AWS Glue to process clickstream data from its website. The data is stored in Amazon S3 in partitioned Parquet format by date and hour. A recent increase in traffic has caused the Glue job to fail with 'Java heap space' errors. The job runs with 10 DPUs and uses Spark's default configurations. The data engineer needs to resolve the memory issue without modifying the ETL script. What should the data engineer do?

A.Decrease the Spark configuration 'spark.sql.shuffle.partitions' to 50.
B.Change the worker type to G.1X.
C.Increase the Spark configuration 'spark.sql.shuffle.partitions' to 500.
D.Increase the number of DPUs to 20.
AnswerC

More shuffle partitions reduce the size of data per partition, mitigating memory issues.

Why this answer

Option B is correct. Increasing Spark's shuffle partitions reduces the amount of data handled per partition, preventing heap space errors. Option A is wrong because increasing DPUs may not directly address the heap space issue if the problem is within a single executor.

Option C is wrong because reducing partitions may increase memory pressure. Option D is wrong because changing worker type to G.1X may not help if the issue is shuffle-related.

138
MCQeasy

A data engineer is using AWS Glue to perform ETL on data stored in an S3 bucket. The source data is in CSV format with a header row, and the target is a set of Parquet files partitioned by date. The engineer notices that the Glue job is reading all files in the source prefix, including temporary files that should be ignored. What is the MOST efficient way to exclude these temporary files?

A.Change the source format from CSV to Parquet.
B.Set up an S3 event notification to trigger a Lambda function that moves temporary files.
C.Use an S3 prefix exclusion pattern in the Glue job's source path.
D.Create a custom classifier in the Glue Data Catalog.
AnswerC

Glue supports S3 include/exclude patterns to filter files.

Why this answer

Using a S3 prefix exclusion pattern in the Glue job's S3 path is the most efficient way to exclude files. Option A is wrong because changing to Parquet does not exclude files. Option B is wrong because a custom classifier is for schema inference.

Option D is wrong because Lambda would add complexity.

139
Multi-Selecthard

A company ingests streaming data from social media feeds into Amazon Kinesis Data Streams. The data is consumed by an AWS Lambda function that transforms and writes to Amazon S3. Recently, the Lambda function started timing out and dropping records. The data volume has tripled. Which actions should the data engineer take to resolve this? (Choose TWO.)

Select 2 answers
A.Increase the number of shards in the Kinesis data stream
B.Replace Lambda with Amazon Kinesis Data Firehose for the transformation
C.Increase the Lambda function timeout to 15 minutes
D.Set a reserved concurrency on the Lambda function
E.Increase the memory allocated to the Lambda function
AnswersA, E

More shards increase throughput capacity.

Why this answer

Options A and D are correct because increasing the number of shards increases throughput, and increasing Lambda memory can improve processing speed and reduce timeouts. Option B (Lambda concurrency) may cause throttling. Option C (Lambda timeout) may not be sufficient without more resources.

Option E (Data Firehose) is a different service; replacing Lambda with Firehose could be an alternative but not a direct fix for current architecture.

140
MCQhard

A data pipeline ingests streaming data from Kinesis Data Streams into S3 via Kinesis Data Firehose. Occasionally, small files are written to S3, increasing downstream processing costs. What is the most efficient way to reduce the number of small files?

A.Use a Lambda function to aggregate records before sending to Firehose.
B.Use the Kinesis Client Library (KCL) to write larger batches to S3 directly.
C.Run a daily AWS Glue job to concatenate small files.
D.Increase the Firehose buffering interval to 300 seconds and buffering size to 64 MB.
AnswerD

Firehose will buffer more data per file.

Why this answer

Option C is correct because increasing the buffering interval and size in Firehose batches more data into fewer files. Option A is wrong because Lambda can aggregate but adds complexity and cost. Option B is wrong because KCL runs on EC2, not serverless.

Option D is wrong because Glue ETL runs after delivery, not preventing small files.

141
Drag & Dropmedium

Order the steps to query data in Amazon Redshift Spectrum from an external table in Athena.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Start by creating the external schema in Redshift, then the external table, grant permissions, run the query, and verify results.

142
Multi-Selecthard

A company uses AWS Glue to transform data from Amazon S3 into Parquet format. The job fails with an out-of-memory error for large files. Which TWO actions can resolve this issue? (Choose TWO.)

Select 2 answers
A.Change the input format from CSV to JSON.
B.Increase the number of DPUs allocated to the job.
C.Use the Glue streaming ETL feature.
D.Enable CloudWatch logs for detailed error analysis.
E.Split the input data into smaller files.
AnswersB, E

More DPUs provide more memory and processing power.

Why this answer

Option A increases memory, Option C increases parallelism. Option B is not relevant since the job reads from S3. Option D is for streaming.

Option E is for debugging, not resolving OOM.

143
MCQmedium

Refer to the exhibit. A data engineer deploys this CloudFormation template to create an AWS Glue job. The job fails on the first run with an error: 'AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/GlueServiceRole/... is not authorized to perform: s3:GetObject on resource: s3://my-bucket/scripts/etl.py'. What is the most likely cause?

A.The ExecutionProperty MaxConcurrentRuns is set to 1, preventing the job from running.
B.The IAM role associated with the Glue job does not have an S3 GetObject permission for the script location.
C.The MaxRetries is set to 0, so the job does not retry on failure.
D.The script location is incorrectly specified; it should be an S3 URI with bucket and key.
AnswerB

Glue needs s3:GetObject on the script.

Why this answer

Option A is correct because the IAM role (GlueServiceRole) does not have permission to read the script from S3. Option B is wrong because the script location is correct. Option C is wrong because MaxConcurrentRuns does not affect first run.

Option D is wrong because MaxRetries is 0, meaning no retries, but the error is about access.

144
MCQeasy

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The job runs successfully but the data in S3 is missing some records that exist in the source. The engineer notices that the job uses a JDBC connection and the query extracts data based on a timestamp column. What is the MOST likely cause of the missing records?

A.The timestamp column includes time portion and the job is using an exclusive upper bound.
B.The S3 bucket lacks write permissions.
C.The JDBC connection uses connection pooling, causing some records to be dropped.
D.The Glue job is configured to only read from one table.
AnswerA

Records with the same timestamp as the upper bound may be excluded.

Why this answer

Option B is correct because if the timestamp column includes time, records with the same timestamp may be missed if the job uses an exclusive upper bound. Option A is wrong because the job can read from different tables. Option C is wrong because the S3 bucket permissions would cause failure, not missing records.

Option D is wrong because connection pooling does not cause missing records.

145
MCQeasy

Refer to the exhibit. A data engineer runs the AWS CLI command and observes the output. The stream has two shards. A producer sends a record with a partition key that hashes to 150000000000000000000000000000000000000. To which shard will the record be written?

A.shardId-000000000001
B.The record will be rejected because it does not match any shard
C.shardId-000000000000
D.The record will be written to both shards
AnswerA

The hash key falls within the range of the second shard.

Why this answer

Option B is correct because the hash key range for shardId-000000000000 is 0 to 113427455640312821154458202477256070484. The given hash 150000... is greater than the end of the first shard, so it falls into the second shard's range (113427... to 226854...). Option A is wrong because the hash is not in the first shard's range.

Option C is wrong because the hash is within the second shard's range. Option D is wrong because the hash is not outside both ranges.

146
Multi-Selectmedium

A company ingests IoT data into an S3 bucket using AWS IoT Core rules. The data is in JSON format, and each record is about 500 bytes. The data volume is 5 GB per day. The company wants to convert the data to Parquet format and partition it by year/month/day. Which TWO AWS services can be used together to achieve this with minimal operational overhead?

Select 2 answers
A.Amazon Athena CTAS query
B.AWS Glue ETL job triggered by S3 event
C.AWS Lambda function triggered by S3 event
D.Amazon EMR with Spark job
E.Amazon Kinesis Data Firehose with Parquet conversion
AnswersB, C

Glue can be triggered by S3 events (via Lambda or EventBridge) and perform the conversion and partitioning.

Why this answer

Options B and E are correct. S3 Events can trigger Lambda, and Lambda can convert JSON to Parquet and write to partitioned paths. AWS Glue can also be triggered by S3 events via a workflow, but the combination of Lambda + Glue (option E) works as Glue can read from the landing zone and write partitioned Parquet.

Option A (Kinesis Firehose) can convert to Parquet but cannot partition by year/month/day easily. Option C (EMR) requires cluster management. Option D (Athena) is for querying, not transformation.

147
Multi-Selecteasy

A data engineer needs to ingest data from an Amazon RDS MySQL database into a data lake on Amazon S3. The engineer wants to perform an initial full load and then capture incremental changes. Which TWO AWS services can be combined to achieve this?

Select 2 answers
A.Amazon Kinesis Data Firehose
B.AWS Glue
C.Amazon S3
D.AWS Database Migration Service (DMS)
E.AWS Data Pipeline
AnswersC, D

S3 is the target for the data lake.

Why this answer

AWS DMS can do full load and CDC. S3 can be the target. Option B is wrong because Glue does not support CDC from MySQL directly.

Option C is wrong because Firehose doesn't connect to MySQL. Option D is wrong because Data Pipeline does not support CDC.

148
MCQhard

A company uses Kinesis Data Analytics for SQL-based real-time analytics on streaming data. They notice that the application is processing data slower than the incoming rate, causing increased latency. Which action is MOST likely to improve the throughput?

A.Increase the number of Kinesis Processing Units (KPUs) for the application
B.Increase the number of shards in the Kinesis data stream
C.Enable auto-scaling on the Kinesis data stream
D.Decrease the retention period of the Kinesis data stream
AnswerA

More KPUs increase parallelism and throughput.

Why this answer

Option B is correct because increasing the number of Kinesis Processing Units (KPUs) allows more parallelism. Option A is wrong because decreasing the retention period does not affect processing speed. Option C is wrong because enabling auto-scaling for the stream itself does not increase application parallelism.

Option D is wrong because increasing shards increases ingestion capacity, but the bottleneck is the analytics application.

149
MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for MySQL and load it into Amazon S3. The job runs daily and processes incremental changes using the JDBC connection. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which step should the engineer take first to diagnose the issue?

A.Verify that the IAM role used by Glue has the correct permissions to access RDS.
B.Change the Glue job type from Spark to Python shell.
C.Check the security group and network ACL rules for the RDS instance and the Glue connection.
D.Check that the JDBC driver is compatible with the Glue version.
AnswerC

Network misconfiguration is the most common cause of link failure.

Why this answer

Option B is correct because network connectivity between Glue and RDS is likely the issue, and checking security groups and subnets is the first step. Option A is wrong because the error is not about authentication. Option C is wrong because S3 permissions are not related to the link failure.

Option D is wrong because the job type (Spark vs Python) does not affect connectivity.

150
Multi-Selectmedium

A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)

Select 2 answers
A.Amazon EMR with Spark
B.AWS Lake Formation
C.Amazon Athena CTAS queries
D.AWS Glue ETL jobs
E.Amazon Kinesis Data Firehose
AnswersA, D

EMR can run Spark for large-scale transformations.

Why this answer

Amazon EMR with Spark is correct because Spark natively supports reading JSON, CSV, and Avro formats and writing Parquet with built-in partitioning by date and source. You can achieve this with a concise PySpark or Scala script that loads the data, applies partitioning logic, and writes to S3, requiring minimal custom code beyond the transformation logic.

Exam trap

The trap here is that candidates often confuse AWS Lake Formation's data catalog and permission features with actual data transformation capabilities, or they assume Kinesis Data Firehose can transform existing S3 objects when it only processes streaming data in transit.

← PreviousPage 2 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion Transformation questions.