Knowledge + Practice

CCNA Data Ingestion Transformation Questions

75 of 610 questions · Page 3/9 · Data Ingestion Transformation topic · Answers revealed

Practice these questions Exam hub All questions

151

MCQmedium

A social media company ingests user activity data from multiple sources using Amazon Kinesis Data Firehose. The data is delivered to Amazon S3 in near-real-time. The company wants to transform the data by adding a timestamp and masking email addresses before storing it in S3. The transformation should be applied to all records. What is the most cost-effective way to implement this transformation?

A.Use Amazon Athena to run a CTAS query that transforms the data and writes to a new location.

B.Use AWS Glue to schedule a batch job every 5 minutes to transform the data.

C.Use Amazon S3 Events to trigger a Lambda function whenever a new object is created.

D.Configure the Firehose delivery stream to invoke a Lambda function for data transformation.

AnswerD

Firehose supports built-in Lambda transformation for real-time processing.

Why this answer

Option A is correct. Kinesis Data Firehose can invoke a Lambda function to transform data on the fly. This is cost-effective because it runs only when data is flowing.

Option B is wrong because Glue jobs are batch-oriented and add latency. Option C is wrong because S3 Events with Lambda adds complexity and cost. Option D is wrong because Athena is for querying, not transforming.

Practice this question →

152

MCQeasy

A data engineering team is ingesting streaming data from IoT devices using AWS IoT Core and needs to process the data in near real-time with minimal code. Which AWS service should they use to transform the data before storing it in Amazon S3?

A.Amazon Kinesis Data Analytics

B.AWS Glue

C.Amazon Redshift

D.Amazon Athena

AnswerA

Kinesis Data Analytics can run SQL queries on streaming data from IoT Core in near real-time.

Why this answer

Option B is correct because AWS IoT Core can route messages to Kinesis Data Analytics for SQL-based streaming transformations. Option A is wrong because Glue is for batch ETL, not real-time. Option C is wrong because Redshift is for warehousing.

Option D is wrong because Athena is for ad-hoc querying, not streaming.

Practice this question →

153

MCQeasy

A company needs to ingest data from Amazon S3 into Amazon Redshift for analytics. The data arrives in CSV format with headers and may contain duplicate rows. Which Redshift command should be used to load the data while handling duplicates?

A.COPY command with the `REMOVEDUPLICATES` option

B.INSERT INTO ... SELECT DISTINCT from S3 via Spectrum

C.Use a staging table with COPY and then MERGE into the target table

D.CREATE TABLE AS SELECT DISTINCT from the S3 bucket

AnswerC

MERGE allows handling duplicates.

Why this answer

Option B is correct because the `COPY` command with the `REMOVEDUPLICATES` option is not valid; the correct approach is to use a staging table and then use `MERGE` (or `UPSERT`) to handle duplicates. Option A (COPY) alone does not remove duplicates. Option C (INSERT INTO SELECT) does not handle duplicates.

Option D (CREATE TABLE AS) does not handle duplicates.

Practice this question →

154

MCQhard

A data engineer reviews the Glue job configuration. The job fails when processing large datasets. The error message indicates out-of-memory in the executors. Which change to the job configuration will most directly address this issue?

A.Change the worker type from Standard to G.2X.

B.Increase the timeout from 30 to 60 minutes.

C.Increase the number of workers from 5 to 10.

D.Set MaxRetries to 3.

AnswerA

G.2X workers have more memory (8 GB vs 4 GB), directly addressing OOM.

Why this answer

The job uses 5 workers of Standard type. To increase memory, the engineer should use G.1X or G.2X worker types which provide more memory per worker, or increase the number of workers. Among options, changing worker type to G.2X directly increases memory.

Practice this question →

155

Multi-Selecteasy

A company wants to ingest streaming data from social media feeds into AWS for real-time analytics. Which TWO services can directly ingest streaming data without writing custom code? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon AppFlow

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Streams

E.Amazon S3 Transfer Acceleration

AnswersB, C

AppFlow can ingest data from SaaS applications (including social media) directly into AWS.

Why this answer

Amazon Kinesis Data Firehose can directly ingest streaming data and deliver to destinations. AWS Glue can stream from Kafka but not directly ingest from social media without custom connectors. Kinesis Data Streams requires producers to send data, not direct ingestion.

AppFlow can ingest from SaaS applications including social media.

Practice this question →

156

MCQeasy

A company uses AWS Lambda to process events from an S3 bucket. The Lambda function writes transformed data to another S3 bucket. Occasionally, the Lambda invocation fails with 'ResourceNotFoundException'. What is the MOST likely cause?

A.The Lambda function timed out.

B.The destination S3 bucket does not exist or the Lambda function's IAM role lacks permissions.

C.The S3 event notification is misconfigured.

D.The source S3 bucket has versioning disabled.

AnswerB

ResourceNotFoundException indicates missing resource or access denial.

Why this answer

Option B is correct. The destination bucket may not exist or the Lambda function's IAM role lacks permissions to write to it. Option A is wrong because Lambda timeouts would cause 'Timeout' error.

Option C is wrong because S3 event notifications are reliable. Option D is wrong because the source bucket exists since it triggered the event.

Practice this question →

157

MCQmedium

A company uses AWS Glue to run ETL jobs that transform data from an Amazon S3 bucket (raw) to another S3 bucket (curated). The jobs run on a schedule and process data incrementally. The data engineer notices that the jobs are taking longer to complete each day, and the job metrics show that the number of DPUs (Data Processing Units) is underutilized. The engineer wants to improve job performance. What should the data engineer do?

A.Increase the number of DPUs allocated to the Glue job to enable more parallelism.

B.Switch from batch processing to streaming using AWS Glue Streaming.

C.Enable job bookmarks to skip already processed data more efficiently.

D.Decrease the number of DPUs to reduce resource contention.

AnswerA

More DPUs can speed up processing if the job is parallelizable.

Why this answer

Option A is correct because increasing the number of DPUs can reduce job duration if the workload is parallelizable. The underutilization suggests that more DPUs could be used. Option B is wrong because decreasing DPUs would increase time.

Option C is wrong because the issue is not about job bookmarks (which handle incremental processing). Option D is wrong because converting to streaming may not be appropriate for batch jobs.

Practice this question →

158

MCQeasy

A data engineering team needs to transform CSV files to Parquet format after they land in an S3 bucket. The transformation should be triggered automatically as soon as a new file arrives. Which AWS service is best suited for this task?

A.AWS Batch job submitted by S3 event

B.Amazon EMR cluster running continuously

C.AWS Lambda function triggered by S3 event

D.AWS Glue ETL job scheduled every 5 minutes

AnswerC

Lambda can be triggered immediately on S3 PUT events and perform the transformation.

Why this answer

Option D is correct because AWS Lambda can be triggered by S3 events and convert files to Parquet using libraries like PyArrow. Option A (AWS Glue) is better for scheduled jobs, not event-driven. Option B (Amazon EMR) is overkill.

Option C (AWS Batch) is for compute jobs but adds latency.

Practice this question →

159

Multi-Selectmedium

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data engineer needs to transform the data before delivery. Which THREE options can be used to perform the transformation?

Select 3 answers

A.Amazon Athena queries

B.AWS Glue ETL job

C.Amazon Kinesis Data Firehose data format conversion (e.g., JSON to Parquet)

D.AWS Lambda function

E.Amazon Kinesis Data Firehose dynamic partitioning with Lambda

AnswersC, D, E

Firehose can convert data formats natively.

Why this answer

Options A, C, and D are correct. Lambda can transform records, Firehose can convert to Parquet/ORC, and Firehose can invoke Lambda for dynamic partitioning. Option B (Glue) is not integrated with Firehose directly.

Option E (Athena) is for querying.

Practice this question →

160

MCQhard

A data engineer is designing a real-time analytics pipeline for clickstream data. The source is Amazon Kinesis Data Streams, and the data must be stored in Amazon S3 in partitioned Parquet format with near-real-time latency. The engineer must also handle late-arriving data (up to 1 hour). Which combination of services meets these requirements?

A.Use AWS Glue streaming ETL to write to S3 in real-time.

B.Use Kinesis Data Analytics with a tumbling window to write to S3.

C.Use Kinesis Data Firehose with a Lambda transformation to write to S3, and a separate Lambda consumer to reprocess late data from the stream.

D.Use Kinesis Data Firehose to deliver to S3, and use S3 Batch Operations to process late data.

AnswerC

Firehose handles real-time delivery with partitioning; Lambda can reprocess late records.

Why this answer

Option D is correct because Kinesis Data Firehose can buffer data and deliver to S3 with custom partitioning, and a separate Lambda function can reprocess late data from the stream. Option A is wrong because Kinesis Data Analytics does not partition data by time. Option B is wrong because S3 Batch Operations are for batch, not streaming.

Option C is wrong because Glue streaming ETL lacks built-in support for late data handling with Firehose partitioning.

Practice this question →

161

Multi-Selecthard

Which THREE factors should be considered when selecting a data ingestion service for a high-volume, real-time streaming pipeline that requires exactly-once processing semantics? (Choose 3.)

Select 3 answers

A.Ability to replay records from a checkpoint

B.Support for idempotent record processing

C.Integration with Amazon S3 for checkpoint storage

D.Support for schema evolution

E.Ability to transform data in-flight

AnswersA, B, C

Replay allows recovery without duplication.

Why this answer

Exactly-once processing requires services that support idempotent writes and checkpoints. Kinesis Data Streams supports record replay and checkpointing. AWS Lambda can be used with Kinesis for idempotent processing.

Amazon S3 provides idempotent PUT operations. Amazon DynamoDB can store checkpoints. Amazon Kinesis Data Firehose provides at-least-once delivery, not exactly-once.

AWS Glue is batch-oriented.

Practice this question →

162

Multi-Selectmedium

Which TWO actions can improve the performance of an AWS Glue ETL job that processes large datasets in Amazon S3? (Choose two.)

Select 2 answers

A.Increase the frequency of the Glue crawler.

B.Use a single Availability Zone for the S3 bucket.

C.Increase the number of DPUs allocated to the job.

D.Use columnar file formats like Parquet or ORC.

E.Use a single large file instead of many small files.

AnswersC, D

More DPUs increase parallelism and memory.

Why this answer

Option A (increase DPUs) and Option D (use columnar format) are correct. Option B (use a single file) may reduce parallelism. Option C (increase crawler frequency) does not affect ETL performance.

Option E (single availability zone) does not improve performance.

Practice this question →

163

MCQmedium

A data engineer is responsible for ingesting log files from a fleet of on-premises servers into Amazon S3 for central analysis. Each server generates log files that are rotated every hour, resulting in files of about 500 MB each. The total daily data volume is approximately 1 TB. The network connection between the on-premises data center and AWS is a 100 Mbps VPN. The engineer needs to ensure that all log files are transferred to S3 within 24 hours of generation without data loss. The engineer is considering using AWS DataSync. However, the initial setup shows that the transfer speed is insufficient to meet the 24-hour SLA. What should the engineer do to meet the requirement?

A.Use AWS CLI with multipart uploads and parallel threads to maximize throughput.

B.Contact the network team to upgrade the VPN bandwidth to at least 1 Gbps.

C.Order an AWS Snowball Edge device to transfer the initial data and then use DataSync for incremental changes.

D.Configure AWS DataSync to run on a schedule with incremental transfers and enable data compression.

AnswerD

Incremental transfers reduce the amount of data transferred each day; compression further reduces size, meeting the SLA.

Why this answer

Option C is correct because AWS DataSync supports scheduling tasks incrementally; after the initial full sync, only new and changed files are transferred, which will fit within the bandwidth. Option A is wrong because ordering a Snowball Edge would take days to arrive and is not suitable for ongoing daily transfers. Option B is wrong because increasing bandwidth may not be feasible or cost-effective.

Option D is wrong because compressing files may reduce size but adds complexity and does not solve the network bottleneck; incremental sync is more effective.

Practice this question →

164

Multi-Selectmedium

A company is designing a data ingestion pipeline for real-time clickstream data. The data must be ingested with low latency (< 1 second) and then processed for real-time analytics. The processed data should be stored in Amazon S3 for batch analytics. Which THREE services should be used together?

Select 3 answers

A.Amazon Managed Streaming for Apache Kafka (MSK)

B.Amazon Kinesis Data Analytics

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Streams

E.AWS Glue ETL job

AnswersB, C, D

Performs real-time processing and analytics on streaming data.

Why this answer

Options A, B, and D are correct. Kinesis Data Streams ingests data with low latency. Kinesis Data Analytics processes streaming data in real-time.

Kinesis Data Firehose delivers processed data to S3. Option C (AWS Glue) is batch-oriented. Option E (Amazon MSK) is an alternative but not necessary if Kinesis is used.

Practice this question →

165

MCQhard

A data pipeline uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The team notices that one consumer falls behind and data accumulates. Which action will help this consumer catch up without affecting other consumers?

A.Increase the retention period of the stream.

B.Register an enhanced fan-out consumer for the slow consumer.

C.Increase the number of shards in the stream.

D.Use a Lambda consumer instead of an enhanced fan-out consumer.

AnswerB

Enhanced fan-out provides dedicated throughput per consumer, allowing the slow consumer to catch up without impacting others.

Why this answer

Registering a new enhanced fan-out consumer with its own dedicated read throughput allows it to catch up independently. Increasing shards affects all consumers, and increasing iterator age may help but doesn't increase throughput.

Practice this question →

166

MCQhard

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

A.Write the aggregated results to a single large file instead of multiple partitions.

B.Convert the Parquet files to CSV to simplify the schema.

C.Replace the Redshift target with Amazon Redshift Spectrum.

D.Increase the number of Glue worker nodes (DPUs) for the job.

AnswerD

More workers parallelize tasks and reduce runtime.

Why this answer

Increasing the number of Glue worker nodes (DPUs) directly scales the distributed processing capacity of the ETL job, allowing it to process larger volumes of Parquet data in parallel. This is the most straightforward way to reduce execution time when data volume is growing, as AWS Glue automatically partitions the workload across the additional workers.

Exam trap

The trap here is that candidates assume increasing DPUs always increases cost without considering that the job's runtime reduction often lowers total cost, and they mistakenly choose a data format or target change that does not address the core parallelism issue.

How to eliminate wrong answers

Option A is wrong because writing to a single large file eliminates parallelism in downstream reads and can cause bottlenecks in Redshift's COPY operation, which benefits from multiple files for concurrent loading. Option B is wrong because converting Parquet to CSV increases file size and I/O overhead due to lack of columnar compression and predicate pushdown, degrading performance. Option C is wrong because replacing Redshift with Redshift Spectrum would offload query processing to S3 but does not address the ETL job's performance bottleneck; the job still writes to Redshift, and Spectrum is a query engine, not a write target.

Practice this question →

167

Multi-Selecthard

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data is consumed by multiple applications: one for real-time sentiment analysis and another for archival to S3. The data must be processed in order for each social media post. Which TWO approaches meet the requirements? (Choose TWO.)

Select 2 answers

A.Use Amazon Kinesis Data Firehose to buffer and deliver to S3

B.Use Amazon SQS FIFO queues between the stream and consumers

C.Use a single shard in the Kinesis Data Streams and have all consumers read from that shard

D.Use a partition key that ensures related records go to the same shard

E.Use multiple shards and assign each consumer to a specific shard

AnswersC, D

Single shard guarantees ordering.

Why this answer

Option A is correct because using a single shard ensures ordering for all records. Option C is correct because using a partition key that groups related records ensures they go to the same shard, preserving order. Option B (multiple shards) does not guarantee global ordering.

Option D (Firehose) does not guarantee ordering. Option E (SQS FIFO) can guarantee order but adds another service.

Practice this question →

168

MCQeasy

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for centralized analytics. The data volume is several GB per day. Which AWS service is most suitable for this ingestion?

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Athena

D.AWS Data Pipeline

AnswerB

Glue can connect to SaaS sources via JDBC and perform ETL to S3.

Why this answer

Option B (AWS Glue) is correct because Glue can connect to various data sources using crawlers and ETL jobs, and it supports JDBC connections to SaaS databases. Option A (Amazon Athena) is for querying, not ingestion. Option C (Amazon Kinesis) is for streaming, not batch from SaaS.

Option D (AWS Data Pipeline) is older and less flexible than Glue.

Practice this question →

169

Multi-Selecthard

A data pipeline uses AWS Glue to process large CSV files. The team notices that some jobs fail with out-of-memory errors. Which TWO configuration changes can help mitigate this issue?

Select 2 answers

A.Reduce the number of DPUs to limit concurrency.

B.Increase the number of DPUs for the Glue job.

C.Enable Glue job autoscaling.

D.Convert input files from CSV to Parquet.

E.Enable job bookmarks.

AnswersB, C

More DPUs provide more memory.

Why this answer

Options A and D are correct: increasing DPUs and enabling autoscaling provide more memory. Option B (reducing DPUs) would worsen the problem. Option C (conversion to Parquet) may reduce memory but is not a direct configuration change for the Glue job.

Option E (job bookmarks) does not affect memory.

Practice this question →

170

MCQeasy

A company needs to ingest data from an on-premises Oracle database into Amazon Redshift for analytics. The data volume is 500 GB and the network bandwidth is limited. Which AWS service should be used for the initial one-time data migration?

A.AWS Snowball

B.AWS Direct Connect

C.Amazon S3 Transfer Acceleration

D.AWS Database Migration Service (DMS)

AnswerA

Snowball allows physical transfer of data, bypassing network limitations.

Why this answer

AWS Snowball is ideal for large data transfers over limited network bandwidth. It allows physical shipping of storage devices. AWS DMS can be used for ongoing replication, but for initial large volume with limited bandwidth, Snowball is more efficient.

S3 Transfer Acceleration speeds up transfers, but still relies on network. Direct Connect improves network but may still be insufficient for 500 GB.

Practice this question →

171

MCQmedium

A data engineer is designing a data ingestion pipeline to load millions of small JSON files from an on-premises FTP server into Amazon S3. The pipeline should minimize cost and operational overhead. Which approach is most suitable?

A.Use S3 Transfer Acceleration to upload files directly from the FTP server

B.Deploy AWS DataSync to transfer files from the FTP server to S3

C.Use AWS Snowball Edge to ship the data to AWS

D.Set up an AWS Direct Connect connection and use AWS CLI to copy files

AnswerB

AWS DataSync is designed for efficient data transfer from on-premises to AWS, handling small files well with minimal operational overhead.

Why this answer

Option C is correct because AWS DataSync can transfer data from on-premises to S3 efficiently, handling many small files. Option A is wrong because S3 Transfer Acceleration speeds up transfers over long distances but requires S3 API access. Option B is wrong because Direct Connect provides dedicated network but still needs a transfer mechanism.

Option D is wrong because Snowball Edge is for large data volumes and incurs shipping delays.

Practice this question →

172

MCQhard

A data engineer is troubleshooting a daily batch ingestion pipeline that uses AWS Glue to read CSV files from Amazon S3 and write Parquet files to another S3 bucket. The job runs successfully but takes significantly longer than expected. The engineer notices that the input data is highly skewed with many small files. Which is the most effective optimization to reduce job duration?

A.Change the output format to JSON

B.Enable the 'groupFiles' option in the S3 source configuration

C.Increase the number of DPUs allocated to the job

D.Enable the 'use_glue_schema_registry' option

AnswerB

Grouping small files into larger splits reduces task overhead and improves performance.

Why this answer

Option D is correct because grouping small files into larger splits reduces the overhead of task creation and improves Spark performance. Option A is wrong because increasing the number of DPUs can increase parallelism but may not help if the issue is file count overhead; it could even be wasteful. Option B is wrong because using a different file format is not the issue; the output is Parquet, which is already efficient.

Option C is wrong because changing the shuffle behavior does not directly address the small files problem.

Practice this question →

173

MCQmedium

A retail company uses Amazon Kinesis Data Firehose to ingest clickstream data from its website into an Amazon S3 bucket. The data includes fields: user_id, event_type, timestamp, page_url. Recently, the data engineering team noticed that some records have malformed JSON (missing commas, extra brackets) causing delivery failures to S3. The Firehose delivery stream is configured to retry failed records for 300 seconds, after which the records are sent to an S3 bucket for failed records. The team wants to transform the data to correct malformed JSON before delivery to the main S3 bucket. They need a solution that does not require managing servers and can handle high throughput. What should the team do?

A.Configure an AWS Lambda function as a data transformation in Kinesis Data Firehose to correct malformed JSON.

B.Set up an Amazon EMR cluster with Apache Spark to process the data in micro-batches and fix JSON errors.

C.Use an AWS Glue streaming ETL job to read from Firehose and write corrected data to S3.

D.Use Amazon Kinesis Data Analytics with a SQL application to parse and fix JSON.

AnswerA

Firehose supports Lambda transformations for record-level processing; it scales automatically.

Why this answer

Option B is correct: Lambda transformation in Firehose can process each record and fix JSON errors. Option A (Kinesis Data Analytics) is for real-time analytics, not record-level transformation. Option C (Glue streaming) adds complexity and latency.

Option D (EMR) requires cluster management.

Practice this question →

174

Drag & Dropmedium

Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, catalog the source data with a crawler. Then, prepare the ETL script. Configure the job with connections, run it, and finally verify the results in Redshift.

Practice this question →

175

Multi-Selectmedium

A company uses AWS Glue to transform data in S3. The Glue job fails with memory errors. Which THREE actions can help resolve this?

Select 3 answers

A.Optimize the transformation to use pushdown predicates.

B.Use a larger worker type (e.g., G.2X).

C.Increase the number of DPUs.

D.Increase the job timeout.

E.Decrease the number of DPUs.

AnswersA, B, C

Pushdown predicates reduce data loaded into memory.

Why this answer

Options A, B, and E are correct. Increasing DPUs (A) adds more worker nodes. Using a larger worker type (B) increases memory per worker.

Optimizing transformations like using pushdown predicates (E) reduces data scanned. Option C is wrong because reducing DPUs would make the problem worse. Option D is wrong because increasing timeout does not address memory issues.

Practice this question →

176

MCQmedium

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is 10 TB and the network bandwidth is 1 Gbps. The migration must be completed within 48 hours. What should the data engineer do to meet the deadline?

A.Use S3 Transfer Acceleration to speed up the transfer

B.Use AWS Snowball Edge to transfer the data physically

C.Use AWS DataSync with multiple agents and enable data compression

D.Request a bandwidth increase from the ISP

AnswerC

Multiple agents and compression maximize throughput to meet the deadline.

Why this answer

Option D (Use AWS DataSync with multiple agents and enable data compression) is correct because it maximizes throughput. Option A (Use Snowball Edge) is not necessary for 10 TB with 1 Gbps; it would take about 22 hours theoretically, but compression and multiple agents can help. Option B (Increase bandwidth) is not always feasible.

Option C (Use S3 Transfer Acceleration) improves upload speed but not as much as multiple agents and compression.

Practice this question →

177

MCQhard

Refer to the exhibit. A data engineer runs an AWS Glue job that fails with an 'Access Denied' error when writing to S3. The IAM role attached to the job has s3:PutObject permission on the output bucket. What additional configuration is most likely missing?

A.The Glue job is not configured to write to S3 with the correct prefix

B.The S3 bucket policy does not grant access to the Glue job's IAM role

C.The Glue job is running in a VPC without an S3 VPC endpoint

D.The S3 bucket is encrypted with AWS KMS and the IAM role lacks kms:Decrypt permission

AnswerB

Even if IAM allows, bucket policy can deny; this is a common misconfiguration.

Why this answer

Option C is correct because the error shows 'Access Denied' when writing, and the job uses Glue version 3.0 (which includes Spark 3.1), which may require the S3 bucket to have a bucket policy that allows the Glue service principal or a VPC endpoint. Often the issue is that the S3 bucket policy does not allow the Glue job's IAM role. Option A (KMS) is possible but not mentioned.

Option B (VPC) might be needed if the job is in a VPC. Option D (CloudWatch) is not related.

Practice this question →

178

MCQeasy

A company streams clickstream data from websites to Amazon Kinesis Data Streams. A Lambda function processes each record and writes it to Amazon S3. Recently, the function has been timing out under high load. Which solution should a data engineer implement to handle the increased throughput?

A.Increase the Lambda function's timeout value.

B.Increase the number of shards in the Kinesis data stream.

C.Increase the memory allocated to the Lambda function.

D.Configure Amazon S3 Event Notifications to trigger Lambda directly.

AnswerB

More shards increase parallelism and allow Lambda to process more records concurrently.

Why this answer

Option C is correct because increasing the number of shards in Kinesis Data Streams increases parallelism, allowing more Lambda invocations to process records concurrently. Option A is wrong because writing to S3 more frequently does not reduce Lambda processing time. Option B is wrong because Lambda already processes records; increasing memory may help but does not address the root cause of limited shard count.

Option D is wrong because S3 does not support streaming directly.

Practice this question →

179

MCQeasy

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data must be transformed in near real-time and stored in Amazon S3 for analytics. Which AWS service should be used to transform the data as it is ingested?

A.AWS Lambda (streaming function)

B.Amazon EMR (Spark Streaming)

C.AWS Glue (ETL jobs)

D.Amazon Kinesis Data Analytics

AnswerD

Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink.

Why this answer

Option B is correct because Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink. Option A (AWS Glue) is a batch ETL service, not for real-time streams. Option C (Amazon EMR) is for big data processing but requires more setup.

Option D (AWS Lambda) can process Kinesis records but is less efficient for complex transformations on high-throughput streams.

Practice this question →

180

MCQmedium

A retail company uses AWS Glue to process daily sales data from multiple CSV files stored in Amazon S3. The Glue job runs a PySpark script that reads the files, performs joins, and writes the output as Parquet. Recently, the job has been failing with 'Out of Memory' errors. The data volume has grown from 10 GB to 50 GB per day. The Glue job uses 10 DPUs and the standard worker type. The data engineer needs to fix the job without rewriting the script. What should the data engineer do?

A.Split the input CSV files into smaller partitions.

B.Change the worker type to G.2X to get more memory per worker.

C.Decrease the number of DPUs to reduce memory contention.

D.Increase the number of DPUs for the Glue job to 20.

AnswerD

More DPUs provide more memory and parallel processing, solving OOM.

Why this answer

Option C is correct. Increasing the number of DPUs provides more memory and compute resources, addressing the OOM error. Option A is wrong because changing worker type to G.2X may not be sufficient if the issue is simply memory; but increasing DPUs is a direct solution.

Option B is wrong because splitting files does not reduce the memory needed for joins. Option D is wrong because decreasing DPUs would make the problem worse.

Practice this question →

181

MCQhard

A company needs to ingest real-time clickstream data from a web application into Amazon Redshift with minimal latency. The data volume is high and requires processing before loading. Which architecture is MOST appropriate?

A.AWS Glue ETL jobs scheduled every 5 minutes -> Redshift

B.S3 -> Lambda -> Redshift

C.DynamoDB Streams -> Lambda -> Redshift

D.Kinesis Data Streams -> Kinesis Data Firehose -> Redshift

AnswerD

Provides real-time ingestion with transformation capability.

Why this answer

Option A is correct because Kinesis Data Firehose can deliver streaming data to Redshift with built-in transformation via Lambda. Option B is wrong because S3 does not provide real-time ingestion. Option C is wrong because Glue is batch-oriented.

Option D is wrong because DynamoDB Streams is for DynamoDB changes, not clickstream.

Practice this question →

182

Matchingmedium

Match each AWS database service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Relational database with managed operations

NoSQL key-value and document database

In-memory caching for low latency

Graph database for connected data

Time-series data for IoT and analytics

Why these pairings

AWS offers purpose-built databases.

Practice this question →

183

MCQhard

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB MySQL database to Amazon Aurora MySQL. The migration is taking longer than expected. The source database is in a different AWS region. Which change would MOST likely improve the migration speed?

A.Use a smaller DMS replication instance to reduce costs.

B.Use a Multi-AZ deployment for the DMS replication instance in the target region.

C.Increase the number of parallel tables being migrated.

D.Disable binary logging on the source MySQL database.

AnswerB

Multi-AZ can improve availability and performance by using a standby replica, which can help with cross-region data transfer.

Why this answer

Option A is correct because using a Multi-AZ DMS replication instance in the target region can improve performance by providing high availability and potentially better network throughput. Option B is incorrect because increasing the number of parallel tables can improve performance, but the question asks for the MOST likely improvement. Option C is incorrect because disabling logging on the source database could affect application performance and is not recommended.

Option D is incorrect because using a smaller instance type would reduce performance.

Practice this question →

184

MCQmedium

A company is using AWS Glue to process data from Amazon S3. The Glue job reads CSV files and writes Parquet files to a different S3 bucket. The job occasionally fails with 'java.lang.OutOfMemoryError: Java heap space'. The data size varies. Which change should the engineer make to avoid this error?

A.Increase the number of DPUs allocated to the Glue job.

B.Convert the CSV files to JSON format before processing.

C.Decrease the Spark shuffle partitions in the job script.

D.Increase the job timeout setting.

AnswerA

More DPUs provide more memory and compute resources.

Why this answer

The 'java.lang.OutOfMemoryError: Java heap space' error in AWS Glue indicates that the Spark executors ran out of memory while processing the data. Increasing the number of DPUs (Data Processing Units) allocated to the Glue job increases the total memory available across the cluster, allowing larger datasets to be processed without hitting the heap limit. Each DPU provides 4 vCPUs and 16 GB of memory, so adding more DPUs scales memory linearly.

Exam trap

The trap here is that candidates often confuse 'increasing DPUs' with 'increasing parallelism' and assume it only speeds up jobs, but in reality it also increases total memory, which directly mitigates heap space errors.

How to eliminate wrong answers

Option B is wrong because converting CSV to JSON does not reduce memory pressure; JSON is typically more verbose than CSV and would increase memory consumption. Option C is wrong because decreasing Spark shuffle partitions reduces parallelism and can cause each partition to hold more data, worsening memory issues and potentially increasing the risk of OutOfMemoryError. Option D is wrong because increasing the job timeout setting only extends the maximum runtime before the job is killed; it does not address memory constraints or prevent heap space errors.

Practice this question →

185

MCQeasy

A company needs to ingest data from a relational database into Amazon S3 for analytics. The database is an Amazon RDS MySQL instance. Which AWS service should be used for a one-time historical data load?

A.AWS Database Migration Service (DMS)

B.AWS Glue ETL

C.Amazon Athena

D.Amazon Kinesis Data Firehose

AnswerA

DMS supports full load from RDS to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can perform a one-time full load from RDS MySQL to S3. Option A is incorrect because Glue ETL can do it but DMS is specialized for databases. Option B is incorrect because Kinesis Data Firehose is for streaming, not database.

Option D is incorrect because Athena queries data, not ingest.

Practice this question →

186

MCQmedium

A data engineer is using AWS Glue ETL to transform data from an S3 data lake. The job fails with a memory error. Which approach should be used to resolve this issue without major code changes?

A.Rewrite the ETL script in PySpark instead of Scala

B.Change the input file format from CSV to Parquet

C.Increase the number of DPUs allocated to the Glue job

D.Use Amazon EMR instead of AWS Glue

AnswerC

More DPUs provide more memory and parallelism.

Why this answer

Option C is correct because increasing the number of DPUs (Data Processing Units) allocated to the Glue job provides more memory and parallelism. Option A is wrong because using a smaller file format may not address memory issues. Option B is wrong because rewriting in PySpark is a major code change.

Option D is wrong because using a different service is unnecessary.

Practice this question →

187

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data delivery is delayed by up to 5 minutes. The engineer wants to reduce the delay to under 1 minute. Which parameter should be adjusted?

A.Enable error logging to CloudWatch.

B.Increase the buffer size in Kinesis Data Firehose.

C.Enable data compression.

D.Decrease the buffer interval in Kinesis Data Firehose.

AnswerD

Lower buffer interval triggers deliveries more frequently.

Why this answer

Option B is correct because lowering the buffer interval reduces the time Firehose waits before delivering a batch, thus reducing latency. Option A is wrong because increasing buffer size would increase delay. Option C is wrong because compression has no effect on delivery timing.

Option D is wrong because error handling does not affect delivery delay.

Practice this question →

188

MCQeasy

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The delivery occasionally fails with 'Firehose is throttled'. What should be done to reduce throttling?

A.Enable compression on the Firehose delivery stream

B.Increase the buffer size and buffer interval

C.Decrease the buffer size to flush more frequently

D.Increase the number of shards in the Kinesis stream

AnswerB

Larger buffer reduces the number of write requests.

Why this answer

Throttling occurs when the write rate exceeds the buffer limit. Increasing the buffer size or interval allows Firehose to accumulate more records before writing, reducing the number of PutRecordBatch calls.

Practice this question →

189

MCQmedium

A data engineering team is designing a data ingestion pipeline that will receive millions of small JSON files per hour from external partners via API. The files should be stored in Amazon S3 and then transformed into Parquet for querying. Which approach is MOST cost-effective and scalable?

A.Use Amazon Kinesis Data Firehose to buffer and deliver data to S3, then use AWS Glue to convert to Parquet.

B.Use AWS Lambda to process each file as it arrives and write to S3.

C.Use AWS Direct Connect to establish a dedicated network for file uploads.

D.Use Amazon EMR to process the files as they arrive in S3.

AnswerA

Firehose can ingest high throughput, buffer, and deliver to S3; Glue can run scheduled conversions.

Why this answer

Option C is correct because Kinesis Firehose can ingest streaming data, buffer it, and deliver in batches to S3, and then a Glue job can convert to Parquet. Option A is wrong because Lambda has a payload limit and is not cost-effective for millions of files. Option B is wrong because Direct Connect is for dedicated network, not for API ingestion.

Option D is wrong because EMR is overkill and more expensive.

Practice this question →

190

MCQeasy

A data engineer is setting up a data pipeline to ingest data from an Amazon RDS for MySQL database into Amazon S3 using AWS Glue ETL. The Glue job uses a JDBC connection to read from the MySQL database. The job runs successfully, but the engineer notices that the job is taking longer than expected. The MySQL database is 500 GB in size and the Glue job uses 10 workers of type G.1X. The engineer wants to improve the performance of the extraction phase. The database is actively used by other applications, so the engineer must minimize the impact on the source database. Which approach should the engineer take?

A.Partition the table by a numeric column, such as the primary key, and use the 'hashex' or 'hashpar' partitioning option in the Glue JDBC connection.

B.Use an incremental extraction strategy with a watermark column to reduce the amount of data read each time.

C.Create a read replica of the MySQL database and configure the Glue job to read from the replica.

D.Increase the number of Glue workers to 20 to increase parallelism.

AnswerA

Partitioning enables parallel reads and reduces the load on the source by reading chunks sequentially.

Why this answer

Option A is correct because partitioning the table on a key column (e.g., primary key) allows Glue to read in parallel from multiple partitions, reducing the load on the database and improving performance. Option B is wrong because increasing workers without partitioning may cause the database to be overwhelmed with connections. Option C is wrong because using a read replica offloads the read traffic but still requires efficient partitioning; also, setting up a read replica adds cost and complexity.

Option D is wrong because full table scans are inefficient; incremental loads are for ongoing changes, not initial extraction.

Practice this question →

191

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon RDS to Amazon S3 daily. The job recently started failing with memory errors. The data volume has grown 3x in the past month. Which change should the data engineer make to resolve the issue?

A.Increase the size of the Amazon RDS instance

B.Switch the Glue job type from Python Shell to Spark

C.Partition the output data in Amazon S3 by date

D.Increase the number of DPUs allocated to the Glue job

AnswerD

More DPUs provide more memory to handle larger data volumes.

Why this answer

Option D is correct because increasing the number of DPUs (Data Processing Units) in the AWS Glue job provides more memory and compute resources, which can handle larger data volumes. Option A (increasing RDS instance size) may not help if the bottleneck is the Glue job. Option B (switching to Spark) is not directly relevant; Glue already uses Spark.

Option C (partitioning S3 output) improves query performance but does not fix memory errors during transformation.

Practice this question →

192

Multi-Selecthard

A company is streaming IoT sensor data from thousands of devices into Amazon Kinesis Data Firehose. The data is then delivered to Amazon S3 for long-term storage. Occasionally, some records fail to be delivered to S3. The company must capture and analyze these failed records. Which TWO actions should be taken? (Choose two.)

Select 2 answers

A.Configure an AWS Lambda function as a pre-processing step to catch and log failed records.

B.Use Amazon Kinesis Data Analytics to analyze the failed records in real time.

C.Send failed records to an Amazon Kinesis Data Stream for reprocessing.

D.Enable Amazon CloudWatch Logs for Kinesis Data Firehose to capture delivery errors.

E.Set up an S3 event notification to trigger a Lambda function to reprocess failed records.

AnswersA, D

Lambda can handle errors during transformation and log them.

Why this answer

A is correct because enabling Kinesis Data Firehose error logging to CloudWatch allows monitoring of delivery failures. C is correct because configuring a Lambda function as a data transformation and error handler can process and log failed records. B is wrong because S3 is the destination, not the source for reprocessing.

D is wrong because Kinesis Data Streams is a different service and does not directly capture Firehose failures. E is wrong because Kinesis Data Analytics is for real-time analytics, not for handling delivery failures.

Practice this question →

193

Multi-Selecthard

A data engineer needs to transform data in Amazon S3 using AWS Glue. The job must handle schema evolution and partition pruning. Which THREE features should be used?

Select 3 answers

A.AWS Glue Data Catalog

B.AWS Glue job bookmarks

C.AWS Glue FindMatches transform

D.AWS Glue crawlers

E.Partition indexes

AnswersA, D, E

The Data Catalog stores schema and partition metadata.

Why this answer

Options A, B, and D are correct. Glue crawlers update the schema, partitions enable pruning, and the Data Catalog stores metadata. Option C is wrong because Bookmarks are for state tracking, not schema evolution.

Option E is wrong because FindMatches is for deduplication.

Practice this question →

194

MCQhard

A data engineer is designing a data ingestion pipeline for a social media analytics platform. The pipeline must ingest tweets in real-time, perform sentiment analysis, and store results in Amazon S3. The sentiment analysis is compute-intensive and must be done as the data arrives. The estimated throughput is 10,000 tweets per second. Which architecture is most suitable?

A.Amazon SQS with AWS Lambda pollers to process tweets and store in S3.

B.Amazon EMR with Spark Streaming to process tweets and write to S3.

C.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics for sentiment analysis, then Kinesis Data Firehose to S3.

D.Amazon API Gateway with AWS Lambda to process each tweet and store in S3.

AnswerC

Scalable real-time stream processing.

Why this answer

Option C is the most suitable because Amazon Kinesis Data Streams can ingest up to 10,000 records per second per shard (with shard-level scaling), and Kinesis Data Analytics provides built-in, low-latency stream processing for compute-intensive sentiment analysis using SQL or Apache Flink. Kinesis Data Firehose then reliably buffers and writes the processed results to Amazon S3 without custom code, ensuring near-real-time delivery.

Exam trap

The trap here is that candidates often choose SQS+Lambda (Option A) for simplicity, underestimating the throughput ceiling and polling overhead, while overlooking Kinesis Data Analytics as the only AWS-managed service that natively supports real-time, compute-intensive stream processing without custom infrastructure.

How to eliminate wrong answers

Option A is wrong because Amazon SQS with Lambda pollers introduces polling latency and cannot efficiently handle 10,000 tweets per second; Lambda has a maximum concurrency limit and SQS batch sizes are capped at 10 messages, leading to throttling and backpressure. Option B is wrong because Amazon EMR with Spark Streaming is designed for large-scale batch and micro-batch processing, not for true real-time, per-record sentiment analysis at 10,000 TPS; it incurs startup overhead and is better suited for historical analysis. Option D is wrong because Amazon API Gateway with Lambda processes each tweet synchronously, which cannot sustain 10,000 requests per second without aggressive throttling and cold starts; it also lacks built-in stream buffering and ordering for real-time ingestion.

Practice this question →

195

MCQhard

A data engineer runs an AWS Glue crawler that is configured to crawl an S3 bucket named 'my-data-lake' and update the Glue Data Catalog. The crawler fails with an access denied error. The IAM role attached to the crawler has the policy shown in the exhibit. What is the likely cause of the failure?

A.The policy does not allow glue:CreateTable on the 'my-data-lake' database.

B.The policy does not allow s3:PutObject on the 'my-data-lake' bucket.

C.The policy does not allow logging to CloudWatch Logs.

D.The policy does not allow s3:ListBucket on the 'my-data-lake' bucket.

AnswerC

Glue crawlers require permissions to create log groups and streams and write logs; the policy lacks logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents.

Why this answer

Option B is correct because the crawler needs permission to write logs to CloudWatch Logs; the policy does not include logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents. Option A is wrong because the policy includes s3:ListBucket on the bucket and s3:GetObject on the bucket contents. Option C is wrong because the policy includes glue:CreateTable and glue:UpdateTable.

Option D is wrong because the crawler does not need to write to S3.

Practice this question →

196

MCQmedium

A logistics company ingests real-time GPS location data from thousands of delivery vehicles into Amazon Kinesis Data Streams. Each vehicle sends a JSON payload every 10 seconds containing vehicle_id, latitude, longitude, timestamp, and speed. The data must be stored in Amazon S3 for historical analysis, but the company wants to first aggregate the data per vehicle per minute (average speed, min/max coordinates) to reduce storage costs. The solution must be serverless and handle potential duplicate records without double-counting. What should the engineer do?

A.Use Amazon EMR with Spark Streaming to perform the aggregation and write to S3.

B.Use Amazon Kinesis Data Analytics for Apache Flink to aggregate data in a 1-minute tumbling window with deduplication logic, then output to Kinesis Data Firehose for delivery to S3.

C.Use Kinesis Data Firehose with a Lambda transformation to aggregate records in a 1-minute window.

D.Use an AWS Glue streaming ETL job with Spark Structured Streaming to aggregate and deduplicate.

AnswerB

Flink supports windowed aggregations and stateful deduplication; Firehose delivers to S3.

Why this answer

Option D is correct: Kinesis Data Analytics for Flink can aggregate data in real-time using tumbling windows and deduplication. Option A (Firehose with Lambda) would require per-record processing and state management. Option B (Glue streaming) adds latency.

Option C (EMR) is not serverless.

Practice this question →

197

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The data is JSON formatted and includes a timestamp field. The company wants to partition the output in Amazon S3 by date and hour, and ensure exactly-once processing semantics. Which combination of configurations should be used?

A.Disable checkpointing and use the 'exactly_once' delivery option in Kinesis Data Streams.

B.Enable checkpointing in the AWS Glue streaming job and specify an S3 location for checkpoint data.

C.Use Amazon DynamoDB as a checkpoint store by configuring the Glue job with a DynamoDB connection.

D.Use Kinesis Client Library (KCL) checkpointing with a DynamoDB table.

AnswerB

Glue streaming jobs support checkpointing to S3 for exactly-once processing.

Why this answer

Option B is correct because AWS Glue streaming jobs require checkpointing to track the progress of data consumption from Kinesis Data Streams and to ensure exactly-once processing semantics. By enabling checkpointing and specifying an S3 location, Glue periodically saves the state of processed records, allowing it to resume from the last committed offset in case of failures, thus preventing duplicates or data loss.

Exam trap

The trap here is that candidates confuse the checkpointing mechanism of AWS Glue (which uses S3) with the Kinesis Client Library (KCL) pattern (which uses DynamoDB), leading them to select option D or C, even though Glue streaming jobs do not support DynamoDB for checkpointing.

How to eliminate wrong answers

Option A is wrong because disabling checkpointing removes the mechanism for tracking processed records, making exactly-once semantics impossible; the 'exactly_once' delivery option in Kinesis Data Streams refers to producer-side delivery guarantees, not consumer-side processing semantics. Option C is wrong because AWS Glue streaming jobs do not support DynamoDB as a checkpoint store; they only support S3 for checkpoint data. Option D is wrong because Kinesis Client Library (KCL) checkpointing with DynamoDB is a pattern for custom applications, not for AWS Glue streaming jobs, which manage checkpointing internally via S3.

Practice this question →

198

Multi-Selecthard

A company is building a data lake on S3. They have a large volume of CSV files (hundreds of GB) in a source bucket. They need to convert them to Parquet, partition by date, and ensure the data is encrypted at rest with SSE-KMS. The pipeline must be triggered automatically when new files arrive. Which THREE steps should be part of the solution? (Choose THREE.)

Select 3 answers

A.Configure S3 Event Notification to send events to an SQS queue

B.Use Amazon Kinesis Data Firehose to ingest new files

C.Create an AWS Glue ETL job that converts to Parquet and partitions by date

D.Use Amazon Athena CTAS query to convert files in batch

E.Configure the Glue job to use a KMS key for server-side encryption in S3

AnswersA, C, E

SQS can buffer events and trigger a Lambda or Step Functions workflow.

Why this answer

Option A (S3 Event Notification to SQS) can trigger the pipeline. Option C (Glue ETL with partition pruning) can convert to Parquet and partition. Option D (KMS key with s3:PutObject) ensures SSE-KMS.

Option B (Athena CTAS) is not triggered by events. Option E (Kinesis) is for streaming.

Practice this question →

199

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data ingestion pipeline must handle both batch and streaming data. The engineer wants to use a single service to ingest both types of data. Which service should the engineer choose?

A.Amazon Athena

B.Amazon Kinesis Data Firehose

C.S3 Transfer Acceleration

D.AWS Glue

AnswerB

Firehose can ingest streaming data and deliver to S3 in near real-time; batch data can be sent via Firehose API.

Why this answer

Option A is correct because Amazon Kinesis Data Firehose can ingest streaming data and can also be used for batch delivery by putting records in batches. Option B is wrong because AWS Glue is batch-only. Option C is wrong because Amazon Athena queries data, not ingests.

Option D is wrong because S3 Transfer Acceleration speeds up uploads but does not handle streaming.

Practice this question →

200

MCQeasy

A data engineering team needs to ingest streaming data from an application into Amazon S3 for analytics. The data volume is moderate and the team wants the lowest operational overhead. Which AWS service should they use?

A.Amazon SQS

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon Kinesis Data Firehose

AnswerD

Fully managed, automatically writes streaming data to S3.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service for loading streaming data into S3 with no code required. Option A is wrong because Amazon Kinesis Data Streams requires custom consumers. Option C is wrong because AWS Glue is for batch ETL, not real-time.

Option D is wrong because Amazon SQS is a message queue, not optimized for streaming analytics destinations.

Practice this question →

201

MCQhard

A data engineering team is ingesting data from multiple sources into Amazon S3 using AWS Glue ETL jobs. The jobs are failing intermittently with the error: 'Task ran out of memory'. The input data size varies widely from 100 MB to 10 GB per job. Which configuration change would best mitigate this issue?

A.Increase the number of workers in the Glue job

B.Enable job bookmarking to process only incremental data

C.Reduce the batch size in the S3 source node

D.Change the job type from Spark to Python shell

AnswerB

Bookmarking reduces the data processed each run, lowering memory requirements.

Why this answer

Option B is correct because enabling job bookmarking in Glue allows the job to process only new data, reducing memory pressure. Option A is wrong because increasing the number of workers adds parallelism but may not address memory per worker. Option C is wrong because switching to Python shell would reduce capability.

Option D is wrong because decreasing batch size may help but could slow down processing.

Practice this question →

202

MCQhard

A company is migrating its on-premises data warehouse to Amazon Redshift. The daily batch load from the source database takes 6 hours using a single-node Redshift cluster. The engineer needs to reduce load time to under 2 hours without increasing cost significantly. Which strategy should the engineer adopt?

A.Use COPY with compression (gzip) to reduce data volume.

B.Use a VPC endpoint to improve network throughput to S3.

C.Change the table distribution style to EVEN to distribute data evenly.

D.Increase the number of nodes in the Redshift cluster and use parallel COPY from multiple files.

AnswerD

More nodes enable parallel data loading.

Why this answer

Option A is correct because increasing the node count provides more parallelism for the COPY command, reducing load time. Option B is wrong because compression reduces storage, not load time. Option C is wrong because distributing data across nodes helps queries, not loads.

Option D is wrong because VPC routing changes do not affect performance.

Practice this question →

203

MCQeasy

A marketing analytics team needs to ingest customer transaction data from an on-premises PostgreSQL database into Amazon S3 for analysis. The data volume is about 10 GB daily, and the team wants to perform full refresh daily (truncate and load) into S3 as Parquet files. The company has a Direct Connect connection to AWS. The team needs a simple, managed solution that minimizes operational overhead. What should the team use?

A.Set up AWS Database Migration Service (DMS) to continuously replicate data to S3 in Parquet format.

B.Use Amazon EMR with a Spark job that reads from PostgreSQL and writes to S3.

C.Use an AWS Glue ETL job with a JDBC connection to the PostgreSQL database, extract data, and write to S3 in Parquet format.

D.Use AWS Data Pipeline with a SQLActivity to extract data and copy to S3.

AnswerC

Glue is serverless and can handle daily full refresh with minimal setup.

Why this answer

Option A is correct: AWS Glue can connect to on-premises PostgreSQL via JDBC using a connection, and perform full extract and write to S3 as Parquet. Option B (DMS) is more suited for ongoing replication, not full refresh. Option C (Data Pipeline) requires more configuration.

Option D (EMR) is overkill.

Practice this question →

204

MCQmedium

A data engineer is ingesting data from an on-premises database to Amazon S3 using AWS DataSync. The data transfer is scheduled to run daily at midnight. The engineer notices that the transfer takes longer than expected and sometimes does not complete before the next scheduled task. What should the engineer do to ensure the transfer completes within the window?

A.Increase the bandwidth limit in the DataSync task settings.

B.Decrease the bandwidth limit to reduce network congestion.

C.Increase the schedule frequency to every 12 hours.

D.Use S3 Transfer Acceleration instead of DataSync.

AnswerA

Higher bandwidth limit allows faster data transfer.

Why this answer

Option C is correct because increasing the bandwidth limit allows DataSync to use more network capacity, speeding up the transfer. Option A is wrong because reducing the schedule frequency does not solve the completion time issue. Option B is wrong because changing to S3 Transfer Acceleration adds cost but may not help if the bottleneck is on-premises bandwidth.

Option D is wrong because decreasing the limit would make it slower.

Practice this question →

205

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data is generated at a high velocity and must be processed in near real-time. The pipeline must also handle bursty traffic. Which TWO AWS services should be combined to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon S3

B.Amazon Kinesis Data Analytics

C.Amazon Simple Queue Service (SQS)

D.AWS Glue

E.Amazon Kinesis Data Streams

AnswersB, E

Can process streaming data in near real-time.

Why this answer

Option A (Kinesis Data Streams) is correct because it can ingest high-velocity streaming data and handle bursty traffic. Option C (Kinesis Data Analytics) is correct because it can perform real-time processing. Option B is wrong because SQS is for decoupling, not real-time streaming.

Option D is wrong because Glue is more batch. Option E is wrong because S3 is not real-time.

Practice this question →

206

MCQmedium

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration must have minimal downtime. The source database is highly active with continuous writes. Which DMS migration type and additional configuration should the engineer use?

A.Use a CDC-only migration task to capture changes from the source.

B.Use a full load migration task and stop the source database before starting.

C.Use a full load migration task with task restart enabled.

D.Use a full load migration task followed by ongoing replication (CDC).

AnswerD

Full load migrates existing data, then CDC replicates new changes, minimizing downtime.

Why this answer

Option D is correct because full load + change data capture (CDC) allows initial bulk load and then continuously replicates ongoing changes to keep the target in sync, minimizing downtime. Option A is wrong because full load only will not capture changes during migration, causing data loss. Option B is wrong because CDC only requires an initial snapshot anyway; starting with CDC assumes existing data is already present, which isn't the case.

Option C is wrong because task restart does not address ongoing changes.

Practice this question →

207

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for streaming data from IoT devices. The devices send JSON messages every second. The engineer needs to ingest the data with low latency and store it in Amazon S3 in Parquet format. Which TWO services should the engineer use together?

Select 2 answers

A.AWS Lambda

B.Amazon Athena

C.Amazon Kinesis Data Streams

D.AWS Glue

E.Amazon Kinesis Data Firehose

AnswersC, E

Provides low-latency ingestion.

Why this answer

Options B and D are correct. Kinesis Data Streams provides low-latency ingestion, and Kinesis Data Firehose can convert data to Parquet and deliver to S3. Option A (Glue) is batch-oriented.

Option C (Athena) is for querying. Option E (Lambda) can be used but is not required for the conversion; Firehose can do it natively.

Practice this question →

208

Multi-Selectmedium

A company is using AWS Glue ETL to transform and load data from Amazon S3 to Amazon Redshift. The data engineer notices that the job is taking longer than expected. Which TWO actions can improve the job performance?

Select 2 answers

A.Use Amazon Redshift Spectrum to query data directly.

B.Partition the source data in S3.

C.Increase the number of DPUs for the Glue job.

D.Enable S3 Transfer Acceleration.

E.Use a larger Redshift node type.

AnswersB, C

Partitioning reduces data scanned.

Why this answer

Options B and D are correct because increasing the number of DPUs and using partitioning in S3 improve parallelism and reduce data scanned. Option A is incorrect because S3 Transfer Acceleration is for uploads, not ETL. Option C is incorrect because Redshift Spectrum is for querying, not Glue ETL.

Option E is incorrect because a larger instance type for Redshift does not affect Glue job performance.

Practice this question →

209

Multi-Selecthard

A company uses AWS Glue to process large datasets. The Glue job occasionally fails with 'DiskFull' errors. Which TWO actions should the engineer take to resolve this issue? (Choose two.)

Select 2 answers

A.Increase the number of workers for the Glue job.

B.Store intermediate results in Amazon S3 instead of local disk.

C.Enable job bookmark to skip already processed data.

D.Use a Python shell job instead of Spark.

E.Use G.2X worker type which provides more disk space per worker.

AnswersA, E

More workers provide more aggregate disk space.

Practice this question →

210

MCQeasy

A company needs to ingest data from multiple SaaS applications into Amazon S3. The data sources provide REST APIs. Which AWS service can be used to build a fully managed data ingestion pipeline without writing custom code?

A.Amazon AppFlow

B.Amazon Kinesis Data Streams

C.AWS Lambda with custom code

D.AWS Glue with Python shell

AnswerA

AppFlow is a fully managed integration service for SaaS applications.

Why this answer

Option D is correct because Amazon AppFlow is a fully managed service to transfer data from SaaS applications to AWS. Option A is wrong because AWS Lambda requires custom code. Option B is wrong because AWS Glue is for ETL, not for direct API integration.

Option C is wrong because Amazon Kinesis is for streaming data.

Practice this question →

211

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from Amazon S3 and load it into Amazon Redshift. The jobs are failing with the error 'Unable to connect to Redshift cluster'. The Redshift cluster is in the same VPC as the Glue job. What is the MOST likely cause?

A.The Redshift cluster's security group does not allow inbound traffic from the Glue job's security group.

B.The IAM role associated with the Glue job does not have permission to access Redshift.

C.The Glue job is not configured to use the same VPC as the Redshift cluster.

D.The Redshift cluster is not publicly accessible and Glue is trying to connect from outside the VPC.

AnswerC

Glue jobs need VPC configuration to access resources in a private VPC.

Why this answer

Option A is correct because Glue jobs run in a separate VPC by default, and if the job is not configured to use the same VPC as the Redshift cluster, it cannot connect to it. Option B is incorrect because the Redshift cluster is in the same VPC, so security group rules should allow traffic if properly configured. Option C is incorrect because Redshift port 5439 is the default and should be open.

Option D is incorrect because IAM roles are for authentication, not network connectivity.

Practice this question →

212

MCQhard

A data pipeline uses AWS Glue to read CSV files from an S3 bucket, transform them, and write Parquet back to S3. The pipeline runs daily and processes about 500 GB per run. The team wants to reduce costs without increasing runtime. Which approach is most effective?

A.Pre-convert the CSV files to Parquet in S3 using a separate process.

B.Enable job bookmarks to skip already processed data.

C.Increase the number of DPUs for the Glue job to improve parallelism.

D.Optimize the Glue script to select only required columns and filter rows early.

AnswerD

Reduces data scanned, lowering cost.

Why this answer

Option D is correct because using column pruning and predicate pushdown reduces the amount of data scanned, lowering costs without affecting runtime. Option A is wrong because increasing DPUs may increase cost and runtime unpredictably. Option B is wrong because converting to Parquet upfront would add an extra processing step and cost.

Option C is wrong because job bookmarks track processed data but do not reduce cost per run.

Practice this question →

213

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command to describe a Glue job. The job is expected to process new data incrementally using job bookmarks. However, the job reprocesses all data every time it runs. What is the MOST likely reason?

A.The job bookmark option is set to 'job-bookmark-enable' but should be 'job-bookmark-disable'.

B.The job's MaxRetries is set to 0, which disables bookmarks.

C.The ETL script does not use the 'transformation_ctx' parameter in its DynamicFrame transformations.

D.The Glue job's command name is 'glueetl', which does not support job bookmarks.

AnswerC

Without transformation_ctx, Glue cannot track bookmarks.

Why this answer

Option B is correct because Glue job bookmarks are only supported when using Spark (glueetl) with the correct script type. The exhibit shows 'Command.Name' is 'glueetl', which is correct. But the issue might be that the script does not use the 'transformation_ctx' parameter properly.

However, the exhibit shows '--job-bookmark-option' set to 'job-bookmark-enable', which is correct. Option B is wrong because the command is glueetl. Option A is wrong because bookmarks are enabled.

Option C is wrong because MaxRetries does not affect bookmarks. Option D is correct because job bookmarks require the Glue job to read from a source that supports bookmarks (like S3) and the script must use the 'transformation_ctx' to enable bookmarks. But the exhibit doesn't show the script.

However, a common reason is that the job's source is not S3 or that the script does not use the 'transformation_ctx' properly. Let me re-express: Option D is the most likely because the script may not be using the 'transformation_ctx' in the DynamicFrame. I'll set D as correct.

Practice this question →

214

MCQmedium

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

A.Enable compression on the Kinesis stream.

B.Change the output format from Parquet to ORC.

C.Increase the number of DPUs allocated to the Glue job.

D.Reduce the streaming batch size in the Glue job configuration.

AnswerC

More DPUs provide more memory and CPU.

Why this answer

The 'Out of Memory' error in AWS Glue indicates that the job's allocated resources are insufficient for the data volume or processing complexity. Increasing the number of DPUs (Data Processing Units) directly increases the available memory and compute capacity, which is the most straightforward fix for OOM errors in Glue streaming jobs. Option C is correct because it addresses the root cause—resource exhaustion—by scaling the job horizontally.

Exam trap

The trap here is that candidates often confuse 'Out of Memory' with a data format or compression issue, leading them to choose options like A or B, when the real solution is to scale compute resources via DPUs.

How to eliminate wrong answers

Option A is wrong because enabling compression on the Kinesis stream reduces data transfer size but does not affect the memory footprint of the Glue job processing the data; the job still decompresses records into memory. Option B is wrong because changing the output format from Parquet to ORC does not reduce memory usage—both are columnar formats with similar memory profiles, and the error is not related to serialization efficiency. Option D is wrong because reducing the streaming batch size can help with latency but does not guarantee resolution of OOM errors; the job may still fail if individual records or transformations are memory-intensive, and the core issue is insufficient total memory allocation.

Practice this question →

215

MCQeasy

A company needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is about 100 GB per day. Which AWS service is BEST suited for this task?

A.Use AWS DataSync to copy the database files to S3.

B.Use Amazon Kinesis Data Firehose with a database connector.

C.Use AWS Database Migration Service (DMS) to replicate data to S3.

D.Use AWS Glue to extract data from Oracle and write to S3.

AnswerC

DMS supports continuous replication from Oracle to S3.

Why this answer

Option C is correct because AWS Database Migration Service (DMS) can continuously replicate data from Oracle to S3, and it supports full load and change data capture (CDC). Option A (AWS DataSync) is for file-based transfers, not database replication. Option B (Amazon Kinesis Data Firehose) is for streaming data, not database pull.

Option D (AWS Glue) is for ETL but does not natively support continuous CDC from Oracle.

Practice this question →

216

MCQmedium

Refer to the exhibit. An IAM policy is attached to an EC2 instance role that runs a data ingestion application. The application reads files from an S3 bucket 'data-lake-primary' and sends records to a Kinesis stream named 'clickstream'. The application is failing with an 'AccessDenied' error when trying to read from S3. What is the MOST likely cause?

A.The actions are specified incorrectly; they should be s3:GetObject and s3:PutObject only.

B.The policy is not attached to the EC2 instance role.

C.The Kinesis stream name is incorrect.

D.The policy does not include the s3:ListBucket permission.

AnswerD

Reading objects often requires ListBucket permission for the bucket.

Why this answer

Option C is correct. The policy allows s3:GetObject but the resource is specified with a trailing slash and wildcard (/*). If the application is trying to read an object in a subfolder, the policy may still work, but the error could be due to missing s3:ListBucket permission for the bucket itself when the application lists objects.

The most common cause is that the application is trying to list the bucket (s3:ListBucket) which is not allowed. Option A is wrong because the stream name is correct. Option B is wrong because the actions are allowed.

Option D is wrong because the policy is attached to the role.

Practice this question →

217

MCQeasy

A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?

A.Use the 'mergeSchema' option when reading the DynamicFrame.

B.Convert all CSV files to Parquet format using a separate preprocessing job.

C.Define a fixed schema in the Glue job using 'apply_mapping' to map columns.

D.Set the job to 'ignore' schema mismatches in the job parameters.

AnswerA

'mergeSchema' allows Glue to handle schemas that evolve over time.

Why this answer

Option A is correct because the 'mergeSchema' option in AWS Glue's DynamicFrame reader automatically reconciles schema differences across files, including inconsistent column counts. When enabled, Glue merges all schemas encountered during the read, adding nulls for missing columns in files with fewer columns, preventing the 'SchemaDetectionError' without manual intervention.

Exam trap

The trap here is that candidates often confuse 'mergeSchema' with schema-on-read features in other tools or assume that 'apply_mapping' can fix schema mismatches, when in reality it only transforms already-resolved schemas.

How to eliminate wrong answers

Option B is wrong because converting to Parquet does not inherently solve schema inconsistency; Parquet also requires a consistent schema across files unless mergeSchema is explicitly enabled, and adding a preprocessing job is less efficient than handling it inline. Option C is wrong because 'apply_mapping' only remaps existing columns after the schema is resolved; it does not handle files with missing or extra columns that cause the initial schema detection to fail. Option D is wrong because AWS Glue does not have a job parameter to 'ignore' schema mismatches; the error occurs during schema detection, and ignoring it would lead to data corruption or job failure.

Practice this question →

218

MCQhard

A company uses Kinesis Data Streams to ingest clickstream data. They notice that the data processing latency increases as the number of shards grows. What is the most likely cause and solution?

A.Reduce the number of shards or increase the number of consumers.

B.Increase the Kinesis Producer Library (KPL) batch size.

C.Use enhanced fan-out to allow multiple consumers per shard.

D.Increase the number of shards to handle more data.

AnswerA

Balancing shards and consumers ensures each shard is processed, reducing latency.

Why this answer

Option D is correct because when there are more shards than consumers, some shards are idle, leading to underutilization and increased latency. Option A is wrong because increasing shards would worsen the imbalance. Option B is wrong because enhanced fan-out is for multiple consumers, not single consumer.

Option C is wrong because increasing batch size might help throughput but not the fundamental shard-consumer mismatch.

Practice this question →

219

MCQhard

A company uses AWS Glue to process JSON logs from S3. The logs have a nested structure and the schema evolves over time. The data engineer needs to ensure the Glue job can handle schema changes without failing. Which configuration should be used?

A.Manually update the table schema in the Glue Data Catalog before each run

B.Use Spark SQL with a static schema definition in the script

C.Set the job parameter '--enable-glue-datacatalog' and '--mergeDynamicColumns' to true

D.Enable AWS Glue Schema Registry and define a schema version

AnswerC

This allows Glue DynamicFrame to merge schema variations automatically.

Why this answer

Option B is correct because setting 'mergeDynamicColumns' to true in Glue job parameters allows new columns to be added dynamically. Option A (schema registry) is for schema validation, not evolution. Option C (manual schema updates) is not automated.

Option D (Spark SQL) does not solve evolution.

Practice this question →

220

MCQmedium

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. The source database is behind a firewall that does not allow direct internet access. Which service should the engineer use to transfer the data securely?

A.AWS DataSync with a network path through AWS Direct Connect or VPN.

B.AWS Database Migration Service (AWS DMS) with ongoing replication from Oracle to S3.

C.Amazon S3 Transfer Acceleration with a public endpoint.

D.AWS Snowball Edge device for daily transfers.

AnswerA

DataSync is designed for scheduled transfers to S3.

Why this answer

Option B is correct because AWS DataSync can transfer data from on-premises to S3, and it supports private network connectivity via AWS Direct Connect or VPN. Option A is wrong because AWS DMS is for ongoing replication, not bulk daily transfers. Option C is wrong because S3 Transfer Acceleration requires internet.

Option D is wrong because Snowball Edge is for offline transfer, not daily.

Practice this question →

221

MCQmedium

A company runs a SQL Server transactional database on Amazon RDS. They need to capture change data (inserts, updates, deletes) in near real-time and replicate them to an Amazon S3 data lake. Which AWS service is most suitable?

A.AWS Database Migration Service (DMS) with change data capture

B.AWS Glue DataBrew

C.Amazon Kinesis Data Streams with Kinesis Client Library

D.Amazon Redshift Spectrum

AnswerA

DMS supports ongoing replication with CDC and can write to S3.

Why this answer

AWS DMS with CDC continuously captures changes from the source and writes to S3 in Parquet/JSON. Kinesis, Glue, and Redshift are not designed for CDC from RDS to S3 directly.

Practice this question →

222

Multi-Selectmedium

Which TWO practices improve the performance of AWS Glue ETL jobs? (Choose two.)

Select 2 answers

A.Use pushdown predicates to filter data at the source

B.Increase the number of DPUs to the maximum allowed

C.Use the smallest possible file size for input data

D.Enable AWS Glue job metrics and debug logging

E.Use column pruning to select only required columns

AnswersA, E

Filters data early, reducing data scanned.

Why this answer

Option B and Option D are correct. Using column pruning reduces the data shuffled, and using pushdown predicates filters data early. Option A is wrong because increasing DPUs beyond the recommended ratio can cause resource contention.

Option C is wrong because the smallest file size increases overhead. Option E is wrong because standard logs are sufficient and debug logs generate overhead.

Practice this question →

223

MCQhard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream, performs a 1-minute tumbling window aggregation, and writes results to an S3 bucket. Recently, the application started experiencing checkpoint failures and increasing processing delay. Which action should the engineer take FIRST to diagnose the issue?

A.Increase the parallelism of the Flink application.

B.Monitor CPU and memory utilization of the Flink application using Amazon CloudWatch metrics.

C.Switch to the Kinesis Client Library (KCL) for checkpointing.

D.Increase the checkpoint interval to reduce checkpoint frequency.

AnswerB

Resource exhaustion is a common cause of checkpoint failures; monitoring helps identify if scaling is needed.

Why this answer

Option B is correct because checkpoint failures are often due to insufficient resources (CPU/memory) for the Flink job. Monitoring CPU and memory utilization via CloudWatch metrics directly helps identify resource bottlenecks. Option A (checkpoint interval) might help but is not diagnostic.

Option C (parallelism) is a tuning step. Option D (KCL) is not relevant for Flink. The first step is to check resource utilization.

Practice this question →

224

Multi-Selecthard

A company is ingesting real-time financial transactions into Amazon Kinesis Data Streams. The data is then consumed by a Kinesis Data Analytics for Apache Flink application that calculates running totals. The application is experiencing high latency and checkpoint failures. Which TWO steps should the engineer take to improve performance and reliability? (Select TWO.)

Select 2 answers

A.Enable enhanced fan-out for the Flink application.

B.Reduce the batch size of records processed per checkpoint.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the number of KPUs (Kinesis Processing Units) for the Flink application.

E.Decrease the checkpoint interval to reduce state size.

AnswersC, D

More shards increase parallelism, reducing latency and improving throughput.

Why this answer

Options B and D are correct. Increasing the number of shards increases throughput and parallelism, helping reduce latency. Increasing the number of KPUs (Kinesis Processing Units) for the Flink application provides more compute resources, addressing checkpoint failures.

Option A (decreasing checkpoint interval) may increase checkpoint overhead. Option C (using Fan-Out) is for multiple consumers, not for a single Flink job. Option E (reducing batch size) may not help with overall throughput.

Practice this question →

225

MCQhard

A media company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is then processed by AWS Glue jobs. The company wants to ensure that data is delivered to S3 within 5 minutes of ingestion. Currently, the Firehose delivery stream is configured with a buffer interval of 300 seconds and a buffer size of 5 MB. The log data arrives at a rate of 2 MB per second. The data engineer notices that some log files are delayed by up to 10 minutes. The company cannot change the buffer size due to downstream requirements. What should the data engineer do to meet the 5-minute delivery requirement?

A.Increase the buffer interval to 600 seconds to reduce the number of delivery attempts.

B.Increase the buffer size to 10 MB to ensure data is delivered in larger chunks.

C.Enable GZIP compression on the Firehose stream to reduce data size.

D.Decrease the buffer interval to 120 seconds.

AnswerD

Lower interval triggers delivery more often, reducing latency.

Why this answer

Option A is correct. Decreasing the buffer interval to 120 seconds will cause Firehose to deliver data more frequently, meeting the 5-minute SLA. Option B is wrong because increasing buffer interval would increase delay.

Option C is wrong because increasing buffer size would also increase delay. Option D is wrong because compressing data does not reduce buffer interval.

Practice this question →

← PreviousPage 3 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion Transformation questions.

Start 20-question session