Knowledge + Practice

CCNA Data Ingestion and Transformation Questions

75 of 610 questions · Page 7/9 · Data Ingestion and Transformation · Answers revealed

Practice these questions Domain overview All questions

451

MCQhard

A company runs a daily batch ETL job using AWS Glue. The job processes 500 GB of data from Amazon RDS to Amazon S3. The job currently uses a single DPU and takes 6 hours to complete. The team wants to reduce runtime to under 1 hour without increasing costs significantly. Which approach should they use?

A.Change the job type from Python to Spark.

B.Use multiple Glue jobs triggered sequentially.

C.Increase the RDS instance size to improve read throughput.

D.Use AWS Glue Spark job with 100 workers.

AnswerD

More workers enable parallelism, reducing runtime.

Why this answer

Option D is correct because increasing the number of workers (DPUs) allows parallel processing, reducing runtime. Option A is wrong because the job type (Spark vs Python) affects resource usage but increasing workers is more direct. Option B is wrong because using Spark in Glue (which is default) already offers parallelism.

Option C is wrong because using a larger instance type for RDS may improve read throughput but is not a Glue optimization and could increase database cost.

Practice this question →

452

Multi-Selecthard

A company uses AWS DMS to replicate data from an Amazon RDS for MySQL database to Amazon S3. Which TWO configurations are required to enable continuous change data capture (CDC) from MySQL?

Select 2 answers

A.Ensure the S3 bucket is in the same AWS Region as the source database

B.Grant REPLICATION CLIENT and REPLICATION SLAVE privileges to the DMS user

C.Enable binary logging (binlog) on the MySQL source database

D.Enable versioning on the target S3 bucket

E.Configure the MySQL source to be Multi-AZ

AnswersB, C

Required for DMS to read binary logs.

Why this answer

Correct options: B and D. DMS CDC requires binary logs (binlog) to be enabled on the MySQL source and the S3 target endpoint must be configured with `DataFormat` set to `parquet` or `csv` (but not required for CDC). However, to capture changes, binlog retention must be sufficient.

Option B is correct: binlog must be enabled. Option D is correct: DMS needs the `REPLICATION CLIENT` and `REPLICATION SLAVE` privileges. Option A is incorrect; the S3 bucket versioning is not required.

Option C is incorrect; the source database does not need to be in Multi-AZ for CDC. Option E is incorrect; the target S3 bucket does not need to be in the same region (though recommended).

Practice this question →

453

MCQeasy

A company needs to transform JSON data from an S3 bucket into a structured format for Amazon Redshift. The transformation should be done serverlessly. Which service should be used?

A.AWS Glue

B.Amazon EMR

C.Amazon Athena

D.AWS Lambda

AnswerA

Glue provides serverless ETL capabilities.

Why this answer

Option B is correct because AWS Glue is a serverless ETL service that can transform data for Redshift. Option A is wrong because Lambda can process events but is not ideal for large-scale ETL. Option C is wrong because Athena is a query service, not for transformation.

Option D is wrong because EMR is not serverless.

Practice this question →

454

MCQeasy

A company uses Amazon S3 Event Notifications to trigger a Lambda function that processes incoming files. Recently, the Lambda function has been timing out for large files (>100 MB). The data engineer wants to improve the pipeline to handle large files reliably. Which solution is the MOST scalable and cost-effective?

A.Use S3 Event Notification to send to an SQS queue, then have Lambda poll the queue

B.Use Amazon SNS to fan out the event to multiple Lambda functions

C.Use AWS Step Functions to orchestrate multiple Lambda functions for parallel processing

D.Increase the Lambda timeout to 15 minutes

AnswerA

SQS decouples and buffers events, allowing Lambda to process at a manageable rate.

Why this answer

Option B is correct because S3 Event Notifications can send to SQS, which decouples the producer and consumer, allowing Lambda to process at its own pace. Option A (increasing timeout) may not be enough. Option C (Step Functions) adds complexity.

Option D (SNS) does not buffer.

Practice this question →

455

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by an Amazon Kinesis Data Analytics for Apache Flink application that performs real-time analytics. The Flink application writes its results to an Amazon S3 bucket. The company has noticed that the Flink application is experiencing high checkpoint failure rates, causing delays. The CloudWatch metrics show that the checkpoint size is large and increasing. The data engineer needs to reduce the checkpoint size. Which action should the data engineer take?

A.Decrease the checkpoint interval to reduce the amount of state accumulated.

B.Reduce the parallelism of the Flink application.

C.Increase the state time-to-live (TTL) configuration to retain state longer.

D.Enable incremental checkpointing in the Flink application to only write changes since the last checkpoint.

AnswerD

Incremental checkpoints reduce size and improve performance.

Why this answer

Option D is correct because enabling incremental checkpointing in Flink reduces the amount of data written per checkpoint by only writing changes since the last checkpoint. Option A is wrong because reducing parallelism may increase load per operator. Option B is wrong because decreasing checkpoint interval increases frequency, not size.

Option C is wrong because state TTL does not directly reduce checkpoint size.

Practice this question →

456

MCQmedium

A company is using Amazon Kinesis Data Firehose to ingest log data from web servers into an Amazon S3 bucket. The data is then queried by Amazon Athena. The company has noticed that the Athena queries are slow and expensive. The data engineer wants to optimize the storage format to improve query performance and reduce costs. Which configuration change should the data engineer make to the Firehose delivery stream?

A.Increase the buffer interval to 600 seconds and buffer size to 128 MB to create larger files.

B.Change the output format to ORC and enable GZIP compression.

C.Enable S3 server access logs to track query patterns.

D.Enable data transformation in Firehose to convert JSON to Parquet format with Snappy compression.

AnswerD

Parquet is columnar and efficient for Athena.

Why this answer

Option B is correct because converting data to Parquet format and compressing it reduces storage space and improves query performance in Athena. Option A is wrong because storing as ORC is also good but Parquet is more common with Athena. Option C is wrong because increasing buffer size delays delivery.

Option D is wrong because enabling S3 server access logs adds cost and does not help query performance.

Practice this question →

457

MCQeasy

A company wants to ingest data from thousands of IoT devices into AWS for real-time analytics. The data is in JSON format and each device sends about 1 KB every second. Which service should be used as the primary ingestion point?

A.AWS IoT Core

B.Amazon Kinesis Data Firehose

C.Amazon SQS

D.Amazon Kinesis Data Streams

AnswerD

Handles high-volume streaming data.

Why this answer

Option B is correct because Kinesis Data Streams can handle high throughput real-time data from many sources. Option A (SQS) is for message queuing, not streaming analytics. Option C (Firehose) could be used but lacks the ability to have multiple consumers for real-time processing.

Option D (IoT Core) is specifically for IoT but not as general-purpose.

Practice this question →

458

MCQeasy

A company uses AWS Glue ETL jobs to transform data and load it into Amazon Redshift. The jobs are failing with 'Out of Memory' errors. What is the most cost-effective way to resolve this issue without changing the transformation logic?

A.Increase the number of G.1X workers in the Glue job configuration.

B.Use Amazon Redshift Spectrum to query data directly from S3 without transformation.

C.Change the worker type to G.2X and keep the same number of workers.

D.Switch the job from Python to Scala.

AnswerA

More workers increase parallelism and total memory.

Why this answer

Option A is correct. Increasing the number of workers (DPUs) adds parallelism without changing logic. Option B is wrong because increasing worker type is more expensive than adding workers.

Option C is wrong because a different engine may not be compatible. Option D is wrong because Redshift Spectrum is for querying S3, not for ETL memory.

Practice this question →

459

MCQmedium

A data engineer is ingesting streaming data from an IoT fleet into Amazon Kinesis Data Streams. The data must be transformed in real-time and loaded into an Amazon Redshift cluster. Which solution minimizes operational overhead?

A.Use Kinesis Data Firehose with a Lambda transformation function

B.Use AWS Glue ETL jobs running continuously

C.Use Kinesis Client Library (KCL) to consume and transform data, then write to Redshift using COPY

D.Use AWS Direct Connect to stream data directly into Redshift

AnswerA

Firehose handles buffering, transformation via Lambda, and direct delivery to Redshift.

Why this answer

Option A is correct because Kinesis Data Firehose can buffer and batch incoming data, invoke a Lambda function for transformation, and load directly into Redshift. Option B is wrong because KCL requires custom application management. Option C is wrong because Glue is batch-oriented.

Option D is wrong because Direct Connect is for dedicated network connections.

Practice this question →

460

MCQhard

A financial services company is ingesting trade data from multiple exchanges via Amazon Kinesis Data Streams. Each shard receives data from multiple exchanges, and a consumer application (using KCL) processes the data. The company needs to ensure that trades from the same exchange are processed in order. However, the current implementation distributes records to shards using a random partition key, causing trades from the same exchange to be spread across shards and processed out of order. The team must enforce ordering per exchange without significantly reducing throughput. What should the team do?

A.Implement a custom sequence number in the application to reorder after processing.

B.Use a single shard for all data to guarantee order.

C.Use the exchange ID as the partition key when putting records into the stream.

D.Increase the number of shards to 10 per exchange.

AnswerC

Ensures same exchange goes to same shard, preserving order.

Why this answer

Option C is correct because using the exchange ID as the partition key ensures all trades from the same exchange go to the same shard, preserving order. Option A is wrong because increasing shard count would further spread data and break ordering. Option B is wrong because using a single shard would preserve order but reduce throughput due to shard limits.

Option D is wrong because implementing a custom sequencer is complex and unnecessary.

Practice this question →

461

MCQhard

A company uses Amazon Kinesis Data Analytics (now Managed Service for Apache Flink) to run a Flink application on streaming data. The application fails with 'OutOfMemoryError: Java heap space'. The data volume is 10 MB/s. What is the most likely cause and solution?

A.The data contains records larger than 1 MB; split records into smaller chunks.

B.Checkpointing is enabled too frequently; reduce checkpoint interval.

C.The Flink application is not suitable for 10 MB/s throughput; use Kinesis Data Firehose instead.

D.The application's Parallelism is too low; increase the number of Parallelism and KPUs.

AnswerD

Low parallelism causes data to accumulate in operator buffers, leading to OOM.

Why this answer

Insufficient Parallelism or KPU allocation leads to OOM. Option A is correct. Option B is wrong because checkpointing actually helps.

Option C is wrong because Flink can handle 10 MB/s with proper resources. Option D is wrong because data format does not cause OOM.

Practice this question →

462

Multi-Selectmedium

A company is using AWS Glue to run ETL jobs that transform data from S3 to Redshift. The jobs are failing intermittently with out-of-memory errors. Which THREE actions can help resolve this issue? (Choose THREE.)

Select 3 answers

A.Increase the number of DPUs allocated to the Glue job

B.Use S3 Select to filter data before reading into the Glue job

C.Use Spark's 'coalesce' function to reduce the number of partitions

D.Optimize the transformation logic to use less memory, for example by filtering early

E.Use a larger worker type, such as G.2X

AnswersA, D, E

More DPUs provide more memory and compute resources.

Why this answer

Options A, B, and D are correct. A: Increasing the number of DPUs provides more memory. B: Using a larger worker type (e.g., G.1X or G.2X) increases memory per worker.

D: Optimizing the transformation logic to reduce memory usage helps. C: Using Spark's 'coalesce' reduces partitions but may not solve memory issues. E: Using S3 Select pushes down filtering but does not address memory.

Practice this question →

463

MCQhard

A company has a 100 TB dataset stored on-premises in a Hadoop cluster. They want to ingest this data into Amazon S3 for processing with AWS Glue. The company has a limited time window and a slow internet connection. Which strategy is MOST appropriate?

A.Use AWS Snowball Edge to physically ship the data to AWS.

B.Use AWS DataSync over the existing internet connection.

C.Use Amazon S3 Transfer Acceleration to speed up the upload.

D.Use AWS Direct Connect to establish a high-bandwidth connection.

AnswerA

Snowball Edge can handle 100 TB offline, bypassing network limitations.

Why this answer

Option A is correct because AWS Snowball Edge is designed for large data transfers with limited network. Option B is wrong because Direct Connect may not be feasible if the connection is slow. Option C is wrong because DataSync is for smaller volumes and needs network.

Option D is wrong because S3 Transfer Acceleration also requires a network connection.

Practice this question →

464

MCQmedium

A company uses AWS Glue to process data from multiple sources. The data is stored in an Amazon S3 data lake. The company needs to transform the data using a custom Python library that is not available in the default Glue environment. What is the MOST efficient way to make this library available to the Glue jobs?

A.Manually install the library on each node in the Glue cluster by editing the bootstrap script.

B.Upload the library as a .whl file to Amazon S3 and reference it in the Glue job's --additional-python-modules parameter.

C.Create a custom Docker image with the library and use it in AWS Glue for Ray.

D.Use a shell command in the Glue job script to run 'pip install <library>' before the job runs.

AnswerB

This is the recommended way to add custom libraries to Glue jobs.

Why this answer

Option D is correct because AWS Glue supports Python shell jobs and allows installing additional Python modules via --additional-python-modules or by providing a .whl file in S3 and referencing it. Option A is wrong because Glue does not support installing packages via pip at runtime. Option B is wrong because creating a custom Docker image is for Glue for Ray or for ETL jobs with custom environment, but it's more complex than needed.

Option C is wrong because installing the library on every node is not supported; Glue manages the environment.

Practice this question →

465

MCQeasy

A data engineering team needs to transform CSV files stored in Amazon S3 into Parquet format using AWS Glue. The files are partitioned by date and are updated hourly. Which AWS Glue feature should be used to automatically detect the schema and partition structure?

A.AWS Glue Crawler

B.AWS Glue DataBrew

C.AWS Lake Formation

D.Amazon Athena

AnswerA

Discovers schema and partitions automatically.

Why this answer

AWS Glue Crawler is the correct choice because it automatically scans data in S3, infers the schema (including data types), and detects the partition structure (e.g., date-based partitions like year/month/day) by examining the folder hierarchy. It then populates the AWS Glue Data Catalog with metadata, enabling ETL jobs to read the data without manual schema definition.

Exam trap

AWS often tests the distinction between tools that discover metadata (Crawler) versus tools that consume or transform data (Athena, DataBrew), leading candidates to pick Athena because it can query partitioned data, but it cannot automatically detect the partition structure without a pre-existing catalog.

How to eliminate wrong answers

Option B (AWS Glue DataBrew) is wrong because it is a visual data preparation tool for cleaning and normalizing data, not for automatic schema or partition detection. Option C (AWS Lake Formation) is wrong because it provides centralized security and governance for data lakes, but it does not perform schema discovery or partition detection itself. Option D (Amazon Athena) is wrong because it is a query engine that can read data from the Glue Data Catalog, but it does not automatically detect schemas or partitions; it relies on existing catalog metadata.

Practice this question →

466

Multi-Selecthard

A company is ingesting IoT sensor data into Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data must be transformed and aggregated in real-time before being stored in Amazon DynamoDB. Which THREE services should be used together in the pipeline? (Choose THREE.)

Select 3 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon S3

D.Amazon Kinesis Data Streams

E.Amazon Kinesis Data Firehose

AnswersA, B, D

Writes results to DynamoDB.

Why this answer

Options A, C, and D are correct. Kinesis Data Streams ingests the data. Kinesis Data Analytics performs real-time transformations and aggregations.

Lambda can be used to write the aggregated results to DynamoDB. Option B is wrong because Kinesis Data Firehose is for delivery to S3 or Redshift, not DynamoDB. Option E is wrong because S3 is not needed for this pipeline.

Practice this question →

467

Multi-Selectmedium

A company is using AWS Glue ETL to process data from Amazon RDS for MySQL to Amazon S3. The job runs daily and takes 2 hours to complete. The engineer wants to improve performance without increasing cost significantly. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Switch to a smaller worker type (e.g., G.1X instead of G.2X).

B.Use Spark DataFrames instead of DynamicFrames.

C.Enable 'Auto Scaling' in the Glue job configuration.

D.Add a partition column to the source table based on a date column.

E.Increase the number of Glue DPUs.

AnswersD, E

Partitioning allows Glue to read data in parallel.

Why this answer

Increasing the number of DPUs improves performance but increases cost; however, it's a common approach. Partitioning the source table helps parallel reads. Option A is correct.

Option B is correct. Option C is wrong because reducing worker type would slow performance. Option D is wrong because DynamicFrame is recommended for Glue.

Option E is wrong because AWS Glue does not support auto-scaling by default without additional configuration.

Practice this question →

468

MCQhard

A data pipeline ingests JSON data from an S3 bucket using AWS Glue. The JSON files contain nested structures, and the team wants to flatten them for analysis in Amazon Athena. Which Glue transformation is most appropriate?

A.Filter

B.Join

C.Map

D.Relationalize

AnswerD

Flattens nested JSON into separate tables.

Why this answer

Option D is correct because Relationalize is specifically designed to flatten nested JSON into relational tables. Option A (Map) applies a function to each record. Option B (Filter) removes records.

Option C (Join) combines datasets.

Practice this question →

469

MCQmedium

A data engineer uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration task is set to full load + CDC. After the full load completes, the CDC phase starts but shows a high latency of 5 minutes. The source database has a low write load. What should the engineer do to reduce the CDC latency?

A.Decrease the batch size for the CDC task.

B.Disable the validation feature on the DMS task.

C.Increase the size of the DMS replication instance.

D.Enable logging for the DMS task to capture additional details.

AnswerC

More resources reduce latency.

Why this answer

Option B is correct because increasing the DMS replication instance size provides more memory and CPU for processing changes. Option A is wrong because disabling validation reduces overhead but not latency due to resource constraints. Option C is wrong because the source has low write load, so batch size is not the bottleneck.

Option D is wrong because the source is PostgreSQL, not Oracle.

Practice this question →

470

MCQmedium

A company uses Kinesis Data Streams to ingest IoT data. The data volume varies, and occasionally the shard write throughput is exceeded, causing ProvisionedThroughputExceeded exceptions. The data engineer needs to handle these spikes without losing data. Which approach is most cost-effective and requires minimal code changes?

A.Implement custom retry logic using the Kinesis Client Library with exponential backoff

B.Increase the number of shards to handle peak throughput

C.Use Kinesis Data Firehose as a consumer with retries and buffer settings

D.Send data to an SQS queue first, then have a Lambda function write to Kinesis

AnswerC

Firehose can buffer data and retry, handling spikes with minimal code changes.

Why this answer

Option B (Use Kinesis Data Firehose as a consumer with retries and buffer settings) is correct because Firehose can buffer data and retry on failures, handling spikes. Option A (Increase shard count) is costly and requires manual scaling. Option C (Use Kinesis Client Library with exponential backoff) is a good practice but still requires custom code; Firehose is more managed.

Option D (Send data to SQS and then to Kinesis) adds complexity and cost.

Practice this question →

471

MCQeasy

A company is streaming data from an application to Amazon Kinesis Data Streams. The data must be transformed in real time and then stored in Amazon S3 in Parquet format. Which AWS service should be used for the transformation step?

A.Amazon Kinesis Data Firehose with a Lambda transformation.

B.Amazon EMR running Apache Spark Streaming.

C.Amazon Kinesis Data Analytics for Apache Flink.

D.AWS Lambda with a Kinesis trigger.

AnswerC

Kinesis Data Analytics for Flink is designed for real-time stream processing and can perform complex transformations.

Why this answer

Option D is correct because Amazon Kinesis Data Analytics for Apache Flink is a serverless service that can run Apache Flink applications to perform real-time transformations on streaming data. Option A (AWS Lambda) can be used but is limited by execution time and is not optimized for complex streaming transformations. Option B (Amazon Kinesis Data Firehose) can transform data using Lambda functions but is more suited for loading data.

Option C (Amazon EMR) is for batch processing, not real-time streaming.

Practice this question →

472

MCQmedium

A data engineer needs to ingest streaming data from thousands of devices sending JSON messages via HTTP POST. The data should be stored in Amazon S3 with minimal latency and also be available for real-time analytics. Which combination of services is MOST appropriate?

A.Amazon DynamoDB with DynamoDB Streams and Lambda.

B.Amazon SQS and AWS Lambda to write to S3.

C.AWS Lambda directly writing to S3 via API Gateway.

D.Amazon API Gateway, Amazon Kinesis Data Streams, and Kinesis Data Firehose.

AnswerD

API Gateway receives POST, sends to Kinesis for real-time analytics, and Firehose batches to S3.

Why this answer

Option C is correct because API Gateway ingests HTTP POST, sends to Kinesis Data Streams for real-time consumption, and Firehose delivers to S3. Option A (SQS) is not ideal for real-time streaming. Option B (DynamoDB Streams) is for database changes.

Option D (Lambda + S3) lacks real-time analytics capability.

Practice this question →

473

Multi-Selectmedium

A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?

Select 2 answers

A.Increase the number of shards in the Kinesis stream.

B.Increase the batch size in the Lambda event source mapping.

C.Decrease the batch window in the Lambda event source mapping.

D.Configure DynamoDB to use on-demand capacity mode.

E.Increase the Lambda reserved concurrency to 1000.

AnswersB, D

Larger batches mean fewer invocations, reducing throttling.

Why this answer

Option B is correct because increasing the batch size in the Lambda event source mapping allows each invocation to process more records from the Kinesis stream, reducing the number of concurrent Lambda invocations and thus lowering the risk of throttling. Option D is correct because switching DynamoDB to on-demand capacity mode eliminates write capacity limits, preventing throttling on the DynamoDB side that can cause Lambda retries and backpressure.

Exam trap

The trap here is that candidates often assume increasing shards (Option A) always improves throughput, but in a Lambda-integrated Kinesis stream, more shards mean more concurrent invocations, which can actually increase throttling risk.

Practice this question →

474

Multi-Selecthard

A company uses AWS Glue to transform data stored in S3. The Glue job runs daily and processes data in the range of hundreds of GB. The data engineer wants to optimize the job for cost and performance. Which THREE actions should be taken? (Choose THREE.)

Select 3 answers

A.Store intermediate data in HDFS on Amazon EMR

B.Increase the number of DPUs for the job

C.Reduce the number of DPUs to save cost

D.Use columnar data formats such as Parquet

E.Partition the data by date or other high-cardinality columns

AnswersB, D, E

More DPUs can reduce runtime, improving cost if job runs shorter.

Why this answer

Option A is correct because using columnar formats like Parquet improves performance and reduces data scanned. Option B is correct because partitioning reduces data processed. Option D is correct because increasing the number of DPUs can speed up the job, but must be balanced with cost.

Option C is wrong because S3 is already the source; moving to HDFS is not relevant. Option E is wrong because reducing DPUs would increase runtime.

Practice this question →

475

Multi-Selectmedium

Which TWO options are valid methods to ingest on-premises relational database data into Amazon S3 for analytics? (Choose 2.)

Select 2 answers

A.AWS Snowball Edge

B.AWS Glue ETL job with JDBC connection to source

C.Amazon Kinesis Data Streams with Direct Put

D.AWS Database Migration Service (DMS) with S3 target

E.Amazon AppFlow

AnswersB, D

Glue can read from JDBC and write to S3.

Why this answer

AWS DMS can continuously replicate data to S3. AWS Glue ETL can connect to JDBC sources and write to S3. Both are valid ingestion methods.

Practice this question →

476

MCQmedium

The exhibit shows an AWS CLI command and its output. A data engineer wants to copy only objects larger than 10 MB from the S3 bucket to another bucket for processing. Which approach should be used to automate this task?

A.Use S3 replication rules to replicate objects above 10 MB

B.Use AWS CLI with a script to filter and copy objects

C.Use S3 Inventory to generate a list and then copy

D.Use AWS Lambda with S3 event notifications

AnswerB

The CLI can filter by size and copy objects using a script.

Why this answer

The command lists objects larger than 10 MB. To automate copying, a script using AWS CLI with the --query parameter can filter and copy. Using S3 Batch Operations allows performing actions on a list of objects.

The correct approach is to use AWS CLI with a script that iterates over the filtered list and uses aws s3 cp. S3 replication is for continuous sync, not one-time copy. Lambda with S3 events triggers only on new objects, not existing ones.

S3 Inventory provides metadata but not direct copy.

Practice this question →

477

MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for real-time processing. The data volume peaks at 5 GB/min. Which AWS service should be used as the ingestion endpoint?

A.Amazon Kinesis Data Streams

B.AWS Glue

C.Amazon S3

D.AWS Lambda

AnswerA

Kinesis Data Streams is built for real-time streaming data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion. Option A is wrong because S3 is for object storage, not real-time streaming. Option B is wrong because Lambda is compute, not an ingestion endpoint.

Option D is wrong because Glue is for ETL jobs.

Practice this question →

478

Multi-Selectmedium

A company is building a data lake on Amazon S3 and needs to ingest data from multiple sources. Which of the following AWS services can be used to ingest and transform data in near real-time? (Select TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Athena

D.AWS Step Functions

E.Amazon Simple Queue Service (SQS)

AnswersA, B

Can be used for ETL jobs triggered by S3 events.

Why this answer

Correct options: A and C. Kinesis Data Firehose can ingest streaming data and perform basic transformations. AWS Glue can be triggered by S3 events for near real-time transformation.

Option B (Amazon Athena) is a query service, not ingestion. Option D (Amazon SQS) is a message queue. Option E (AWS Step Functions) is for orchestration.

Practice this question →

479

Multi-Selecthard

A company is migrating on-premises Apache Kafka clusters to Amazon MSK. The migration must be seamless with no data loss. The team is using MirrorMaker 2 to replicate data from on-premises Kafka to MSK. Which THREE configurations are necessary to ensure exactly-once semantics and minimal downtime? (Choose three.)

Select 3 answers

A.Set auto.create.topics.enable to false to prevent automatic topic creation.

B.Set offsets.topic.replication.factor to 3 for the consumer offsets topic.

C.Set replication.factor to 3 on the MSK cluster.

D.Enable TLS encryption between on-premises and MSK.

E.Configure MirrorMaker 2 to use exactly-once semantics.

AnswersB, C, E

Ensures offset data is durable and replicated.

Why this answer

A is correct because enabling replication.factor ensures data durability in MSK. C is correct because enabling exactly-once semantics in MirrorMaker prevents duplicates. E is correct because setting offsets.topic.replication.factor ensures consumer offsets are replicated.

B is wrong because auto.create.topics.enable should be true to allow topic creation. D is wrong because SSL is optional and not required for exactly-once.

Practice this question →

480

MCQhard

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data is then processed by a Kinesis Data Analytics application running SQL queries. The analytics application is falling behind and processing records with increasing latency. The stream has 4 shards, and the average record size is 5 KB. What is the MOST effective way to improve processing latency?

A.Increase the number of shards in the Kinesis stream to 8.

B.Enable enhanced fan-out on the Kinesis stream for the analytics application.

C.Increase the parallelism of the Kinesis Data Analytics application.

D.Increase the retention period of the Kinesis stream to 7 days.

AnswerC

More parallelism allows the application to process more records per unit time.

Why this answer

Option D is correct because increasing the parallelism of the Kinesis Data Analytics application (e.g., by increasing the number of in-application streams or ParallelismPerKPU) allows it to consume from the stream faster, reducing latency. Option A is wrong because 5 KB is well below the 1 MB/s shard limit, so increasing shards is unnecessary. Option B is wrong because enhanced fan-out is for consumers that need low latency, but does not help if the application is CPU-bound.

Option C is wrong because increasing record TTL does not affect processing speed.

Practice this question →

481

MCQeasy

A company wants to move data from an Amazon RDS for MySQL database to Amazon Redshift for analytics. The data needs to be refreshed daily. Which AWS service is best suited for this?

A.AWS Database Migration Service (DMS)

B.AWS Glue

C.Amazon EMR

D.Amazon Athena

AnswerB

Can extract from RDS and load to Redshift with scheduling.

Why this answer

Option C is correct because AWS Glue can connect to RDS and Redshift, and schedule jobs. Option A (DMS) is for ongoing replication, not necessarily for daily batch. Option B (Athena) queries S3.

Option D (EMR) is for big data processing but overkill for simple transfer.

Practice this question →

482

Multi-Selecteasy

A company needs to ingest data from an on-premises Oracle database into Amazon S3 for analytics. The data volume is about 1 TB initially, with daily incremental updates of about 10 GB. Which TWO services can be combined to achieve this with minimal custom code?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Athena

D.Amazon S3

E.AWS Database Migration Service (DMS)

AnswersD, E

Target for the ingested data.

Why this answer

Options A and D are correct because AWS DMS can perform full load and ongoing replication, and S3 is the target. Option B (Kinesis Data Streams) is for real-time streaming, not database migration. Option C (Glue) can be used but often requires more custom code than DMS.

Option E (Athena) is a query service, not data movement.

Practice this question →

483

Multi-Selectmedium

A company is ingesting real-time clickstream data into Amazon S3 using Amazon Kinesis Data Firehose. The data is semi-structured and the company wants to transform the data into Parquet format and partition it by year, month, day, and hour. Which TWO steps should be taken to achieve this? (Choose TWO.)

Select 2 answers

A.Set up an Amazon S3 event notification to trigger an AWS Lambda function that partitions the data after delivery.

B.Enable dynamic partitioning in Kinesis Data Firehose and specify the partition keys as year, month, day, hour extracted from the data.

C.Use an AWS Glue Crawler to infer the schema and automatically partition the data in S3.

D.Create an AWS Lambda function that transforms incoming records to Parquet and attach it to the Firehose delivery stream as a data transformation.

E.Configure Kinesis Data Firehose to convert the data to Parquet format using a schema from the AWS Glue Data Catalog.

AnswersB, D

Dynamic partitioning allows Firehose to write data into partitioned S3 prefixes.

Why this answer

Option B is correct because Kinesis Data Firehose's dynamic partitioning feature allows you to specify partition keys (year, month, day, hour) extracted from the incoming data, and Firehose will automatically create the corresponding S3 prefix structure (e.g., year=2024/month=01/day=15/hour=10/) during delivery. Option D is correct because to convert semi-structured data to Parquet format, you can attach an AWS Lambda function as a data transformation to Firehose, which converts each record to Parquet before delivery to S3.

Exam trap

AWS often tests the misconception that dynamic partitioning alone handles format conversion, but in reality, dynamic partitioning only manages the S3 prefix structure, while Parquet conversion requires a separate Lambda transformation or the use of Firehose's built-in Parquet conversion with a compatible input format.

Practice this question →

484

Multi-Selecteasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 for analytics. The data changes frequently and the engineer wants to capture both initial load and incremental changes with minimal latency. Which TWO AWS services should be used together? (Choose TWO.)

Select 2 answers

A.AWS Database Migration Service (DMS)

B.AWS Lambda

C.AWS Glue

D.AWS Transfer Family

E.Amazon Kinesis Data Streams

AnswersA, E

DMS can perform ongoing replication from Oracle to S3.

Why this answer

Option B and Option D are correct. AWS DMS can continuously replicate changes from Oracle to S3, and Amazon Kinesis Data Streams can provide low-latency streaming for near-real-time data. Option A is for batch file transfer, not real-time.

Option C is for ETL, not replication. Option E is for event-driven processing but not direct database replication.

Practice this question →

485

Multi-Selecthard

Which THREE factors should a data engineer consider when choosing between AWS Glue and Amazon EMR for a data transformation job? (Choose three.)

Select 3 answers

A.The ability to output results to Amazon S3

B.The support for Apache Spark

C.The level of control over the execution environment and dependencies

D.The need for a serverless vs. cluster-based environment

E.The cost model: pay per DPU for Glue vs. per instance for EMR

AnswersC, D, E

EMR offers more control; Glue is less customizable.

Why this answer

Options A, B, and C are correct. Option D is not a direct factor because both can output to S3. Option E is incorrect because EMR also uses clusters.

Practice this question →

486

MCQhard

A company uses AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration is taking longer than expected. The task status shows 'Full load in progress' with a low 'Table throughput (rows/s)'. Which action would MOST improve throughput?

A.Enable Multi-AZ on the DMS replication instance

B.Change the target table preparation mode to 'Do nothing'

C.Increase the number of parallel tasks in the DMS task settings

D.Increase the number of shards in the source database

AnswerC

Parallel tasks allow concurrent loading of tables, increasing throughput.

Why this answer

Using multiple parallel tasks splits the load across threads, improving throughput. Increasing the DMS replication instance size also helps, but parallel tasks are more effective for large datasets.

Practice this question →

487

MCQmedium

A company uses AWS Glue ETL to transform data from Amazon RDS for MySQL to Amazon S3. The Glue job reads from a JDBC connection. The job runs once daily and processes all records, but the data volume is growing. Which change would improve performance and reduce costs?

A.Increase the number of DPUs for the Glue job

B.Switch to a Glue Python shell job

C.Use a higher JDBC fetch size

D.Enable Glue job bookmarking and set the job to process only new data

AnswerD

Bookmarking enables incremental loads.

Why this answer

Using incremental extraction with bookmarking allows Glue to process only new data instead of full scans, reducing time and cost.

Practice this question →

488

MCQeasy

A company needs to ingest real-time clickstream data from a web application into Amazon S3 for analytics. The data must be available within minutes of generation. Which AWS service should be used to capture and deliver this streaming data?

A.Amazon RDS

B.Amazon Kinesis Data Firehose

C.AWS Glue

D.Amazon Simple Queue Service (SQS)

AnswerB

Correct: Kinesis Data Firehose captures streaming data and delivers it to S3 with low latency.

Why this answer

Option A (Amazon Kinesis Data Firehose) is correct because it can capture streaming data and deliver it to S3 with minimal latency (typically 60 seconds). Option B (AWS Glue) is a batch ETL service, not real-time. Option C (Amazon SQS) is a message queue, not designed for direct S3 delivery.

Option D (Amazon RDS) is a relational database.

Practice this question →

489

MCQhard

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 using AWS DMS. The change data capture (CDC) must be enabled to capture ongoing changes. Which additional AWS service is required to store the transaction logs for CDC?

A.Amazon RDS

B.Amazon S3

C.Amazon EBS

D.Amazon CloudWatch Logs

AnswerB

DMS CDC uses S3 to store Oracle transaction logs for ongoing replication.

Why this answer

Option A is correct because AWS DMS CDC requires Oracle transaction logs to be stored in S3 or a custom S3 bucket for replication. Option B (CloudWatch Logs) is for monitoring, not storage. Option C (Amazon RDS) is a database, not storage.

Option D (Amazon EBS) is block storage attached to EC2, not suitable for DMS CDC.

Practice this question →

490

MCQhard

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams. The stream has 10 shards, and the data volume is expected to grow by 50% over the next month. The engineer needs to ensure that the pipeline can scale without manual intervention. Which approach should be used?

A.Set up a CloudWatch Alarm to trigger a Lambda function to add shards

B.Use an Auto Scaling group to add more shards

C.Switch the Kinesis stream to on-demand capacity mode

D.Configure the stream to use a Lambda function that scales shards

AnswerC

On-demand mode automatically scales shards based on ingestion throughput.

Why this answer

Option B is correct because Kinesis Data Streams supports automatic scaling using the 'On-demand' capacity mode, which adjusts shards based on traffic. Option A is wrong because Auto Scaling groups are for EC2, not Kinesis. Option C is wrong because CloudWatch Alarms can trigger auto scaling, but scaling a Kinesis stream via API requires custom code; on-demand mode is simpler.

Option D is wrong because Lambda does not natively scale the stream.

Practice this question →

491

Multi-Selecteasy

A data engineer needs to ingest data from multiple on-premises relational databases into Amazon S3 for analytics. The data must be transformed and loaded daily. Which THREE AWS services should the engineer use together to build this pipeline? (Choose THREE.)

Select 3 answers

A.AWS Glue

B.AWS Glue Data Catalog

C.Amazon Athena

D.AWS Database Migration Service (DMS)

E.Amazon Kinesis Data Streams

AnswersA, B, D

Performs ETL transformations on the data.

Why this answer

Options B, C, and D are correct. Option B: AWS DMS can migrate data from on-premises databases to S3. Option C: AWS Glue can transform the data.

Option D: AWS Glue Data Catalog can store metadata. Option A is wrong because Kinesis Data Streams is for real-time streaming, not batch ingestion. Option E is wrong because Athena is for querying, not for building the ingestion pipeline.

Practice this question →

492

MCQeasy

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is stored on an NFS file server and the total volume is 50 TB. The network bandwidth between the on-premises data center and AWS is 1 Gbps (gigabit per second). What is the primary factor that will determine the total time required for the initial data transfer?

A.The available network bandwidth between on-premises and AWS

B.The number of S3 buckets used as the destination

C.The average file size in the dataset

D.The IOPS (I/O operations per second) of the on-premises NFS server

AnswerA

With 50 TB and 1 Gbps, the theoretical minimum time is ~4.7 days; network bandwidth is the key constraint.

Why this answer

Option C is correct because the network bandwidth is the bottleneck for transferring 50 TB over a 1 Gbps link; it will take approximately 4.7 days to transfer if bandwidth is fully utilized, but file metadata operations and number of files also matter. Option A is wrong because DataSync can handle large files efficiently. Option B is wrong because the NFS server performance is usually not the bottleneck compared to network.

Option D is wrong because S3 object storage can accept data at high throughput.

Practice this question →

493

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The delivery occasionally fails due to 'ThrottlingException' from S3. What should the team do to resolve this issue without losing data?

A.Enable S3 Transfer Acceleration on the destination bucket.

B.Disable error logging in Firehose to reduce API calls.

C.Configure Firehose to deliver data to Amazon DynamoDB instead.

D.Increase the Firehose buffer size and buffer interval to reduce the number of S3 PUT requests.

AnswerD

Larger buffers mean fewer writes, reducing throttling risk.

Why this answer

Option C is correct. Kinesis Firehose can buffer failed records and retry; you can also increase the buffer size or interval to reduce S3 PUT frequency. Option A is wrong because disabling error logging is not a solution.

Option B is wrong because S3 Transfer Acceleration is for speed, not throttling. Option D is wrong because DynamoDB is not the target.

Practice this question →

494

MCQhard

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data is partitioned by date and hour. The job reads the latest hour's data, performs aggregations, and writes results to a separate S3 bucket. The job runs every hour and processes approximately 500 MB of input data. The team notices that the job takes longer than expected, often exceeding the 1-hour window. Which action would most effectively reduce the job's runtime?

A.Use a Python shell job instead of a Spark job.

B.Switch from using DynamicFrame to using Spark SQL for transformations.

C.Repartition the input data into more partitions before reading.

D.Increase the number of workers (DPUs) for the Glue job.

AnswerD

More workers increase parallelism, reducing runtime for the given data size.

Why this answer

The correct answer is to increase the number of workers. The job processes only 500 MB, so increasing worker count (DPUs) will improve parallelism. Option B is incorrect because the job processes only one hour's data, and repartitioning would add overhead.

Option C is incorrect because using Spark SQL does not inherently improve performance. Option D is incorrect because switching to a Python shell would not handle the transformation efficiently. Option A directly adds resources to speed up the job.

Practice this question →

495

MCQeasy

A data engineer needs to ingest log files from multiple EC2 instances into Amazon S3. The logs are written to local disk on each instance. The engineer wants a simple agent-based solution that can collect, compress, and upload logs to S3 with minimal configuration. The solution must support incremental uploads (only new log lines) and handle log rotation. What should the engineer use?

A.Install and configure Amazon Kinesis Agent for Amazon CloudWatch to send logs to CloudWatch Logs, then use a subscription filter to export logs to S3.

B.Use AWS CLI cp command with --recursive in a cron job to copy logs to S3 every minute.

C.Install AWS DataSync agent on each EC2 instance to sync logs to S3 daily.

D.Use an S3 sync command from the AWS CLI scheduled every hour.

AnswerA

Kinesis Agent tails log files, compresses, and sends to CloudWatch Logs; export to S3 can be automated.

Why this answer

Option C is correct: Amazon Kinesis Agent for Amazon CloudWatch can tail log files, compress, and send to CloudWatch Logs, which can then be exported to S3. Option A (AWS DataSync) is for bulk transfers, not streaming. Option B (AWS CLI) is manual.

Option D (S3 sync) is not real-time.

Practice this question →

496

MCQhard

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The Firehose delivery stream has a buffer size of 64 MB and a buffer interval of 300 seconds. The data volume is 1 GB per minute, and the average record size is 1 KB. The data must be delivered to S3 within 5 minutes of ingestion. The engineer notices that some files are being delivered after 10 minutes. What is the most likely cause?

A.The buffer size of 64 MB is too small for the data volume

B.The data is not compressed, causing larger file sizes

C.The buffer interval of 300 seconds is too long

D.The S3 bucket is throttling PUT requests due to high throughput

AnswerD

High PUT request rates can cause throttling, leading to retries and increased delivery time.

Why this answer

Option A is correct because with 1 KB records and 64 MB buffer, it takes ~64,000 records to fill the buffer. At 1 GB/min (~1,000,000 records/min), the buffer fills in ~3.84 seconds, so the 300-second interval is not the bottleneck. However, if the S3 bucket has a PUT request rate limit, Firehose may retry and cause delays.

Option B is incorrect because 64 MB is default, not too small. Option C is incorrect because 300 seconds is actually long, but data volume is high enough to trigger delivery sooner. Option D is incorrect because compression would reduce size, not cause delays.

Practice this question →

497

MCQeasy

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) and load it into Amazon Redshift. The data must be transformed before loading. Which AWS service should be used to build the ingestion pipelines?

A.AWS Database Migration Service (DMS)

B.AWS Data Pipeline

C.Amazon AppFlow

D.AWS Glue (crawlers and ETL jobs)

AnswerD

AWS Glue can connect to SaaS sources via JDBC and perform complex transformations.

Why this answer

Option A is correct because AWS Glue can connect to various sources via JDBC and perform ETL transformations. Option B (AppFlow) is for ingesting SaaS data but does not have built-in transformation capabilities; it can only perform simple field mappings. Option C (DMS) is for database migrations, not SaaS.

Option D (Data Pipeline) is older and less flexible.

Practice this question →

498

MCQeasy

A data engineer needs to ingest streaming data from an IoT fleet into Amazon S3 for near-real-time analytics. The data volume is approximately 5 GB per hour, and each event is less than 1 KB. Which AWS service should be used as the ingestion endpoint?

A.AWS IoT Core

B.AWS DataSync

C.Amazon AppFlow

D.Amazon Kinesis Data Streams

AnswerA

Designed for IoT device ingestion.

Why this answer

AWS IoT Core is purpose-built for ingesting data from IoT devices, supporting MQTT, HTTP, and WebSocket protocols. It can handle millions of devices and high-throughput, small-message payloads (each event <1 KB) and integrates directly with Amazon S3 via IoT Core rules, making it the ideal ingestion endpoint for near-real-time analytics on streaming IoT data.

Exam trap

The trap here is that candidates often default to Amazon Kinesis Data Streams for any streaming workload, overlooking that AWS IoT Core is the specialized, fully managed service designed specifically for IoT device ingestion, with native MQTT support and direct S3 integration via rules.

How to eliminate wrong answers

Option B (AWS DataSync) is wrong because it is designed for one-time or scheduled bulk data transfers between on-premises storage and AWS, not for continuous, near-real-time streaming ingestion from IoT devices. Option C (Amazon AppFlow) is wrong because it is a fully managed integration service for transferring data between SaaS applications (e.g., Salesforce, Slack) and AWS, not for ingesting IoT device telemetry streams. Option D (Amazon Kinesis Data Streams) is wrong because while it can ingest streaming data, it is a generic stream processing service that requires additional configuration (e.g., Kinesis Data Firehose) to write to S3, and it is not the dedicated IoT ingestion endpoint; AWS IoT Core is the recommended first-hop for IoT data.

Practice this question →

499

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for clickstream data. The data arrives in batches of 10-50 MB every 5 seconds. The engineer needs to buffer the data, perform simple transformations (e.g., add timestamp, remove PII), and land it in S3 within 10 minutes. Which TWO services should be combined? (Choose TWO.)

Select 2 answers

A.Amazon Simple Queue Service (SQS)

B.Amazon Kinesis Data Firehose

C.AWS Lambda

D.AWS Glue ETL

E.Amazon Kinesis Data Streams

AnswersB, C

Firehose can buffer and invoke Lambda for transformation, then deliver to S3.

Why this answer

Option A (Kinesis Data Firehose) can buffer and invoke Lambda for transformation, delivering to S3. Option E (Lambda) can perform the transformation. Option B (SQS) is for decoupling, not streaming to S3.

Option C (Kinesis Data Streams) requires custom consumer. Option D (Glue) is batch-oriented.

Practice this question →

500

MCQmedium

A data engineer runs the command shown. The consumer application is unable to read data older than 24 hours. What is the most likely cause?

A.The shard has reached its maximum sequence number.

B.The stream is encrypted with KMS, preventing access.

C.The retention period is set to 24 hours, so data older than 24 hours is deleted.

D.The stream is in ACTIVE status but not processing data.

AnswerC

Data retention is 24 hours; data beyond that is expired.

Why this answer

The stream's retention period is 24 hours, meaning data is automatically deleted after 24 hours. The consumer tries to read data older than 24 hours, which is no longer available.

Practice this question →

501

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting a Glue job that reads objects from this S3 bucket. The job runs successfully but produces no output. The Glue catalog table points to the same S3 path. What is the most likely cause?

A.The S3 key does not follow Hive-style partitioning (e.g., year=2024/month=01).

B.The object metadata is too large.

C.The StorageClass is not supported by Glue.

D.The ContentType is not supported by Glue.

AnswerA

Glue uses partition projections.

Why this answer

Option C is correct because Glue expects a partition structure (e.g., year=2024/month=01/day=01) to automatically discover partitions. The current key has no partition markers. Option A is wrong because Glue can read application/octet-stream.

Option B is wrong because storage class is STANDARD. Option D is wrong because metadata is irrelevant to Glue catalog.

Practice this question →

502

MCQhard

A company is ingesting data from multiple on-premises databases into AWS using AWS Database Migration Service (DMS). The data must be continuously replicated with minimal downtime. However, the source databases do not support native CDC. What should the data engineer do to enable continuous replication?

A.Use Amazon Kinesis Data Streams with a custom producer to capture database changes.

B.Use Amazon Redshift Spectrum to directly query the on-premises databases.

C.Use AWS DMS with log-based CDC if the source databases support it; otherwise, use DMS with batch replication and schedule frequent refreshes.

D.Set up AWS Glue jobs to run every minute to extract and load the data.

AnswerC

DMS supports CDC via source database logs, and if not available, batch replication can approximate continuous sync.

Why this answer

Option A is correct because DMS can use the source engine's native CDC capabilities (e.g., Oracle GoldenGate, MySQL binlog) or log-based CDC if available. Option B is wrong because Glue is not for real-time CDC from databases. Option C is wrong because Kinesis is for streaming data, not for pulling from databases.

Option D is wrong because Redshift is a target, not a replication tool.

Practice this question →

503

MCQeasy

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3. The data is stored as CSV files. The downstream team requires the data to be in Apache Parquet format. Which change should the data engineer make to the DMS task?

A.Modify the DMS task to use Apache Parquet as the target table preparation mode.

B.Add an S3 lifecycle rule to convert CSV to Parquet.

C.Change the DMS task to use full load instead of continuous replication.

D.Configure a Lambda function to transform data after DMS writes to S3.

AnswerA

DMS can write directly in Parquet format.

Why this answer

Option C is correct because DMS supports target table preparation mode to convert data to Parquet format. Option A is wrong because S3 lifecycle policies do not convert formats. Option B is wrong because DMS does not support Lambda transformations natively.

Option D is wrong because switching to full load only would stop continuous replication.

Practice this question →

504

MCQhard

A company uses Kinesis Data Firehose with a Lambda function for data transformation. The transformation is failing intermittently due to Lambda timeouts. The maximum record size is 1 MB. What is the most cost-effective way to reduce failures without losing data?

A.Use Kinesis Data Analytics to pre-process data before Firehose

B.Decrease the Firehose batch size to reduce the number of records per invocation

C.Configure the Firehose delivery stream to send failed records to an S3 dead-letter bucket

D.Increase the Lambda function timeout and memory allocation

AnswerD

Increasing timeout and memory reduces timeouts without losing data.

Why this answer

Option A is correct because increasing the Lambda timeout and memory reduces timeouts while remaining cost-effective. Option B is wrong because sending to a dead-letter queue (DLQ) loses data. Option C is wrong because decreasing batch size may increase costs.

Option D is wrong because using Kinesis Data Analytics adds unnecessary complexity and cost.

Practice this question →

505

MCQeasy

A data engineer needs to ingest data from an Amazon S3 bucket into an Amazon Redshift table on a daily schedule. The data is in CSV format and the schema matches. Which service is simplest for this batch ingestion?

A.Amazon Redshift COPY command

B.AWS Glue ETL job with JDBC connection

C.AWS Data Pipeline

D.Amazon Athena CREATE TABLE AS SELECT

AnswerA

Direct and optimized for loading from S3.

Why this answer

Redshift COPY command loads from S3 efficiently. Glue and Data Pipeline also work but are more complex. Athena is for querying, not loading.

Practice this question →

506

MCQmedium

A data engineer is using Amazon EMR to transform large datasets stored in S3. The cluster runs once a day and takes 3 hours. The engineer notices that the cluster is idle for 30 minutes at the start while waiting for resources. What is the most cost-effective way to reduce the idle time?

A.Increase the instance type size

B.Use Spot Instances for all nodes

C.Configure a larger initial core instance count and enable managed scaling

D.Purchase Reserved Instances for the cluster

AnswerC

More core nodes reduce the time to allocate resources, and managed scaling adjusts during the job.

Why this answer

Option C is correct because using a cluster with managed scaling and a larger initial core node count can reduce the time waiting for resource allocation. Option A (spot instances) may cause interruptions. Option B (reserved instances) is for long-term use, not transient clusters.

Option D (increasing instance size) may reduce processing time but not idle time.

Practice this question →

507

MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The on-premises network has a 1 Gbps link to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

A.Use Amazon S3 Transfer Acceleration to speed up the transfer over the internet.

B.Use AWS DataSync to transfer the data over the existing network link.

C.Use AWS Snowball Edge to physically transfer the data.

D.Use AWS Direct Connect to establish a dedicated network connection.

AnswerC

Snowball Edge is a physical device that can transfer large amounts of data quickly and cost-effectively, especially when network bandwidth is limited.

Why this answer

Option C is correct because AWS Snowball Edge is a physical device that can transfer large amounts of data faster than a network link, especially with a 1 Gbps link that would take about 4.6 days for 50 TB (theoretical max, but actual throughput will be lower due to overhead). Snowball Edge can transfer 50 TB in a few days and is cost-effective for large data volumes. Option A (AWS DataSync) is efficient for online transfers but may not meet the 5-day deadline over 1 Gbps.

Option B (Amazon S3 Transfer Acceleration) speeds up transfers but still limited by network bandwidth. Option D (AWS Direct Connect) would require additional setup and cost, and still limited by the 1 Gbps link.

Practice this question →

508

Multi-Selectmedium

A data engineer is building a batch ETL pipeline using AWS Glue. The source data is in Amazon RDS for MySQL. The pipeline must run daily and process only new and modified records since the last run. The engineer needs to implement change data capture (CDC) efficiently. Which THREE steps should the engineer take? (Choose THREE.)

Select 3 answers

A.Configure the Glue job to use a JDBC connection with a SQL query that reads from the binary log.

B.Ingest the RDS data into Kinesis Data Streams and then use Glue.

C.Set a job bookmark in Glue to track processed records.

D.Enable binary logging (binlog) on the RDS MySQL instance.

E.Use a full table scan with a WHERE clause on a timestamp column.

AnswersA, C, D

Glue can read CDC from binlog.

Why this answer

Options A, C, and D are correct. Enable binary logging for CDC, use a bookmark to track processed records, and create a job bookmark in Glue. Option B is wrong because querying entire table daily is inefficient.

Option E is wrong because Kinesis Data Streams is for real-time, not batch CDC.

Practice this question →

509

MCQmedium

A data engineer needs to ingest JSON files from an on-premises SFTP server into Amazon S3. The files are uploaded daily and each file is up to 500 MB. The solution must be serverless and minimize cost. Which service should the engineer use?

A.Amazon Kinesis Data Firehose.

B.AWS DataSync with an on-premises agent.

C.Amazon S3 Transfer Acceleration.

D.AWS Transfer Family (SFTP) endpoint.

AnswerD

Fully managed SFTP service that writes directly to S3.

Why this answer

Option B is correct because AWS Transfer Family supports SFTP and can automatically transfer files to S3, with no need to manage servers. Option A is wrong because AWS DataSync requires an agent on-premises. Option C is wrong because S3 Transfer Acceleration is for speeding up transfers to S3, not for ingesting from SFTP.

Option D is wrong because Kinesis Data Firehose is for streaming data, not file-based SFTP transfers.

Practice this question →

510

MCQhard

A data engineering team is troubleshooting a slow AWS Glue ETL job that reads from an Amazon DynamoDB table and writes to Amazon S3 in Parquet format. The job processes 50 GB of data. Which action would most effectively improve job performance?

A.Use S3 Select to push down filters

B.Reduce the batch size in the DynamoDB connector

C.Increase the number of DPUs

D.Change output to JSON format to reduce overhead

AnswerC

More DPUs increase parallelism and can speed up the job.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job provides more parallelism and memory, which can significantly speed up processing. Using G.1X worker type with more memory can also help. Using S3 Select is for filtering within S3, not for DynamoDB.

Changing to JSON format may reduce performance due to larger file size. Reducing batch size could slow down the job.

Practice this question →

511

MCQhard

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

A.Amazon Kinesis Data Streams with AWS Lambda for transformation and Amazon S3.

B.Amazon Simple Queue Service (SQS) with AWS Lambda for transformation and Amazon S3.

C.AWS Glue streaming jobs consuming from Amazon Kinesis Data Streams and writing to Amazon S3.

D.Amazon Kinesis Data Firehose with data transformation via AWS Lambda, delivering to Amazon S3.

AnswerD

Firehose supports Parquet conversion and partitioning; Lambda handles transformation.

Why this answer

Amazon Kinesis Data Firehose is the correct choice because it can directly ingest streaming data from AWS IoT Core, use a built-in AWS Lambda function to transform JSON to Parquet, and deliver the data to Amazon S3 with automatic partitioning. It also supports buffering and retry logic to handle late-arriving data (up to 1 hour) and provides exactly-once delivery to S3 when configured with the appropriate error handling and idempotent transformations.

Exam trap

The trap here is that candidates often choose Kinesis Data Streams with Lambda (Option A) because they think it offers more control, but they overlook that Firehose provides a managed, exactly-once, partitioned Parquet delivery pipeline with built-in late-arriving data handling, which is the exact requirement in the question.

How to eliminate wrong answers

Option A is wrong because Amazon Kinesis Data Streams with AWS Lambda requires custom code to manage checkpointing, partitioning, and exactly-once semantics, and does not natively support Parquet conversion or S3 delivery without additional complexity. Option B is wrong because Amazon SQS does not guarantee exactly-once processing (standard queues offer at-least-once, FIFO queues offer exactly-once but lack native streaming integration with IoT Core and Parquet transformation). Option C is wrong because AWS Glue streaming jobs consume from Kinesis Data Streams, not directly from IoT Core, and they do not provide built-in exactly-once delivery to S3; they rely on checkpointing that can lead to duplicates or data loss in failure scenarios.

Practice this question →

512

MCQeasy

A company wants to ingest streaming data from IoT devices into Amazon S3 using Amazon Kinesis Data Firehose. The data must be transformed from JSON to Parquet format before landing in S3. What is the SIMPLEST way to achieve this?

A.Configure Kinesis Data Firehose with a built-in Parquet converter.

B.Use an AWS Lambda function as a data transformation in Kinesis Data Firehose to convert JSON to Parquet.

C.Use Kinesis Data Firehose to deliver data directly to S3 in JSON format and run a nightly Glue job to convert to Parquet.

D.Use Kinesis Data Analytics to convert the data to Parquet before sending to Firehose.

AnswerB

Lambda can transform the data and convert to Parquet before delivery.

Why this answer

Option D is correct because Kinesis Data Firehose can use an AWS Lambda function for data transformation, and converting to Parquet can be done via a Lambda function. Option A is wrong because Firehose can deliver to S3 directly. Option B is wrong because Firehose can transform without a separate Glue job.

Option C is wrong because Firehose does not natively support Parquet conversion without a transformation.

Practice this question →

513

MCQhard

A data engineer is troubleshooting a slow AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 10 GB of CSV data. The engineer notices that the job runs with a single DPU and takes longer than expected. Which change would MOST likely improve performance?

A.Replace Redshift with Amazon Redshift Spectrum.

B.Change the input format to Parquet and enable predicate pushdown.

C.Use a JDBC connection to read data directly from S3.

D.Increase the number of DPUs and configure the job to use the S3 list implementation for parallel reads.

AnswerD

More DPUs allow parallel processing, and S3 list implementation improves file discovery.

Why this answer

Option A is correct because increasing DPUs and using the S3 list implementation can parallelize reading. Option B is wrong because Parquet is more efficient, but the bottleneck may be parallelism. Option C is wrong because JDBC connections are for databases, not S3.

Option D is wrong because Redshift Spectrum queries data in place, but the job is an ETL, not querying.

Practice this question →

514

MCQeasy

A startup is building a data pipeline to ingest user activity logs from a mobile app. The logs are sent in real-time via HTTP POST requests. The data volume is low (a few hundred requests per second) but can spike to a few thousand during promotions. The team wants to store the logs in Amazon S3 for analysis. They also need to be able to query the data using Amazon Athena with minimal latency. The data must be transformed from JSON to Parquet and partitioned by date. The team is considering using Amazon API Gateway with AWS Lambda to receive the logs and write to S3. However, they are concerned about Lambda cold starts and the complexity of handling spikes. Which alternative solution should they choose?

A.Use Amazon API Gateway with AWS Lambda that sends logs to Amazon SQS, then a separate Lambda reads from SQS and writes to S3

B.Use Amazon Kinesis Data Firehose with a HTTP endpoint as source, enable Parquet conversion, and deliver to S3 with dynamic partitioning

C.Use Amazon Kinesis Data Streams with AWS Lambda to process and write to S3

D.Use Amazon EMR with Spark Streaming to ingest logs from a custom endpoint

AnswerB

Firehose handles ingestion, transformation, and partitioning with automatic scaling.

Why this answer

Option A is correct because Kinesis Data Firehose can be used as a HTTP endpoint (via API Gateway or directly with Firehose API), automatically buffers data, converts to Parquet, and writes to S3 with partitioning by date. This handles spikes without custom code. Option B (Lambda + SQS) adds complexity and still faces cold starts.

Option C (EMR) is overkill. Option D (Kinesis Data Streams + Lambda) still requires Lambda and has cold start issues.

Practice this question →

515

MCQmedium

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for analytics. The data sources have different schemas and update frequencies. Which AWS service should be used to build this ingestion pipeline with minimal code?

A.AWS Data Pipeline

B.AWS Glue

C.Amazon Kinesis Data Firehose

D.Amazon AppFlow

AnswerD

AppFlow is designed to ingest data from SaaS applications to S3 with minimal code.

Why this answer

Option C is correct because AWS Glue has built-in connectors for many SaaS applications and can schedule crawlers to discover schema and extract data. Option A is wrong because Kinesis Data Firehose is for streaming data, not for batch extraction from SaaS APIs. Option B is wrong because AppFlow is specifically designed for SaaS data ingestion with minimal code.

Actually, AppFlow is the correct answer; AWS Glue also has connectors but requires more setup. Correcting: Amazon AppFlow is purpose-built for ingesting data from SaaS applications to S3. Option C is wrong because Glue has connectors but AppFlow is simpler.

The correct answer is B.

Practice this question →

516

MCQeasy

A company wants to ingest data from multiple SaaS applications into Amazon S3 using a fully managed service that supports schema discovery and transformation. Which AWS service should they use?

A.Amazon Kinesis Data Firehose

B.Amazon AppFlow

C.AWS Glue

D.AWS Data Pipeline

AnswerB

Fully managed service for SaaS data ingestion with schema discovery.

Why this answer

Option B is correct because Amazon AppFlow is a fully managed integration service that supports SaaS sources, schema discovery, and data transformation. Option A (AWS Glue) is for ETL but not for SaaS ingestion directly. Option C (Amazon Kinesis) is for streaming data.

Option D (AWS Data Pipeline) is not fully managed for SaaS.

Practice this question →

517

MCQhard

A company uses Amazon Kinesis Data Firehose to ingest JSON logs from multiple sources into an S3 data lake. The data is then consumed by Amazon Athena for analysis. Recently, some queries have been failing with the error 'HIVE_BAD_DATA: Field xyz's type is an unsupported type'. The firehose delivery stream transforms the data using a Lambda function that converts timestamps to Unix epoch. What is the MOST likely cause of the query failure?

A.Some records contain timestamps that were not converted to epoch, so Athena infers the column as a string.

B.The data is in JSON format instead of Parquet.

C.The S3 partitions are not registered in the Glue Data Catalog.

D.The IAM role for Firehose does not have permission to write to S3.

AnswerA

Inconsistent data types in a column cause Athena to default to string, leading to type mismatch when queried.

Why this answer

The error indicates that the data type detected by Athena's schema inference does not match the actual data. Since the Lambda function converts timestamps to Unix epoch (a number), Athena may infer the column as a string due to some records not being converted properly. Option A is wrong because partitions do not cause this error.

Option B is wrong because S3 permissions would cause a different error. Option D is wrong because data format is likely compatible.

Practice this question →

518

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for real-time user activity logs. The logs are generated by a web application and must be ingested into Amazon S3 with minimal latency (under 1 minute). The logs also need to be queried in Amazon Athena. The engineer considers using Amazon Kinesis Data Firehose. Which TWO configurations are required to achieve near-real-time delivery to S3? (Choose TWO.)

Select 2 answers

A.Set the BufferIntervalInSeconds to 60 seconds.

B.Enable S3 compression (e.g., GZIP) on the delivery stream.

C.Enable Amazon CloudWatch error logging for the delivery stream.

D.Enable data format conversion to Parquet using AWS Glue.

E.Set the BufferSizeInMBs to 1 MB.

AnswersA, E

Controls how often data is delivered.

Why this answer

Options A and D are correct. Set 'BufferIntervalInSeconds' to 60 seconds to flush every minute, and 'BufferSizeInMBs' to 1 MB to ensure small files are delivered quickly. Option B is wrong because enabling S3 compression does not affect delivery frequency.

Option C is wrong because converting to Parquet is a transformation, not a delivery trigger. Option E is wrong because enabling error logging does not reduce latency.

Practice this question →

519

MCQeasy

A company uses AWS Glue ETL jobs to transform data stored in Amazon S3. The job reads data in Parquet format, applies transformations, and writes the output back to S3 in Parquet format. The team wants to improve the job's performance and reduce costs. Which action is MOST effective?

A.Change the input format from Parquet to CSV to simplify parsing.

B.Coalesce the input data into a single large file before processing.

C.Use column pruning and predicate pushdown to read only necessary columns and filter data early.

D.Increase the number of workers to maximum allowed.

AnswerC

Reduces the amount of data processed, improving performance and reducing costs.

Why this answer

The correct answer is to use column pruning and predicate pushdown. Reading only necessary columns and filtering early reduces data scanned and processing time. Option B (increasing worker count) increases cost and may not be needed.

Option C (switching to CSV) would increase data size and slow performance. Option D (using a single large file) reduces parallelism and may harm performance.

Practice this question →

520

MCQhard

Refer to the exhibit. A data engineer runs this AWS Glue job but it fails with an error that the table 'orders' does not exist in the 'sales_db' database. The engineer has verified that the table exists in the AWS Glue Data Catalog. What is the most likely cause of the error?

A.The IAM role used by the Glue job does not have permission to read the Data Catalog

B.The Glue job has job bookmark enabled and is skipping the table

C.The script uses 'create_dynamic_frame.from_catalog' incorrectly

D.The S3 path 's3://data-lake/raw/' does not exist

AnswerA

The job needs glue:GetTable permission to access the table metadata.

Why this answer

Option B is correct because the Glue job needs permissions to access the Data Catalog. The IAM role associated with the job must have 'glue:GetTable' permission. Option A is wrong because the error is about the table not existing, not about the S3 path.

Option C is wrong because the script is correct syntactically. Option D is wrong because the job bookmark setting does not affect table discovery.

Practice this question →

521

MCQhard

Refer to the exhibit. A data engineer runs an AWS Glue ETL job that writes output to an S3 bucket. The job fails with the error shown. What is the most likely cause?

A.The IAM role used by the Glue job lacks the s3:PutObject permission for the output bucket

B.The Glue job attempted to write data in an unsupported format

C.The S3 bucket does not exist

D.The output file name contains invalid characters

AnswerA

The error explicitly states the role is not authorized to perform s3:PutObject.

Why this answer

Option A is correct because the error message indicates the GlueServiceRole does not have s3:PutObject permission on the bucket. Option B is wrong because the error is about permissions, not the file name. Option C is wrong because Parquet format is not the issue.

Option D is wrong because the bucket exists (error shows it in the ARN).

Practice this question →

522

MCQhard

A data engineer is ingesting data from a third-party API into Amazon S3 using AWS Lambda. The API returns a JSON payload of up to 10 MB per request. The Lambda function runs every minute. Occasionally, the function times out after 15 seconds. What is the most likely cause?

A.The Lambda function is in a VPC without a NAT gateway, causing network timeouts.

B.The Lambda function's timeout setting is too short.

C.The Lambda function's memory is too low, causing slow processing.

D.The API response size exceeds the Lambda invocation payload limit.

AnswerB

Default timeout is 3 seconds; increase to handle larger payloads.

Why this answer

Option B is correct because Lambda has a default timeout of 3 seconds, but can be increased to 15 minutes. If the function times out at 15 seconds, the timeout is set to 15 seconds. Increasing it resolves the issue.

Option A is wrong because 10 MB is within Lambda limits. Option C is wrong because memory may need increase but timeout is the immediate cause. Option D is wrong because VPC configuration can cause delays but not specifically timeout.

Practice this question →

523

MCQeasy

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. Which AWS service is best suited for this purpose?

A.Amazon S3

B.AWS Lambda

C.Amazon RDS

D.Amazon Kinesis Data Streams

AnswerD

It is designed for real-time streaming data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion from many sources. Option A is wrong because S3 is for object storage, not real-time streaming. Option B is wrong because Lambda processes data but is not a primary ingestion service.

Option D is wrong because RDS is a relational database, not for streaming ingestion.

Practice this question →

524

MCQhard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis stream with 10 shards and writes to an S3 bucket. The application is experiencing high latency. Analysis shows that the application is not keeping up with the incoming data rate. Which action would MOST effectively reduce latency?

A.Increase the number of shards in the Kinesis stream

B.Increase the Parallelism of the Flink application

C.Enable exactly-once delivery to S3

D.Use a larger Kinesis Data Analytics application (increase KPU)

AnswerB

Higher parallelism allows more concurrent processing.

Why this answer

Increasing the parallelism of the Flink application allows it to process more data concurrently. This directly addresses the processing bottleneck.

Practice this question →

525

MCQeasy

A data engineer needs to ingest data from a relational database (MySQL) into Amazon S3 for analytics. The database is 500 GB and the job must run daily with incremental updates. Which AWS service is BEST suited for this task?

A.Amazon EMR with Apache Sqoop.

B.Amazon Kinesis Data Firehose with a database source.

C.AWS Database Migration Service (DMS) with a replication task.

D.AWS Glue ETL job with a JDBC connection.

AnswerC

DMS supports continuous replication and can write to S3.

Why this answer

Option A is correct because AWS Database Migration Service (DMS) supports ongoing replication from MySQL to S3 with change data capture (CDC), enabling incremental updates. Option B is wrong because AWS Glue can do batch jobs but does not natively support CDC as seamlessly as DMS. Option C is wrong because Amazon Kinesis is for streaming data, not database snapshots.

Option D is wrong because Amazon EMR is overkill for simple database ingestion.

Practice this question →

← PreviousPage 7 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion and Transformation questions.

Start 20-question session