CCNA Data Ingestion Transformation Questions

75 of 610 questions · Page 1/9 · Data Ingestion Transformation topic · Answers revealed

1
MCQeasy

A data engineer needs to capture change data capture (CDC) events from an Amazon RDS for PostgreSQL database and stream them to Amazon S3 in near real-time. Which AWS service should be used?

A.Amazon S3 Transfer Acceleration
B.Amazon Athena
C.AWS Database Migration Service (AWS DMS)
D.Amazon Kinesis Data Streams
AnswerC

DMS supports ongoing replication (CDC) from databases to S3.

Why this answer

Option C is correct because AWS DMS can continuously replicate changes from a source database to S3. Option A is wrong because Kinesis Data Streams is for custom streaming applications. Option B is wrong because S3 Transfer Acceleration speeds up uploads but does not capture CDC.

Option D is wrong because Athena is a query service.

2
MCQhard

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3 in Parquet format. The replication is used for near-real-time analytics. Recently, the DMS task started failing with an error indicating insufficient memory. The source database is large (2 TB). What should a data engineer do to resolve this issue while minimizing changes to the existing architecture?

A.Change the target format to JSON to reduce memory usage.
B.Split the DMS task into multiple smaller tasks.
C.Use Change Data Capture (CDC) only, without full load.
D.Increase the DMS replication instance size.
AnswerD

Provides more memory.

Why this answer

The error indicates the DMS replication instance is running out of memory during continuous replication of a 2 TB Oracle database to S3 in Parquet format. Increasing the replication instance size (Option D) directly addresses the memory constraint by providing more RAM and processing capacity, which is necessary for handling large volumes of Change Data Capture (CDC) data and Parquet conversion overhead. This solution requires minimal architectural changes, as it only involves modifying the instance class in the DMS task settings.

Exam trap

The trap here is that candidates may think splitting tasks or changing formats reduces memory usage, but the root cause is insufficient instance resources, and AWS DMS tasks require adequate instance sizing for large-scale CDC workloads.

How to eliminate wrong answers

Option A is wrong because changing the target format to JSON would not reduce memory usage; JSON is typically larger than Parquet and would increase memory consumption during serialization, not decrease it. Option B is wrong because splitting the DMS task into multiple smaller tasks would increase complexity and overhead, potentially causing additional memory pressure from multiple connections and task management, and does not directly resolve the insufficient memory error. Option C is wrong because using CDC only without full load ignores the fact that the task is already failing during continuous replication (CDC phase), and the full load may have already completed; disabling full load does not address the memory issue in CDC processing.

3
MCQmedium

A company has a large volume of CSV files in S3 that need to be transformed into Parquet using AWS Glue. The files are partitioned by date. The engineer wants to minimize costs by processing only new files each day. Which approach should be used?

A.Use S3 partition discovery to automatically read new partitions.
B.Schedule the Glue job to run daily and process all files.
C.Enable job bookmarks in the Glue job.
D.Use S3 Event Notifications to trigger the Glue job on each new file.
AnswerC

Bookmarks track processed data and skip already processed files.

Why this answer

Option C is correct because using a Job Bookmark in Glue tracks processed data and skips already processed files, processing only new ones. Option A is wrong because it would reprocess all files. Option B is wrong because partitioning alone does not prevent reprocessing.

Option D is wrong because an SQS event can trigger a job, but without bookmarks, it may still reprocess.

4
MCQeasy

A data engineering team needs to ingest real-time streaming data from thousands of IoT devices and transform the data before storing it in Amazon S3. Which AWS service is most suitable for performing the transformation step in near real-time?

A.AWS Lambda
B.Amazon Kinesis Data Analytics
C.Amazon Kinesis Data Firehose
D.Amazon S3
AnswerA

Lambda can process Kinesis stream records and transform them.

Why this answer

Option A is correct because AWS Lambda can run code in response to Kinesis Data Streams events and perform transformations before writing to S3. Option B is incorrect because Kinesis Data Analytics is for running SQL/Java on streams, not simple transforms. Option C is incorrect because Kinesis Data Firehose is for loading data to destinations with optional simple transformations via Lambda.

Option D is incorrect because Amazon S3 is storage, not a transformation service.

5
MCQeasy

A data engineer needs to ingest streaming data from thousands of IoT devices into Amazon S3 in near real-time. The data must be processed with minimal latency and stored in a columnar format for analytics. Which service should the engineer use to ingest the data?

A.Amazon Kinesis Data Analytics
B.Amazon Simple Queue Service (SQS)
C.Amazon Kinesis Data Streams with a Lambda consumer
D.Amazon Kinesis Data Firehose
AnswerD

Directly loads streaming data to S3 with transformation and columnar format support.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service for loading streaming data into S3, and it can convert data to columnar formats like Parquet and ORC. Option A (Kinesis Data Streams) requires custom consumers. Option C (Kinesis Data Analytics) processes data but does not load to S3 directly.

Option D (SQS) is a message queue, not designed for streaming ingestion.

6
MCQmedium

A data engineer needs to ingest data from an on-premises Oracle database to Amazon S3 daily. The data volume is 500 GB per day, and the network bandwidth is 200 Mbps. The requirement is to minimize the impact on the source database and ensure data integrity. Which combination of AWS services should be used?

A.AWS Database Migration Service (DMS) with S3 as target
B.AWS Glue ETL jobs with JDBC connection
C.Amazon Kinesis Data Firehose with Oracle as source
D.AWS Data Pipeline with SQLActivity
AnswerA

AWS DMS minimizes source impact by using change data capture and supports S3 as a target.

Why this answer

Option C is correct because AWS DMS can perform ongoing replication with minimal impact, and S3 as a target is supported. Option A (AWS Glue) is a batch ETL tool but may put load on the source. Option B (AWS Data Pipeline) is older and less efficient.

Option D (Amazon Kinesis) is for streaming, not batch database loads.

7
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and needs to be converted to Parquet. However, the conversion is failing. What is the most likely cause?

A.The data transformation Lambda function is not converting to Parquet
B.The schema is not defined in the AWS Glue Data Catalog
C.The data size exceeds the 1 MB limit per record
D.The delivery stream is configured to use Kinesis Data Streams as source
AnswerB

Parquet conversion requires a schema; without it, Firehose cannot convert.

Why this answer

Kinesis Data Firehose requires a schema to convert to Parquet. This schema can be provided by a Glue Data Catalog table. If the schema is not defined, the conversion fails.

Data size is not an issue for conversion. Kinesis Data Streams is not involved here. Lambda transformation can convert to Parquet but is not required.

8
Multi-Selecthard

A company is ingesting Apache logs from multiple web servers into AWS. The logs are sent via Amazon CloudWatch Logs to a subscription filter that delivers to a Lambda function. The Lambda function parses the logs and writes to Amazon S3. However, there is a significant backlog. Which THREE actions can reduce the backlog?

Select 3 answers
A.Route the CloudWatch Logs subscription to an Amazon SQS queue first
B.Increase the Lambda function memory allocation
C.Increase the Lambda function reserved concurrency
D.Change the Lambda function runtime from Python to Node.js
E.Increase the Lambda function maximum concurrency (unreserved account concurrency)
AnswersB, C, E

More memory also increases CPU, speeding up processing.

Why this answer

Increasing Lambda concurrency, increasing memory (which also increases CPU), and using a reserved concurrency ensure capacity. Changing runtime has minor effect; SQS is not used in this path.

9
MCQeasy

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then loaded into an Amazon S3 bucket for long-term storage. Which AWS service should be used to perform the transformation and delivery to S3 with minimal operational overhead?

A.Amazon Kinesis Data Firehose
B.AWS Glue
C.Amazon EMR
D.Amazon Kinesis Data Analytics
AnswerA

Kinesis Data Firehose can subscribe to a Kinesis Data Stream, transform data, and automatically deliver to S3.

Why this answer

Option B is correct because Kinesis Data Firehose can subscribe to a Kinesis Data Stream, transform data using Lambda or its built-in transformations, and automatically deliver to S3 with no code required for delivery. Option A is wrong because Kinesis Data Analytics is for real-time analytics, not direct delivery to S3. Option C is wrong because AWS Glue is a batch ETL service, not real-time.

Option D is wrong because Amazon EMR is a managed Hadoop cluster that requires significant overhead.

10
MCQeasy

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The data contains nested fields. Which Glue feature should the engineer use to define the schema and handle the nested structure?

A.Use the 'FindMatches' transform to identify duplicates.
B.Use the 'DropFields' transform to remove nested fields.
C.Use the 'Relationalize' transform in a Glue ETL script.
D.Use the 'Spigot' transform to write sample data.
AnswerC

Relationalize flattens nested JSON into relational tables.

Why this answer

Option A is correct because Glue's built-in transform 'Relationalize' converts nested JSON into flat tables. Option B is wrong because 'FindMatches' is for deduplication. Option C is wrong because 'Spigot' is for sampling.

Option D is wrong because 'DropFields' is for removing fields, not handling nesting.

11
Multi-Selecteasy

Which TWO AWS services can be used to transform data in an Amazon S3 data lake before loading into Amazon Redshift? (Choose 2.)

Select 2 answers
A.AWS Lambda
B.Amazon Athena
C.Amazon EMR
D.AWS Glue
E.Amazon Redshift Spectrum
AnswersD, E

Glue can transform data in S3 and load into Redshift.

Why this answer

AWS Glue can run ETL jobs to transform S3 data and load into Redshift. Amazon Redshift Spectrum can query S3 data directly and load results into Redshift tables. Amazon Athena is for querying, not loading.

Amazon EMR requires cluster management. AWS Lambda is for small data volumes.

12
MCQmedium

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data volume is about 10 TB per day. The engineer wants to set up a managed Kafka connector. Which AWS service should they use?

A.AWS Database Migration Service
B.AWS Lambda with Kafka trigger
C.Amazon MSK Connect
D.Amazon Kinesis Data Streams
AnswerC

MSK Connect runs Kafka Connect workers, including S3 sink connectors.

Why this answer

Amazon MSK Connect is a fully managed Kafka Connect service that can run S3 sink connectors. Lambda is not a Kafka sink; Kinesis is a different streaming service; DMS is for databases.

13
MCQhard

A company is building a data lake on S3 and needs to ingest data from on-premises Oracle database. The data is 5 TB and changes incrementally. The ingestion must capture changes in near real-time (less than 1 minute latency) and be cost-effective. Which approach should be used?

A.Use AWS Database Migration Service (DMS) with ongoing replication to S3
B.Use Amazon Kinesis Data Firehose with an Oracle JDBC connector
C.Use AWS Glue to perform a full table export daily
D.Use AWS DataSync to sync the Oracle data files to S3
AnswerA

DMS supports CDC and can replicate changes to S3 with low latency.

Why this answer

Option B (Use AWS DMS with ongoing replication from Oracle to S3) is correct because DMS supports change data capture (CDC) with low latency. Option A (Full export using AWS Glue) does not capture incremental changes in near real-time. Option C (Use Kinesis Data Firehose with Oracle as source) is not directly possible without a connector.

Option D (Use AWS DataSync) is for file transfers, not database CDC.

14
MCQmedium

A company ingests JSON logs into Amazon S3 using Kinesis Data Firehose. The logs contain a timestamp field, but the delivery to S3 is delayed by up to 15 minutes during peak hours. The business requires near-real-time availability (under 2 minutes). Which configuration change should the data engineer make?

A.Increase the number of shards in the Kinesis Data Firehose stream
B.Increase the buffer size to 128 MB
C.Decrease the buffer interval to 60 seconds
D.Enable buffering hints in the Firehose delivery stream
AnswerC

Shorter buffer interval reduces delivery latency.

Why this answer

Option B is correct because reducing the buffer interval to 60 seconds (minimum 60s for Firehose) forces Firehose to deliver data more frequently, reducing latency. Option A (increasing buffer size) would increase latency. Option C (increasing shards) does not apply to Firehose directly; shards are for Kinesis Data Streams.

Option D (buffering hints) is not a direct configuration in Firehose.

15
Multi-Selecteasy

A data engineer needs to ingest data from a SaaS application that sends webhooks in JSON format. The data must be stored in S3 for batch analysis. Which AWS services can receive the webhooks and store the data in S3 with minimal custom code? (Choose TWO.)

Select 2 answers
A.AWS Lambda with S3 SDK
B.AWS Glue with a Python shell job
C.Amazon Kinesis Data Streams
D.Amazon API Gateway with S3 integration
E.Amazon API Gateway with Kinesis Data Firehose integration
AnswersD, E

API Gateway can directly write to S3.

Why this answer

Option A and Option D are correct because API Gateway can receive webhooks and route to Firehose or S3. Option B is wrong because Glue cannot directly receive webhooks. Option C is wrong because Kinesis Data Streams requires custom producers.

Option E is wrong because Lambda can receive but not directly store to S3 without code.

16
Multi-Selecthard

A data engineer is designing a data transformation pipeline using AWS Glue. The source data is in Amazon S3 in Parquet format, and the transformed output must be written to another S3 bucket in Parquet format partitioned by year, month, day. The pipeline should handle incremental updates efficiently. Which three features should the engineer use? (Choose THREE.)

Select 3 answers
A.AWS Glue job bookmarks to track processed data
B.Use AWS Glue JobWatch for monitoring job progress
C.Use DynamicFrames instead of Spark DataFrames for schema handling
D.Enable partition pruning in the Glue job
E.Use Spark SQL for transformations
AnswersA, C, D

Enables incremental processing.

Why this answer

Options A, B, and D are correct. Bookmarks track processed data for incremental jobs. DynamicFrames are flexible for schema evolution.

Partition pruning helps read only relevant partitions. Option C (Spark SQL) is not specific to incremental. Option E (JobWatch) is not an AWS Glue feature.

17
Multi-Selecteasy

A company uses AWS Glue to catalog and transform data in Amazon S3. The Glue ETL jobs are failing intermittently with 'ThrottlingException' errors. Which THREE actions can help mitigate this issue? (Select THREE.)

Select 3 answers
A.Implement exponential backoff and retry in the Glue job code.
B.Increase the number of DPUs for the Glue job.
C.Request a service quota increase for the Glue API.
D.Enable job bookmarking to avoid reprocessing old data.
E.Switch from PySpark to Spark SQL.
AnswersA, C, D

Exponential backoff retries throttled requests, reducing failure impact.

Why this answer

Options A, C, and D are correct. Implementing retry logic with exponential backoff handles transient throttling. Increasing the service quota for Glue API calls if the limit is reached.

Using Glue job bookmarking reduces reprocessing, thus lowering API calls. Option B (increasing DPUs) addresses resource issues but not API throttling. Option E (changing to Spark SQL) does not affect API calls.

18
MCQmedium

A company uses AWS Glue crawlers to populate the Data Catalog from data in Amazon S3. The crawler fails to update the schema when new columns are added to the CSV files. What is the most likely cause?

A.The S3 bucket has versioning enabled.
B.The crawler is configured to only crawl new partitions.
C.The IAM role for the crawler lacks permissions to read the new columns.
D.The crawler uses a custom classifier that defines a fixed schema.
AnswerD

Custom classifiers can override schema inference.

Why this answer

Option A is correct because if the crawler is configured with a custom classifier that expects a fixed schema, it will ignore new columns. Option B is wrong because partition discovery does not prevent schema updates. Option C is wrong because S3 event notifications are not related.

Option D is wrong because IAM permissions would cause a different error.

19
MCQeasy

A data engineer is tasked with ingesting on-premises database snapshots (full load) into Amazon S3 on a daily basis. The database is PostgreSQL and the snapshot size is 50 GB. The network link is 1 Gbps. Which approach is the MOST time-efficient and cost-effective?

A.Use AWS Database Migration Service (DMS) with S3 as target.
B.Use AWS Snowball Edge to transfer the snapshot.
C.Use AWS CLI to copy the snapshot file directly to S3.
D.Write a Lambda function to run pg_dump and upload to S3.
AnswerA

DMS can perform full loads from on-premises PostgreSQL to S3 in a managed and scalable way.

Why this answer

Option B is correct because AWS DMS can perform a full load from on-premises PostgreSQL to S3 efficiently, and it's a managed service. Option A (CLI) would be manual and slow. Option C (Snowball) is for larger datasets or slow networks; here 50 GB at 1 Gbps takes ~7 minutes, so Snowball is overkill.

Option D (Lambda + pg_dump) adds custom code and complexity.

20
Multi-Selecteasy

A data engineer is building a data ingestion pipeline that uses AWS Lambda to process records from Amazon Kinesis Data Streams. The Lambda function writes the processed data to Amazon DynamoDB. Which TWO factors affect the maximum number of concurrent Lambda executions for this stream? (Choose TWO.)

Select 2 answers
A.DynamoDB table's read capacity units
B.Lambda function's memory allocation
C.Kinesis stream name
D.Batch size configured for the Lambda event source mapping
E.Number of shards in the Kinesis stream
AnswersD, E

Batch size determines how many records are sent per invocation, affecting concurrency.

Why this answer

Correct options: A and C. The number of shards determines the maximum number of concurrent batches (one per shard). The batch size (records per invocation) also affects concurrency because smaller batches lead to more invocations.

Option B is wrong because DynamoDB table type does not affect Lambda concurrency. Option D is wrong because Lambda memory is not directly related to concurrency. Option E is wrong because stream name does not affect concurrency.

21
MCQeasy

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

A.Unnest
B.ResolveChoice
C.Relationalize
D.Map
AnswerC

Relationalize flattens nested structures into separate DynamicFrames.

Why this answer

The Relationalize transform is specifically designed to flatten nested JSON structures (arrays and objects) into a set of related tables, making it ideal for this use case. It automatically handles complex nesting by creating separate DataFrames for each nested level and linking them via foreign keys, which is exactly what is needed when ingesting JSON with nested arrays and objects into a relational format.

Exam trap

The trap here is that candidates often confuse the generic Spark SQL function `explode` (or the concept of 'unnesting') with a named AWS Glue transform, leading them to select 'Unnest' even though it does not exist as a Glue transform and would require manual handling of multiple nesting levels.

How to eliminate wrong answers

Option A (Unnest) is wrong because AWS Glue does not have a built-in transform named 'Unnest'; this is a Spark SQL function (e.g., `explode`) but not a named Glue transform, and it would require manual handling of multiple nesting levels. Option B (ResolveChoice) is wrong because it is used to resolve schema ambiguities (e.g., when a column has mixed types like string and int) and does not flatten nested structures. Option D (Map) is wrong because it applies a function to each record in a DynamicFrame for row-wise transformations, but it does not inherently flatten nested arrays or objects—you would need to write custom logic to handle the nesting.

22
MCQmedium

A gaming company ingests player event data from mobile games into Amazon Kinesis Data Streams. Each event is a small JSON payload (<1 KB). The data must be delivered to Amazon S3 for analytics, and the company wants to minimize storage costs by aggregating events into larger files (e.g., 100 MB per file). The current setup uses Kinesis Data Firehose with a buffer size of 10 MB and a buffer interval of 60 seconds, but the resulting files are very small (average 5 MB) because the data volume is low. The engineer needs to ensure that files are at least 100 MB to reduce the number of S3 objects and lower costs. What should the engineer do?

A.Use an AWS Glue streaming ETL job with a 100 MB file size threshold to write to S3.
B.Use an AWS Lambda function to buffer events in memory and write to S3 when buffer reaches 100 MB.
C.Increase the buffer size in Kinesis Data Firehose to 100 MB and increase the buffer interval to 300 seconds.
D.Use Amazon EMR with Spark Streaming to aggregate and write larger files to S3.
AnswerC

This allows Firehose to accumulate data until the buffer size or interval is reached, producing larger files.

Why this answer

Option D is correct: Increase the buffer size to 100 MB and the buffer interval to 300 seconds to allow more data to accumulate before writing. Option A (Lambda) would not aggregate efficiently. Option B (Glue) adds latency and cost.

Option C (EMR) is overkill.

23
Multi-Selecteasy

A financial services company is ingesting real-time stock trade data into Amazon Kinesis Data Streams. The data is then processed by a Kinesis Data Analytics application for fraud detection. The company must ensure that the data is processed in the correct order for each stock symbol. Which TWO configuration steps should be taken? (Choose two.)

Select 2 answers
A.Use a random partition key to distribute the load evenly.
B.Increase the number of shards to reduce latency.
C.Use the stock symbol as the partition key when putting records into the stream.
D.Use AWS Lambda with DynamoDB Streams instead of Kinesis Data Analytics.
E.Configure the Kinesis Client Library (KCL) to process records in the order they arrive in each shard.
AnswersC, E

Partition key determines shard assignment; same symbol goes to same shard, preserving order.

Why this answer

A is correct because using a partition key of stock symbol ensures records with the same symbol go to the same shard, preserving order. D is correct because enabling the Kinesis Client Library (KCL) with sequence number ordering ensures records are processed in order within a shard. B is wrong because increasing shards does not guarantee ordering.

C is wrong because using a random partition key would distribute records across shards, breaking order. E is wrong because DynamoDB Streams is for CDC, not for streaming order.

24
MCQhard

A data pipeline uses AWS Glue to read from an Amazon S3 bucket containing millions of small CSV files (each < 1 MB). The ETL job is slow. Which optimization would most improve performance?

A.Write the ETL script using PySpark instead of Scala
B.Increase the number of Glue workers
C.Use the G.1X worker type for more memory
D.Use S3 file grouping to combine small files
AnswerD

Grouping small files reduces the number of partitions and improves Spark performance.

Why this answer

Using Amazon S3 file grouping or converting to columnar format like Parquet reduces the number of files and improves read performance. Increasing workers helps, but file consolidation is more impactful. Using G.1X worker type may help, but grouping files is key.

Using Spark SQL directly does not address the small files problem.

25
MCQeasy

A company needs to ingest data from an on-premises Hadoop cluster into Amazon S3 for archival and analysis. The total data volume is 50 TB. The migration must be completed within one week. The on-premises network has a 1 Gbps connection to AWS. Which AWS service should be used?

A.AWS Transfer Family
B.AWS Snowball Edge
C.AWS Glue
D.AWS DataSync
AnswerB

Snowball can physically ship data for large transfers.

Why this answer

Option C is correct because AWS Snowball is a physical device that can transfer large amounts of data faster than over the network given the bandwidth limitation. Option A (DataSync) can use the network but may not complete within a week due to bandwidth constraints. Option B (Glue) is for ETL, not raw transfer.

Option D (Transfer Family) is for SFTP.

26
Multi-Selecthard

A data engineer is implementing a CDC (Change Data Capture) pipeline from a relational database to Amazon S3 using AWS Database Migration Service (DMS). Which TWO configurations are required for continuous replication?

Select 2 answers
A.Define transformation rules in the DMS task.
B.Enable binary logging on the source database.
C.Configure a VPC endpoint for DMS.
D.Enable 'Full load' and 'Ongoing replication' in the task.
E.Pre-create the target table in S3.
AnswersB, D

Binary logs are needed to capture changes for CDC.

Why this answer

Options A and C are correct. To enable ongoing replication, DMS needs to have full load completed (A) and the source database must have continuous change capture enabled (e.g., binary logs for MySQL) (C). Option B is wrong because the target table must exist before replication.

Option D is wrong because DMS does not need a VPC endpoint; it connects via the internet or VPN. Option E is wrong because transformation rules are optional.

27
MCQmedium

Refer to the exhibit. A Glue ETL job failed. What is the most likely cause?

A.Some source files have a different schema than others.
B.The job bookmarks are misconfigured.
C.The source data has inconsistent partitioning.
D.The job ran out of memory due to insufficient DPUs.
AnswerA

Directly matches the error.

Why this answer

Option D is correct because the error 'Cannot merge incompatible schemas' indicates differing schemas in the source data. Option A is wrong because insufficient DPUs cause memory errors, not schema conflicts. Option B is wrong because partition pruning is about filtering, not schema.

Option C is wrong because bookmarks track processed files, not schema issues.

28
MCQhard

A company is designing a data ingestion pipeline for real-time analytics. The source is a relational database, and the target is Amazon Redshift. The pipeline must handle schema changes in the source database automatically. Which combination of services should be used?

A.Amazon S3 and Amazon Athena
B.AWS DMS and AWS Glue
C.AWS Glue and Amazon Redshift COPY
D.Amazon Kinesis Data Streams and AWS Lambda
AnswerB

DMS captures changes, Glue can detect and apply schema changes.

Why this answer

AWS DMS can continuously replicate data from the source database to S3 in CDC mode. AWS Glue can then detect schema changes and transform the data before loading into Redshift. Kinesis Data Streams is for streaming data, not database CDC.

Athena is for querying, not for handling schema evolution. Lambda alone would require custom code for schema detection.

29
MCQmedium

A company uses Amazon S3 to store raw data and AWS Glue to run ETL jobs that transform the data into analytics-ready tables. The Glue job reads from a source with a schema that changes frequently (new columns added). The engineer wants the Glue job to automatically adapt to schema changes without manual intervention. Which configuration should the engineer use?

A.Schedule a Glue crawler to run after each ETL job to update the Data Catalog.
B.Set the job to use schema-on-read by storing data in Parquet format.
C.Enable the 'Update schema' option in the Glue job's output target configuration.
D.Use Glue's partition indexes to automatically detect new columns.
AnswerC

This option automatically adds new columns to the target table.

Why this answer

Glue's 'Update schema' option in the job's output target allows the job to automatically incorporate new columns. Option A is wrong because a crawler after the job would add lag. Option B is wrong because schema-on-read is not automatic.

Option D is wrong because catalog partitions are for partitioning, not schema evolution.

30
Multi-Selectmedium

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3 for long-term retention. Which THREE services can be used together to achieve this?

Select 3 answers
A.Amazon Kinesis Data Analytics
B.Amazon Kinesis Data Firehose
C.AWS Glue
D.Amazon Athena
E.Amazon Kinesis Data Streams
AnswersA, B, E

Performs real-time transformations.

Why this answer

Options B, C, and D are correct: Kinesis Data Streams ingests, Kinesis Data Analytics transforms, and Kinesis Data Firehose writes to S3. Option A (Glue) is batch-oriented, not real-time. Option E (Athena) is for querying, not streaming.

31
MCQhard

A company is using Amazon MSK (Managed Streaming for Apache Kafka) to ingest real-time data. They need to transform the data using custom Java code before writing to Amazon S3. The transformation must be fault-tolerant and exactly-once semantics are required. Which AWS service should be used?

A.Amazon EMR with Spark Streaming
B.AWS Lambda consumer for MSK
C.Kafka Connect with S3 Sink Connector
D.Kinesis Data Analytics for Apache Flink
AnswerC

Supports exactly-once and custom transformations.

Why this answer

Option C is correct because Kafka Connect with S3 Sink Connector supports exactly-once semantics and custom transformations via Single Message Transforms (SMTs) or custom connectors. Option A (Kinesis Data Analytics for Apache Flink) is for Flink, not Kafka. Option B (AWS Lambda) does not provide exactly-once semantics from MSK.

Option D (Amazon EMR) is for batch processing.

32
MCQmedium

A media company ingests large video files (up to 100 GB each) from content creators via Amazon S3 multipart uploads. After upload, the company needs to transcode the videos into multiple formats using AWS Elemental MediaConvert. The current pipeline uses S3 event notifications to trigger an AWS Lambda function that starts a MediaConvert job. However, for very large files, the Lambda function times out (15-minute limit) before the upload completes because the event is sent when the multipart upload is initiated, not when it completes. How should the engineer fix this issue?

A.Increase the Lambda timeout to 30 minutes.
B.Use Amazon SQS to queue the event and have a Lambda function process it later.
C.Use AWS Step Functions to poll the S3 bucket for object existence and then start MediaConvert.
D.Change the S3 event notification to listen for s3:ObjectCreated:Put events instead of s3:ObjectCreated:*, which fires only after the object is fully written.
AnswerD

s3:ObjectCreated:Put triggers only when the object is complete, avoiding early invocation.

Why this answer

Option D is correct: Use S3 event notifications for s3:ObjectCreated:Put, which triggers only after the object is completely written. Then Lambda can start MediaConvert. Option A (increase timeout) does not solve the root cause.

Option B (Step Functions) adds complexity. Option C (SQS) does not address the event timing.

33
Multi-Selecteasy

A company is ingesting large volumes of sensor data into Amazon S3. The data must be encrypted at rest using an AWS KMS customer managed key. Which TWO actions are required to enable server-side encryption with AWS KMS (SSE-KMS) on the S3 bucket?

Select 2 answers
A.Enable S3 Versioning on the bucket
B.Set the default encryption on the S3 bucket to AWS-KMS and specify the KMS key
C.Enable Amazon CloudWatch Logs for the bucket
D.Add a bucket policy that denies uploads without encryption
E.Ensure the IAM role/user has kms:Encrypt permission on the KMS key
AnswersB, E

This configures SSE-KMS for all objects.

Why this answer

Option B is correct because setting the default encryption on the S3 bucket to AWS-KMS and specifying the KMS key ensures that all objects uploaded to the bucket are automatically encrypted with SSE-KMS using that customer managed key. This is the primary configuration step to enforce server-side encryption at rest with a customer managed key.

Exam trap

The trap here is that candidates often think a bucket policy denying unencrypted uploads alone is sufficient to enable SSE-KMS, but it only enforces that uploads must include encryption headers—it does not automatically apply encryption, so the default encryption setting is also required.

34
MCQmedium

A company needs to transform JSON data from an Amazon S3 bucket into Parquet format and load it into an Amazon Redshift cluster. The transformation includes joining with a reference table stored in Amazon RDS. Which AWS service is BEST suited for this task?

A.AWS Data Pipeline
B.AWS Glue ETL job
C.Amazon Athena
D.Amazon EMR with Spark
AnswerB

Glue ETL jobs can read from S3, connect to RDS via JDBC, transform, and write to Redshift efficiently.

Why this answer

Option D is correct because AWS Glue ETL jobs can read from S3, connect to RDS via JDBC, perform joins and transformations, and write to Redshift. Option A (Athena) can query S3 but cannot join with RDS natively. Option B (EMR) is possible but more complex to set up.

Option C (Data Pipeline) is older and less integrated.

35
MCQmedium

A company uses Amazon Kinesis Data Firehose to ingest application logs into an Amazon S3 bucket. The logs are in JSON format. The data engineering team wants to convert the logs from JSON to Parquet format before landing in S3. What is the most cost-effective way to achieve this?

A.Use Amazon Athena to query the JSON data and write results in Parquet format.
B.Configure the Firehose delivery stream to convert the data to Parquet using a schema from AWS Glue.
C.Use an AWS Lambda function to transform each record to Parquet and send to Firehose.
D.Use an AWS Glue ETL job to run on a schedule and convert JSON to Parquet in S3.
AnswerB

Firehose supports built-in conversion to Parquet/ORC using Glue schema.

Why this answer

Option B is correct because Kinesis Data Firehose can convert the input data format to Parquet using a schema from AWS Glue. Option A is incorrect because Lambda can do this but incurring additional compute cost. Option C is incorrect because Athena queries raw data and would not help with ingestion.

Option D is incorrect because Glue ETL would add cost and latency.

36
MCQmedium

A company is using Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. The team notices that the application is experiencing high latency during peak hours. The stream has 8 shards, and the application is configured with a parallelism of 4. Which action would most likely reduce the latency?

A.Decrease the batch size in the S3 sink.
B.Use a larger Kinesis Data Analytics application instance type.
C.Increase the parallelism of the Flink application to 8.
D.Increase the checkpointing interval to reduce overhead.
AnswerC

Matching parallelism to shard count ensures each shard is processed concurrently, reducing backpressure.

Why this answer

The correct answer is to increase the parallelism of the Flink application to match the number of shards (8). When parallelism is lower than the number of shards, some shards are underutilized, causing backpressure and latency. Option A (increasing checkpointing interval) would reduce overhead but not address the parallelism mismatch.

Option C (using a larger instance type) could help but is less effective than matching parallelism. Option D (decreasing batch size) is not applicable to Flink. Option B directly fixes the bottleneck.

37
MCQhard

Refer to the exhibit. A data engineer is using a Kinesis Data Stream with 2 shards. The producer uses a partition key that is the user ID (a UUID). The consumer is falling behind. Which change would improve throughput?

A.Switch to Kinesis Data Firehose
B.Increase the number of shards
C.Increase the retention period
D.Change the partition key to a constant value
AnswerB

More shards increase the read capacity for consumers.

Why this answer

Option B is correct because increasing the number of shards increases the ingestion capacity. Option A is wrong because Kinesis Data Firehose would add latency. Option C is wrong because the partition key is already random (UUID), which distributes data well.

Option D is wrong because increasing the retention period does not affect throughput.

38
MCQmedium

A company runs an e-commerce platform that generates clickstream data from millions of users. The data is ingested into Amazon Kinesis Data Streams with a shard count of 10. The data is then consumed by a Kinesis Data Analytics application that runs SQL queries to aggregate metrics in real time. Recently, the application has been falling behind, and the stream's iterator age metric is increasing. The data volume has doubled over the past month. The application currently uses a single Kinesis Data Analytics application with parallelism of 1. Which action should the data engineer take to improve the processing rate and reduce the iterator age without losing data or causing duplicates?

A.Change the Kinesis Data Analytics application to use a Kinesis Data Firehose delivery stream as the source.
B.Reduce the retention period of the Kinesis Data Streams to 24 hours.
C.Increase the number of shards in the Kinesis Data Streams to 20.
D.Increase the parallelism of the Kinesis Data Analytics application to match the number of shards.
AnswerD

Higher parallelism allows concurrent processing of multiple shards.

Why this answer

Option D is correct because Kinesis Data Analytics (KDA) processes data from each shard in a stream using one or more parallel operators. With a parallelism of 1, the application uses only a single processing thread, which cannot keep up with the doubled data volume across 10 shards. By increasing parallelism to match the shard count (10), KDA can read from all shards concurrently, distributing the processing load and reducing the iterator age without data loss or duplicates, as KDA manages checkpointing and exactly-once semantics internally.

Exam trap

The trap here is that candidates often assume increasing shard count (Option C) is the only way to handle higher data volume, but they overlook that the processing application's parallelism must also scale to consume the additional shards, otherwise the bottleneck shifts to the consumer.

How to eliminate wrong answers

Option A is wrong because Kinesis Data Firehose is a delivery service that buffers and loads data into destinations like S3 or Redshift; it does not support real-time SQL analytics or reduce iterator age, and using it as a source would break the existing KDA SQL application. Option B is wrong because reducing the retention period from the default (24 hours or more) to 24 hours does not improve processing rate; it only causes data to expire sooner, potentially losing unprocessed records and increasing the risk of data loss without addressing the throughput bottleneck. Option C is wrong because increasing the shard count to 20 would double the stream's ingestion capacity, but the KDA application with parallelism of 1 would still process only one shard at a time, leaving the other 19 shards unprocessed and worsening the iterator age; the bottleneck is the application's parallelism, not the stream's shard count.

39
Multi-Selecthard

A company needs to ingest data from a MySQL database into Amazon S3 using AWS DMS. The data changes frequently and the requirement is to capture changes in near real-time. Which THREE configurations are necessary?

Select 3 answers
A.Create a VPC endpoint for S3.
B.Create an S3 target endpoint in DMS.
C.Enable binary logging (binlog) on the MySQL source database.
D.Create an AWS DMS replication instance.
E.Configure an S3 event notification to trigger DMS.
AnswersB, C, D

Needed to specify the S3 bucket.

Why this answer

Options A, B, and E are correct because the database must have binary logging enabled, DMS requires a replication instance, and the target endpoint should be S3. Option C is incorrect because S3 events are not needed for DMS. Option D is incorrect because a VPC endpoint is not required if using public S3.

40
MCQmedium

A gaming company collects player event data from mobile devices. The data is sent to an Amazon API Gateway endpoint, which triggers an AWS Lambda function that writes the data to an Amazon DynamoDB table. The company wants to also store the data in Amazon S3 for historical analysis. The data volume is about 100 GB per day. The data engineer needs to design a solution to copy data from DynamoDB to S3 with minimal impact on the DynamoDB table. What should the data engineer do?

A.Enable DynamoDB Streams on the table and configure a Lambda function to write changes to S3.
B.Create a global secondary index on the table and export the index to S3.
C.Use AWS Glue to scan the DynamoDB table and write results to S3 every hour.
D.Use the DynamoDB Export to S3 feature to export the entire table daily.
AnswerA

Streams capture changes with low latency and minimal impact on the table.

Why this answer

Option A is correct. Using DynamoDB Streams with a Lambda function that writes to S3 is a common pattern for real-time replication with minimal impact. Option B is wrong because Export to S3 is a one-time or scheduled export, not continuous.

Option C is wrong because using Scan would consume read capacity and impact performance. Option D is wrong because adding a secondary index does not help with exporting data.

41
MCQhard

A company uses AWS Glue ETL to transform data from Amazon RDS for PostgreSQL to Amazon S3. The transformation includes joining several tables and aggregating millions of rows. The job runs successfully but takes over 2 hours. The data engineer wants to reduce runtime. Which action is MOST effective?

A.Enable Auto Scaling for the Glue job.
B.Use AWS Glue DynamicFrames instead of DataFrames.
C.Increase the number of DPUs for the Glue job.
D.Convert the source data to Parquet format.
AnswerC

More DPUs increase parallelism and reduce execution time.

Why this answer

Option D is correct because increasing the number of DPUs (Data Processing Units) in AWS Glue ETL jobs can significantly reduce runtime by parallelizing the transformation. Option A is wrong because Auto Scaling is enabled by default and may not help if DPUs are already maxed. Option B is wrong because converting to Parquet can help with reading from S3, but the source is RDS, not S3.

Option C is wrong because writing to S3 using dynamic frames does not directly improve performance of reading from RDS.

42
MCQhard

A company ingests streaming data from multiple sources into a single Kinesis Data Streams stream. Each source produces records with a different schema. The data must be routed to different S3 prefixes based on the source. Which approach minimizes transformation overhead?

A.Use a single Kinesis Data Firehose with a Lambda transformation that reads schema metadata from DynamoDB to determine the S3 prefix.
B.Ingest all data into S3 and use AWS Glue ETL jobs to partition and route data to different prefixes.
C.Use separate Kinesis Data Streams for each source and configure separate Firehose delivery streams.
D.Use Kinesis Data Analytics to run SQL queries that route data to different Firehose streams.
AnswerA

This approach uses DynamoDB for schema metadata and Lambda for lightweight routing.

Why this answer

Option A is correct because using a DynamoDB table to store schema metadata and a Lambda function to route records to the appropriate Firehose delivery stream is efficient and scalable. Option B is wrong because it requires multiple Kinesis streams, increasing cost and complexity. Option C is wrong because ingesting into S3 and then using Glue to route adds latency.

Option D is wrong because using Kinesis Data Analytics for routing is overkill and not designed for this purpose.

43
MCQhard

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is configured with a buffer interval of 60 seconds and a buffer size of 5 MB. The data arrives at an average rate of 2 MB per second. What is the expected time interval between S3 writes?

A.Approximately 2.5 seconds
B.Approximately 30 seconds
C.Approximately 60 seconds
D.Approximately 10 seconds
AnswerA

The buffer size of 5 MB fills in 2.5 seconds at 2 MB/s, triggering a write.

Why this answer

Option A is correct because with 2 MB/s, the buffer size of 5 MB will be reached in 2.5 seconds, but the buffer interval is 60 seconds. Firehose writes when either condition is met first. Since the buffer size is reached much earlier, writes will occur approximately every 2.5 seconds.

Option B, C, D are incorrect because they are based on the interval or miscalculations.

44
MCQhard

A healthcare company is building a data pipeline to ingest electronic health records (EHR) from hospitals. The data is sent as JSON files via SFTP to an on-premises server. The company wants to move this data to AWS using AWS Transfer Family (SFTP) and then process it with AWS Glue. Data sovereignty regulations require that all data remain within the EU (Frankfurt) region. The pipeline must detect when a new file arrives and start the Glue job automatically. The engineer has set up an AWS Transfer Family server in Frankfurt, and files are uploaded to an S3 bucket in the same region. However, the Glue job is not triggering automatically. The engineer needs to implement automated triggering. What should the engineer do?

A.Configure AWS Step Functions to poll the S3 bucket every minute and start the Glue job if new files exist.
B.Configure Amazon CloudWatch Events to trigger the Glue job on a schedule that checks for new files.
C.Use Amazon Simple Queue Service (SQS) to queue file metadata and have a Lambda function poll the queue to start the Glue job.
D.Set up an S3 event notification on the bucket to invoke an AWS Lambda function that starts the Glue job.
AnswerD

S3 event notifications can invoke Lambda immediately when a new file is uploaded.

Why this answer

Option B is correct: S3 event notifications can be configured to invoke Lambda when a new object is created, and Lambda can start the Glue job. Option A (CloudWatch Events) cannot directly monitor S3 object creation. Option C (SQS) is not needed.

Option D (Step Functions) adds unnecessary complexity.

45
MCQeasy

A company needs to ingest data from multiple on-premises databases into Amazon S3 for analytics. The databases include Oracle, MySQL, and PostgreSQL. The data must be continuously replicated with minimal latency. Which AWS service should be used?

A.AWS Database Migration Service (AWS DMS)
B.Amazon Kinesis Data Streams
C.AWS Snowball
D.AWS Glue
AnswerA

DMS can continuously replicate from multiple source databases to S3.

Why this answer

Option B is correct because AWS DMS supports heterogeneous database migrations and continuous replication to S3. Option A is wrong because Glue is batch-oriented. Option C is wrong because Kinesis Data Streams is for streaming data, not database replication.

Option D is wrong because Snowball is for large offline transfers.

46
MCQeasy

A company needs to ingest data from an external FTP server into AWS S3. The FTP server is not accessible from the internet. Which AWS service should be used to securely transfer the data?

A.Kinesis Data Firehose
B.AWS Transfer Family with SFTP endpoint in a VPC
C.AWS DataSync
D.AWS Snowball Edge
AnswerB

Transfer Family supports SFTP and can be deployed in a VPC to access private FTP servers.

Why this answer

Option B is correct because AWS Transfer Family supports SFTP and can be used with a VPC endpoint to transfer data from an FTP server in a private network to S3. Option A is wrong because Snowball Edge is for large offline data transfers, not regular FTP transfers. Option C is wrong because DataSync is for moving data between on-premises storage and AWS, but it requires network access.

Option D is wrong because Kinesis Data Firehose is for streaming data, not FTP transfers.

47
MCQhard

A data engineering team needs to ingest streaming data from thousands of IoT devices. The data must be processed in near real-time and stored in Amazon S3 in Apache Parquet format partitioned by device_id and timestamp. Which combination of services should the team use to minimize operational overhead and cost?

A.Amazon Kinesis Data Streams, Amazon EC2 for processing, and Amazon S3 with lifecycle policies.
B.Amazon MSK (Kafka), AWS Glue Streaming, and Amazon S3.
C.Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and optionally AWS Lambda.
D.Amazon S3 Transfer Acceleration and AWS Lambda for event-driven transformation.
AnswerC

Kinesis provides serverless ingestion and Firehose handles delivery, Parquet conversion, and partitioning.

Why this answer

Option D is correct. Kinesis Data Streams ingests streaming data, Kinesis Data Firehose delivers data to S3 with built-in Parquet conversion and partitioning, and optionally uses Lambda for lightweight transformations. Option A is wrong because EC2-based ingestion is high overhead.

Option B is wrong because S3 Transfer Acceleration is for large file transfers, not streaming. Option C is wrong because Kafka and Flink require more operational overhead.

48
MCQhard

Refer to the exhibit. A data engineer runs a Glue job manually and receives a ThrottlingException. The engineer checks the job run history and sees a previous failure with the same error. What is the MOST likely cause of the throttling, and which solution is MOST appropriate?

A.Increase the number of DPUs for the job to reduce runtime.
B.Implement retry logic with exponential backoff in the script that calls start-job-run.
C.Use AWS Glue reserved capacity to guarantee API throughput.
D.Delete old job runs to reduce the number of entries in the job run history.
AnswerB

Exponential backoff handles API rate limits by retrying after a delay, reducing the chance of throttling.

Why this answer

Option B is correct because the error 'Rate exceeded' indicates the Glue API rate limit has been reached. The most likely cause is multiple concurrent job runs or too many API calls. Implementing retry logic with exponential backoff in the job or script that triggers the job will handle transient throttling.

Option A (increasing DPUs) does not affect API rate. Option C (deleting old job runs) might free up quota but is not the best practice. Option D (using reserved capacity) is not applicable for Glue API calls.

49
MCQmedium

A media company ingests video files from content partners into an Amazon S3 bucket. Each video file is 10-50 GB. Upon upload, an AWS Lambda function is triggered to extract metadata (e.g., resolution, duration) and store it in DynamoDB. The company now wants to also generate a thumbnail image for each video. The thumbnail generation is CPU-intensive and can take up to 5 minutes per video. The Lambda function has a maximum execution time of 15 minutes. The company has noticed that some thumbnail generation tasks are timing out. What should the data engineer do to reliably generate thumbnails for all videos?

A.Provision an EC2 instance to run a script that polls S3 for new videos and generates thumbnails
B.Use AWS Glue with a Python shell job to generate thumbnails
C.Increase the Lambda timeout to 15 minutes and allocate more memory
D.Use AWS Batch to run a containerized thumbnail generation job triggered by S3 events
AnswerD

Batch is optimized for batch computing and can handle long-running jobs.

Why this answer

Option C is correct because AWS Batch is designed for long-running, compute-intensive jobs like video processing. Option A (increase Lambda timeout) is not a good practice for such heavy tasks. Option B (EC2 instance) requires manual management.

Option D (Glue) is for ETL, not video processing.

50
MCQmedium

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The data rate is high and the Lambda function is falling behind, resulting in increased processing latency. What is the MOST effective way to improve throughput?

A.Increase the memory allocated to the Lambda function.
B.Increase the Lambda function timeout.
C.Use Kinesis Data Firehose instead of Lambda.
D.Increase the number of shards in the Kinesis stream.
AnswerD

More shards increase parallelism and throughput.

Why this answer

Option C is correct because increasing the number of shards increases the stream's capacity and allows more concurrent Lambda invocations. Option A is incorrect because Lambda concurrency might be a limit but the root cause is shard count. Option B is incorrect because increasing Lambda memory may help but not as much as more shards.

Option D is incorrect because Firehose would add a separate step.

51
MCQhard

A data engineer is designing a data ingestion pipeline for clickstream data that arrives in bursts, up to 100 MB/s, and must be processed with exactly-once semantics. The data must be stored in Amazon S3 partitioned by event date and hour. Which combination of services should the engineer use?

A.Amazon Kinesis Data Streams with AWS Lambda consumer writing to S3.
B.Amazon Kinesis Data Firehose with S3 destination and dynamic partitioning.
C.AWS Glue streaming ETL job reading from Amazon MSK and writing to S3.
D.Amazon Kinesis Data Streams with KCL application writing to S3.
AnswerB

Firehose handles bursts and supports partitioning with no custom code.

Why this answer

Amazon Kinesis Data Firehose can buffer and batch data to S3 with partitioning. Option A is correct. Option B is wrong because Lambda cold starts can cause latency.

Option C is wrong because Glue is for batch ETL, not real-time. Option D is wrong because KCL requires custom application.

52
Multi-Selecteasy

A company is using AWS Glue to catalog data in Amazon S3. The data is stored in CSV format, but the schema is not consistent across all files. Which TWO actions can the company take to handle schema evolution and ensure the Glue Data Catalog is up to date? (Choose TWO.)

Select 2 answers
A.Configure the Glue crawler to update the table schema on each run.
B.Manually update the Glue Data Catalog tables whenever the schema changes.
C.Disable schema update in the crawler and add partitions manually.
D.Schedule the Glue crawler to run periodically to detect changes.
E.Require all data producers to use a single fixed schema.
AnswersA, D

This allows the crawler to automatically detect and apply schema changes.

Why this answer

Options B and D are correct. Enabling schema update in the Glue crawler (B) allows the crawler to update the existing table schema when new columns are found. Running the crawler on a schedule (D) ensures that changes are captured periodically.

Option A (manual schema update) is not automated. Option C (using a single schema) is not practical for evolving data. Option E (disabling schema update) would not update the catalog.

53
MCQhard

A data engineer runs an AWS Glue ETL job that reads from a large Amazon S3 source (several terabytes of CSV files) and writes transformed data to an S3 bucket in Parquet format. The job fails with the error shown in the exhibit. The job uses the Standard worker type with 10 workers (G.1X). The engineer needs to resolve the failure with minimal cost increase. What should the engineer do?

A.Increase the number of workers to 20 while keeping G.1X worker type.
B.Change the worker type to G.2X with 10 workers.
C.Change the worker type to G.4X with 10 workers.
D.Set the 'coalesce' parameter to reduce the number of output files.
AnswerB

G.2X provides double the memory (32 GB) per worker compared to G.1X (16 GB), resolving the heap space error with minimal cost increase.

Why this answer

Option D is correct because the OutOfMemoryError indicates that the Spark executors do not have enough memory; switching to G.2X workers doubles memory per worker (from 16 GB to 32 GB) without increasing the number of workers, which is more cost-effective than increasing the number of workers with G.1X. Option A is wrong because increasing the number of workers with G.1X may not resolve the memory issue per executor; it spreads the data across more executors but each still has limited memory. Option B is wrong because reducing the number of files (coalesce) may not help if the issue is per-task memory.

Option C is wrong because using the G.2X worker type with more memory per worker is likely sufficient; G.4X may be unnecessary and more expensive.

54
Multi-Selecthard

A company uses AWS DMS to continuously replicate data from an on-premises SQL Server to Amazon Aurora MySQL. The replication lag is increasing. Which THREE actions can reduce the lag? (Choose three.)

Select 3 answers
A.Use parallel apply on the target endpoint.
B.Filter out unnecessary tables from replication.
C.Enable DMS validation.
D.Enable Multi-AZ for the DMS replication instance.
E.Increase the DMS replication instance size.
AnswersA, B, E

Parallel apply speeds up writes on the target.

55
MCQeasy

A data engineer needs to ingest JSON data from an on-premises relational database into Amazon S3 every hour. Which AWS service should be used to set up a scheduled, incremental data transfer?

A.Amazon S3 Transfer Acceleration with a cron job.
B.AWS Database Migration Service (DMS) with S3 as target.
C.AWS Glue with a JDBC connection and a scheduled crawler.
D.Amazon Kinesis Data Firehose with a database source.
AnswerB

DMS supports scheduled, incremental transfers from databases to S3.

Why this answer

AWS DMS is purpose-built for migrating databases to AWS targets, including Amazon S3. It supports ongoing replication (change data capture) and scheduled full-load tasks, making it ideal for hourly incremental transfers from an on-premises relational database to S3 without custom scripting.

Exam trap

The trap here is that candidates confuse AWS Glue's ETL capabilities with DMS's managed database migration, assuming Glue's JDBC connections can handle incremental transfers, but Glue lacks built-in change data capture and requires custom logic for scheduled incremental loads.

How to eliminate wrong answers

Option A is wrong because S3 Transfer Acceleration only speeds up uploads over long distances via edge locations; it does not provide scheduling, incremental data capture, or database connectivity. Option C is wrong because AWS Glue crawlers are designed for schema discovery and metadata cataloging, not for scheduled incremental data transfer from a database to S3; Glue ETL jobs can do this but require custom code, whereas DMS is the managed service for database migration. Option D is wrong because Kinesis Data Firehose ingests streaming data from producers like Kinesis streams or direct PUT, not from a relational database via JDBC; it lacks built-in change data capture for incremental database loads.

56
Multi-Selecteasy

A company uses Kinesis Data Firehose to deliver streaming data to S3. They need to transform the data by adding a timestamp and removing sensitive fields. Which TWO approaches can achieve this?

Select 2 answers
A.Use Kinesis Data Analytics to transform the stream
B.Use S3 Select to transform data at rest
C.Use AWS Glue ETL to process data after delivery to S3
D.Use Amazon Redshift Spectrum to transform data
E.Configure a Lambda function as a data transformation in Firehose
AnswersC, E

Glue can transform data after it is stored in S3.

Why this answer

Option A and Option D are correct because Lambda transformation can modify records, and Glue ETL can process data from Firehose's S3 destination. Option B is wrong because Kinesis Data Analytics is for analytics, not transformation. Option C is wrong because S3 Select is for retrieving subsets of data, not transformation.

Option E is wrong because Redshift Spectrum is for querying data in S3.

57
MCQhard

A data engineer is designing a data ingestion pipeline for real-time clickstream data from a website. The data must be stored in Amazon S3 in near-real time, and also be available for real-time analytics using Amazon Athena. The pipeline must handle occasional spikes of up to 10x the normal throughput. Which combination of services should the engineer use?

A.Amazon Simple Queue Service (SQS) with AWS Lambda to write to Amazon S3, and Amazon Athena for queries.
B.AWS Database Migration Service (DMS) to stream data to Amazon S3, and Amazon Athena for queries.
C.Amazon Kinesis Data Streams with AWS Lambda to write to Amazon S3, and Amazon Athena for queries.
D.Amazon Kinesis Data Firehose with AWS Glue for transformation, and Amazon Athena for queries.
AnswerC

Kinesis handles spikes, Lambda writes to S3, Athena queries.

Why this answer

Option A is correct because Kinesis Data Streams can handle high throughput spikes, Lambda can process and write to S3, and Athena can query directly. Option B is wrong because SQS is pull-based and does not support push to S3 directly; Lambda would need to poll, adding latency. Option C is wrong because DMS is for database migration, not real-time clickstream.

Option D is wrong because Glue is batch-oriented and not suitable for low-latency streaming.

58
MCQeasy

A company wants to ingest real-time clickstream data from a website into Amazon S3 with minimal code. The data should be delivered within 60 seconds of generation. Which AWS service should be used?

A.Amazon Kinesis Data Firehose
B.AWS Database Migration Service (DMS)
C.Amazon Kinesis Data Streams
D.Amazon S3 Transfer Acceleration
AnswerA

Firehose is designed for near-real-time streaming ingestion into S3 with minimal configuration.

Why this answer

Option B is correct because Amazon Kinesis Data Firehose is a fully managed service that can ingest streaming data and deliver it to S3 with near-real-time latency (typically under 60 seconds). Option A is wrong because AWS Database Migration Service is for database migration, not streaming ingestion. Option C is wrong because Amazon Kinesis Data Streams requires custom consumers and does not directly deliver to S3.

Option D is wrong because Amazon S3 Transfer Acceleration speeds up uploads but does not provide streaming ingestion capabilities.

59
MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a slow internet connection (100 Mbps). The data must be transferred within 2 weeks. Which service should the engineer recommend?

A.AWS DataSync
B.AWS Snowball Edge
C.Amazon Kinesis Data Firehose
D.AWS Glue ETL job with JDBC connection to Hadoop
AnswerB

Snowball Edge physically transfers data, bypassing network limitations.

Why this answer

The correct answer is AWS Snowball Edge, a physical device for large data transfers over slow networks. 50 TB at 100 Mbps would take over 46 days, exceeding the 2-week deadline. Option A (AWS DataSync) uses the network and would be too slow. Option B (AWS Glue) is for ETL, not transfer.

Option D (Amazon Kinesis) is for streaming.

60
Multi-Selecthard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The Flink application reads from a Kinesis Data Streams source, performs aggregations, and writes results to Amazon S3. The application is experiencing high checkpoint failures, and the processing lag is increasing. The data volume is 50 MB/s with an average record size of 1 KB. Which TWO actions would improve checkpoint reliability and reduce lag? (Choose TWO.)

Select 2 answers
A.Decrease the checkpoint interval to complete checkpoints faster.
B.Replace the S3 sink with Kinesis Data Firehose.
C.Decrease the parallelism of the Flink application.
D.Increase the checkpoint interval in the Flink configuration.
E.Increase the number of Kinesis Processing Units (KPUs) for the application.
AnswersD, E

Less frequent checkpoints reduce overhead.

Why this answer

Options A and D are correct. Increase the checkpoint interval to reduce frequency, and increase parallelism with more KPUs. Option B is wrong because reducing parallelism would worsen lag.

Option C is wrong because decreasing checkpoint interval increases failures. Option E is wrong because Kinesis Data Firehose is not a direct solution to checkpoint failures.

61
MCQhard

A financial services company processes real-time stock trade data. They use Amazon Kinesis Data Streams with a shard count of 5, each shard receiving about 500 records per second. The consumer application uses the Kinesis Client Library (KCL) with DynamoDB for checkpointing. Lately, some records are being processed multiple times. What is the most likely cause?

A.The consumer application is crashing and restarting, causing re-processing of records.
B.The Kinesis stream's iterator age is exceeding the retention period.
C.The DynamoDB table used for checkpointing is throttling write requests.
D.The record size exceeds the 1 MB API limit, causing retries.
AnswerA

KCL reprocesses from last checkpoint after failure.

Why this answer

The Kinesis Client Library (KCL) uses DynamoDB to track checkpoint progress for each shard. If the consumer application crashes and restarts, the KCL will resume processing from the last committed checkpoint, which may be behind the actual processing point. This causes records that were already processed (but not yet checkpointed) to be re-processed, leading to duplicate processing.

Exam trap

The trap here is that candidates often confuse checkpoint throttling (Option C) with duplicate processing, but throttling would cause checkpoint failures and potential re-processing only if the application cannot recover, whereas the direct cause of duplicates is the gap between processing and checkpointing after a crash.

How to eliminate wrong answers

Option B is wrong because iterator age exceeding the retention period would cause data to expire and become unavailable, not cause duplicate processing. Option C is wrong because DynamoDB throttling on checkpoint writes would cause checkpoint failures and potential re-processing, but the question states checkpointing is occurring and the issue is duplicate processing, not checkpoint failures. Option D is wrong because the 1 MB API limit applies to the total payload per PutRecords request, not per record, and exceeding it would cause write failures or retries, not duplicate processing of already-successful records.

62
Multi-Selectmedium

A company needs to ingest streaming data from thousands of IoT devices. The data must be processed in real-time and stored in Amazon S3. Which TWO services should be used together?

Select 2 answers
A.Amazon Kinesis Data Streams
B.Amazon Kinesis Data Firehose
C.AWS Glue
D.Amazon Simple Queue Service (SQS)
E.AWS Direct Connect
AnswersA, B

Provides real-time data ingestion.

Why this answer

Options A and C are correct. Kinesis Data Streams ingests streaming data, and Kinesis Data Firehose can deliver data from the stream to S3. Option B is wrong because SQS is a message queue, not for real-time processing.

Option D is wrong because Glue is batch. Option E is wrong because Direct Connect doesn't apply.

63
MCQmedium

A company is ingesting data from multiple sources into S3 using AWS Glue. The data engineer notices that the Glue job is failing with an OutOfMemory error. Which step should be taken to resolve this issue?

A.Reduce the volume of incoming data
B.Configure the job to use a larger memory setting
C.Use a smaller file size for input
D.Increase the number of DPUs allocated to the Glue job
AnswerD

More DPUs provide more memory and compute.

Why this answer

Option C is correct because increasing the number of DPUs provides more memory. Option A is wrong because reducing data volume is not a solution. Option B is wrong because using a smaller file size may help but not directly address memory.

Option D is wrong because Glue does not support memory configuration directly.

64
Multi-Selectmedium

A data engineering team is using AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration must have minimal downtime and needs to capture ongoing changes after the full load. Which THREE resources are required for this task? (Choose three.)

Select 3 answers
A.A DMS source endpoint configured for Oracle.
B.An AWS DMS replication instance.
C.An AWS Snowball Edge device for initial data transfer.
D.An Amazon S3 bucket for staging the data.
E.A DMS target endpoint configured for Amazon RDS PostgreSQL.
AnswersA, B, E

Connects to the source Oracle database.

Why this answer

A is correct because a DMS replication instance is needed to run the migration tasks. B is correct because a source endpoint for Oracle is required. E is correct because a target endpoint for RDS PostgreSQL is required.

C is wrong because an S3 bucket is not required unless using S3 as a target. D is wrong because a Snowball device is for large offline transfers, not for DMS.

65
MCQmedium

A company uses AWS Glue ETL to process data from Amazon S3 and write results to Amazon Redshift. The job fails with a memory error when processing large files. Which action should the data engineer take to resolve this issue?

A.Reduce the number of partitions in the Glue job.
B.Increase the number of DPUs allocated to the Glue job.
C.Switch to a smaller instance type in the Glue job configuration.
D.Use S3 Select to filter columns before reading into Glue.
AnswerB

More DPUs provide additional memory and compute resources.

Why this answer

Option C is correct because increasing the number of DPUs provides more memory for processing. Option A is wrong because reducing parallelism may increase memory per worker but typically reduces overall throughput. Option B is wrong because S3 Select filters data server-side but does not reduce memory usage in Glue.

Option D is wrong because using a smaller instance type would exacerbate the memory issue.

66
MCQmedium

A company wants to ingest data from a SaaS application into Amazon S3. The SaaS application supports streaming data via HTTP POST requests. The data volume is approximately 100 MB per hour, and the company needs to store the raw data in S3 for archival and later analysis. Which approach is the most cost-effective and operationally efficient?

A.Launch a t3.nano EC2 instance that runs a script to receive HTTP POST requests and write to S3.
B.Use Amazon Kinesis Data Firehose with HTTP endpoint as the source, and configure S3 as the destination.
C.Use Amazon Simple Queue Service (SQS) to queue the HTTP POST data and have an AWS Lambda function read from SQS and write to S3.
D.Use Amazon API Gateway to create a REST API that receives the data and triggers an AWS Lambda function to store it in S3.
AnswerB

Firehose is serverless, scales automatically, and directly delivers to S3 with no intermediate storage.

Why this answer

Option C is correct because Kinesis Data Firehose can directly accept HTTP POST requests via the Kinesis Agent or direct API calls, and it automatically delivers data to S3 with buffering, requiring no servers to manage. Option A is wrong because running an EC2 instance adds operational overhead and cost even if it's small. Option B is wrong because SQS also requires a consumer (e.g., Lambda) to move data to S3, adding complexity.

Option D is wrong because API Gateway and Lambda add unnecessary layers and cost for simple ingestion.

67
MCQhard

A data engineer is setting up an Amazon Kinesis Data Analytics application to process streaming data from a Kinesis data stream named "input-stream". The application uses a reference data source from an S3 bucket. The engineer has attached the IAM policy shown in the exhibit to the application's IAM role. When starting the application, the engineer receives an 'AccessDeniedException' error. Which additional permission is required?

A.kinesis:PutRecord on the input stream
B.s3:GetObject on the S3 bucket containing the reference data
C.kinesis:CreateStream on the input stream
D.kinesis:PutRecords on the input stream
AnswerB

The application needs to read reference data from S3, so GetObject is required.

Why this answer

The Kinesis Data Analytics application needs to read reference data from the S3 bucket, which requires the s3:GetObject permission on the bucket and its objects. The error 'AccessDeniedException' indicates the IAM role lacks this specific permission to retrieve the reference data file. Option B correctly adds the missing s3:GetObject action to allow the application to fetch the reference data from S3.

Exam trap

The trap here is that candidates often confuse the direction of data flow and assume the application needs write permissions (PutRecord/PutRecords) to the input stream, when in fact it only needs read permissions (kinesis:DescribeStream, kinesis:GetShardIterator, kinesis:GetRecords) and the missing permission is for the separate S3 reference data source.

How to eliminate wrong answers

Option A is wrong because kinesis:PutRecord is used to write data to a Kinesis stream, but the application reads from the input stream as a source, not writes to it; the error is not about writing. Option C is wrong because kinesis:CreateStream is an administrative action to create a new stream, which is irrelevant to an existing stream used as input. Option D is wrong because kinesis:PutRecords is for batch writing to a stream, not for reading or for accessing reference data from S3.

68
MCQeasy

A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?

A.Configure the crawler to 'Inherit schema from table' and set the table name.
B.Configure the crawler to 'Create a single schema for each S3 path' and enable 'Merge tables'.
C.Configure the crawler to 'Create a single schema for each S3 path' without enabling 'Merge tables'.
D.Configure the crawler to 'Create a single schema for each S3 path' and set 'Each file as a separate table'.
AnswerB

This merges schemas from all files in the path.

Why this answer

Option B is correct because when CSV files have varying schemas (extra columns), the Glue crawler must be configured to 'Create a single schema for each S3 path' with 'Merge tables' enabled. This configuration instructs the crawler to union the schemas from all files in the S3 path, adding new columns as they appear, rather than creating separate tables for each schema variation.

Exam trap

The trap here is that candidates often assume 'Merge tables' is about combining multiple tables into one, when in fact it merges schemas from multiple files within the same S3 path into a single table definition.

How to eliminate wrong answers

Option A is wrong because 'Inherit schema from table' is not a valid Glue crawler configuration; crawlers do not inherit schemas from existing tables automatically. Option C is wrong because without enabling 'Merge tables', the crawler will create multiple tables for each distinct schema, not a single unified table. Option D is wrong because 'Each file as a separate table' would create a separate table per CSV file, which defeats the goal of having a single table with all columns merged.

69
Multi-Selectmedium

A company uses AWS Glue to run ETL jobs daily. The jobs consume data from an Amazon RDS for MySQL database and write results to Amazon S3. The company wants to minimize the impact on the source database during extraction. Which THREE actions should the data engineer take to achieve this? (Choose THREE.)

Select 3 answers
A.Schedule the Glue job to run during off-peak hours.
B.Configure the Glue job to connect to a read replica of the RDS instance.
C.Increase the number of Glue DPUs to process data faster.
D.Disable Glue job bookmarks to force full refresh.
E.Use a JDBC connection with a WHERE clause to extract only incremental data.
AnswersA, B, E

Runs when database load is naturally low.

Why this answer

Options A, C, and D are correct. Option A: Using a read replica avoids load on the primary database. Option C: Using JDBC connection with a WHERE clause to extract only new/changed data reduces the amount of data read.

Option D: Scheduling Glue jobs during off-peak hours minimizes impact. Option B is wrong because increasing DPUs does not reduce database load; it may increase parallelism and actually increase load. Option E is wrong because reducing job bookmarks would cause full scans, increasing load.

70
Multi-Selectmedium

A data engineer is designing a pipeline to ingest daily CSV files from an SFTP server into Amazon S3. The files are large (up to 10 GB) and must be encrypted in transit. The pipeline should be fully managed and serverless where possible. Which TWO services should be used together to achieve this? (Choose TWO.)

Select 2 answers
A.Amazon Kinesis Data Firehose
B.AWS Lambda
C.AWS Glue
D.AWS Transfer Family
E.Amazon Athena
AnswersC, D

Glue can process CSV files in S3.

Why this answer

Options B and D are correct. AWS Transfer Family provides a fully managed SFTP service that can transfer files directly to S3 with encryption in transit. AWS Glue can then be used to process the CSV files in S3.

Option A is wrong because Lambda has a 15-minute timeout and may not handle large files. Option C is wrong because Kinesis is for streaming data, not batch file transfers. Option E is wrong because Athena is a query service, not a transfer service.

71
MCQhard

A data engineer is designing a data pipeline that ingests streaming data from Kinesis Data Streams, transforms it using AWS Lambda, and writes to S3. The Lambda function sometimes fails due to transient errors, and the engineer wants to ensure no data is lost. Which approach should be used?

A.Use the Kinesis Client Library to process records with checkpointing
B.Increase the Lambda function's timeout and memory
C.Use Kinesis Data Firehose as the delivery stream with Lambda for transformation and configure error handling with retries and a backup S3 bucket
D.Configure a dead-letter queue (DLQ) on Lambda to capture failed records
AnswerC

Firehose automatically retries on errors and can send failed records to a backup S3 bucket.

Why this answer

Option D (Use Kinesis Data Firehose as the delivery stream with Lambda for transformation and configure error handling with retries and a backup S3 bucket) is correct because Firehose provides built-in retry and error handling. Option A (Increase Lambda timeout) does not solve transient errors. Option B (Use a DLQ) helps capture failures but does not automatically retry.

Option C (Use Kinesis Client Library) requires more custom code.

72
MCQmedium

A company is using Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data must be transformed from JSON to Parquet format before landing in S3. The transformation logic is simple: convert the JSON schema to Parquet. Which approach meets the requirements with the least operational overhead?

A.Use the built-in data format conversion feature of Firehose with an AWS Glue Data Catalog table
B.Use an AWS Lambda function to transform records to Parquet before sending to Firehose
C.Use Amazon Kinesis Data Analytics to convert the stream to Parquet
D.Provision an Amazon EMR cluster to convert the data in micro-batches
AnswerA

Firehose can convert to Parquet automatically.

Why this answer

Option A is correct because Firehose can natively convert JSON to Parquet using a schema from AWS Glue Data Catalog, without custom code. Option B (Lambda) requires writing and maintaining code. Option C (Kinesis Data Analytics) is overkill.

Option D (EC2) adds management overhead.

73
MCQeasy

Refer to the exhibit. A Lambda function named 'IngestionProcessor' is failing. The engineer checks CloudWatch Logs and sees the log group exists but storedBytes is 0. Why might the logs show no data?

A.The Lambda execution role does not have permission to write logs to CloudWatch
B.The Lambda function is configured with a dead letter queue
C.The Lambda function has not been invoked yet
D.The log group is encrypted with a KMS key and the Lambda function lacks decrypt permission
AnswerA

Without logs:CreateLogGroup, CreateLogStream, PutLogEvents, logs are not written.

Why this answer

The log group exists but has no log streams, meaning the function hasn't executed (or its execution role lacks logs:CreateLogStream and logs:PutLogEvents). The most common cause is missing IAM permissions for CloudWatch Logs.

74
MCQeasy

A company is using Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is delivered in 5-minute intervals. The company wants to reduce the delivery frequency to 1 minute to get data faster. Which parameter should be changed in the Firehose delivery stream configuration?

A.Reduce the buffer interval from 300 seconds to 60 seconds.
B.Increase the buffer size to trigger delivery sooner.
C.Enable dynamic partitioning to deliver data more frequently.
D.Enable compression to reduce data size and speed up delivery.
AnswerA

The buffer interval controls the maximum time between deliveries.

Why this answer

Option A is correct because the buffer interval determines how often data is delivered to the destination. Changing it from 300 seconds to 60 seconds will deliver data every minute. Option B (buffer size) affects delivery based on data volume, not time.

Option C (compression) does not affect frequency. Option D (partitioning) does not affect delivery frequency.

75
MCQhard

Refer to the exhibit. A Lambda function with this IAM policy is used to process records from a Kinesis stream and write to S3. The function is failing with access denied errors when writing to S3. What is the issue?

A.The function needs to use Kinesis Data Analytics for transformation.
B.The Kinesis stream ARN is incorrect.
C.The Lambda function does not have permission to read from the Kinesis stream.
D.The S3 bucket name in the policy does not match the actual bucket used by the function.
AnswerD

Common cause of access denied.

Why this answer

Option B is correct because the policy allows s3:PutObject on 'my-bucket/*' but the function might be writing to a different bucket or the resource ARN is missing permissions for the bucket itself (e.g., s3:ListBucket). Option A is wrong because the policy includes GetRecords. Option C is wrong because the policy has kinesis actions.

Option D is wrong because Kinesis Data Analytics is not mentioned.

Page 1 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion Transformation questions.