CCNA Data Ingestion and Transformation Questions — Page 9 of 9

601

Multi-Selecteasy

Which TWO AWS services can be used to ingest streaming data into Amazon S3? (Choose two.)

Select 2 answers

A.Amazon S3 Transfer Acceleration

B.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

C.Amazon Kinesis Data Firehose

D.Amazon Elastic Block Store (Amazon EBS)

E.AWS Snowball

AnswersB, C

MSK can stream data to S3 via Kafka Connect S3 sink.

Why this answer

Option A and Option C are correct. Kinesis Data Firehose can directly deliver streaming data to S3. Amazon MSK can be used with Kafka Connect to sink data into S3.

Option B is wrong because S3 Transfer Acceleration accelerates uploads but is not a streaming ingestion service. Option D is wrong because Snowball is for offline data transfer. Option E is wrong because EBS volumes are block storage attached to EC2.

Practice this question →

602

Multi-Selecteasy

A data engineer is designing a real-time streaming pipeline to ingest clickstream data from a website into Amazon S3. The data must be transformed before storage. Which TWO AWS services can be used together to build this pipeline? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon S3 Transfer Acceleration

E.AWS Database Migration Service (DMS)

AnswersA, C

Delivers streaming data to S3 with transformation capabilities.

Why this answer

Option A and Option E are correct. Amazon Kinesis Data Streams (A) can ingest streaming data, and Amazon Kinesis Data Firehose (E) can deliver the data to S3 with optional transformation via Lambda. Option B (S3 Transfer Acceleration) is for accelerating uploads to S3, not for streaming.

Option C (AWS DMS) is for database migration. Option D (AWS Glue) is for batch ETL, not real-time.

Practice this question →

603

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon S3 to Amazon Redshift. The job reads JSON files, applies schema mapping, and writes to a Redshift table. Recently, the job started failing with memory errors. The data volume has increased tenfold. Which approach should a data engineer take to resolve this issue with minimal code changes?

A.Switch from Spark to Python Shell job type.

B.Implement batch processing with smaller file sizes.

C.Increase the number of DPUs allocated to the Glue job.

D.Use Redshift Spectrum to query data directly from S3.

AnswerC

Provides more resources for processing.

Why this answer

Option C is correct because increasing the number of DPUs (Data Processing Units) allocated to the AWS Glue job directly addresses the memory constraint caused by a tenfold increase in data volume. Glue ETL jobs run on Apache Spark, which distributes data processing across executors; more DPUs provide more memory and compute capacity, allowing the job to handle larger datasets without code changes.

Exam trap

The trap here is that candidates may assume memory errors always require code optimization (e.g., batching or partitioning), but the question explicitly asks for minimal code changes, making resource scaling the correct answer.

How to eliminate wrong answers

Option A is wrong because switching from Spark to Python Shell job type would reduce parallelism and memory capacity, as Python Shell runs on a single node with limited resources, making it unsuitable for large-scale data transformations. Option B is wrong because implementing batch processing with smaller file sizes would require significant code changes to split and manage files, contradicting the 'minimal code changes' requirement, and does not address the root cause of insufficient memory allocation. Option D is wrong because using Redshift Spectrum to query data directly from S3 bypasses the Glue ETL job entirely, which is a different architectural approach that does not resolve the memory error in the existing Glue job and may introduce new costs and complexity.

Practice this question →

604

Multi-Selecthard

A data engineer is building a data ingestion pipeline using AWS Glue. The source is an Amazon DynamoDB table, and the target is an Amazon S3 data lake in Parquet format. The pipeline must handle large volumes and ensure exactly-once processing. Which THREE features should the engineer use together to achieve this? (Choose THREE.)

Select 3 answers

A.Use Amazon Kinesis Data Streams to capture DynamoDB Streams changes.

B.Configure the Glue job to convert data to Parquet format.

C.Use Amazon S3 Object Lambda to transform data on the fly.

D.Enable job bookmarks in the Glue job to track processed items.

E.Use DynamoDB's export to S3 feature to get a full snapshot.

AnswersB, D, E

Parquet is columnar and efficient for analytics.

Why this answer

Options A, C, and E are correct. Option A: Glue job bookmarks track processed data to avoid duplicates. Option C: Using the DynamoDB export feature to S3 is efficient for large volumes and provides a consistent snapshot.

Option E: Converting to Parquet in the Glue job is a common pattern. Option B is wrong because Kinesis Data Streams is for streaming, not for DynamoDB bulk export. Option D is wrong because S3 Object Lambda is for modifying data on read, not for ingestion.

Practice this question →

605

MCQmedium

A data engineer uses AWS Glue to process data from S3. The Glue job frequently fails with 'Out of Memory' errors. The job reads several large compressed files. What is the MOST effective way to resolve this issue without changing the code?

A.Increase the number of G.1X workers or use G.2X workers

B.Convert the compressed files to uncompressed format before processing

C.Repartition the data to fewer partitions

D.Increase the job timeout setting

AnswerA

More workers or higher memory workers provide more heap space for processing.

Why this answer

Option C is correct because increasing the number of G.1X workers (DPU) provides more memory per worker. Option A (changing file type) may help but is not directly about memory. Option B (repartition) requires code change.

Option D (increasing timeout) does not fix memory.

Practice this question →

606

MCQeasy

A data engineer is setting up a data ingestion pipeline using Amazon Kinesis Data Firehose to deliver web server logs to Amazon S3. The logs are in JSON format and the engineer wants to convert them to Parquet format. The engineer has configured a Glue table for the schema. However, when testing, the Firehose delivery stream fails with 'Error converting to Parquet'. The engineer checks the Glue table schema and notices that it includes a column 'timestamp' of type 'string' in the format 'yyyy-MM-dd HH:mm:ss'. The logs have a 'timestamp' field in the same format. What is the MOST likely cause of the failure?

A.Firehose does not support Parquet conversion.

B.The Glue table schema does not match the data schema exactly.

C.The S3 bucket lacks write permissions for Firehose.

D.The 'timestamp' column must be of type 'timestamp' instead of 'string'.

AnswerB

Mismatch between schema and data causes conversion failure.

Why this answer

Option A is correct because Firehose requires the schema to match the data exactly. If the Glue table schema is not consistent with the data, conversion fails. Option B is wrong because Firehose supports Parquet conversion with Glue schema.

Option C is wrong because S3 bucket permissions would cause a different error. Option D is wrong because string type is acceptable for conversion.

Practice this question →

607

MCQeasy

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3. Which AWS service should be used to perform the transformation?

A.AWS Glue

B.Amazon EMR

C.Amazon Kinesis Data Analytics

D.Amazon Athena

AnswerC

Kinesis Data Analytics provides real-time stream processing capabilities.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics can process and transform streaming data in real-time using SQL or Apache Flink. Option B is wrong because AWS Glue is a batch ETL service, not real-time. Option C is wrong because Amazon EMR is for big data processing, not real-time streaming transformation.

Option D is wrong because Amazon Athena is an interactive query service, not for real-time transformation.

Practice this question →

608

MCQmedium

A company uses AWS DMS to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. After initial load, ongoing replication is set up. The replication task shows 'Task status: failed with error: The specified LSN is not available in the source database logs.' What is the most likely cause?

A.DMS does not support PostgreSQL as a source for ongoing replication.

B.The source database's network security group blocks outbound traffic to DMS.

C.The full load was incomplete, preventing CDC from starting.

D.The source database's WAL retention period is too short, and required logs have been purged.

AnswerD

DMS uses WAL logs for CDC; if logs are purged, replication fails.

Why this answer

The error indicates that the required WAL logs have been removed or are not accessible. Option A is correct because DMS needs WAL logs for CDC. Option B is wrong because network connectivity would cause a different error.

Option C is wrong because DMS supports PostgreSQL. Option D is wrong because ongoing replication does not need a full load.

Practice this question →

609

MCQeasy

Refer to the exhibit. A data engineer runs the above CLI command to find files smaller than 1000 bytes in a bucket. The command returns an empty array, but the engineer knows there are small files. What is the issue?

A.The prefix is incorrect; it should be 'logs/2023/01/01/'.

B.The bucket policy does not allow listing objects.

C.The query syntax is invalid; use a filter instead.

D.The Size is compared as a string, not an integer; remove quotes around '1000'.

AnswerD

JMESPath comparison requires numeric types.

Why this answer

Option C is correct because the Size attribute is a number, but the command compares it to a string '1000', which causes type mismatch. Option A is wrong because the prefix is valid. Option B is wrong because the query syntax is correct.

Option D is wrong because bucket policy does not affect CLI query results if access is allowed.

Practice this question →

610

MCQhard

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on clickstream data. The application uses a sliding window of 1 minute. The data engineer notices that the application is producing incorrect results because late-arriving records are not being handled properly. What should the data engineer do to ensure late records are included in the window calculations?

A.Use a Kinesis Data Firehose to buffer the data and then send to Kinesis Data Analytics.

B.Increase the watermark delay in the Kinesis Data Analytics application to allow more time for late records.

C.Increase the window size from 1 minute to 2 minutes.

D.Increase the retention period of the Kinesis stream to 7 days.

AnswerB

Watermark delay controls how long the application waits for late data.

Why this answer

Option B is correct because Kinesis Data Analytics allows configuring a higher watermark delay, which tells the application to wait longer for late-arriving records. Option A is wrong because increasing the window size changes the semantics. Option C is wrong because modifying the stream retention does not affect the application's window.

Option D is wrong because ordering is not the issue; it's about waiting for late data.

Practice this question →