CCNA Data Ingestion Transformation Questions

75 of 610 questions · Page 6/9 · Data Ingestion Transformation topic · Answers revealed

376
MCQhard

A data engineer is designing a data ingestion pipeline for real-time financial transactions. The pipeline must ensure exactly-once processing semantics and must handle duplicate records that may occur due to retries. Which combination of AWS services can achieve exactly-once processing?

A.Amazon Kinesis Data Streams with Amazon Kinesis Data Analytics for Apache Flink
B.Amazon SQS with AWS Lambda
C.Amazon MSK with AWS Lambda
D.Amazon Kinesis Data Firehose with AWS Lambda
AnswerA

Flink supports exactly-once processing with KDS.

Why this answer

Option A is correct because Kinesis Data Streams provides sequence numbers, and Kinesis Data Analytics can use Flink's exactly-once semantics with idempotent sinks. Option B (Firehose) provides at-least-once. Option C (SQS) provides at-least-once.

Option D (MSK) can achieve exactly-once but requires more configuration and is not as straightforward.

377
Multi-Selectmedium

A data engineering team is designing a data ingestion pipeline for a social media analytics platform. The pipeline must handle up to 100,000 events per second with less than 1 second processing latency. Which TWO services should be used together to meet these requirements?

Select 2 answers
A.AWS Glue streaming ETL
B.Amazon Kinesis Data Firehose
C.Amazon SQS
D.Amazon Kinesis Data Analytics for Apache Flink
E.Amazon Kinesis Data Streams
AnswersD, E

Flink can process streaming data with sub-second latency.

Why this answer

A (Kinesis Data Streams) provides the high-throughput ingestion layer, and C (Kinesis Data Analytics for Apache Flink) provides low-latency stream processing. B (SQS) is a queue, not optimized for high-throughput streams. D (Firehose) has higher latency (buffering).

E (Glue) is batch-oriented.

378
Multi-Selectmedium

A data engineering team is building a pipeline to transform CSV files uploaded to Amazon S3 into Parquet format using AWS Glue. The transformation must be serverless and handle files that arrive at irregular intervals. Which TWO actions should the team take? (Choose two.)

Select 2 answers
A.Configure an Amazon EMR cluster with Apache Spark for on-demand transformation.
B.Use an AWS Glue ETL job to convert CSV to Parquet.
C.Use AWS Data Pipeline to schedule a periodic transformation.
D.Use Amazon Redshift Spectrum to convert files during query execution.
E.Set up an S3 event notification to invoke an AWS Lambda function that triggers the Glue job.
AnswersB, E

Glue ETL jobs are serverless and can transform data formats.

Why this answer

Option A: AWS Glue is serverless and can transform data. Option C: S3 event notifications can trigger a Lambda function to start the Glue job. Option B (Amazon EMR) is server-based and requires cluster management.

Option D (AWS Data Pipeline) is not serverless. Option E (Amazon Redshift Spectrum) queries data in place but does not transform file formats.

379
MCQeasy

A company needs to ingest data from multiple SaaS sources (e.g., Salesforce, Marketo) into Amazon S3 for analytics. Which AWS service is designed for this purpose?

A.AWS Transfer Family
B.AWS Glue
C.Amazon AppFlow
D.AWS DataSync
AnswerC

AppFlow is purpose-built for SaaS data ingestion.

Why this answer

Option B is correct because AppFlow is a fully managed service for securely transferring data from SaaS applications to AWS. Option A (Glue) can connect to JDBC sources but not directly to many SaaS APIs. Option C (DataSync) is for file-based transfers.

Option D (Transfer Family) is for SFTP.

380
MCQhard

A financial services company ingests stock trade data from multiple exchanges into an Amazon S3 bucket (trade-bucket). Each exchange sends a CSV file every 5 minutes. The data must be transformed into Parquet format and partitioned by exchange and date (trade_date) for efficient querying using Amazon Athena. The pipeline must handle late-arriving data (files up to 2 hours late) and ensure exactly-once processing to avoid duplicates. Currently, a scheduled AWS Glue ETL job runs every hour, reads new CSV files, converts them to Parquet, and writes to an output bucket. However, the team is experiencing data duplication: if the job fails midway, upon retry it reprocesses all files in the input folder, causing duplicates in the output. Additionally, the job takes too long because it scans all files each run. The engineer must redesign the pipeline to eliminate duplicates and improve efficiency. What should the engineer do?

A.Use AWS Glue Workflows to orchestrate the job and add a condition to check for duplicates before writing.
B.Set up an S3 event notification to invoke an AWS Lambda function that starts a Glue job with a parameter containing the S3 object key of the new file; modify the Glue job to process only that file and use the file key to avoid duplicates.
C.Modify the Glue job to move processed CSV files to an archive folder after successful transformation, and process only unprocessed files.
D.Replace Glue with Amazon EMR and use Spark Structured Streaming with checkpointing to process files incrementally.
AnswerB

This ensures each file is processed exactly once, and the job runs only on new files, improving efficiency.

Why this answer

Option D is the best approach: S3 event notifications trigger Lambda to start a Glue job passing the file key; the job processes only that file, using the file key to avoid reprocessing. Option A still processes all files and may cause duplicates if a file arrives after the job starts. Option B (Glue Workflows) does not solve the file-level tracking.

Option C (EMR) is overkill and still scans the entire input folder.

381
Multi-Selecteasy

A company is building a data lake on Amazon S3 and needs to ingest data from various on-premises sources. Which TWO AWS services can be used to transfer data securely over the internet?

Select 2 answers
A.AWS Snowcone
B.AWS DataSync
C.Amazon Kinesis Data Firehose
D.AWS Direct Connect
E.AWS CLI
AnswersB, E

DataSync can transfer data over the internet.

Why this answer

Options A and C are correct because AWS DataSync can transfer data over the internet, and the AWS CLI can be used for uploads. Option B is incorrect because Direct Connect is a physical connection. Option D is incorrect because Snowcone is a physical device.

Option E is incorrect because Kinesis Data Firehose is for streaming, not file transfer.

382
Multi-Selectmedium

A company is building a data lake on Amazon S3. They need to ingest data from multiple sources, including relational databases, streaming data, and log files. Which THREE AWS services can be used to ingest data into the data lake?

Select 3 answers
A.Amazon Kinesis Data Firehose
B.AWS Database Migration Service (DMS)
C.Amazon Athena
D.Amazon Redshift Spectrum
E.AWS Glue
AnswersA, B, E

Ingests streaming data into S3.

Why this answer

Options A, B, and D are correct. Kinesis Firehose for streaming, Glue for batch ETL, and DMS for database migration. Option C is wrong because Athena is a query service, not ingestion.

Option E is wrong because Redshift Spectrum is for querying S3 from Redshift.

383
Multi-Selecteasy

A data engineer needs to ingest streaming data from an e-commerce application into Amazon S3 for near-real-time analytics. The solution must handle variable throughput and allow reprocessing of failed records. Which TWO AWS services should the engineer use? (Choose two.)

Select 2 answers
A.Amazon Kinesis Data Streams
B.Amazon SQS
C.Amazon DynamoDB Streams
D.Amazon Kinesis Data Firehose
E.Amazon RDS
AnswersA, D

Kinesis Data Streams ingests and stores streaming data in shards for real-time processing.

Why this answer

Amazon Kinesis Data Streams ingests streaming data reliably, and Amazon Kinesis Data Firehose delivers it to S3 with retry logic. Option A (Amazon SQS) is for message queues, not streaming ingestion. Option B (Amazon RDS) is a database.

Option D (AWS Lambda) can process records but is not the primary ingestion service for high-throughput streaming. Option E (Amazon DynamoDB Streams) is for CDC from DynamoDB, not general streaming.

384
MCQmedium

Refer to the exhibit. A data engineer is attaching this IAM policy to an IAM role used by an AWS Glue job. The job reads from a Kinesis Data Streams stream and writes transformed data to an S3 bucket. When the job runs, it fails with an AccessDenied error for the Kinesis stream. What is the MOST likely cause?

A.The stream ARN in the policy is incorrect.
B.The IAM policy is missing the 'kinesis:DescribeStream' action.
C.The Glue job does not have permissions to call 'kinesis:PutRecord'.
D.The S3 bucket policy blocks the PutObject action from the Glue role.
AnswerB

Required to read stream metadata.

Why this answer

Option B is correct because missing 'kinesis:DescribeStream' is required for Glue to read from the stream. Option A is wrong because the S3 policy is fine. Option C is wrong because resource ARN is correctly formatted.

Option D is wrong because the policy does not allow 'kinesis:GetRecords' or 'kinesis:DescribeStream'.

385
MCQeasy

An organization needs to ingest data from on-premises databases into AWS S3 for archival purposes. The data volume is several TB per day, and the network has moderate bandwidth. Which AWS service is BEST suited for this bulk data transfer?

A.AWS Direct Connect
B.Amazon S3 Transfer Acceleration
C.AWS Snowball
D.AWS DataSync
AnswerD

DataSync automates and accelerates moving data from on-premises to AWS.

Why this answer

Option C is correct because AWS DataSync is designed for large-scale data transfers from on-premises to AWS. Option A is wrong because Snowball is for offline transfer, not real-time. Option B is wrong because Direct Connect is a dedicated network connection, not a data transfer service.

Option D is wrong because S3 Transfer Acceleration speeds up uploads over the internet but is not specifically designed for on-premises bulk transfer.

386
MCQmedium

A company uses AWS DMS to migrate data from Oracle to Aurora MySQL. During the ongoing replication, the target table shows duplicate primary key errors. What is the most likely cause?

A.DMS is using 'Limited LOB mode' and truncating LOB data, causing row mismatches.
B.The source table has a trigger that inserts additional rows.
C.The target table has an auto-increment column, and DMS is inserting explicit values that conflict.
D.The DMS task is configured with 'Parallel apply' threads that cause race conditions.
AnswerC

DMS inserts values for the PK, but auto-increment may also generate values, causing duplicates.

Why this answer

Duplicate PK errors often occur when the target table has an auto-increment column that conflicts with DMS inserts. Option A is correct. Option B is wrong because LOB mode doesn't cause duplicates.

Option C is wrong because full LOB mode is for large objects. Option D is wrong because parallel threads can cause duplicates if not properly configured.

387
MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The jobs run nightly and take 3 hours to complete. The data volume is growing by 20% each month. The engineer needs to reduce job runtime and cost. The source RDS is a db.r5.large instance. Which approach would be MOST effective?

A.Reduce the number of DPUs to lower cost and accept longer runtime.
B.Increase the number of Glue workers and choose a G.1X or G.2X worker type.
C.Create a read replica of the RDS instance and point the Glue job to the replica.
D.Enable S3 Transfer Acceleration on the destination bucket.
AnswerB

More workers increase parallelism.

Why this answer

Option A is correct because increasing the number of Glue workers (DPUs) directly increases parallelism and reduces runtime, and you can choose a worker type with more memory if needed. Option B is wrong because RDS read replica does not help Glue processing speed. Option C is wrong because S3 Transfer Acceleration is for uploads, not Glue processing.

Option D is wrong because reducing DPUs would increase runtime.

388
MCQmedium

A financial company needs to ingest real-time stock trade data from multiple sources and store it in Amazon S3 for compliance. The data must be delivered within 1 minute of the trade occurring. The data volume is approximately 10,000 records per second, with occasional spikes to 50,000 records per second. The engineer has set up Amazon Kinesis Data Streams with 10 shards and a Kinesis Data Firehose delivery stream that reads from the Kinesis stream and writes to S3. However, during spikes, the Firehose delivery stream falls behind, causing data to be delayed beyond the 1-minute SLA. What should the engineer do to meet the SLA without over-provisioning?

A.Increase the buffer size in Kinesis Data Firehose from 1 MB to 5 MB to batch more data per delivery.
B.Use Amazon SQS as a buffer between Kinesis and Firehose to absorb spikes.
C.Replace Firehose with an AWS Lambda function that writes directly to S3 for lower latency.
D.Increase the number of shards in the Kinesis data stream to handle peak throughput and enable auto-scaling.
AnswerD

More shards increase read capacity; auto-scaling adjusts during spikes.

Why this answer

Option A is correct: Increase shard count to handle peak throughput, and enable automatic scaling for Kinesis Data Streams. Option B (increase buffer size) would increase latency. Option C (use Lambda) may not handle the throughput.

Option D (use SQS) would not reduce latency.

389
Multi-Selectmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data must be transformed and stored in Amazon S3 for batch analytics. The engineer wants to use AWS Lambda for transformation. Which TWO configurations are required? (Choose two.)

Select 2 answers
A.Configure the Lambda function to write to S3 via Kinesis Data Firehose.
B.Configure the Kinesis stream to send records directly to S3.
C.Set up an SQS queue as a destination for Lambda errors.
D.Create an event source mapping from the Kinesis stream to the Lambda function.
E.Assign an IAM role to Lambda with permissions to read from Kinesis and write to S3.
AnswersD, E

Event source mapping enables Lambda to poll from Kinesis.

Why this answer

Options A and D are correct. Option A: Lambda needs an event source mapping to poll from Kinesis. Option D: Lambda needs an IAM role with permissions to read from Kinesis and write to S3.

Option B is wrong because Lambda does not write directly to S3 via Kinesis; it writes via the Lambda function. Option C is wrong because Lambda does not use SQS to trigger from Kinesis. Option E is wrong because Lambda does not write to S3 via Firehose; Firehose can directly write to S3.

390
MCQmedium

A company needs to ingest real-time sensor data from thousands of IoT devices into Amazon S3, with a latency of less than 1 minute. The data must be transformed (e.g., convert to Parquet) before landing in S3. Which combination of services is MOST cost-effective?

A.Amazon Kinesis Data Streams to AWS Lambda to S3.
B.Amazon Kinesis Data Streams to Amazon Kinesis Data Analytics to S3.
C.Amazon Kinesis Data Streams to AWS Glue streaming ETL to S3.
D.Amazon Kinesis Data Streams to Amazon Kinesis Data Firehose to S3.
AnswerD

Firehose can transform and deliver with low latency, cost-effective for high throughput.

Why this answer

Option A is correct because Kinesis Data Streams ingests data in real-time, and Kinesis Data Firehose can buffer, transform (e.g., convert to Parquet), and deliver to S3 with low latency. Option B is wrong because Kinesis Data Analytics is for running SQL on streams, not for simple transformation. Option C is wrong because Glue is batch-oriented, not real-time streaming.

Option D is wrong because Lambda can transform but may not scale cost-effectively for thousands of devices.

391
MCQmedium

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database into Amazon Redshift. The pipeline must capture changes (inserts, updates, deletes) with low latency and minimal impact on the source database. Which combination of AWS services should the engineer use?

A.AWS Database Migration Service (DMS) to Amazon S3, then COPY into Redshift
B.Amazon Kinesis Data Streams with a custom producer on Oracle
C.AWS Glue with JDBC connection to Oracle, writing to Redshift
D.AWS Lambda reading from Oracle logs and writing to Redshift
AnswerA

DMS supports ongoing replication from Oracle to S3 with CDC, and Redshift can COPY from S3.

Why this answer

Option A is correct because AWS DMS with ongoing replication (change data capture) can capture changes from Oracle with minimal impact and replicate to S3, then COPY into Redshift. Option B is wrong because AWS Glue is batch-oriented and does not support real-time CDC natively. Option C is wrong because Lambda can process events but is not designed for continuous CDC from a database.

Option D is wrong because Kinesis Data Streams requires custom producers and does not directly integrate with Oracle for CDC.

392
Multi-Selecthard

A company uses a Kinesis Data Firehose delivery stream to load data into an S3 bucket. The data is in JSON format and must be converted to Parquet before landing in S3. Which steps are required to achieve this? (Choose THREE.)

Select 3 answers
A.Configure the Firehose delivery stream to enable data format conversion to Parquet.
B.Create a table in the AWS Glue Data Catalog with the schema.
C.Store the schema in Amazon DynamoDB.
D.Set the Firehose's schema mapping to reference the Glue table.
E.Use Kinesis Data Analytics to convert the data.
AnswersA, B, D

Firehose has built-in conversion capability.

Why this answer

Option A, Option C, and Option E are correct because Firehose can convert to Parquet using a schema from a Glue table, and the table must be registered in the Data Catalog. Option B is wrong because Firehose does not use DynamoDB. Option D is wrong because Kinesis Data Analytics is for stream processing, not conversion.

393
MCQhard

A data engineer runs an AWS Glue ETL job that writes to a table in the AWS Glue Data Catalog. The job fails occasionally with the error 'Resource Not Found' for the table. The table exists. What is a likely cause?

A.The job is using an outdated version of the table schema.
B.Multiple Glue jobs are writing to the same table concurrently.
C.The table location in S3 is incorrect.
D.The Glue job name contains special characters.
AnswerA

Schema version mismatch can cause 'Resource Not Found'.

Why this answer

Option B is correct because Glue jobs use a version of the table schema that may be stale if the schema was updated after the job started; the job may be referencing an old version that no longer exists. Option A is wrong because table location is irrelevant if the table exists. Option C is wrong because concurrent jobs do not cause 'Resource Not Found' errors.

Option D is wrong because the job name does not affect table access.

394
MCQmedium

A company uses Amazon S3 to store raw data and needs to transform it into Parquet format for analytics. The transformation job runs daily on a schedule. Which AWS service is BEST suited for this task?

A.Amazon Redshift
B.Amazon EMR
C.AWS Lambda
D.AWS Glue
AnswerD

Glue is serverless, supports Parquet, and can be scheduled with triggers.

Why this answer

Option C is correct because AWS Glue is a serverless ETL service that can run scheduled jobs to convert data to Parquet. Option A is wrong because Lambda has a 15-minute timeout. Option B is wrong because EMR requires managing clusters.

Option D is wrong because Redshift is a warehouse, not for file conversion.

395
MCQmedium

A data engineer runs an AWS Glue job that reads from a JDBC connection to a PostgreSQL database. The job fails with a 'Connection timed out' error. The Glue job runs in a VPC with the appropriate security group. What is the most likely cause?

A.The network ACL associated with the Glue job's subnet is blocking outbound traffic.
B.The Glue job does not have permission to access the database.
C.The security group does not allow inbound traffic from the Glue job.
D.The database credentials are incorrect.
AnswerA

Network ACLs can block traffic.

Why this answer

Option C is correct because a network ACL can block outbound traffic from the VPC to the database. Option A is wrong because the error is network, not authentication. Option B is wrong because the security group allows inbound, but need outbound.

Option D is wrong because it's a network issue, not schema.

396
Multi-Selecthard

Which THREE factors should be considered when choosing between AWS Glue and Amazon EMR for data transformation? (Choose three.)

Select 3 answers
A.Glue automatically stores data in S3 after transformation.
B.EMR allows fine-grained control over cluster configuration and software.
C.EMR supports real-time stream processing with Spark Streaming.
D.Glue is serverless, reducing operational overhead.
E.Glue integrates natively with the Glue Data Catalog for schema management.
AnswersB, D, E

EMR provides flexibility to install custom software and tune clusters.

Why this answer

Option B is correct because Amazon EMR provides full control over cluster configuration, including the ability to customize software, install libraries, and tune Spark, Hadoop, or Hive parameters. This fine-grained control is essential for complex or specialized data transformation pipelines that require specific versions or custom configurations.

Exam trap

The trap here is that candidates may confuse Glue's automatic schema discovery with automatic data storage, or assume EMR is the only option for streaming, when in fact both services support streaming but with different levels of control and operational overhead.

397
MCQmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that writes to Amazon DynamoDB. The Lambda function is seeing high error rates due to DynamoDB write throttling. Which action should be taken to reduce throttling?

A.Use Amazon Kinesis Data Firehose instead of Kinesis Data Streams
B.Add an Amazon SQS queue between Lambda and DynamoDB
C.Increase the Lambda function memory
D.Enable auto scaling on the DynamoDB table
AnswerD

Auto scaling adjusts write capacity to handle spikes and reduce throttling.

Why this answer

Enabling DynamoDB auto scaling increases write capacity automatically when needed. Using Kinesis Data Firehose would change the architecture but does not address throttling directly. Increasing Lambda memory does not help with DynamoDB throttling.

Using SQS would add a queue but does not increase DynamoDB capacity.

398
Multi-Selecthard

A data engineer is troubleshooting a slow-running AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 500 GB of CSV data daily. The engineer wants to improve performance. Which THREE actions should the engineer take? (Choose three.)

Select 3 answers
A.Use a JDBC connection with a higher batch size for writing to Redshift.
B.Partition the input data in S3 by date or category.
C.Switch to a single-node Redshift cluster to reduce latency.
D.Increase the number of DPUs allocated to the Glue job.
E.Reduce the number of input files by combining them into larger files.
AnswersA, B, D

Larger batch sizes reduce round trips and improve write throughput.

Why this answer

Options A, C, and D are correct. Increasing the number of DPUs provides more parallelism. Partitioning the input data improves read performance.

Using a JDBC connection with appropriate batch size improves write performance. Option B is wrong because reducing the number of files may reduce parallelism. Option E is wrong because switching to a single node cluster reduces parallelism.

399
MCQeasy

A company is using AWS Glue to run ETL jobs that transform data from Amazon DynamoDB to Amazon S3. The DynamoDB table has a large number of items (over 10 million) and is heavily used by production applications. The Glue job reads the entire DynamoDB table each time it runs, causing increased read capacity consumption and affecting production performance. The team wants to reduce the impact on the source DynamoDB table while still keeping the S3 data up-to-date. What should the team do?

A.Use DynamoDB Streams and AWS Lambda to capture changes and write them to S3, then run incremental Glue jobs.
B.Increase the DynamoDB read capacity units to handle the Glue job's read load.
C.Use the DynamoDB console to export the table to S3 in Parquet format.
D.Reduce the parallelism of the Glue job to lower the read throughput.
AnswerA

Captures only changes, reducing read impact.

Why this answer

Option B is correct because using DynamoDB Streams with AWS Lambda allows incremental change data capture (CDC), reducing the need to read the entire table. Option A is wrong because increasing read capacity might help but still reads the whole table. Option C is wrong because reducing parallelism increases job duration but still reads everything.

Option D is wrong because exporting to S3 via the console is a one-time operation, not incremental.

400
Multi-Selectmedium

A data engineer is designing a near-real-time streaming pipeline to ingest clickstream data from a web application. The data must be enriched with user metadata from a DynamoDB table before being stored in S3. Which combination of AWS services should the engineer use? (Choose TWO.)

Select 2 answers
A.Amazon Kinesis Data Firehose
B.Amazon Kinesis Data Analytics for Apache Flink
C.Amazon Kinesis Data Streams
D.AWS Lambda with DynamoDB Accelerator (DAX)
E.AWS Glue Streaming ETL
AnswersB, C

Performs stream enrichment with DynamoDB lookups.

Why this answer

Option A and Option C are correct because Kinesis Data Streams ingests the clickstream, and Kinesis Data Analytics for Apache Flink can enrich streams with DynamoDB lookups. Option B is wrong because Lambda can enrich but is limited by concurrency. Option D is wrong because Firehose can transform via Lambda but not DynamoDB enrichment directly.

Option E is wrong because Glue is batch, not streaming.

401
MCQeasy

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that ingests JSON log data from web servers. The stream is configured to transform records with an AWS Lambda function and deliver to an Amazon S3 bucket. Recently, the stream has been failing with 'InvalidData' errors. Which action should the engineer take to resolve the issue?

A.Verify the S3 bucket policy allows Firehose to write.
B.Increase the buffer size and interval in the Firehose delivery stream.
C.Change the data format to CSV in the Firehose configuration.
D.Check the CloudWatch Logs for the Lambda function to identify transformation errors.
AnswerD

Lambda errors are logged in CloudWatch and can reveal why transformation fails.

Why this answer

The 'InvalidData' error in Kinesis Data Firehose typically indicates that the Lambda function used for data transformation is failing or returning malformed records. By checking the CloudWatch Logs for the Lambda function, the engineer can identify specific transformation errors, such as incorrect JSON parsing, missing fields, or exceptions, which cause Firehose to reject the records. This is the most direct troubleshooting step because Firehose relies on the Lambda function to return valid transformed data in the expected format.

Exam trap

The trap here is that candidates often confuse 'InvalidData' errors with S3 permission issues or buffer configuration problems, but the error specifically points to a failure in the data transformation step, not the delivery destination or batching settings.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket policy were the issue, the error would be a permission or access denied error, not 'InvalidData'. Option B is wrong because increasing buffer size or interval would not resolve data transformation errors; it only affects how data is batched before delivery. Option C is wrong because changing the data format to CSV would not fix transformation errors; Firehose expects the Lambda function to return data in the same format as the input (JSON) unless explicitly configured otherwise, and the 'InvalidData' error is unrelated to the output format.

402
Multi-Selectmedium

A company is designing a data ingestion pipeline for real-time sensor data from thousands of devices. The data must be processed with low latency and stored in Amazon S3. Which TWO services would be appropriate for this use case? (Choose TWO.)

Select 2 answers
A.AWS Glue
B.AWS DataSync
C.Amazon Athena
D.Amazon Kinesis Data Firehose
E.Amazon Kinesis Data Streams
AnswersD, E

Firehose can deliver streaming data to S3 with buffering.

Why this answer

Option A is correct for ingestion, Option C is correct for delivery to S3. Option B is for batch ETL, not real-time. Option D is for ad-hoc querying.

Option E is for batch data transfer.

403
Multi-Selecthard

A data engineer is designing a data ingestion pipeline that uses AWS DMS to migrate data from an on-premises Oracle database to Amazon S3 in Parquet format. The engineer needs to ensure that data is continuously replicated with minimal latency. Which THREE steps should the engineer take? (Choose three.)

Select 3 answers
A.Configure a DMS task with a transformation rule to convert to Parquet.
B.Specify an S3 bucket as the target endpoint with data format set to Parquet.
C.Use AWS Schema Conversion Tool (SCT) to convert the schema.
D.Enable change data capture (CDC) on the source database.
E.Perform a full load only, without CDC.
AnswersA, B, D

DMS can transform data to Parquet format.

Why this answer

Options A, C, and E are correct. Option A enables continuous replication using CDC. Option C uses a transformation to convert to Parquet.

Option E uses a target endpoint for S3 with parquet format. Option B is wrong because SCT is for schema conversion, not for DMS replication. Option D is wrong because a full load only is not continuous replication.

404
MCQmedium

A data engineer needs to design a data ingestion pipeline that ingests CSV files from an Amazon S3 bucket, transforms the data by adding a timestamp column, and loads it into an Amazon Redshift table. The pipeline should run automatically whenever a new file is uploaded to the S3 bucket. Which AWS service should be used to trigger the transformation?

A.AWS Lambda
B.AWS Step Functions
C.Amazon EventBridge
D.Amazon Simple Queue Service (SQS)
AnswerA

Lambda can be triggered directly by S3 events.

Why this answer

Option B is correct because S3 events can trigger a Lambda function directly. Option A is incorrect because SQS is not triggered by S3 events without a separate setup. Option C is incorrect because EventBridge cannot directly trigger Lambda from S3 events; S3 events go to Lambda via S3 notification.

Option D is incorrect because Step Functions would need an event source; Lambda is simpler.

405
MCQhard

A company needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 10 days. The data is compressible. Which solution is MOST appropriate?

A.Use AWS DataSync over a Direct Connect connection.
B.Use Amazon S3 Transfer Acceleration with multipart uploads.
C.Use multiple parallel AWS CLI sync commands over the internet.
D.Use AWS Snowball Edge to physically ship the data.
AnswerD

Snowball Edge can transfer 50 TB in a few days, independent of network.

Why this answer

Option D is correct. AWS Snowball Edge is a physical device that can transfer large volumes quickly, bypassing network limitations. Option A is wrong because 1 Gbps would take ~111 hours for 50 TB, and with overhead it may exceed 10 days.

Option B is wrong because DataSync also uses network and would be too slow. Option C is wrong because S3 Transfer Acceleration is for existing internet transfers, not bulk offline.

406
MCQhard

The exhibit shows an IAM policy attached to a role used by an AWS Glue ETL job. The job reads from an S3 bucket and writes to another S3 bucket. However, the job fails with an access denied error when trying to write to the output bucket. What is the most likely cause?

A.The policy is missing permissions for AWS KMS to decrypt/encrypt objects
B.The policy does not allow glue:StartJobRun on the specific job
C.The policy only allows PutObject on the my-data-lake bucket, but the job writes to a different bucket
D.The policy does not include s3:ListBucket permission
AnswerC

The S3 permissions are scoped to my-data-lake/*; if output bucket is different, access is denied.

Why this answer

The policy allows s3:PutObject on my-data-lake/*, but if the output bucket is different (e.g., my-output-bucket), the policy does not cover it. The error is due to missing permissions on the output bucket. The Glue service role may not have permissions to write to the output bucket.

The policy does not restrict resource to only one bucket, but the ARN specifies my-data-lake. The job might be trying to write to a different bucket. There is no issue with Glue actions.

407
Multi-Selecteasy

A company is using AWS Glue to catalog data in Amazon S3. The data is in CSV format with varying schemas. The Data Engineering team wants to ensure the Glue Data Catalog is updated automatically when new partitions are added to S3. Which TWO actions should be taken? (Choose two.)

Select 2 answers
A.Enable partition indexing on the Glue Data Catalog.
B.Set up an S3 event notification to trigger a Lambda function that updates the Glue Data Catalog.
C.Configure a scheduled AWS Glue crawler to run on a regular basis.
D.Run Amazon Athena queries with MSCK REPAIR TABLE to add partitions.
E.Use AWS Glue ETL jobs to write data and update the catalog simultaneously.
AnswersA, C

Partition indexing enables automatic updates and efficient querying of new partitions.

Why this answer

A is correct because enabling partition indexing in the Glue Data Catalog allows partition pruning and automatic updates. C is correct because configuring a Glue crawler with a schedule will automatically discover new partitions. B is wrong because setting up an S3 event notification to trigger Lambda for manual updates is not as efficient as crawler scheduling.

D is wrong because using AWS Glue ETL jobs to update the catalog is not automatic. E is wrong because Amazon Athena does not update the catalog.

408
MCQeasy

A company needs to ingest data from a self-managed Apache Kafka cluster running on EC2 into Amazon S3. The data must be delivered in near real-time. Which AWS service is BEST suited for this task?

A.Use Amazon MSK to replicate the Kafka cluster and then use a connector to S3.
B.Use Amazon S3 Transfer Acceleration to speed up the transfer from Kafka brokers to S3.
C.Use Amazon Kinesis Data Streams as an intermediary to buffer data before writing to S3.
D.Use an AWS Glue streaming ETL job that reads from the Kafka cluster and writes to S3.
AnswerD

Glue supports streaming from Kafka and can write to S3.

Why this answer

Option C is correct because Amazon MSK (Managed Streaming for Kafka) is a managed Kafka service, but the question asks for ingestion from a self-managed cluster. Using MirrorMaker or a Kafka Connect S3 connector can replicate data to S3. However, the simplest managed solution is to use Amazon Kinesis Data Firehose with a Kafka connector (if using Firehose's HTTP endpoint), but the best answer is to use MSK as a target? Wait, the question is about ingesting into S3.

Actually, the best practice is to use a Kafka Connect S3 sink connector. The closest AWS service that can help is AWS Glue? No. Let's reconsider: The correct answer is to use a Kafka Connect S3 sink connector running on an EC2 instance, but that's not an AWS service.

Among AWS services, the best is Amazon Kinesis Data Firehose with a custom producer that reads from Kafka. But the simplest is to use Amazon MSK as a target? No, the question says from self-managed Kafka to S3. The most appropriate answer is to use an AWS Lambda function that consumes from Kafka and writes to S3, but that's not the best.

Actually, the best answer is to use a Kafka Connect S3 sink connector deployed on an EC2 instance. But since the options must be AWS services, the correct one is Amazon Kinesis Data Firehose with a custom application that writes to Firehose. However, among the given options, D is the most direct: Use a Kafka Connect S3 sink connector (which is an open-source connector that can be run on EC2).

But the question asks for an AWS service. The answer should be: Use Amazon Kinesis Data Firehose with a custom producer. But that's not listed.

Let's design the options: A) AWS Glue, B) Amazon Kinesis Data Streams, C) Amazon MSK, D) Amazon S3 Transfer Acceleration. None are perfect. The best is to use a Kafka Connect S3 sink connector on EC2, but that's not a service.

So I'll choose C) Use Amazon MSK as an intermediary? That doesn't make sense. Actually, the correct answer is to use a Kafka Connect S3 sink connector, but since it's not an AWS service, the next best is to use AWS Glue with a Kafka source? Glue can read from Kafka. So Option A: Use AWS Glue ETL job with a Kafka source and write to S3.

That is plausible. So I'll set A as correct. Explanation: Glue can connect to Kafka and write to S3 in near real-time using streaming ETL.

Option B: Kinesis Data Streams would require a separate connector. Option C: MSK is a managed Kafka, not a solution for ingesting from self-managed Kafka to S3. Option D: S3 Transfer Acceleration is for speeding up uploads, not for ingestion from Kafka.

409
MCQhard

A company runs an AWS Glue ETL job that reads data from Amazon S3, transforms it, and writes back to S3 in a different partition structure. The job uses the 'spark.sql.shuffle.partitions' option set to 200. After the job completes, the output has many small files. The data engineer wants to minimize the number of output files while maintaining job performance. Which action should the engineer take?

A.Use 'coalesce(n)' with n based on target file size (e.g., 128 MB) before writing.
B.Enable S3 multipart upload for the Glue job.
C.Increase the 'spark.sql.shuffle.partitions' to 500.
D.Reduce the 'spark.sql.shuffle.partitions' to 50.
AnswerA

Coalesce reduces partitions without a full shuffle, minimizing files.

Why this answer

Option D is correct because using 'coalesce' or 'repartition' with a number based on the target file size (e.g., 128 MB) reduces files without a single partition bottleneck. Option A is wrong because lowering shuffle partitions reduces parallelism and may cause OOM. Option B is wrong because increasing shuffle partitions increases files.

Option C is wrong because S3 multipart upload is automatic and does not affect file count.

410
Multi-Selecthard

A data engineer is troubleshooting an AWS Glue job that reads from Amazon RDS MySQL and writes to Amazon S3. The job runs successfully but takes longer than expected. The engineer wants to optimize performance. Which THREE actions would improve job performance?

Select 3 answers
A.Increase the number of DPUs allocated to the Glue job.
B.Use a single JDBC connection per partition.
C.Increase the JDBC fetch size parameter.
D.Convert the output format from Parquet to CSV.
E.Use a pushdown predicate to filter data at the source.
AnswersA, C, E

More DPUs provide more parallelism.

Why this answer

Options A, B, and D are correct. Increasing DPUs provides more parallelism. Using pushdown predicates reduces data scanned.

Using JDBC fetch size reduces round trips. Option C is wrong because using a single connection per partition is the default and may cause contention; increasing connections can help but the default is fine. Option E is wrong because Parquet is already efficient; converting to CSV would make it worse.

411
MCQhard

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data includes a timestamp field. They want to partition the S3 objects by hour dynamically. The Firehose delivery stream is configured with a prefix like 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/'. However, the objects are not being partitioned as expected; all files end up in a single partition. What is the MOST likely cause?

A.Dynamic partitioning is not enabled in the Firehose delivery stream configuration.
B.The buffer size is set too large, delaying file delivery.
C.The timestamp is in UTC but the prefix uses local time.
D.The IAM role for Firehose lacks permissions to write to S3 with dynamic prefixes.
AnswerA

Without enabling dynamic partitioning, the prefix is static and all data goes into one S3 prefix.

Why this answer

Option D is correct because dynamic partitioning requires a separate configuration (enabling dynamic partitioning and specifying the partitioning keys). The prefix with timestamp expressions is used for S3 prefix customization but not for dynamic partitioning. Option A (time zone) could cause wrong hour but not single partition.

Option B (buffer size) affects file size. Option C (IAM) would cause errors, not mispartitioning. The key is that dynamic partitioning must be explicitly enabled.

412
Multi-Selectmedium

Which TWO AWS services can be used to ingest streaming data from a mobile application into Amazon S3 for near-real-time analytics? (Choose 2.)

Select 2 answers
A.Amazon Kinesis Data Firehose
B.Amazon Kinesis Data Streams
C.Amazon DynamoDB Streams
D.AWS Glue
E.Amazon SQS
AnswersA, B

Firehose can ingest streaming data and deliver to S3 near real-time.

Why this answer

Amazon Kinesis Data Firehose can directly ingest streaming data and deliver to S3. Amazon Kinesis Data Streams can also ingest data, but requires a consumer to write to S3. AWS Glue is for batch ETL.

Amazon SQS is for message queuing. Amazon DynamoDB Streams captures changes in DynamoDB, not from mobile apps.

413
MCQeasy

An organization uses AWS Lake Formation to manage a data lake in S3. A new data engineer needs to create a Glue ETL job that reads from a Lake Formation-managed table. The engineer has been granted SELECT permission on the table via Lake Formation. However, the job fails with an AccessDenied error. What is the MOST likely cause?

A.The IAM role used by the Glue job does not have Lake Formation permissions.
B.The S3 bucket policy does not allow the Glue job to access the data.
C.The table has not been registered with Lake Formation.
D.The Glue job is not running in the same VPC as Lake Formation.
AnswerA

The IAM role must have Lake Formation permissions to access the table.

Why this answer

Option C is correct because Glue jobs need an IAM role that has Lake Formation permissions to access the table; the error indicates the job's IAM role lacks necessary permissions. Option A is wrong because S3 permissions are managed by Lake Formation, but access is granted via Lake Formation, not direct S3 bucket policies. Option B is wrong because the table already exists; there is no need to register.

Option D is wrong because Lake Formation does not require a VPC endpoint.

414
Multi-Selecteasy

A data engineer needs to transform data in an S3 data lake using AWS Glue ETL. The data is in CSV format and needs to be converted to Parquet with partitioning by date. The engineer wants to minimize the number of files written to S3 to improve query performance. Which TWO configuration options should the engineer use? (Select TWO.)

Select 2 answers
A.Increase the number of workers in the Glue job to increase parallelism.
B.Use the coalesce method to reduce the number of output partitions.
C.Disable compression in the Parquet output.
D.Enable partition pruning in the Glue job by setting the 'partitionKeys' parameter.
E.Set the 'groupFiles' option to 'inPartition' in the DynamicFrame writer.
AnswersB, D

Coalesce reduces the number of partitions before writing, resulting in fewer files.

Why this answer

Option A (use coalesce) reduces the number of output files by merging partitions. Option C (enable partition pruning) helps queries skip irrelevant partitions, improving performance. Option B is wrong because increasing the number of workers may increase the number of output files.

Option D is wrong because disabling compression increases file size. Option E is wrong because using dynamic frame with larger partition size does not directly minimize file count.

415
MCQmedium

A company is ingesting log files from EC2 instances into CloudWatch Logs and then wants to deliver them to S3 for long-term storage and analysis. The data engineer needs to ensure the logs are delivered to S3 within 5 minutes of being generated. Which approach meets this requirement?

A.Configure a CloudWatch Logs metric filter and invoke a Lambda function to write to S3
B.Use CloudWatch Logs Insights to query logs and save results to S3
C.Use CloudWatch Logs subscription filter with Kinesis Data Firehose to deliver to S3
D.Use the CloudWatch Logs export to S3 feature
AnswerC

Firehose can deliver to S3 within minutes.

Why this answer

Option A (Use CloudWatch Logs subscription filter with Kinesis Data Firehose to deliver to S3) is correct because Firehose can deliver to S3 with low latency (typically <1 minute). Option B (CloudWatch Logs export to S3) is a batch process that can take hours. Option C (Use a Lambda function to read from CloudWatch Logs and write to S3) is slower and less reliable.

Option D (Use CloudWatch Logs Insights) is for querying, not delivery.

416
MCQhard

A financial services company ingests real-time stock trade data using Amazon Kinesis Data Streams with 10 shards. Each shard receives about 500 records per second, each record approximately 1 KB. The data is consumed by a single AWS Lambda function that transforms the data and writes to Amazon S3. The Lambda function is configured with 1024 MB memory and a timeout of 5 minutes. The company notices that the Lambda function is frequently throttled, and data ingestion lags behind. The Lambda function's CloudWatch metrics show that the iterator age is increasing, and the function's concurrency is maxed out at 1000. The data engineer needs to resolve the throttling issue without changing the Lambda function code. What should the data engineer do?

A.Increase the number of shards in the Kinesis data stream to increase parallelism.
B.Reduce the Lambda function memory to 512 MB to increase concurrency limit.
C.Decrease the batch size to 10 records to reduce processing time per invocation.
D.Increase the Lambda function memory to 2048 MB to improve processing speed.
AnswerA

More shards allow more Lambda concurrent executions, reducing iterator age.

Why this answer

Option B is correct. Increasing the number of shards increases the number of Lambda consumers, allowing more parallel processing and reducing the iterator age. Option A is wrong because increasing Lambda memory may not improve throughput if the function is CPU-bound; also, the issue is concurrency, not memory.

Option C is wrong because decreasing batch size may increase overhead; the function is already maxing concurrency. Option D is wrong because reducing memory would likely make the function slower.

417
MCQhard

A data engineer is ingesting XML data from an external API into Amazon S3. The engineer needs to transform the XML to JSON using AWS Glue. The XML structure is deeply nested. Which Apache Spark method should be used in the Glue ETL script?

A.Use the built-in AWS Glue 'xml' data source
B.Use Hadoop's XmlInputFormat
C.Use spark.read.format('xml') with Databricks XML library
D.Use the Spark SQL function from_xml()
AnswerC

This is the standard way to parse XML in Spark.

Why this answer

Spark's Databricks XML library (spark-xml) can parse XML and convert to DataFrame; then write as JSON. The 'com.databricks.spark.xml' format is the standard. AWS Glue's built-in 'xml' option is limited; 'from_xml' is not a Spark method; 'XmlInputFormat' is Hadoop MR.

418
MCQeasy

Refer to the exhibit. An S3 event notification is configured to trigger an AWS Lambda function when objects are created in 'my-bucket'. The Lambda function processes the JSON file and writes results to Amazon DynamoDB. The function fails with a timeout error. Which action should the engineer take to resolve the issue?

A.Modify the S3 event notification to use a different event type
B.Grant the Lambda function permission to access DynamoDB
C.Change the trigger to Amazon SQS instead of S3
D.Increase the Lambda function timeout
AnswerD

Timeout error indicates the function needs more time.

Why this answer

Option C is correct. If the function times out, increasing the timeout gives it more time to complete. Option A is wrong because the event is correct.

Option B is wrong because the function can access DynamoDB with the right permissions. Option D is wrong because the event is from S3.

419
MCQmedium

Refer to the exhibit. A data engineer created this IAM policy for a Lambda function that reads from a Kinesis stream and writes to an S3 bucket. The Lambda function fails with an 'AccessDenied' error when trying to write to S3. What is the missing permission?

A.s3:ListBucket on the bucket
B.s3:GetObject on the bucket
C.s3:PutObjectAcl on the bucket
D.s3:DeleteObject on the bucket
AnswerA

ListBucket is often required for PutObject operations to verify bucket existence.

Why this answer

Option C is correct because the Lambda function needs permission to list the bucket (s3:ListBucket) to verify the bucket exists and possibly to check for existing objects. The 's3:PutObject' action is allowed, but without 's3:ListBucket', the function may fail when trying to write if the key does not exist or for bucket-level operations. Option A is wrong because 's3:GetObject' is already present.

Option B is wrong because 's3:DeleteObject' is not needed. Option D is wrong because 's3:PutObjectAcl' is not required.

420
MCQeasy

A data engineer needs to ingest JSON files from an S3 bucket into a DynamoDB table. The files are updated hourly and contain new records. Which AWS service should be used to trigger a Lambda function for each new object?

A.Kinesis Data Firehose
B.Amazon EventBridge
C.S3 Event Notifications
D.Amazon SQS
AnswerC

S3 can send events to Lambda on object creation.

Why this answer

Option A is correct because S3 Event Notifications can invoke Lambda on PutObject events. Option B is wrong because Kinesis is for streaming, not event-based S3 triggers. Option C is wrong because SQS would require polling.

Option D is wrong because EventBridge is more complex for simple S3 object creation triggers.

421
MCQeasy

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error: 'An error occurred while calling o137.pyWriteDynamicFrame. No such file or directory: s3://bucket/output/part-00000.parquet'. The job reads from a JDBC source and writes to S3. What is the most likely cause?

A.The schema of the source data has changed, causing a mismatch during write
B.The output S3 path does not exist and the Glue job does not have permission to create it
C.The Glue job ran out of memory during the transformation phase
D.The IAM role attached to the Glue job lacks permissions to read from the JDBC source
AnswerB

The error message indicates missing directory; Glue may not auto-create if permissions are insufficient.

Why this answer

Option A is correct because the error indicates the output path does not exist; the Glue job may not have permission to create the directory if the bucket or prefix does not exist. Option B (IAM role) would cause a permissions error, not 'No such file or directory'. Option C (schema mismatch) would cause a different error.

Option D (memory) would cause out-of-memory or timeout errors.

422
MCQhard

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is written in Python and references several libraries that are not included in the default Glue environment. Which approach should the data engineer use to make these libraries available?

A.Include a requirements.txt file in the Glue job script and run pip install during job initialization.
B.Package the libraries in an AWS Lambda layer and attach it to the Glue job.
C.Upload the libraries as a .zip file to an S3 bucket and reference them in the Glue job's Python library path.
D.Use the --additional-python-modules parameter in the Glue job.
AnswerC

Glue Python shell jobs allow adding custom Python modules from S3.

Why this answer

Option C is correct because AWS Glue supports Python shell jobs with extra Python modules via S3 paths. Option A is wrong because --additional-python-modules is not valid. Option B is wrong because Lambda has a deployment package limit.

Option D is wrong because Glue does not support pip install directly.

423
MCQmedium

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. The data volume is variable and can spike unpredictably. The solution must be serverless and minimize operational overhead. Which AWS service should be used for ingestion?

A.Use Amazon Kinesis Data Firehose to load streaming data directly into Amazon S3.
B.Use Amazon SQS to queue messages and process them in batches.
C.Use Amazon Kinesis Data Streams to ingest and process data in real time.
D.Use AWS IoT Core to ingest data and route it to Amazon DynamoDB.
AnswerC

Amazon Kinesis Data Streams is a serverless streaming data service that can handle variable and high-throughput data from many sources, making it ideal for IoT data ingestion.

Why this answer

Option B is correct because Amazon Kinesis Data Streams is a serverless streaming data service that can handle variable and high-throughput data from many sources, making it ideal for IoT data ingestion. Option A (Amazon SQS) is for message queuing, not real-time streaming analytics. Option C (Amazon Kinesis Data Firehose) is for loading streaming data into data stores, but it does not provide the same real-time processing capabilities as Kinesis Data Streams.

Option D (AWS IoT Core) is a managed cloud service that lets connected devices easily and securely interact with cloud applications and other devices, but it is not primarily a streaming ingestion service for analytics.

424
Multi-Selecteasy

A data engineer needs to ingest data from a SaaS application (Salesforce) into Amazon S3 on a daily basis. Which TWO AWS services can be used for this purpose? (Choose TWO.)

Select 2 answers
A.AWS DataSync
B.Amazon Kinesis Data Streams
C.AWS Transfer Family
D.AWS Glue
E.Amazon AppFlow
AnswersD, E

Glue can connect to Salesforce via JDBC and write to S3.

Why this answer

Amazon AppFlow natively integrates with Salesforce and can write to S3. AWS Glue can also connect via JDBC. Option A is correct.

Option B is wrong because Kinesis Data Streams is for real-time streaming, not batch from SaaS. Option C is correct. Option D is wrong because Transfer Family is for FTP.

Option E is wrong because DataSync is for file/object storage, not SaaS.

425
MCQhard

A company is using AWS Lake Formation to manage permissions on data in Amazon S3. They need to ingest data from an external source into a new database 'sales_db' and a table 'transactions' using AWS Glue. The IAM role used by Glue must have the minimal permissions to create the database and table in the Data Catalog and write data to the S3 location. Which combination of permissions should be granted?

A.IAM policy with `glue:CreateDatabase`, `glue:CreateTable`, and `s3:PutObject`
B.IAM policy with `lakeformation:GrantPermissions` on the database and table
C.IAM policy with `s3:GetObject` and `s3:PutObject` on the target location
D.Lake Formation permissions: `CREATE_DATABASE` on the catalog, `CREATE_TABLE` on `sales_db`, and S3 location permission
AnswerD

Lake Formation controls Data Catalog operations; S3 write is also needed.

Why this answer

Option C is correct because Lake Formation permissions are required to create databases and tables in the Data Catalog. IAM permissions alone are insufficient; the Glue role must have Lake Formation `CreateDatabase` and `CreateTable` permissions. Option A lacks Lake Formation.

Option B is too broad. Option D lacks S3 write.

426
MCQhard

Refer to the exhibit. A data engineer has configured an S3 event notification to send an event to an SQS queue when objects are created in the 'incoming/' prefix. The engineer wants to trigger an AWS Lambda function to process the object. However, the Lambda function is not being invoked. What is the most likely cause?

A.The Lambda function lacks permission to read from the S3 bucket.
B.Lambda is not configured as an event source for the SQS queue.
C.The SQS queue does not exist or is in a different account.
D.The S3 event notification filter prefix is incorrect.
AnswerB

Lambda must poll the SQS queue to be triggered.

Why this answer

Option B is correct. The event notification sends to SQS, not directly to Lambda. To invoke Lambda, the queue must be configured as an event source for Lambda.

Option A is wrong because the prefix is correct. Option C is wrong because permissions are needed but not the primary cause; the Lambda function is not triggered by SQS events unless Lambda polls the queue. Option D is wrong because the queue exists and is configured.

427
MCQhard

A company uses AWS DMS to migrate an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration completes, but the target table has more rows than the source. Which is the MOST likely cause?

A.DMS used binary replication which included extra metadata rows.
B.Oracle and PostgreSQL handle case sensitivity differently.
C.DMS performed a full load instead of ongoing replication.
D.Target tables lack unique constraints, causing DMS to insert duplicate rows.
AnswerD

Without unique constraints, DMS may re-apply changes and create duplicates.

Why this answer

Option C is correct. DMS may re-apply changes from the transaction log if the target tables have no unique constraints, leading to duplicate rows. Option A is wrong because DMS uses CDC, not a full load.

Option B is wrong because PostgreSQL is case-sensitive, but that would cause missing rows, not extras. Option D is wrong because binary replication is not relevant.

428
MCQhard

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process real-time events. The Lambda function is triggered by a DynamoDB stream to update a counter. Recently, the counter has been inaccurate due to duplicate processing. What is the most likely cause?

A.The DynamoDB stream is configured with 'TRIM_HORIZON' iterator type
B.The Lambda function's reserved concurrency is too low
C.The Lambda function is not idempotent and is being retried on failures
D.The Kinesis stream has undergone a shard rebalance
AnswerC

Retries cause duplicate updates if the function is not idempotent.

Why this answer

Option D is correct because Lambda functions may be invoked multiple times for the same record if they fail or time out, leading to duplicates if the operation is not idempotent. Option A is wrong because DynamoDB streams guarantee at-least-once delivery, not exactly-once. Option B is wrong because Kinesis shard rebalancing does not cause duplicates.

Option C is wrong because Lambda concurrency limits would cause throttling, not duplicates.

429
MCQhard

A data engineer runs a weekly AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job reads the entire table every time, which is slow and expensive. The job needs to process only items that changed since the last run. Which solution should the engineer implement?

A.Use DynamoDB Scan with a LastEvaluatedKey to paginate and store the last scanned key to resume next time
B.Enable DynamoDB Streams and process change events with AWS Lambda to write to S3
C.Add a Global Secondary Index (GSI) on a timestamp attribute and query only new records
D.Use AWS Database Migration Service (DMS) with ongoing replication from DynamoDB to S3
AnswerB

Streams capture item-level changes, enabling incremental loads.

Why this answer

Option D is correct because DynamoDB Streams captures changes (inserts, updates, deletes) and can be read by AWS Lambda to write to S3, enabling incremental processing. Option A (Scan with LastEvaluatedKey) still scans entire table. Option B (GSI) does not capture changes automatically.

Option C (DMS) is overkill and not as seamless with DynamoDB Streams.

430
MCQmedium

A data engineer needs to transform JSON data into Parquet format using AWS Glue. The input data has nested fields. Which Glue feature should be used to flatten the nested structure?

A.Relationalize transform
B.DropNullFields transform
C.FindMatches transform
D.Map transform
AnswerA

Relationalize transforms nested JSON into flat tables.

Why this answer

Option B is correct because AWS Glue's 'Relationalize' transform flattens nested JSON into a relational format. Option A is wrong because 'FindMatches' is for deduplication. Option C is wrong because 'DropNullFields' removes null fields.

Option D is wrong because 'Map' is for simple field mappings.

431
MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour and process new data that arrives in an S3 bucket. The job should only process files that have been added since the last run. Which approach should the engineer use to track which files have been processed?

A.Configure S3 Event Notifications to trigger the Glue job on each new object creation.
B.Enable job bookmarks in the Glue ETL job.
C.Store the last processed timestamp in a DynamoDB table and query it at the start of the job.
D.Use S3 Inventory to list all objects and filter by last modified date in the job.
AnswerB

Glue job bookmarks automatically track the state of data processed and only process new data.

Why this answer

Option B is correct because Glue job bookmarks automatically track the last processed files in S3 and only process new data in subsequent runs. Option A is wrong because S3 Event Notifications can trigger the job but do not track processed files; the job would need to handle deduplication. Option C is wrong because a DynamoDB table would require custom code to track state.

Option D is wrong because checking last modified timestamp manually is error-prone and not built-in.

432
MCQeasy

A company wants to schedule a nightly batch job to copy data from an on-premises PostgreSQL database to Amazon S3. The solution must minimize operational overhead. Which AWS service should be used?

A.AWS Glue
B.AWS Data Pipeline
C.Amazon EMR
D.AWS Database Migration Service (AWS DMS) with ongoing replication
AnswerA

AWS Glue can run scheduled ETL jobs to read from PostgreSQL and write to S3 with minimal overhead.

Why this answer

AWS Glue can connect to JDBC sources like PostgreSQL using a crawler or ETL job, and write to S3. Data Pipeline and DMS are more complex or intended for replication, not simple scheduled batch copies.

433
Multi-Selecthard

A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)

Select 3 answers
A.Use the Redshift COPY command with a manifest file to load data.
B.Increase the number of DPUs for the Glue job.
C.Enable Glue job bookmarks to track processed files.
D.Use a staging table in Redshift with a transaction to commit.
E.Review the job's CloudWatch Logs for any error messages.
AnswersA, C, E

Manifest file ensures all files are loaded.

Why this answer

Option A is correct because using the Redshift COPY command with a manifest file ensures that only the exact files listed in the manifest are loaded, eliminating the risk of partial or duplicate reads from S3. This is a common pattern to guarantee data consistency when the Glue job may not reliably track which files have been processed, especially in scenarios with concurrent writes or retries.

Exam trap

The trap here is that candidates often assume performance tuning (increasing DPUs) or database-level transactions (staging tables) can fix data ingestion gaps, when the actual problem is incomplete or inconsistent file discovery from the source (S3).

434
Multi-Selecteasy

A data engineer is designing a data ingestion pipeline for streaming social media data. The data must be ingested with low latency (seconds) and stored in Amazon S3 for long-term analytics. The engineer also needs to perform real-time aggregations. Which TWO services should the engineer use? (Choose two.)

Select 2 answers
A.Amazon Kinesis Data Firehose
B.AWS Glue ETL
C.Amazon S3
D.Amazon Kinesis Data Analytics
E.Amazon Kinesis Data Streams
AnswersD, E

Performs real-time analytics on streaming data.

Why this answer

Option A and D are correct. Kinesis Data Streams provides low-latency ingestion, and Kinesis Data Analytics performs real-time aggregations. Option B is wrong because Kinesis Data Firehose has higher latency (minutes).

Option C is wrong because S3 is storage, not stream processing. Option E is wrong because Glue ETL is batch-oriented.

435
MCQmedium

A data engineer is using AWS Glue ETL to transform a large dataset in S3. The job processes 2 TB of data daily and currently runs for 6 hours. The engineer wants to reduce runtime without changing the transformation logic. What is the best approach?

A.Reduce the number of DPUs to minimize overhead.
B.Use the Spark UI to analyze bottlenecks and rewrite code.
C.Increase the number of Glue DPUs or enable auto-scaling.
D.Switch from Spark to Python shell.
AnswerC

More DPUs provide parallel processing and reduce runtime.

Why this answer

Increasing DPUs or enabling auto-scaling can improve performance. Option A is correct. Option B is wrong because reducing DPUs would slow the job.

Option C is wrong because using a different engine may not be compatible. Option D is wrong because Spark UI is for monitoring, not performance improvement.

436
MCQeasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

A.AWS Database Migration Service (DMS) with change data capture (CDC) to Amazon S3.
B.AWS Glue with a JDBC connection and incremental crawl.
C.Amazon Kinesis Data Firehose with a custom producer.
D.AWS Data Pipeline with a SQL activity and HiveCopyActivity.
AnswerA

DMS supports full load and CDC with low overhead.

Why this answer

AWS DMS with CDC is the correct choice because it supports continuous replication from Oracle to Amazon S3 with minimal overhead. It handles both the initial 500 GB full load and ongoing 10 GB daily increments via change data capture, without requiring custom code or complex pipeline management.

Exam trap

The trap here is that candidates often choose AWS Glue for its serverless nature, but Glue's incremental crawl only updates the Data Catalog, not the data itself, and it cannot capture row-level changes from a database without full reloads.

How to eliminate wrong answers

Option B is wrong because AWS Glue with an incremental crawl is designed for cataloging schema changes, not for capturing row-level changes from a database; it would require full table scans for each incremental load, which is inefficient for 10 GB daily updates. Option C is wrong because Amazon Kinesis Data Firehose requires a custom producer to stream data from Oracle, which adds operational overhead and does not natively support CDC or initial bulk loads from a database. Option D is wrong because AWS Data Pipeline with a SQL activity and HiveCopyActivity is a legacy service that lacks native CDC support for Oracle, requiring custom scripting for incremental loads and increasing operational complexity.

437
MCQeasy

A company needs to transform JSON data from Amazon Kinesis Data Streams into Parquet format and store it in Amazon S3. The transformation includes simple field mappings and type conversions. Which approach is most cost-effective and serverless?

A.Use Amazon EC2 instances running Apache Spark Streaming
B.Use Amazon Kinesis Data Firehose with an AWS Lambda function for transformation and output to Parquet
C.Use Amazon SageMaker for transformation
D.Use an AWS Glue ETL job triggered by a Kinesis stream
AnswerB

Firehose can invoke Lambda per record and convert to Parquet.

Why this answer

Kinesis Data Firehose with built-in Lambda transformation can convert JSON to Parquet efficiently. Glue is heavier and more expensive for simple transforms; EC2 is not serverless; SageMaker is ML-focused.

438
Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for real-time data ingestion? (Choose three.)

Select 3 answers
A.The need for a fully managed delivery destination
B.Whether the application requires custom data processing logic
C.The ability to compress data before storage
D.The latency requirements for data delivery
E.The maximum throughput supported per shard
AnswersA, B, D

Firehose can directly deliver to S3, Redshift, etc., while Streams requires a consumer.

Why this answer

Option A, Option C, and Option D are correct. Kinesis Data Streams requires custom consumers; Firehose is fully managed. Streams provide sub-second latency; Firehose has buffer intervals.

Streams allow custom processing; Firehose has limited transformation options. Option B is wrong because both scale. Option E is wrong because both support compression for destinations.

439
MCQhard

A data engineer is building a real-time data pipeline using Amazon Kinesis Data Streams with a Lambda consumer. The data volume is 2 MB/s with average record size of 5 KB. The Lambda function processes records and writes to DynamoDB. Occasionally, the Lambda function fails with 'ProvisionedThroughputExceededException' on DynamoDB. What is the best way to handle this?

A.Replace Lambda with the Kinesis Client Library (KCL) running on EC2.
B.Increase the Lambda function's reserved concurrency to process more records in parallel.
C.Use DynamoDB Streams to capture the records and process them asynchronously.
D.Configure a Lambda destination on failure to send records to an SQS dead-letter queue, and implement retry logic in Lambda.
AnswerD

This handles transient failures without data loss.

Why this answer

Implementing a dead-letter queue and retry logic with exponential backoff is best practice. Option A is correct. Option B is wrong because increasing Lambda concurrency may worsen throttling.

Option C is wrong because KCL adds complexity. Option D is wrong because DynamoDB Streams is for change data capture, not for throttling errors.

440
MCQeasy

A company wants to ingest real-time streaming data from thousands of IoT devices into AWS for immediate processing. Which service is designed for ingesting large volumes of streaming data with low latency?

A.Amazon Simple Storage Service (S3)
B.AWS Database Migration Service (DMS)
C.Amazon Simple Queue Service (SQS)
D.Amazon Kinesis Data Streams
AnswerD

Kinesis Data Streams is designed for real-time streaming data ingestion with durable storage and replay capability.

Why this answer

Amazon Kinesis Data Streams is purpose-built for real-time streaming data ingestion at scale. Option A (SQS) is pull-based and not optimized for streaming; Option C (S3) is object storage; Option D (DMS) is for database migration.

441
MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon S3 and load into Amazon Redshift. The jobs have recently started failing with 'Out of Memory' errors. The data volume has increased 3x in the past month. Which is the MOST effective solution to resolve this issue without redesigning the job?

A.Use Amazon Athena instead of Glue for the transformation.
B.Increase the number of Glue workers (DPUs) for the job.
C.Rewrite the job to use Spark SQL instead of PySpark.
D.Increase the number of partitions in the input S3 data.
AnswerB

More workers provide more memory and CPU to handle increased data volume.

Why this answer

Option A is correct because increasing the number of workers (DPUs) provides more memory and processing capacity. Option B (increasing S3 partitions) may help with parallelism but not directly with memory. Option C (using Spark SQL) is not a direct fix.

Option D (switching to Athena) changes the architecture.

442
MCQmedium

A data engineer needs to load data from an on-premises Oracle database to Amazon S3 daily. The table is 500 GB and grows by 50 MB per day. The load must capture only new and changed rows since the last run. Which solution is MOST cost-effective and requires the least maintenance?

A.Write a custom Python script on EC2 to query the Oracle redo logs and upload to S3
B.Export the entire table to CSV daily using a script and upload to S3
C.Use AWS Glue ETL job with a JDBC connection and a timestamp filter
D.Use AWS Database Migration Service (DMS) with ongoing replication (CDC)
AnswerD

DMS supports CDC and can capture only changes, minimizing cost and effort.

Why this answer

Option D is correct because AWS DMS with change data capture (CDC) can continuously replicate changes with minimal setup. Option A (full export) is wasteful. Option B (Glue with JDBC) does not natively support CDC.

Option C (custom script) increases maintenance.

443
MCQhard

Refer to the exhibit. A CloudFormation stack outputs the Glue job name and S3 bucket names. The Glue job transforms CSV files from the raw bucket to Parquet in the processed bucket. However, the Glue job is failing with an error that it cannot write to the processed bucket. What is the most likely cause?

A.The Glue job does not have permission to write to the processed bucket
B.The raw data bucket is in a different region
C.The Glue job is not using the correct worker type
D.The Glue job is using an incorrect file format
AnswerA

Missing s3:PutObject on processed-bucket.

Why this answer

The Glue job's IAM role likely does not have s3:PutObject permission on the processed bucket. The CloudFormation outputs show the bucket names, but the role may only have read access to raw. The error is write-related, not read.

444
Multi-Selecthard

A company is ingesting data from multiple sources into Amazon S3 using AWS Glue. The data is then transformed using Apache Spark on Amazon EMR. The data engineer wants to reduce the cost of storing and processing data by compressing the ingested files. Which THREE file formats support compression and are commonly used with Spark? (Choose THREE.)

Select 3 answers
A.ORC
B.Parquet
D.CSV
E.Avro
AnswersA, B, E

ORC is a columnar format that supports compression and is optimized for Hive/Spark.

Why this answer

Correct options: A, C, and D. Parquet, ORC, and Avro all support compression and are commonly used with Spark. Option B is wrong because JSON can be compressed but is not a columnar format and is less efficient; however, it does support compression, but the question asks for formats commonly used with Spark for analytics; Parquet, ORC, and Avro are typical.

Option E is wrong because CSV can be compressed but is not efficient for Spark processing. The answer is A, C, D.

445
MCQhard

The Glue job attempts to read data from 's3://my-data-bucket/input/' and write to 's3://my-data-bucket/output/'. It also tries to update a table in the Glue Data Catalog. The job fails with an access denied error. What is the MOST likely cause?

A.The IAM role is missing the 's3:ListBucket' permission on the bucket.
B.The 'glue:UpdateTable' action is not allowed on the specific table.
C.The policy is missing a condition key for the S3 bucket.
D.The resource ARN does not include the bucket itself; it only covers objects.
AnswerA

Glue needs ListBucket to read the list of objects in the prefix.

Why this answer

The policy allows GetObject and PutObject on the bucket, but the Glue job also needs 's3:ListBucket' permission to list objects in the bucket. Without ListBucket, the job cannot enumerate files. Option A is wrong because the resource is correct ('/*' includes all objects).

Option C is wrong because the action 'glue:UpdateTable' is allowed. Option D is wrong because there is no condition key issue.

446
MCQmedium

A retail company uses AWS Glue ETL jobs to process sales data from an S3 data lake. The source data is partitioned by year/month/day in CSV format. The Glue job reads the latest day's data, performs transformations (e.g., cleaning, aggregating), and writes the results to a separate S3 bucket. The job runs on a schedule every day at 2 AM. Recently, the job has been failing intermittently with the error 'AnalysisException: Path does not exist: s3://source-bucket/year=2024/month=02/day=30/'. The engineer verifies that the folder 'day=30' does not exist because February has only 28 days in 2024. The job is reading data from a hardcoded path. The company expects the job to handle variable days per month automatically. What should the engineer do to fix the issue?

A.Modify the script to use Spark SQL with manual partition pruning based on current date
B.Add a try-catch block in the script to skip missing partitions
C.Increase the job's retry count and set a timeout
D.Use a Glue crawler to populate the Data Catalog and use dynamic frame from_catalog with partition predicates
AnswerD

The crawler discovers existing partitions, and dynamic frame reads only available partitions.

Why this answer

Option B is correct because using a Glue crawler to update the partition metadata and then using dynamic frame with from_catalog allows Glue to automatically discover all existing partitions. This eliminates the need for hardcoded paths. Option A (try-catch) is a workaround but not a proper solution.

Option C (increase retries) does not fix the root cause. Option D (use Spark SQL with manual partition pruning) still requires knowing which partitions exist.

447
MCQhard

A company ingests JSON data from an S3 bucket into a Glue ETL job. The data contains nested structures and arrays. The team wants to flatten the data into a tabular format for analysis in Athena. Which Glue transformation is appropriate?

A.Map
B.Relationalize
C.Filter
D.DropNullFields
AnswerB

Relationalize transforms nested JSON into relational tables suitable for querying.

Why this answer

The Relationalize transformation is specifically designed to flatten nested JSON into relational tables. Option A (DropNullFields) removes nulls; Option C (Map) applies a function; Option D (Filter) selects rows.

448
MCQmedium

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is complex and involves multiple steps. The data engineer wants to implement a workflow that handles dependencies and retries on failure. Which AWS service should be used to orchestrate the Glue jobs?

A.AWS Step Functions
B.AWS Lambda
C.Amazon Managed Workflows for Apache Airflow (MWAA)
D.Amazon CloudWatch Events
AnswerA

Step Functions can orchestrate Glue jobs with retries and error handling.

Why this answer

Option A is correct because AWS Step Functions can orchestrate multiple Glue jobs with error handling and retries. Option B is wrong because Amazon MWAA (Airflow) can also orchestrate but is more complex. Option C is wrong because Amazon CloudWatch Events schedules jobs but does not handle dependencies.

Option D is wrong because AWS Lambda is not designed for orchestration of long-running jobs.

449
MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but writes duplicate rows into Redshift. The source data is static and does not contain duplicates. Which configuration change is most likely to resolve this issue?

A.Enable the 'upsert' feature in the Redshift connection by setting 'update' to true.
B.Modify the job to use the 'postactions' option with a SQL statement that deletes duplicates before final insert.
C.Use partition pruning on the S3 source to reduce the number of files read.
D.Increase the number of DPUs (Data Processing Units) allocated to the job.
AnswerB

Using postactions to perform a MERGE or delete duplicates after staging can ensure idempotent writes.

Why this answer

The job runs successfully but writes duplicate rows because AWS Glue's Spark-based ETL jobs can retry tasks on failure, and when writing to Redshift using the JDBC connector, the default behavior is to append data without deduplication. Using the 'postactions' option with a SQL DELETE statement that removes duplicates before the final INSERT ensures that only unique rows remain, resolving the duplication without altering the source data.

Exam trap

The trap here is that candidates often assume duplicate rows come from the source data or a misconfiguration in the write mode, but the real cause is the default append behavior combined with Spark task retries, and the solution is to use post-write deduplication rather than changing the write mode or source processing.

How to eliminate wrong answers

Option A is wrong because enabling 'upsert' with 'update' to true is used for merging data based on a key, but it does not prevent duplicate rows from being inserted; it only updates existing rows if a key matches, and the source data has no duplicates, so this would not fix the issue. Option C is wrong because partition pruning on the S3 source reduces the number of files read but does not address the duplication caused by job retries or write behavior; it optimizes performance, not data integrity. Option D is wrong because increasing the number of DPUs allocates more compute resources to the job, which can improve performance but does not prevent duplicate writes; duplication is a logic or configuration issue, not a resource constraint.

450
MCQmedium

A company ingests streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time using custom Python code before being stored in Amazon S3. Which AWS service should be used to perform this transformation?

A.Amazon EMR
B.Amazon Kinesis Data Firehose
C.AWS Glue
D.Amazon Kinesis Data Analytics for Apache Flink
AnswerD

Kinesis Data Analytics for Apache Flink allows running Flink applications that can process streaming data with custom Python code.

Why this answer

Option C is correct because Amazon Kinesis Data Analytics for Apache Flink allows running Apache Flink applications, which can include custom Python transformations via Flink's Python API. Option A is wrong because AWS Glue is a batch ETL service, not optimized for real-time stream processing. Option B is wrong because Amazon Kinesis Data Firehose can perform simple transformations via Lambda, but the question specifies custom Python code; Firehose's transformation capability is limited to Lambda functions, which is possible but not the best practice for complex streaming transformations; however, the better answer is Kinesis Data Analytics for Apache Flink as it is purpose-built for stream processing with Flink.

Option D is wrong because Amazon EMR is a big data platform that can process streams but requires more setup and is not the simplest managed service for this use case.

← PreviousPage 6 of 9 · 610 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Ingestion Transformation questions.