Courseiva
Knowledge + Practice
CertificationsVendorsCareer RoadmapsLabs & ToolsStudy GuidesGlossaryPractice Questions
C
Courseiva

Free IT certification practice questions with explained answers for CCNA, CompTIA, AWS, Azure, Google Cloud, and more.

Certification Practice Questions

CCNA practice questionsSecurity+ SY0-701 practice questionsAWS SAA-C03 practice questionsAZ-104 practice questionsAZ-900 practice questionsCLF-C02 practice questionsA+ Core 1 practice questionsGoogle Cloud ACE practice questionsCySA+ CS0-003 practice questionsNetwork+ N10-009 practice questions
View all certifications →

Product

CertificationsCertification PathsExam TopicsPractice TestsExam Dumps vs Practice TestsStudy HubComparisons

Company

AboutContactEditorial PolicyQuestion Writing PolicyTrust Center

Legal

Privacy PolicyTerms of Service

Courseiva is a free IT certification practice platform offering original exam-style practice questions, detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics for Cisco, CompTIA, Microsoft, AWS, and other technology certifications.

© 2026 Courseiva. Courseiva is operated by JTNetSolutions Ltd. All rights reserved.

Courseiva is an independent certification practice platform and is not affiliated with, endorsed by, or sponsored by Cisco, Microsoft, AWS, CompTIA, Google, ISC2, ISACA, or any other certification vendor. Vendor names and certification marks are used only to identify the exams learners are preparing for.

HomeCertificationsDEA-C01DomainsData Ingestion and Transformation
DEA-C01Free — No Signup

Data Ingestion and Transformation

Practice DEA-C01 Data Ingestion and Transformation questions with full explanations on every answer.

610questions

Start practicing

Data Ingestion and Transformation — choose a session length

10 questions~10 min20 questions~20 min30 questions~30 min50 questions~50 min

Free · No account required

DEA-C01 Domains

Data Ingestion and TransformationData Operations and SupportData Security and GovernanceData Store Management

Practice Data Ingestion and Transformation questions

10Q20Q30Q50Q

DEA-C01 Data Ingestion and Transformation questions (showing 300 of 610)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

1

A data engineer needs to ingest streaming data from an IoT fleet into Amazon S3 for near-real-time analytics. The data volume is approximately 5 GB per hour, and each event is less than 1 KB. Which AWS service should be used as the ingestion endpoint?

2

A company uses AWS Glue ETL jobs to transform data from Amazon S3 to Amazon Redshift. The job reads JSON files, applies schema mapping, and writes to a Redshift table. Recently, the job started failing with memory errors. The data volume has increased tenfold. Which approach should a data engineer take to resolve this issue with minimal code changes?

3

A financial services company processes real-time stock trade data. They use Amazon Kinesis Data Streams with a shard count of 5, each shard receiving about 500 records per second. The consumer application uses the Kinesis Client Library (KCL) with DynamoDB for checkpointing. Lately, some records are being processed multiple times. What is the most likely cause?

4

A data engineering team needs to transform CSV files stored in Amazon S3 into Parquet format using AWS Glue. The files are partitioned by date and are updated hourly. Which AWS Glue feature should be used to automatically detect the schema and partition structure?

5

An e-commerce company ingests clickstream data from their website into Amazon S3. The data is in JSON format, and each file is about 10 MB. They need to transform the data into a columnar format for analytics and load it into Amazon Redshift nightly. The transformation should be cost-effective and require minimal operational overhead. Which approach meets these requirements?

6

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3 in Parquet format. The replication is used for near-real-time analytics. Recently, the DMS task started failing with an error indicating insufficient memory. The source database is large (2 TB). What should a data engineer do to resolve this issue while minimizing changes to the existing architecture?

7

A data engineer needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for a data lake. The data volumes are moderate and the sync needs to be scheduled daily. Which AWS service is most appropriate for this task?

8

A company uses AWS Lambda to process records from an Amazon Kinesis Data Stream. Each record is about 50 KB. The Lambda function transforms the data and writes to Amazon DynamoDB. Recently, the Lambda function has been experiencing throttling and high error rates. The Kinesis stream has 10 shards. What is the most cost-effective solution to improve processing throughput?

9

A data engineer is designing a data ingestion pipeline for a social media analytics platform. The pipeline must ingest tweets in real-time, perform sentiment analysis, and store results in Amazon S3. The sentiment analysis is compute-intensive and must be done as the data arrives. The estimated throughput is 10,000 tweets per second. Which architecture is most suitable?

10

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. The transfer must be completed within one week. Which service should be used?

11

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format. The delivery stream is configured with a buffer size of 5 MB and a buffer interval of 60 seconds. However, the data engineer notices that S3 objects are being created with sizes much smaller than 5 MB. What is a likely cause?

12

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms each record and writes it to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors when writing to S3. The team has already increased the Lambda function's memory and timeout. Which action should the team take to resolve the issue?

13

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

14

A data engineer needs to ingest JSON data from an on-premises relational database into Amazon S3 every hour. Which AWS service should be used to set up a scheduled, incremental data transfer?

15

A company is using AWS Glue to process data from Amazon S3. The Glue job reads CSV files and writes Parquet files to a different S3 bucket. The job occasionally fails with 'java.lang.OutOfMemoryError: Java heap space'. The data size varies. Which change should the engineer make to avoid this error?

16

A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?

17

A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)

18

A company uses AWS Glue to perform ETL on data stored in Amazon S3. The Glue job reads CSV files, converts them to Parquet, and partitions by date. The job runs daily and processes about 500 GB of data. The team wants to optimize costs and performance. Which three actions should the team take? (Select THREE.)

19

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

20

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

21

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

22

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

23

A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?

24

A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)

25

A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)

26

A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?

27

A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?

28

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

29

A company uses AWS Data Pipeline to copy data from DynamoDB to S3 daily. Recently, the pipeline started failing with 'ThrottlingException' errors. The DynamoDB table has on-demand capacity. Which action should be taken to resolve the issue?

30

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

31

A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?

32

A company uses Amazon RDS for MySQL as a source for AWS DMS to replicate data to S3. The replication task is failing with 'OutOfMemory' errors on the DMS instance. The source table has 10 million rows with large BLOB columns. Which THREE changes would most likely resolve the issue?

33

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that ingests JSON log data from web servers. The stream is configured to transform records with an AWS Lambda function and deliver to an Amazon S3 bucket. Recently, the stream has been failing with 'InvalidData' errors. Which action should the engineer take to resolve the issue?

34

A data engineer is setting up an Amazon Kinesis Data Analytics application to process streaming data from a Kinesis data stream named "input-stream". The application uses a reference data source from an S3 bucket. The engineer has attached the IAM policy shown in the exhibit to the application's IAM role. When starting the application, the engineer receives an 'AccessDeniedException' error. Which additional permission is required?

35

A company runs an e-commerce platform that generates clickstream data from millions of users. The data is ingested into Amazon Kinesis Data Streams with a shard count of 10. The data is then consumed by a Kinesis Data Analytics application that runs SQL queries to aggregate metrics in real time. Recently, the application has been falling behind, and the stream's iterator age metric is increasing. The data volume has doubled over the past month. The application currently uses a single Kinesis Data Analytics application with parallelism of 1. Which action should the data engineer take to improve the processing rate and reduce the iterator age without losing data or causing duplicates?

36

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object containing fields: trade_id, symbol, price, quantity, and timestamp. The data is consumed by an AWS Lambda function that performs data validation and enrichment, then writes the processed records to an Amazon DynamoDB table for low-latency querying. Recently, the Lambda function has been timing out and failing to process all records. The Lambda function is configured with a 5-second timeout and 128 MB memory. The average record size is 2 KB, and the stream receives about 1000 records per second. The Lambda function's concurrency limit is 1000. Which set of actions should the data engineer take to resolve the issue without losing data?

37

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The data is JSON formatted and includes a timestamp field. The company wants to partition the output in Amazon S3 by date and hour, and ensure exactly-once processing semantics. Which combination of configurations should be used?

38

A data engineer is troubleshooting an AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but writes duplicate rows into Redshift. The source data is static and does not contain duplicates. Which configuration change is most likely to resolve this issue?

39

A company is designing a data ingestion pipeline to load CSV files from an SFTP server into Amazon S3. The files are generated hourly and range from 10 MB to 500 MB. Which AWS service should be used to orchestrate the transfer with minimal operational overhead?

40

A company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is in JSON format and each record is approximately 2 KB. The delivery stream is configured to buffer incoming records for 60 seconds or 5 MB, whichever comes first. The company notices that the data in S3 is delayed by up to 5 minutes during peak hours. Which action would most effectively reduce the delivery latency?

41

A company is ingesting real-time clickstream data into Amazon S3 using Amazon Kinesis Data Firehose. The data is semi-structured and the company wants to transform the data into Parquet format and partition it by year, month, day, and hour. Which TWO steps should be taken to achieve this? (Choose TWO.)

42

Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.

43

Arrange the steps to implement a data lake on Amazon S3 with AWS Lake Formation.

44

Order the steps to query data in Amazon Redshift Spectrum from an external table in Athena.

45

Order the steps to troubleshoot a failed AWS Glue job that reads from JDBC and writes to S3.

46

Match each AWS service to its primary purpose in data engineering.

47

Match each AWS data migration tool to its primary function.

48

Match each AWS database service to its primary use case.

49

Match each AWS security service to its purpose in data protection.

50

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data is then consumed by a Lambda function that transforms the records and writes them to an S3 bucket in Parquet format. Recently, the Lambda function has been timing out and the S3 bucket is not receiving all expected records. The Kinesis stream has a shard count of 10 and the Lambda function's reserved concurrency is set to the default. Which change would MOST likely resolve the issue?

51

A data engineer is using AWS Glue to perform ETL on data stored in an S3 bucket. The source data is in CSV format with a header row, and the target is a set of Parquet files partitioned by date. The engineer notices that the Glue job is reading all files in the source prefix, including temporary files that should be ignored. What is the MOST efficient way to exclude these temporary files?

52

A company uses Amazon Kinesis Data Firehose to ingest JSON logs from multiple sources into an S3 data lake. The data is then consumed by Amazon Athena for analysis. Recently, some queries have been failing with the error 'HIVE_BAD_DATA: Field xyz's type is an unsupported type'. The firehose delivery stream transforms the data using a Lambda function that converts timestamps to Unix epoch. What is the MOST likely cause of the query failure?

53

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database to Amazon S3. The pipeline should capture changes in near real-time (within minutes) and minimize impact on the source database. The source table has a 'last_modified' timestamp column. Which service combination would meet these requirements?

54

A company uses AWS Glue to process data from multiple S3 buckets. The Glue job runs daily and reads data from a bucket that contains millions of small files (each < 1 MB). The job has been running for hours and is often close to the 8-hour timeout limit. Which optimization would MOST reduce the job's runtime?

55

A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream to load data into Amazon Redshift. The data is coming from an application that produces JSON records. The engineer needs to transform the data to match the Redshift table schema. Which approach is the MOST cost-effective and requires the least operational overhead?

56

A company uses Amazon S3 to store raw data and AWS Glue to run ETL jobs that transform the data into analytics-ready tables. The Glue job reads from a source with a schema that changes frequently (new columns added). The engineer wants the Glue job to automatically adapt to schema changes without manual intervention. Which configuration should the engineer use?

57

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a fleet of EC2 instances running a custom application that processes the records and writes to DynamoDB. The application is experiencing high latency and records are being processed slower than they are produced. The stream has 5 shards. Which action would MOST effectively improve processing speed?

58

A company is building a data lake on Amazon S3 and wants to ingest data from multiple AWS services (CloudTrail, VPC Flow Logs, and ALB logs). The data should be stored in a central S3 bucket with a common partitioning scheme. Which service can be used to collect and centralize this data with minimal configuration?

59

A company is running a critical application that generates millions of small JSON files every hour in an S3 bucket. A data engineer needs to process these files in near real-time using AWS Glue. The engineer wants to minimize the latency between file arrival and Glue job start. Which TWO actions should the engineer take?

60

A data engineer needs to ingest data from an Amazon RDS MySQL database into a data lake on Amazon S3. The engineer wants to perform an initial full load and then capture incremental changes. Which TWO AWS services can be combined to achieve this?

61

A company is using Amazon Kinesis Data Streams to process real-time stock trade data. The data is consumed by a Lambda function that calculates moving averages and stores results in Amazon DynamoDB. The Lambda function is failing with 'ProvisionedThroughputExceededException' on the DynamoDB table. The table has on-demand capacity. Which TWO actions should the engineer take to resolve this issue?

62

A data engineer is designing a data ingestion pipeline for IoT sensor data. The sensors send JSON messages every second, and the data must be stored in Amazon S3 in near real-time (within 5 minutes). The engineer also needs to transform the data by adding a timestamp and filtering out malformed records. Which THREE services should be used together?

63

The Glue job attempts to read data from 's3://my-data-bucket/input/' and write to 's3://my-data-bucket/output/'. It also tries to update a table in the Glue Data Catalog. The job fails with an access denied error. What is the MOST likely cause?

64

The command returns an empty result, but you know there are objects in the 'logs/' prefix larger than 1000 bytes. What is the MOST likely reason?

65

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms and loads it into Amazon S3. Recently, the team noticed that the Lambda function is failing with throttling errors (HTTP 429) from the Kinesis API. Which configuration change should the team make to resolve this issue?

66

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data is partitioned by date and hour. The job reads the latest hour's data, performs aggregations, and writes results to a separate S3 bucket. The job runs every hour and processes approximately 500 MB of input data. The team notices that the job takes longer than expected, often exceeding the 1-hour window. Which action would most effectively reduce the job's runtime?

67

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a nightly basis. The data volume is approximately 10 GB per night. The database is accessible over the internet. Which AWS service is MOST appropriate for this task?

68

A company is using Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. The team notices that the application is experiencing high latency during peak hours. The stream has 8 shards, and the application is configured with a parallelism of 4. Which action would most likely reduce the latency?

69

A team is designing a data ingestion pipeline to load JSON files from an Amazon S3 bucket into Amazon Redshift. The files arrive every 5 minutes, and each file is between 10 MB and 50 MB. The team wants to minimize the time between file arrival and data availability in Redshift. Which approach should the team use?

70

A company uses AWS Glue ETL jobs to transform data stored in Amazon S3. The job reads data in Parquet format, applies transformations, and writes the output back to S3 in Parquet format. The team wants to improve the job's performance and reduce costs. Which action is MOST effective?

71

A data engineer is ingesting data from an Amazon RDS for PostgreSQL database into Amazon S3 using AWS Glue. The Glue job reads the entire table each time it runs, which takes several hours. The team wants to reduce the job duration by reading only new or updated records. Which approach should the engineer adopt?

72

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and each record is about 2 KB. The delivery stream is configured to buffer data for 60 seconds or 5 MB, whichever comes first. The team notices that the S3 objects are very small (around 1 MB) and numerous, causing high costs due to S3 PUT requests. Which configuration change should the team make to reduce the number of S3 objects?

73

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a slow internet connection (100 Mbps). The data must be transferred within 2 weeks. Which service should the engineer recommend?

74

A company is using Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that enriches records and writes to Amazon S3. The Lambda function is experiencing high error rates due to records exceeding the 256 KB payload limit. Which TWO actions should the team take to resolve this issue?

75

A data engineering team is building a data lake on Amazon S3. They need to ingest data from multiple sources: (1) streaming IoT data, (2) daily CSV exports from an on-premises system via SFTP, and (3) change data capture (CDC) from an Amazon Aurora database. Which THREE services should the team use to ingest these data sources?

76

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. The team wants to resolve this issue without increasing costs significantly. Which TWO actions should the team take?

77

A data engineer is ingesting streaming data from an IoT fleet into Amazon S3 using Amazon Kinesis Data Firehose. The data arrives as JSON, but the downstream analytics require Parquet format. Which Firehose transformation should the engineer configure?

78

A company uses AWS Glue ETL to process data from Amazon S3 and write results to Amazon Redshift. The job fails with a memory error when processing large files. Which action should the data engineer take to resolve this issue?

79

A data engineer is designing a real-time analytics pipeline for clickstream data. The source is Amazon Kinesis Data Streams, and the data must be stored in Amazon S3 in partitioned Parquet format with near-real-time latency. The engineer must also handle late-arriving data (up to 1 hour). Which combination of services meets these requirements?

80

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3. The data is stored as CSV files. The downstream team requires the data to be in Apache Parquet format. Which change should the data engineer make to the DMS task?

81

A data engineer is building a data ingestion pipeline that reads JSON files from Amazon S3 and loads them into an Amazon Redshift table using COPY commands. The files are gzip compressed and contain nested JSON. The engineer wants to minimize transformation steps. Which approach should the engineer use?

82

A company is migrating its on-premises data warehouse to Amazon Redshift. The daily batch load from the source database takes 6 hours using a single-node Redshift cluster. The engineer needs to reduce load time to under 2 hours without increasing cost significantly. Which strategy should the engineer adopt?

83

A data engineer needs to ingest data from an external HTTP API into Amazon S3. The API returns JSON data for a list of users, updated hourly. The engineer wants to use a serverless solution with minimal operational overhead. Which AWS service should the engineer use?

84

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is failing to deliver data to an Amazon S3 bucket. The stream is configured with a Lambda transformation function. The CloudWatch logs show that the Lambda function is timing out. Which action should the engineer take to resolve the issue?

85

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for MySQL and load it into Amazon S3. The job runs daily and processes incremental changes using the JDBC connection. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which step should the engineer take first to diagnose the issue?

86

A data engineer is designing a data ingestion pipeline for streaming social media data. The data must be ingested with low latency (seconds) and stored in Amazon S3 for long-term analytics. The engineer also needs to perform real-time aggregations. Which TWO services should the engineer use? (Choose two.)

87

A company is building a data lake on Amazon S3. The data sources include relational databases, streaming data, and log files. The data engineer needs to ensure that the data ingestion pipeline can handle schema evolution, support both batch and streaming, and provide a unified metadata catalog. Which THREE services should the engineer use? (Choose three.)

88

A data engineer is troubleshooting a slow-running AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 500 GB of CSV data daily. The engineer wants to improve performance. Which THREE actions should the engineer take? (Choose three.)

89

Refer to the exhibit. A data engineer is configuring an IAM policy for an AWS Glue ETL job that reads data from the 'my-data-bucket' S3 bucket, transforms it, and writes the output back to the same bucket. The engineer wants to prevent accidental deletion of objects. Based on the policy, which statement is true about the Glue job's permissions?

90

Refer to the exhibit. A data engineer has configured an S3 event notification to send an event to an SQS queue when objects are created in the 'incoming/' prefix. The engineer wants to trigger an AWS Lambda function to process the object. However, the Lambda function is not being invoked. What is the most likely cause?

91

Refer to the exhibit. An AWS Glue ETL job is failing with an OutOfMemoryError. The job reads from Amazon S3 and performs a GROUP BY on a large dataset. Which change should the data engineer make to resolve this error?

92

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for real-time processing. The data volume peaks at 5 GB/min. Which AWS service should be used as the ingestion endpoint?

93

A company uses AWS DMS to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. After initial load, ongoing replication is set up. The replication task shows 'Task status: failed with error: The specified LSN is not available in the source database logs.' What is the most likely cause?

94

A data engineer is designing a data ingestion pipeline for clickstream data that arrives in bursts, up to 100 MB/s, and must be processed with exactly-once semantics. The data must be stored in Amazon S3 partitioned by event date and hour. Which combination of services should the engineer use?

95

A company needs to ingest CSV files from an FTP server into Amazon S3 daily. The files are typically 50 MB each, and the process should be fully managed with minimal operational overhead. Which AWS service should be used?

96

A data engineer is using AWS Glue ETL to transform a large dataset in S3. The job processes 2 TB of data daily and currently runs for 6 hours. The engineer wants to reduce runtime without changing the transformation logic. What is the best approach?

97

A company uses Amazon Kinesis Data Analytics (now Managed Service for Apache Flink) to run a Flink application on streaming data. The application fails with 'OutOfMemoryError: Java heap space'. The data volume is 10 MB/s. What is the most likely cause and solution?

98

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The source files are in a bucket with thousands of small files. What is the best practice to optimize the Glue job performance?

99

A company uses AWS DMS to migrate data from Oracle to Aurora MySQL. During the ongoing replication, the target table shows duplicate primary key errors. What is the most likely cause?

100

A data engineer is building a real-time data pipeline using Amazon Kinesis Data Streams with a Lambda consumer. The data volume is 2 MB/s with average record size of 5 KB. The Lambda function processes records and writes to DynamoDB. Occasionally, the Lambda function fails with 'ProvisionedThroughputExceededException' on DynamoDB. What is the best way to handle this?

101

A data engineer needs to ingest data from a SaaS application (Salesforce) into Amazon S3 on a daily basis. Which TWO AWS services can be used for this purpose? (Choose TWO.)

102

A company is using AWS Glue ETL to process data from Amazon RDS for MySQL to Amazon S3. The job runs daily and takes 2 hours to complete. The engineer wants to improve performance without increasing cost significantly. Which TWO actions should the engineer take? (Choose TWO.)

103

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams with multiple consumers. The data must be processed by a Lambda function for real-time alerts and also stored in Amazon S3 for historical analysis. Which THREE components are needed to implement this architecture? (Choose THREE.)

104

A data engineer has an IAM policy attached to an IAM role used by an AWS Glue job. The Glue job needs to read from S3 bucket 'data-bucket' and write to the same bucket. The job fails with an access denied error when trying to write to S3. What is the issue?

105

A data engineer is troubleshooting a Lambda function that reads from the Kinesis stream 'my-data-stream'. The Lambda function is able to read data but occasionally fails with 'KMS.AccessDeniedException'. What is the most likely cause?

106

A CloudFormation template defines an AWS Glue job. The job fails during execution with the error 'Unable to locate script: s3://scripts-bucket/etl-script.py'. The S3 bucket 'scripts-bucket' exists and the script file is present. What is the most likely cause?

107

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3. Which AWS service should be used to perform the transformation?

108

A data engineering team needs to load data from an on-premises Oracle database to Amazon S3 daily. The data volume is about 50 GB per day, and the network bandwidth is 100 Mbps. The team wants to minimize operational overhead and use AWS managed services. Which solution should they choose?

109

A company uses Amazon Kinesis Data Streams with a shard count of 10 to ingest clickstream data. The data is consumed by a Lambda function that transforms the records and writes to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. The average record size is 5 KB, and the incoming data rate is 15 MB/s. What is the most likely cause and solution?

110

A data engineer needs to ingest data from an external partner's FTP server to Amazon S3. The data arrives once daily as a CSV file. Which AWS service should be used for this ingestion?

111

A company uses AWS Glue to transform data in S3. The transformation job reads Parquet files, filters rows, and writes to another S3 bucket. The job takes longer than expected. Which change would MOST likely reduce the job execution time?

112

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is configured with a buffer interval of 60 seconds and a buffer size of 5 MB. The data arrives at an average rate of 2 MB per second. What is the expected time interval between S3 writes?

113

A company wants to ingest real-time data from a social media API into Amazon S3 for analysis. The API provides data as JSON records. Which AWS service is best suited for this ingestion?

114

A data engineer needs to transform JSON data into Parquet format using AWS Glue. The input data has nested fields. Which Glue feature should be used to flatten the nested structure?

115

A company uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The stream has 5 shards. Each consumer reads from all shards. The total incoming data rate is 25 MB/s. What is the maximum read throughput per consumer if enhanced fan-out is enabled?

116

A company needs to ingest streaming data from thousands of IoT devices. The data must be processed in real-time and stored in Amazon S3. Which TWO services should be used together?

117

A data engineer needs to transform data in Amazon S3 using AWS Glue. The job must handle schema evolution and partition pruning. Which THREE features should be used?

118

Which TWO AWS services can be used to ingest data from an on-premise relational database into Amazon S3 on a one-time basis?

119

A data engineer is ingesting streaming data from thousands of IoT devices into AWS. The data is JSON-formatted and must be stored in Amazon S3 for long-term analytics. Which service is most appropriate for real-time ingestion and routing to S3?

120

A company runs a SQL Server transactional database on Amazon RDS. They need to capture change data (inserts, updates, deletes) in near real-time and replicate them to an Amazon S3 data lake. Which AWS service is most suitable?

121

A data engineer is designing a data pipeline that ingests CSV files from an FTP server to Amazon S3. The files arrive hourly and each file is about 500 MB. The engineer wants to minimize operational overhead and cost. Which approach is best?

122

A company needs to transform JSON data from Amazon Kinesis Data Streams into Parquet format and store it in Amazon S3. The transformation includes simple field mappings and type conversions. Which approach is most cost-effective and serverless?

123

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data volume is about 10 TB per day. The engineer wants to set up a managed Kafka connector. Which AWS service should they use?

124

A company is ingesting CSV files into Amazon S3. Each file contains a header row. The pipeline uses AWS Glue to crawl the S3 bucket and create a table in the AWS Glue Data Catalog. However, the crawler is including the header as data. What is the most likely cause?

125

A data engineer needs to ingest data from an Amazon S3 bucket into an Amazon Redshift table on a daily schedule. The data is in CSV format and the schema matches. Which service is simplest for this batch ingestion?

126

A company is ingesting streaming data from a fleet of weather sensors. Each sensor sends a JSON payload every second. The data is used for real-time dashboarding and also archived to S3. The pipeline should handle sudden bursts of data without data loss. Which architecture meets these requirements?

127

A data engineer is ingesting XML data from an external API into Amazon S3. The engineer needs to transform the XML to JSON using AWS Glue. The XML structure is deeply nested. Which Apache Spark method should be used in the Glue ETL script?

128

A company is ingesting large volumes of sensor data into Amazon S3. The data must be encrypted at rest using an AWS KMS customer managed key. Which TWO actions are required to enable server-side encryption with AWS KMS (SSE-KMS) on the S3 bucket?

129

A data engineer is designing a data ingestion pipeline for real-time clickstream data. The data must be available for both real-time analytics and batch processing. The engineer wants to use Amazon Kinesis Data Streams. Which THREE components should be included in the architecture?

130

A company is ingesting Apache logs from multiple web servers into AWS. The logs are sent via Amazon CloudWatch Logs to a subscription filter that delivers to a Lambda function. The Lambda function parses the logs and writes to Amazon S3. However, there is a significant backlog. Which THREE actions can reduce the backlog?

131

Refer to the exhibit. An IAM policy for an AWS Lambda function. The Lambda function is triggered by an S3 event (object created) and needs to read from a Kinesis stream. However, the function fails with access denied when trying to read from Kinesis. What is the most likely cause?

132

Refer to the exhibit. A CloudFormation stack outputs the Glue job name and S3 bucket names. The Glue job transforms CSV files from the raw bucket to Parquet in the processed bucket. However, the Glue job is failing with an error that it cannot write to the processed bucket. What is the most likely cause?

133

Refer to the exhibit. A Lambda function named 'IngestionProcessor' is failing. The engineer checks CloudWatch Logs and sees the log group exists but storedBytes is 0. Why might the logs show no data?

134

A company needs to ingest real-time clickstream data from a web application into Amazon S3 for analytics. The data must be available within minutes of generation. Which AWS service should be used to capture and deliver this streaming data?

135

A data engineer is tasked with transforming JSON data from an S3 bucket into Parquet format for efficient querying. The transformation should run on a schedule every hour. Which AWS service is best suited for this task?

136

A company uses Kinesis Data Streams to ingest IoT data. The data volume varies, and occasionally the shard write throughput is exceeded, causing ProvisionedThroughputExceeded exceptions. The data engineer needs to handle these spikes without losing data. Which approach is most cost-effective and requires minimal code changes?

137

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is 10 TB and the network bandwidth is 1 Gbps. The migration must be completed within 48 hours. What should the data engineer do to meet the deadline?

138

A data engineer is designing a data pipeline that ingests streaming data from Kinesis Data Streams, transforms it using AWS Lambda, and writes to S3. The Lambda function sometimes fails due to transient errors, and the engineer wants to ensure no data is lost. Which approach should be used?

139

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for centralized analytics. The data volume is several GB per day. Which AWS service is most suitable for this ingestion?

140

A company is ingesting log files from EC2 instances into CloudWatch Logs and then wants to deliver them to S3 for long-term storage and analysis. The data engineer needs to ensure the logs are delivered to S3 within 5 minutes of being generated. Which approach meets this requirement?

141

A company is building a data lake on S3 and needs to ingest data from on-premises Oracle database. The data is 5 TB and changes incrementally. The ingestion must capture changes in near real-time (less than 1 minute latency) and be cost-effective. Which approach should be used?

142

A data engineer needs to transform CSV files arriving in S3 into Parquet format and partition them by date. The transformation should be event-driven and run immediately after each file is uploaded. Which approach is most efficient?

143

A data engineer is designing a data ingestion pipeline that uses AWS Lambda to process records from a Kinesis Data Stream and write to DynamoDB. Which TWO strategies can help handle increased throughput and prevent data loss? (Choose TWO.)

144

A company is using AWS Glue to run ETL jobs that transform data from S3 to Redshift. The jobs are failing intermittently with out-of-memory errors. Which THREE actions can help resolve this issue? (Choose THREE.)

145

A data engineer needs to set up a data ingestion pipeline that reads from Amazon MSK (Managed Streaming for Kafka) and writes to Amazon S3 with transformations. The data is in Avro format and must be converted to Parquet. Which THREE components should be used together? (Choose THREE.)

146

A data engineer needs to ingest streaming data from thousands of IoT devices into Amazon S3 in near real-time. The data must be processed with minimal latency and stored in a columnar format for analytics. Which service should the engineer use to ingest the data?

147

A company uses AWS Glue ETL jobs to transform data from Amazon RDS to Amazon S3 daily. The job recently started failing with memory errors. The data volume has grown 3x in the past month. Which change should the data engineer make to resolve the issue?

148

A data engineer needs to design a data ingestion pipeline that captures change data capture (CDC) events from an on-premises SQL Server database to Amazon S3 with low latency. The pipeline must handle schema changes and ensure exactly-once delivery semantics. Which combination of AWS services should the engineer use?

149

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error: 'An error occurred while calling o137.pyWriteDynamicFrame. No such file or directory: s3://bucket/output/part-00000.parquet'. The job reads from a JDBC source and writes to S3. What is the most likely cause?

150

A company ingests JSON logs into Amazon S3 using Kinesis Data Firehose. The logs contain a timestamp field, but the delivery to S3 is delayed by up to 15 minutes during peak hours. The business requires near-real-time availability (under 2 minutes). Which configuration change should the data engineer make?

151

A data engineer runs a weekly AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job reads the entire table every time, which is slow and expensive. The job needs to process only items that changed since the last run. Which solution should the engineer implement?

152

A company needs to ingest data from an external API that returns CSV files daily. The files range from 100 MB to 2 GB. The data should be landed in Amazon S3 and then transformed using AWS Glue. Which ingestion method is most cost-effective and requires the least operational overhead?

153

A data engineer notices that an AWS Glue ETL job is running slower than expected. The job reads from Amazon S3, joins two datasets, and writes the result back to S3. The job uses the default worker type (G.1X) and 10 DPUs. Which action is most likely to improve performance?

154

A company ingests streaming data from social media feeds into Amazon Kinesis Data Streams. The data is consumed by an AWS Lambda function that transforms and writes to Amazon S3. Recently, the Lambda function started timing out and dropping records. The data volume has tripled. Which actions should the data engineer take to resolve this? (Choose TWO.)

155

A data engineer needs to design a data ingestion pipeline that captures streaming data from mobile app events into Amazon S3 for analytics. The pipeline must support real-time processing of events and allow for schema evolution over time. Which AWS services should the engineer use? (Choose THREE.)

156

A company uses AWS Glue to run ETL jobs daily. The data engineer wants to reduce costs by optimizing the job configuration. Which two actions will help reduce costs? (Choose TWO.)

157

A data engineer is designing a data transformation pipeline using AWS Glue. The source data is in Amazon S3 in Parquet format, and the transformed output must be written to another S3 bucket in Parquet format partitioned by year, month, day. The pipeline should handle incremental updates efficiently. Which three features should the engineer use? (Choose THREE.)

158

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue ETL job that fails with an access denied error when writing to S3. The IAM role attached to the Glue job has the policy shown. What is the most likely cause of the error?

159

Refer to the exhibit. A data engineer runs an AWS Glue job that fails with an 'Access Denied' error when writing to S3. The IAM role attached to the job has s3:PutObject permission on the output bucket. What additional configuration is most likely missing?

160

Refer to the exhibit. A data engineer is troubleshooting a Kinesis Data Streams consumer that is falling behind. The stream has 2 shards and is receiving data at a rate of 2 MB/s. The consumer is an AWS Lambda function with a batch size of 100 records. What should the engineer do to improve consumer throughput?

161

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError'. What is the MOST likely cause?

162

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The delivery occasionally fails with 'Firehose is throttled'. What should be done to reduce throttling?

163

A company uses AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration is taking longer than expected. The task status shows 'Full load in progress' with a low 'Table throughput (rows/s)'. Which action would MOST improve throughput?

164

A data engineer is building a pipeline to ingest JSON files from Amazon S3 into Amazon Redshift. The files are 100 MB each and arrive every 5 minutes. Which service is BEST suited for this ingestion?

165

A company uses AWS Glue DataBrew to clean and transform data for analytics. The source data is in Parquet format in Amazon S3. The transformation includes filtering rows and adding calculated columns. What is the MOST cost-effective way to run these transformations on a schedule?

166

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis stream with 10 shards and writes to an S3 bucket. The application is experiencing high latency. Analysis shows that the application is not keeping up with the incoming data rate. Which action would MOST effectively reduce latency?

167

A company uses AWS Lake Formation to manage data lake permissions. A data engineer needs to grant an IAM role 'Read' access to a specific database and all its tables in the Data Catalog. What is the MOST efficient way to achieve this?

168

A company uses AWS Glue ETL to transform data from Amazon RDS for MySQL to Amazon S3. The Glue job reads from a JDBC connection. The job runs once daily and processes all records, but the data volume is growing. Which change would improve performance and reduce costs?

169

Which TWO options are valid methods to ingest on-premises relational database data into Amazon S3 for analytics? (Choose 2.)

170

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a streaming ingestion architecture? (Choose 3.)

171

Which TWO AWS services can be used to transform data in transit during ingestion? (Choose 2.)

172

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is then consumed by a custom application for real-time analytics. Recently, the application has been experiencing high latency. The operations team suspects the shard count is insufficient. How should the team increase the shard count of the existing stream?

173

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. The source database is behind a firewall that does not allow direct internet access. Which service should the engineer use to transfer the data securely?

174

A company uses AWS Glue to transform data stored in Amazon S3. During a run, the job fails with a 'OutOfMemoryError' in the Spark executor. The job processes 2 TB of parquet files using 10 DPUs. The data is evenly distributed across partitions. Which action would MOST likely resolve the issue without impacting the job logic?

175

A company is designing a data ingestion pipeline for real-time IoT sensor data. The data volume peaks at 10,000 messages per second. The pipeline must process messages in order per sensor and persist raw data to Amazon S3 for archival. Which TWO services should be used together to meet these requirements? (Choose TWO.)

176

A data engineer is building a batch ETL pipeline using AWS Glue. The source data is in Amazon RDS for MySQL. The pipeline must run daily and process only new and modified records since the last run. The engineer needs to implement change data capture (CDC) efficiently. Which THREE steps should the engineer take? (Choose THREE.)

177

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The Flink application reads from a Kinesis Data Streams source, performs aggregations, and writes results to Amazon S3. The application is experiencing high checkpoint failures, and the processing lag is increasing. The data volume is 50 MB/s with an average record size of 1 KB. Which TWO actions would improve checkpoint reliability and reduce lag? (Choose TWO.)

178

Refer to the exhibit. A data engineer is attaching this IAM policy to an IAM role used by an AWS Glue job. The job reads from a Kinesis Data Streams stream and writes transformed data to an S3 bucket. When the job runs, it fails with an AccessDenied error for the Kinesis stream. What is the MOST likely cause?

179

Refer to the exhibit. A data engineer runs the describe-stream command on a Kinesis data stream. The stream has two shards. The engineer wants to increase the shard count to 4 using the UpdateShardCount API. What will be the resulting shard distribution?

180

Refer to the exhibit. A company uses S3 Event Notifications to trigger an AWS Lambda function whenever a new object is uploaded to an S3 bucket. The Lambda function processes the file and moves it to a different bucket. Recently, the function has been failing intermittently. The engineer checks the Lambda CloudWatch logs and sees the above event. What is the MOST likely cause of the intermittent failures?

181

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The jobs run nightly and take 3 hours to complete. The data volume is growing by 20% each month. The engineer needs to reduce job runtime and cost. The source RDS is a db.r5.large instance. Which approach would be MOST effective?

182

A data engineer uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration task is set to full load + CDC. After the full load completes, the CDC phase starts but shows a high latency of 5 minutes. The source database has a low write load. What should the engineer do to reduce the CDC latency?

183

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time processing. Each device sends JSON payloads of about 2 KB at a rate of 1 message per second. The data must be processed with a durable, ordered stream per device. Which service should the company use as the ingestion layer?

184

A company uses Amazon EMR to process large datasets stored in Amazon S3. The data is in Parquet format and partitioned by date. The EMR cluster uses Spark SQL for transformations. Recently, the job has been slow and some tasks are failing due to 'java.lang.OutOfMemoryError'. The cluster has 10 core nodes of type m5.xlarge. Which configuration change would MOST improve performance and stability?

185

A data engineer is designing a data ingestion pipeline for real-time user activity logs. The logs are generated by a web application and must be ingested into Amazon S3 with minimal latency (under 1 minute). The logs also need to be queried in Amazon Athena. The engineer considers using Amazon Kinesis Data Firehose. Which TWO configurations are required to achieve near-real-time delivery to S3? (Choose TWO.)

186

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 (Parquet) into a denormalized format for Amazon Redshift. The Glue job uses the DynamicFrame API. The job is failing with a 'MemoryError' when performing a join operation. The data is skewed on the join key. Which THREE actions can reduce memory usage and improve job stability? (Choose THREE.)

187

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. Which AWS service is best suited for this purpose?

188

A data engineer needs to transform CSV files arriving in an S3 bucket into Parquet format and store them in another S3 bucket. The transformation is simple and on-demand, triggered by data arrival. Which solution is the MOST cost-effective and requires the least operational overhead?

189

A company uses Kinesis Data Analytics for SQL-based real-time analytics on streaming data. They notice that the application is processing data slower than the incoming rate, causing increased latency. Which action is MOST likely to improve the throughput?

190

An organization needs to ingest data from on-premises databases into AWS S3 for archival purposes. The data volume is several TB per day, and the network has moderate bandwidth. Which AWS service is BEST suited for this bulk data transfer?

191

A company uses AWS Glue for ETL jobs. The data engineer needs to ensure that the Glue job can access an S3 bucket in another account. What is the recommended approach?

192

A data streaming application uses Kinesis Data Streams with 10 shards. The data producer is throttled frequently. Which action should be taken to resolve this issue?

193

A company needs to transform JSON data from an S3 bucket into a structured format for Amazon Redshift. The transformation should be done serverlessly. Which service should be used?

194

A company is ingesting data from multiple sources into S3 using AWS Glue. The data engineer notices that the Glue job is failing with an OutOfMemory error. Which step should be taken to resolve this issue?

195

A company needs to ingest real-time clickstream data from a web application into Amazon Redshift with minimal latency. The data volume is high and requires processing before loading. Which architecture is MOST appropriate?

196

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data is generated at a high velocity and must be processed in near real-time. The pipeline must also handle bursty traffic. Which TWO AWS services should be combined to achieve this? (Choose TWO.)

197

A company uses AWS Glue to transform data stored in S3. The Glue job runs daily and processes data in the range of hundreds of GB. The data engineer wants to optimize the job for cost and performance. Which THREE actions should be taken? (Choose THREE.)

198

A data engineer needs to transfer 50 TB of data from an on-premises data center to Amazon S3 over a 1 Gbps network. The transfer must be completed within one week. Which TWO AWS services can be used for this task? (Choose TWO.)

199

A company uses Kinesis Data Streams to ingest clickstream data. They notice that the data processing latency increases as the number of shards grows. What is the most likely cause and solution?

200

A data engineer needs to ingest JSON files from an S3 bucket into a DynamoDB table. The files are updated hourly and contain new records. Which AWS service should be used to trigger a Lambda function for each new object?

201

A company uses AWS Glue ETL jobs to transform data in S3. The job runs successfully but takes longer than expected. The data is in Parquet format and partitioned by date. Which change would most improve performance without increasing cost?

202

A data engineer is designing a streaming pipeline that ingests IoT sensor data from 10,000 devices. Each device sends a 1 KB message every second. The data must be processed in near real-time and stored in S3 for analytics. Which combination of services provides the most cost-effective solution?

203

A company runs a nightly ETL job using AWS Glue. The job reads data from a JDBC connection to an on-premises MySQL database. The job fails with an error indicating that the connection pool is exhausted. What is the most likely cause and solution?

204

A data engineer needs to transform CSV files to Parquet format using AWS Glue. The source data contains sensitive columns that must be masked. Which Glue feature should be used?

205

A data pipeline ingests streaming data from Kinesis Data Streams into S3 via Kinesis Data Firehose. Occasionally, small files are written to S3, increasing downstream processing costs. What is the most efficient way to reduce the number of small files?

206

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an Oracle RDS instance to S3. The data is used for analytics. The replication lags behind the source by several hours. Which change would most likely reduce the lag?

207

A data engineer needs to ingest data from an external FTP server into S3 on a schedule. The FTP server is only accessible via VPN. Which AWS service is best suited for this task?

208

A data engineer is designing a near-real-time streaming pipeline to ingest clickstream data from a web application. The data must be enriched with user metadata from a DynamoDB table before being stored in S3. Which combination of AWS services should the engineer use? (Choose TWO.)

209

A company uses a Kinesis Data Firehose delivery stream to load data into an S3 bucket. The data is in JSON format and must be converted to Parquet before landing in S3. Which steps are required to achieve this? (Choose THREE.)

210

A data engineer needs to ingest data from a SaaS application that sends webhooks in JSON format. The data must be stored in S3 for batch analysis. Which AWS services can receive the webhooks and store the data in S3 with minimal custom code? (Choose TWO.)

211

Refer to the exhibit. A data engineer is configuring an IAM policy for a Lambda function that writes transformed data to S3. The function writes to both 'example-bucket/data/' and 'example-bucket/public/'. The policy is intended to enforce server-side encryption with SSE-S3 for all objects written to the 'public/' prefix, while allowing all operations on other prefixes. However, the Lambda function is failing with an AccessDenied error when writing to 'example-bucket/public/'. What is the most likely cause?

212

Refer to the exhibit. A data engineer runs the above CLI command to find files smaller than 1000 bytes in a bucket. The command returns an empty array, but the engineer knows there are small files. What is the issue?

213

Refer to the exhibit. A data engineer deploys this CloudFormation template to create an AWS Glue job. The job fails on the first run with an error: 'AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/GlueServiceRole/... is not authorized to perform: s3:GetObject on resource: s3://my-bucket/scripts/etl.py'. What is the most likely cause?

214

A company uses AWS Glue to run ETL jobs daily. The data is stored in S3 as Parquet files partitioned by date. Recently, jobs have failed with the error 'No such file or directory' for certain partitions. What is the MOST likely cause?

215

A data engineering team needs to ingest streaming data from thousands of IoT devices and store it in Amazon S3 for batch processing. The data arrives at a rate of 10 MB/s, with occasional spikes up to 50 MB/s. The data must be processed in near real-time with minimal latency. Which AWS service should be used for ingestion?

216

A company runs a daily batch ETL job using AWS Glue. The job processes 500 GB of data from Amazon RDS to Amazon S3. The job currently uses a single DPU and takes 6 hours to complete. The team wants to reduce runtime to under 1 hour without increasing costs significantly. Which approach should they use?

217

A company needs to transfer 10 TB of historical data from an on-premises HDFS cluster to Amazon S3. The data is stored on a single 20 TB disk. The network link to AWS has a bandwidth of 1 Gbps. The transfer must be completed within 2 days. Which solution meets these requirements?

218

A data engineer notices that an AWS Glue job writing to Amazon S3 in Parquet format creates many small files (less than 1 MB each). This leads to poor query performance in Amazon Athena. What is the BEST way to reduce the number of output files?

219

A company uses AWS Glue to process data from Amazon RDS MySQL into Amazon S3. The Glue job uses a JDBC connection and runs on a schedule. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which troubleshooting step should the data engineer take FIRST?

220

A data engineer needs to ingest real-time clickstream data from a website into Amazon S3 for analytics. The data arrives as JSON records, each under 1 KB. The engineer wants to use a serverless solution with automatic scaling and minimal operational overhead. Which AWS service should be used as the ingestion endpoint?

221

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that some changes are not being captured in the target database. What is the MOST likely cause?

222

A company runs an AWS Glue ETL job that reads data from Amazon S3, transforms it, and writes back to S3 in a different partition structure. The job uses the 'spark.sql.shuffle.partitions' option set to 200. After the job completes, the output has many small files. The data engineer wants to minimize the number of output files while maintaining job performance. Which action should the engineer take?

223

A data engineer is designing a data ingestion pipeline to load JSON files from Amazon S3 into Amazon Redshift. Which TWO methods can be used to load the data efficiently?

224

A company is building a data lake on Amazon S3. They need to ingest data from multiple sources, including relational databases, streaming data, and log files. Which THREE AWS services can be used to ingest data into the data lake?

225

A data engineer is troubleshooting an AWS Glue job that reads from Amazon RDS MySQL and writes to Amazon S3. The job runs successfully but takes longer than expected. The engineer wants to optimize performance. Which THREE actions would improve job performance?

226

Refer to the exhibit. A data engineer is creating an IAM policy for an application that sends data to a Kinesis stream and stores processed data in S3. The policy is attached to an IAM role used by an EC2 instance. The application fails to write to S3 with an access denied error. What is the cause?

227

Refer to the exhibit. A data engineer runs the AWS CLI command to describe a Glue job. The job is expected to process new data incrementally using job bookmarks. However, the job reprocesses all data every time it runs. What is the MOST likely reason?

228

Refer to the exhibit. A data engineer is running an AWS Glue job that reads data from an S3 source. The job fails with the error shown. What is the MOST likely cause?

229

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data arrives in JSON format but needs to be converted to Parquet for efficient querying. Which AWS Glue feature should be used to infer the schema and generate transformation code?

230

A data engineer is ingesting streaming data from an IoT fleet into Amazon Kinesis Data Streams. The data must be transformed in real-time and loaded into an Amazon Redshift cluster. Which solution minimizes operational overhead?

231

A company ingests clickstream data into Amazon S3 via Kinesis Data Firehose. The data arrives in 20 MB files every 2 minutes. The data engineering team needs to transform nested JSON into a flat structure before loading into Amazon Redshift. Which approach is most cost-effective and scalable?

232

A data engineer needs to capture change data capture (CDC) events from an Amazon RDS for PostgreSQL database and stream them to Amazon S3 in near real-time. Which AWS service should be used?

233

A company uses AWS Glue to process data in Amazon S3. The Glue job fails with an error indicating that the partition keys in the catalog do not match the actual S3 partition structure. What is the most likely cause?

234

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams with a shard count of 10. The incoming data rate is 1 MB/second. The consuming application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely performance bottleneck?

235

A company needs to ingest data from multiple on-premises databases into Amazon S3 for analytics. The databases include Oracle, MySQL, and PostgreSQL. The data must be continuously replicated with minimal latency. Which AWS service should be used?

236

A data engineer is building a data pipeline that ingests data from Amazon S3 into Amazon Redshift. The data is in CSV format and includes a timestamp column. The pipeline should load only new data incrementally. Which approach is most efficient?

237

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is compressed with GZIP and partitioned by year, month, day, and hour. The delivery stream is configured to buffer up to 5 MB or 60 seconds. Some records are missing from S3. What is the most likely cause?

238

Which TWO AWS services can be used to ingest streaming data into Amazon S3? (Choose two.)

239

Which TWO practices improve the performance of AWS Glue ETL jobs? (Choose two.)

240

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for real-time data ingestion? (Choose three.)

241

Refer to the exhibit. A data engineer has attached this IAM policy to an AWS Glue job role. The Glue job fails when trying to write transformed data to an S3 bucket located in a different AWS account. What is the most likely reason?

242

Refer to the exhibit. A data engineer runs the AWS CLI command and observes the output. The stream has two shards. A producer sends a record with a partition key that hashes to 150000000000000000000000000000000000000. To which shard will the record be written?

243

Refer to the exhibit. A data engineer creates an AWS Glue job using this CloudFormation template. The job processes new data files in S3 and uses job bookmarks to track processed files. After initial success, the job runs again but processes all files again instead of only new ones. What is the most likely cause?

244

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed into Parquet format and stored in Amazon S3. Which AWS service can perform the transformation in near real-time with minimal operational overhead?

245

A data engineer needs to load data from an on-premises Oracle database to Amazon S3 daily. The table is 500 GB and grows by 50 MB per day. The load must capture only new and changed rows since the last run. Which solution is MOST cost-effective and requires the least maintenance?

246

A company uses AWS Glue to process JSON logs from S3. The logs have a nested structure and the schema evolves over time. The data engineer needs to ensure the Glue job can handle schema changes without failing. Which configuration should be used?

247

A data engineer is designing a data ingestion pipeline for clickstream data. The data arrives in batches of 10-50 MB every 5 seconds. The engineer needs to buffer the data, perform simple transformations (e.g., add timestamp, remove PII), and land it in S3 within 10 minutes. Which TWO services should be combined? (Choose TWO.)

248

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for analytics. The data volume is moderate (~100 GB per day). The pipeline must handle schema changes, deduplicate records, and provide low latency (under 1 hour). Which THREE services should be used? (Choose THREE.)

249

A data engineer needs to transform CSV files in S3 to Parquet format using a serverless solution. The files are large (up to 5 GB each) and arrive irregularly. Which TWO services can accomplish this with minimal operational overhead? (Choose TWO.)

250

A company uses Amazon S3 Event Notifications to trigger a Lambda function that processes incoming files. Recently, the Lambda function has been timing out for large files (>100 MB). The data engineer wants to improve the pipeline to handle large files reliably. Which solution is the MOST scalable and cost-effective?

251

A data engineer is designing a streaming pipeline that ingests data from an Amazon Kinesis Data Stream (with 5 shards) into Amazon S3. The data must be transformed using a complex stateful operation that cannot be done in a Lambda function (limited to 15 minutes). The engineer needs a solution that can maintain state across multiple records. Which service should be used?

252

A company needs to ingest data from a MySQL database into Amazon S3 in near real-time. The database is running on EC2. The data engineer wants to minimize the impact on the source database. Which service should be used?

253

A data engineer uses AWS Glue to process data from S3. The Glue job frequently fails with 'Out of Memory' errors. The job reads several large compressed files. What is the MOST effective way to resolve this issue without changing the code?

254

A company is building a data lake on S3. They have a large volume of CSV files (hundreds of GB) in a source bucket. They need to convert them to Parquet, partition by date, and ensure the data is encrypted at rest with SSE-KMS. The pipeline must be triggered automatically when new files arrive. Which THREE steps should be part of the solution? (Choose THREE.)

255

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data must be transformed in near real-time and stored in Amazon S3 for analytics. Which AWS service should be used to transform the data as it is ingested?

256

A data engineer needs to ingest data from an on-premises Oracle database to Amazon S3 daily. The data volume is 500 GB per day, and the network bandwidth is 200 Mbps. The requirement is to minimize the impact on the source database and ensure data integrity. Which combination of AWS services should be used?

257

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The Firehose delivery stream has a buffer size of 64 MB and a buffer interval of 300 seconds. The data volume is 1 GB per minute, and the average record size is 1 KB. The data must be delivered to S3 within 5 minutes of ingestion. The engineer notices that some files are being delivered after 10 minutes. What is the most likely cause?

258

A data engineering team needs to transform CSV files to Parquet format after they land in an S3 bucket. The transformation should be triggered automatically as soon as a new file arrives. Which AWS service is best suited for this task?

259

A company uses AWS Glue ETL jobs to process data from an S3 data lake. The job reads data in CSV format, transforms it, and writes to Parquet. The job runs daily and takes 2 hours to complete. The data volume is increasing by 20% each month. The engineer wants to reduce the job runtime. Which action is most effective?

260

A company uses Amazon Kinesis Data Streams to ingest IoT sensor data. The data is processed by an AWS Lambda function that transforms the records and writes to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'Rate exceeded' errors for the S3 PUT API calls. The data volume is 10 MB/s with average record size 2 KB. What should be done to resolve this issue?

261

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) and load it into Amazon Redshift. The data must be transformed before loading. Which AWS service should be used to build the ingestion pipelines?

262

A data engineer is using Amazon EMR to transform large datasets stored in S3. The cluster runs once a day and takes 3 hours. The engineer notices that the cluster is idle for 30 minutes at the start while waiting for resources. What is the most cost-effective way to reduce the idle time?

263

A company uses AWS Glue DataBrew for data preparation. The data source is an S3 bucket with millions of small CSV files (each < 1 MB). The DataBrew project takes a long time to load the sample data. What is the most likely cause and solution?

264

A company ingests IoT data into an S3 bucket using AWS IoT Core rules. The data is in JSON format, and each record is about 500 bytes. The data volume is 5 GB per day. The company wants to convert the data to Parquet format and partition it by year/month/day. Which TWO AWS services can be used together to achieve this with minimal operational overhead?

265

A company runs a real-time analytics platform using Amazon Kinesis Data Streams. The data is consumed by multiple consumers: one for real-time dashboard (using Lambda) and one for long-term storage (using Firehose to S3). The Kinesis stream has 10 shards. Each record is 1 KB, and the total incoming data rate is 5 MB/s. The Lambda consumer is falling behind and processing latency exceeds 10 seconds. Which TWO actions should be taken to resolve the issue?

266

A company is designing a data ingestion pipeline for real-time clickstream data. The data must be ingested with low latency (< 1 second) and then processed for real-time analytics. The processed data should be stored in Amazon S3 for batch analytics. Which THREE services should be used together?

267

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object with fields: trade_id, symbol, price, quantity, timestamp. The stream has 5 shards. The data is consumed by an AWS Lambda function that aggregates trades per symbol every minute and writes the results to an Amazon DynamoDB table for a real-time dashboard. Recently, the dashboard has been showing outdated data, and the Lambda function is experiencing high error rates. The CloudWatch logs show 'ProvisionedThroughputExceededException' errors from DynamoDB. The DynamoDB table has 10 read capacity units (RCU) and 10 write capacity units (WCU). The average trade volume is 5,000 trades per second across all symbols, and there are 100 symbols. The Lambda function is configured with a batch size of 100 and a 1-minute window. The data volume is expected to double in the next month. As a data engineer, what is the most appropriate course of action?

268

A retail company uses AWS Glue ETL jobs to process sales data from an S3 data lake. The source data is partitioned by year/month/day in CSV format. The Glue job reads the latest day's data, performs transformations (e.g., cleaning, aggregating), and writes the results to a separate S3 bucket. The job runs on a schedule every day at 2 AM. Recently, the job has been failing intermittently with the error 'AnalysisException: Path does not exist: s3://source-bucket/year=2024/month=02/day=30/'. The engineer verifies that the folder 'day=30' does not exist because February has only 28 days in 2024. The job is reading data from a hardcoded path. The company expects the job to handle variable days per month automatically. What should the engineer do to fix the issue?

269

A startup is building a data pipeline to ingest user activity logs from a mobile app. The logs are sent in real-time via HTTP POST requests. The data volume is low (a few hundred requests per second) but can spike to a few thousand during promotions. The team wants to store the logs in Amazon S3 for analysis. They also need to be able to query the data using Amazon Athena with minimal latency. The data must be transformed from JSON to Parquet and partitioned by date. The team is considering using Amazon API Gateway with AWS Lambda to receive the logs and write to S3. However, they are concerned about Lambda cold starts and the complexity of handling spikes. Which alternative solution should they choose?

270

A company is streaming IoT data from thousands of devices into Amazon Kinesis Data Streams. The data must be transformed in real time before being stored in Amazon S3. Which service should be used to perform the transformation as the data streams through Kinesis?

271

A data engineer is designing a data ingestion pipeline for clickstream data from a mobile app. The data volume varies, with occasional spikes up to 10 MB/s. The pipeline must persist the raw data in Amazon S3 and make it available for near-real-time analytics via Amazon Athena. Which combination of services minimizes cost and operational overhead?

272

A company wants to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is 500 GB per transfer. Which AWS service is most appropriate for this batch ingestion?

273

A company is using Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data must be transformed from JSON to Parquet format before landing in S3. The transformation logic is simple: convert the JSON schema to Parquet. Which approach meets the requirements with the least operational overhead?

274

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is experiencing high error rates when writing to an S3 bucket. The error logs indicate 'AccessDenied' errors. The S3 bucket policy allows access from the Firehose service, but the errors persist. What is the most likely cause?

275

A company needs to ingest data from multiple SaaS sources (e.g., Salesforce, Marketo) into Amazon S3 for analytics. Which AWS service is designed for this purpose?

276

A company is ingesting log files from multiple EC2 instances into Amazon S3 using the CloudWatch agent. The logs are delivered to a CloudWatch Logs group, and a subscription filter sends them to a Lambda function for transformation, then to Firehose. The Firehose stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The logs are critical and must be available in S3 within 5 minutes. What is the most cost-effective way to reduce the delivery latency?

277

A data engineer is designing a data ingestion pipeline for real-time financial transactions. The pipeline must ensure exactly-once processing semantics and must handle duplicate records that may occur due to retries. Which combination of AWS services can achieve exactly-once processing?

278

A company needs to ingest data from an on-premises Hadoop cluster into Amazon S3 for archival and analysis. The total data volume is 50 TB. The migration must be completed within one week. The on-premises network has a 1 Gbps connection to AWS. Which AWS service should be used?

279

A company is using AWS Glue to run ETL jobs that read from Amazon S3 and write to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. Which TWO actions should the data engineer take to resolve this issue? (Choose TWO.)

280

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data is consumed by multiple applications: one for real-time sentiment analysis and another for archival to S3. The data must be processed in order for each social media post. Which TWO approaches meet the requirements? (Choose TWO.)

281

A company is designing a data lake on Amazon S3. The data ingestion pipeline must handle both structured and unstructured data. The data must be cataloged for easy discovery. Which THREE services should be included in the solution? (Choose THREE.)

282

A company runs a real-time analytics platform that ingests data from thousands of sensors via Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data is consumed by a fleet of EC2 instances running a custom consumer application. Recently, the consumer has been falling behind, with the iterator age exceeding 10 minutes. The company has already increased the number of shards to 100, but the problem persists. The consumer application is single-threaded per shard and uses the Kinesis Client Library (KCL). The CPU utilization on the EC2 instances is below 30%. What should the data engineer do to reduce the iterator age?

283

A media company ingests video files from content partners into an Amazon S3 bucket. Each video file is 10-50 GB. Upon upload, an AWS Lambda function is triggered to extract metadata (e.g., resolution, duration) and store it in DynamoDB. The company now wants to also generate a thumbnail image for each video. The thumbnail generation is CPU-intensive and can take up to 5 minutes per video. The Lambda function has a maximum execution time of 15 minutes. The company has noticed that some thumbnail generation tasks are timing out. What should the data engineer do to reliably generate thumbnails for all videos?

284

A small startup is building a data pipeline to ingest customer orders from a web application into Amazon Redshift for analytics. The orders are written to an Amazon RDS MySQL database. The startup wants to replicate the orders to Redshift in near-real time (within 5 minutes) with minimal operational overhead. The data volume is low, averaging 100 new orders per minute. The startup has a single data engineer who is also responsible for other tasks. What is the simplest solution?

285

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for near-real-time analytics. The data volume varies significantly and can spike unpredictably. The engineer wants to minimize operational overhead and ensure that data is durably stored as soon as it arrives. Which AWS service combination should the engineer use?

286

A data engineer is troubleshooting a Lambda function that reads from a Kinesis Data Stream, processes records, and writes to a Kinesis Data Firehose delivery stream. The Firehose delivery stream is configured to deliver data to an S3 bucket. The Lambda function is failing with an access denied error. The IAM policy attached to the Lambda execution role is shown in the exhibit. Which permission is missing?

287

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is stored on an NFS file server and the total volume is 50 TB. The network bandwidth between the on-premises data center and AWS is 1 Gbps (gigabit per second). What is the primary factor that will determine the total time required for the initial data transfer?

288

A data engineer runs an AWS Glue ETL job that reads from a large Amazon S3 source (several terabytes of CSV files) and writes transformed data to an S3 bucket in Parquet format. The job fails with the error shown in the exhibit. The job uses the Standard worker type with 10 workers (G.1X). The engineer needs to resolve the failure with minimal cost increase. What should the engineer do?

289

A company wants to ingest data from a SaaS application into Amazon S3. The SaaS application supports streaming data via HTTP POST requests. The data volume is approximately 100 MB per hour, and the company needs to store the raw data in S3 for archival and later analysis. Which approach is the most cost-effective and operationally efficient?

290

A data engineer needs to transform JSON data from a Kinesis Data Stream into Parquet format and store it in an S3 data lake. The transformation includes simple field mapping and data type conversions. Which AWS service is the most cost-effective for performing this transformation in near-real-time?

291

A data engineer runs an AWS Glue crawler that is configured to crawl an S3 bucket named 'my-data-lake' and update the Glue Data Catalog. The crawler fails with an access denied error. The IAM role attached to the crawler has the policy shown in the exhibit. What is the likely cause of the failure?

292

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration must have minimal downtime. The source database is highly active with continuous writes. Which DMS migration type and additional configuration should the engineer use?

293

A data engineer needs to schedule an AWS Glue ETL job to run every hour and process new data that arrives in an S3 bucket. The job should only process files that have been added since the last run. Which approach should the engineer use to track which files have been processed?

294

A data engineer is designing a data ingestion pipeline for real-time clickstream data from a website. The data must be ingested with low latency (seconds) and made available for multiple consumer applications, including a dashboard that refreshes every minute and a machine learning model that processes data in near-real-time. The engineer needs to choose a streaming ingestion service. Which TWO services meet these requirements? (Select TWO.)

295

A data engineer is troubleshooting a Kinesis Data Streams consumer application that is falling behind. The stream has 10 shards and is receiving 5 MB/s of data. The consumer uses the Kinesis Client Library (KCL) with a single worker. The worker is processing all 10 shards but is experiencing high latency and checkpointing delays. Which THREE actions should the engineer take to improve consumer performance? (Select THREE.)

296

A data engineer needs to transform data in an S3 data lake using AWS Glue ETL. The data is in CSV format and needs to be converted to Parquet with partitioning by date. The engineer wants to minimize the number of files written to S3 to improve query performance. Which TWO configuration options should the engineer use? (Select TWO.)

297

A company runs an e-commerce platform that generates clickstream data from user interactions on their website. The data is sent as JSON objects via HTTP POST to an API Gateway endpoint, which triggers a Lambda function that writes each record to a Kinesis Data Stream (100 shards). A second Lambda function consumes the stream, transforms the data (enriches with geolocation from a DynamoDB table), and writes to a Kinesis Data Firehose delivery stream that delivers Parquet files to an S3 data lake every 5 minutes. The system has been working for months, but recently the Firehose delivery stream started showing 'DeliveryFailed' errors for a subset of records. The errors point to 'InvalidData' from the Lambda transformation. The engineer reviews the Lambda transformation code and notices that the geolocation lookup occasionally fails because the DynamoDB table has a throttling issue. The engineer needs to handle these failures gracefully so that records that fail enrichment are still delivered to S3 with a null geolocation field, without blocking other records. Which course of action should the engineer take?

298

A data engineer is responsible for ingesting log files from a fleet of on-premises servers into Amazon S3 for central analysis. Each server generates log files that are rotated every hour, resulting in files of about 500 MB each. The total daily data volume is approximately 1 TB. The network connection between the on-premises data center and AWS is a 100 Mbps VPN. The engineer needs to ensure that all log files are transferred to S3 within 24 hours of generation without data loss. The engineer is considering using AWS DataSync. However, the initial setup shows that the transfer speed is insufficient to meet the 24-hour SLA. What should the engineer do to meet the requirement?

299

A data engineer is setting up a data pipeline to ingest data from an Amazon RDS for MySQL database into Amazon S3 using AWS Glue ETL. The Glue job uses a JDBC connection to read from the MySQL database. The job runs successfully, but the engineer notices that the job is taking longer than expected. The MySQL database is 500 GB in size and the Glue job uses 10 workers of type G.1X. The engineer wants to improve the performance of the extraction phase. The database is actively used by other applications, so the engineer must minimize the impact on the source database. Which approach should the engineer take?

300

A company uses AWS Glue to process data from multiple sources. The data is stored in an Amazon S3 data lake. The company needs to transform the data using a custom Python library that is not available in the default Glue environment. What is the MOST efficient way to make this library available to the Glue jobs?

Practice all 300 Data Ingestion and Transformation questions

Other DEA-C01 exam domains

Data Operations and SupportData Security and GovernanceData Store Management

Frequently asked questions

What does the Data Ingestion and Transformation domain cover on the DEA-C01 exam?

The Data Ingestion and Transformation domain covers the key concepts tested in this area of the DEA-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all DEA-C01 domains — no account required.

How many Data Ingestion and Transformation questions are in the DEA-C01 question bank?

The Courseiva DEA-C01 question bank contains 300 questions in the Data Ingestion and Transformation domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Data Ingestion and Transformation for DEA-C01?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Data Ingestion and Transformation questions for DEA-C01?

Yes — the session launcher on this page draws questions exclusively from the Data Ingestion and Transformation domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your DEA-C01 domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Sign Up Free

Free forever · Every certification included

Practice Session

10 questions20 questions30 questions50 questions

Study Resources

All DomainsPractice TestMock ExamFlashcardsStudy Guide

Related Exams

SAA-C03MLS-C01DBS-C01