Courseiva
Knowledge + Practice
CertificationsVendorsCareer RoadmapsLabs & ToolsStudy GuidesGlossaryPractice Questions
C
Courseiva

Free IT certification practice questions with explained answers for CCNA, CompTIA, AWS, Azure, Google Cloud, and more.

Certification Practice Questions

CCNA practice questionsSecurity+ SY0-701 practice questionsAWS SAA-C03 practice questionsAZ-104 practice questionsAZ-900 practice questionsCLF-C02 practice questionsA+ Core 1 practice questionsGoogle Cloud ACE practice questionsCySA+ CS0-003 practice questionsNetwork+ N10-009 practice questions
View all certifications →

Product

CertificationsCertification PathsExam TopicsPractice TestsExam Dumps vs Practice TestsStudy HubComparisons

Company

AboutContactEditorial PolicyQuestion Writing PolicyTrust Center

Legal

Privacy PolicyTerms of Service

Courseiva is a free IT certification practice platform offering original exam-style practice questions, detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics for Cisco, CompTIA, Microsoft, AWS, and other technology certifications.

© 2026 Courseiva. Courseiva is operated by JTNetSolutions Ltd. All rights reserved.

Courseiva is an independent certification practice platform and is not affiliated with, endorsed by, or sponsored by Cisco, Microsoft, AWS, CompTIA, Google, ISC2, ISACA, or any other certification vendor. Vendor names and certification marks are used only to identify the exams learners are preparing for.

← Data Ingestion and Transformation practice sets

DEA-C01 Data Ingestion and Transformation • Complete Question Bank

DEA-C01 Data Ingestion and Transformation — All Questions With Answers

Complete DEA-C01 Data Ingestion and Transformation question bank — all 0 questions with answers and detailed explanations.

500
Questions
Free
No signup
Certifications/DEA-C01/Practice Test/Data Ingestion and Transformation/All Questions
Question 1easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from an IoT fleet into Amazon S3 for near-real-time analytics. The data volume is approximately 5 GB per hour, and each event is less than 1 KB. Which AWS service should be used as the ingestion endpoint?

Question 2mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data from Amazon S3 to Amazon Redshift. The job reads JSON files, applies schema mapping, and writes to a Redshift table. Recently, the job started failing with memory errors. The data volume has increased tenfold. Which approach should a data engineer take to resolve this issue with minimal code changes?

Question 3hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A financial services company processes real-time stock trade data. They use Amazon Kinesis Data Streams with a shard count of 5, each shard receiving about 500 records per second. The consumer application uses the Kinesis Client Library (KCL) with DynamoDB for checkpointing. Lately, some records are being processed multiple times. What is the most likely cause?

Question 4easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team needs to transform CSV files stored in Amazon S3 into Parquet format using AWS Glue. The files are partitioned by date and are updated hourly. Which AWS Glue feature should be used to automatically detect the schema and partition structure?

Question 5mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

An e-commerce company ingests clickstream data from their website into Amazon S3. The data is in JSON format, and each file is about 10 MB. They need to transform the data into a columnar format for analytics and load it into Amazon Redshift nightly. The transformation should be cost-effective and require minimal operational overhead. Which approach meets these requirements?

Question 6hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3 in Parquet format. The replication is used for near-real-time analytics. Recently, the DMS task started failing with an error indicating insufficient memory. The source database is large (2 TB). What should a data engineer do to resolve this issue while minimizing changes to the existing architecture?

Question 7easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for a data lake. The data volumes are moderate and the sync needs to be scheduled daily. Which AWS service is most appropriate for this task?

Question 8mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Lambda to process records from an Amazon Kinesis Data Stream. Each record is about 50 KB. The Lambda function transforms the data and writes to Amazon DynamoDB. Recently, the Lambda function has been experiencing throttling and high error rates. The Kinesis stream has 10 shards. What is the most cost-effective solution to improve processing throughput?

Question 9hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for a social media analytics platform. The pipeline must ingest tweets in real-time, perform sentiment analysis, and store results in Amazon S3. The sentiment analysis is compute-intensive and must be done as the data arrives. The estimated throughput is 10,000 tweets per second. Which architecture is most suitable?

Question 10easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. The transfer must be completed within one week. Which service should be used?

Question 11mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format. The delivery stream is configured with a buffer size of 5 MB and a buffer interval of 60 seconds. However, the data engineer notices that S3 objects are being created with sizes much smaller than 5 MB. What is a likely cause?

Question 12mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms each record and writes it to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors when writing to S3. The team has already increased the Lambda function's memory and timeout. Which action should the team take to resolve the issue?

Question 13hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

Question 14easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest JSON data from an on-premises relational database into Amazon S3 every hour. Which AWS service should be used to set up a scheduled, incremental data transfer?

Question 15mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to process data from Amazon S3. The Glue job reads CSV files and writes Parquet files to a different S3 bucket. The job occasionally fails with 'java.lang.OutOfMemoryError: Java heap space'. The data size varies. Which change should the engineer make to avoid this error?

Question 16hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?

Question 17easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)

Question 18mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to perform ETL on data stored in Amazon S3. The Glue job reads CSV files, converts them to Parquet, and partitions by date. The job runs daily and processes about 500 GB of data. The team wants to optimize costs and performance. Which three actions should the team take? (Select THREE.)

Question 19mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

Question 20hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

Question 21easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

Question 22hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

Question 23easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?

Question 24mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)

Question 25hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)

Question 26easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?

Question 27mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?

Question 28hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

Question 29mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Data Pipeline to copy data from DynamoDB to S3 daily. Recently, the pipeline started failing with 'ThrottlingException' errors. The DynamoDB table has on-demand capacity. Which action should be taken to resolve the issue?

Question 30easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

Question 31mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?

Question 32hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon RDS for MySQL as a source for AWS DMS to replicate data to S3. The replication task is failing with 'OutOfMemory' errors on the DMS instance. The source table has 10 million rows with large BLOB columns. Which THREE changes would most likely resolve the issue?

Question 33easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that ingests JSON log data from web servers. The stream is configured to transform records with an AWS Lambda function and deliver to an Amazon S3 bucket. Recently, the stream has been failing with 'InvalidData' errors. Which action should the engineer take to resolve the issue?

Question 34hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is setting up an Amazon Kinesis Data Analytics application to process streaming data from a Kinesis data stream named "input-stream". The application uses a reference data source from an S3 bucket. The engineer has attached the IAM policy shown in the exhibit to the application's IAM role. When starting the application, the engineer receives an 'AccessDeniedException' error. Which additional permission is required?

Exhibit

Refer to the exhibit.

"Effect": "Allow",
"Action": [
  "kinesis:DescribeStream",
  "kinesis:GetShardIterator",
  "kinesis:GetRecords",
  "kinesis:ListShards"
],
"Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/input-stream"
Question 35mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs an e-commerce platform that generates clickstream data from millions of users. The data is ingested into Amazon Kinesis Data Streams with a shard count of 10. The data is then consumed by a Kinesis Data Analytics application that runs SQL queries to aggregate metrics in real time. Recently, the application has been falling behind, and the stream's iterator age metric is increasing. The data volume has doubled over the past month. The application currently uses a single Kinesis Data Analytics application with parallelism of 1. Which action should the data engineer take to improve the processing rate and reduce the iterator age without losing data or causing duplicates?

Question 36hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object containing fields: trade_id, symbol, price, quantity, and timestamp. The data is consumed by an AWS Lambda function that performs data validation and enrichment, then writes the processed records to an Amazon DynamoDB table for low-latency querying. Recently, the Lambda function has been timing out and failing to process all records. The Lambda function is configured with a 5-second timeout and 128 MB memory. The average record size is 2 KB, and the stream receives about 1000 records per second. The Lambda function's concurrency limit is 1000. Which set of actions should the data engineer take to resolve the issue without losing data?

Question 37mediummultiple choice
Read the full NAT/PAT explanation →

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The data is JSON formatted and includes a timestamp field. The company wants to partition the output in Amazon S3 by date and hour, and ensure exactly-once processing semantics. Which combination of configurations should be used?

Question 38hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but writes duplicate rows into Redshift. The source data is static and does not contain duplicates. Which configuration change is most likely to resolve this issue?

Question 39easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is designing a data ingestion pipeline to load CSV files from an SFTP server into Amazon S3. The files are generated hourly and range from 10 MB to 500 MB. Which AWS service should be used to orchestrate the transfer with minimal operational overhead?

Question 40hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is in JSON format and each record is approximately 2 KB. The delivery stream is configured to buffer incoming records for 60 seconds or 5 MB, whichever comes first. The company notices that the data in S3 is delayed by up to 5 minutes during peak hours. Which action would most effectively reduce the delivery latency?

Question 41mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting real-time clickstream data into Amazon S3 using Amazon Kinesis Data Firehose. The data is semi-structured and the company wants to transform the data into Parquet format and partition it by year, month, day, and hour. Which TWO steps should be taken to achieve this? (Choose TWO.)

Question 42mediumdrag order
Read the full Data Ingestion and Transformation explanation →

Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5
Question 43mediumdrag order
Read the full Data Ingestion and Transformation explanation →

Arrange the steps to implement a data lake on Amazon S3 with AWS Lake Formation.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5
Question 44mediumdrag order
Read the full Data Ingestion and Transformation explanation →

Order the steps to query data in Amazon Redshift Spectrum from an external table in Athena.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5
Question 45mediumdrag order
Read the full Data Ingestion and Transformation explanation →

Order the steps to troubleshoot a failed AWS Glue job that reads from JDBC and writes to S3.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5
Question 46mediummatching
Read the full Data Ingestion and Transformation explanation →

Match each AWS service to its primary purpose in data engineering.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Serverless ETL and data catalog

Data warehousing and SQL analytics

Big data processing using Hadoop/Spark

Building and managing data lakes

Real-time streaming data ingestion

Question 47mediummatching
Read the full Data Ingestion and Transformation explanation →

Match each AWS data migration tool to its primary function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Migrate databases with minimal downtime

Physical device for large data transfer

Online data transfer between on-prem and AWS

Fast uploads over long distances

Combine data across sources into views

Question 48mediummatching
Read the full Data Ingestion and Transformation explanation →

Match each AWS database service to its primary use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Relational database with managed operations

NoSQL key-value and document database

In-memory caching for low latency

Graph database for connected data

Time-series data for IoT and analytics

Question 49mediummatching
Read the full Data Ingestion and Transformation explanation →

Match each AWS security service to its purpose in data protection.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Managed encryption keys

User and role access control

Audit API activity

Discover and protect sensitive data

Web application firewall

Question 50mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data is then consumed by a Lambda function that transforms the records and writes them to an S3 bucket in Parquet format. Recently, the Lambda function has been timing out and the S3 bucket is not receiving all expected records. The Kinesis stream has a shard count of 10 and the Lambda function's reserved concurrency is set to the default. Which change would MOST likely resolve the issue?

Question 51easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is using AWS Glue to perform ETL on data stored in an S3 bucket. The source data is in CSV format with a header row, and the target is a set of Parquet files partitioned by date. The engineer notices that the Glue job is reading all files in the source prefix, including temporary files that should be ignored. What is the MOST efficient way to exclude these temporary files?

Question 52hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to ingest JSON logs from multiple sources into an S3 data lake. The data is then consumed by Amazon Athena for analysis. Recently, some queries have been failing with the error 'HIVE_BAD_DATA: Field xyz's type is an unsupported type'. The firehose delivery stream transforms the data using a Lambda function that converts timestamps to Unix epoch. What is the MOST likely cause of the query failure?

Question 53mediummultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database to Amazon S3. The pipeline should capture changes in near real-time (within minutes) and minimize impact on the source database. The source table has a 'last_modified' timestamp column. Which service combination would meet these requirements?

Question 54hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process data from multiple S3 buckets. The Glue job runs daily and reads data from a bucket that contains millions of small files (each < 1 MB). The job has been running for hours and is often close to the 8-hour timeout limit. Which optimization would MOST reduce the job's runtime?

Question 55easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is setting up an Amazon Kinesis Data Firehose delivery stream to load data into Amazon Redshift. The data is coming from an application that produces JSON records. The engineer needs to transform the data to match the Redshift table schema. Which approach is the MOST cost-effective and requires the least operational overhead?

Question 56mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon S3 to store raw data and AWS Glue to run ETL jobs that transform the data into analytics-ready tables. The Glue job reads from a source with a schema that changes frequently (new columns added). The engineer wants the Glue job to automatically adapt to schema changes without manual intervention. Which configuration should the engineer use?

Question 57hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a fleet of EC2 instances running a custom application that processes the records and writes to DynamoDB. The application is experiencing high latency and records are being processed slower than they are produced. The stream has 5 shards. Which action would MOST effectively improve processing speed?

Question 58mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3 and wants to ingest data from multiple AWS services (CloudTrail, VPC Flow Logs, and ALB logs). The data should be stored in a central S3 bucket with a common partitioning scheme. Which service can be used to collect and centralize this data with minimal configuration?

Question 59hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is running a critical application that generates millions of small JSON files every hour in an S3 bucket. A data engineer needs to process these files in near real-time using AWS Glue. The engineer wants to minimize the latency between file arrival and Glue job start. Which TWO actions should the engineer take?

Question 60easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an Amazon RDS MySQL database into a data lake on Amazon S3. The engineer wants to perform an initial full load and then capture incremental changes. Which TWO AWS services can be combined to achieve this?

Question 61mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Streams to process real-time stock trade data. The data is consumed by a Lambda function that calculates moving averages and stores results in Amazon DynamoDB. The Lambda function is failing with 'ProvisionedThroughputExceededException' on the DynamoDB table. The table has on-demand capacity. Which TWO actions should the engineer take to resolve this issue?

Question 62mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for IoT sensor data. The sensors send JSON messages every second, and the data must be stored in Amazon S3 in near real-time (within 5 minutes). The engineer also needs to transform the data by adding a timestamp and filtering out malformed records. Which THREE services should be used together?

Question 63hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

The Glue job attempts to read data from 's3://my-data-bucket/input/' and write to 's3://my-data-bucket/output/'. It also tries to update a table in the Glue Data Catalog. The job fails with an access denied error. What is the MOST likely cause?

Exhibit

Refer to the exhibit. You have the following IAM policy attached to an IAM role used by an AWS Glue job:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::my-data-bucket/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:UpdateTable"
            ],
            "Resource": "*"
        }
    ]
}
Question 64easymultiple choice
Read the full Data Ingestion and Transformation explanation →

The command returns an empty result, but you know there are objects in the 'logs/' prefix larger than 1000 bytes. What is the MOST likely reason?

Network Topology
aws s3api list-objects-v2bucket my-bucketprefix logs/query "Contents[?Size > '1000'].Key"output text
Question 65mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that transforms and loads it into Amazon S3. Recently, the team noticed that the Lambda function is failing with throttling errors (HTTP 429) from the Kinesis API. Which configuration change should the team make to resolve this issue?

Question 66hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data is partitioned by date and hour. The job reads the latest hour's data, performs aggregations, and writes results to a separate S3 bucket. The job runs every hour and processes approximately 500 MB of input data. The team notices that the job takes longer than expected, often exceeding the 1-hour window. Which action would most effectively reduce the job's runtime?

Question 67easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a nightly basis. The data volume is approximately 10 GB per night. The database is accessible over the internet. Which AWS service is MOST appropriate for this task?

Question 68mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to an Amazon S3 bucket. The team notices that the application is experiencing high latency during peak hours. The stream has 8 shards, and the application is configured with a parallelism of 4. Which action would most likely reduce the latency?

Question 69hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A team is designing a data ingestion pipeline to load JSON files from an Amazon S3 bucket into Amazon Redshift. The files arrive every 5 minutes, and each file is between 10 MB and 50 MB. The team wants to minimize the time between file arrival and data availability in Redshift. Which approach should the team use?

Question 70easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data stored in Amazon S3. The job reads data in Parquet format, applies transformations, and writes the output back to S3 in Parquet format. The team wants to improve the job's performance and reduce costs. Which action is MOST effective?

Question 71mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting data from an Amazon RDS for PostgreSQL database into Amazon S3 using AWS Glue. The Glue job reads the entire table each time it runs, which takes several hours. The team wants to reduce the job duration by reading only new or updated records. Which approach should the engineer adopt?

Question 72hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and each record is about 2 KB. The delivery stream is configured to buffer data for 60 seconds or 5 MB, whichever comes first. The team notices that the S3 objects are very small (around 1 MB) and numerous, causing high costs due to S3 PUT requests. Which configuration change should the team make to reduce the number of S3 objects?

Question 73easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a slow internet connection (100 Mbps). The data must be transferred within 2 weeks. Which service should the engineer recommend?

Question 74mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that enriches records and writes to Amazon S3. The Lambda function is experiencing high error rates due to records exceeding the 256 KB payload limit. Which TWO actions should the team take to resolve this issue?

Question 75hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineering team is building a data lake on Amazon S3. They need to ingest data from multiple sources: (1) streaming IoT data, (2) daily CSV exports from an on-premises system via SFTP, and (3) change data capture (CDC) from an Amazon Aurora database. Which THREE services should the team use to ingest these data sources?

Question 76mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. The team wants to resolve this issue without increasing costs significantly. Which TWO actions should the team take?

Question 77easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting streaming data from an IoT fleet into Amazon S3 using Amazon Kinesis Data Firehose. The data arrives as JSON, but the downstream analytics require Parquet format. Which Firehose transformation should the engineer configure?

Question 78mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL to process data from Amazon S3 and write results to Amazon Redshift. The job fails with a memory error when processing large files. Which action should the data engineer take to resolve this issue?

Question 79hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a real-time analytics pipeline for clickstream data. The source is Amazon Kinesis Data Streams, and the data must be stored in Amazon S3 in partitioned Parquet format with near-real-time latency. The engineer must also handle late-arriving data (up to 1 hour). Which combination of services meets these requirements?

Question 80easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an on-premises Oracle database to Amazon S3. The data is stored as CSV files. The downstream team requires the data to be in Apache Parquet format. Which change should the data engineer make to the DMS task?

Question 81mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a data ingestion pipeline that reads JSON files from Amazon S3 and loads them into an Amazon Redshift table using COPY commands. The files are gzip compressed and contain nested JSON. The engineer wants to minimize transformation steps. Which approach should the engineer use?

Question 82hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is migrating its on-premises data warehouse to Amazon Redshift. The daily batch load from the source database takes 6 hours using a single-node Redshift cluster. The engineer needs to reduce load time to under 2 hours without increasing cost significantly. Which strategy should the engineer adopt?

Question 83easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an external HTTP API into Amazon S3. The API returns JSON data for a list of users, updated hourly. The engineer wants to use a serverless solution with minimal operational overhead. Which AWS service should the engineer use?

Question 84mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is failing to deliver data to an Amazon S3 bucket. The stream is configured with a Lambda transformation function. The CloudWatch logs show that the Lambda function is timing out. Which action should the engineer take to resolve the issue?

Question 85hardmultiple choice
Review the full subnetting walkthrough →

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for MySQL and load it into Amazon S3. The job runs daily and processes incremental changes using the JDBC connection. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which step should the engineer take first to diagnose the issue?

Question 86easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for streaming social media data. The data must be ingested with low latency (seconds) and stored in Amazon S3 for long-term analytics. The engineer also needs to perform real-time aggregations. Which TWO services should the engineer use? (Choose two.)

Question 87mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3. The data sources include relational databases, streaming data, and log files. The data engineer needs to ensure that the data ingestion pipeline can handle schema evolution, support both batch and streaming, and provide a unified metadata catalog. Which THREE services should the engineer use? (Choose three.)

Question 88hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a slow-running AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 500 GB of CSV data daily. The engineer wants to improve performance. Which THREE actions should the engineer take? (Choose three.)

Question 89mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is configuring an IAM policy for an AWS Glue ETL job that reads data from the 'my-data-bucket' S3 bucket, transforms it, and writes the output back to the same bucket. The engineer wants to prevent accidental deletion of objects. Based on the policy, which statement is true about the Glue job's permissions?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": [
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-data-bucket/*"
    }
  ]
}
Question 90hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer has configured an S3 event notification to send an event to an SQS queue when objects are created in the 'incoming/' prefix. The engineer wants to trigger an AWS Lambda function to process the object. However, the Lambda function is not being invoked. What is the most likely cause?

Network Topology
aws s3api get-bucket-notification-configurationbucket my-data-bucket"QueueConfigurations": ["Id": "sqs-notify","QueueArn": "arn:aws:sqs:us-east-1:123456789012:MyQueue","Events": ["s3:ObjectCreated:*"],"Filter": {"Key": {"FilterRules": [{"Name": "prefix", "Value": "incoming/"}
Question 91mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An AWS Glue ETL job is failing with an OutOfMemoryError. The job reads from Amazon S3 and performs a GROUP BY on a large dataset. Which change should the data engineer make to resolve this error?

Exhibit

Error Log:
[ERROR] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, ip-10-0-0-12.ec2.internal, executor 1): java.lang.OutOfMemoryError: Java heap space
	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.<init>(UnsafeRow.java:42)
Question 92easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for real-time processing. The data volume peaks at 5 GB/min. Which AWS service should be used as the ingestion endpoint?

Question 93mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate an on-premises PostgreSQL database to Amazon RDS for PostgreSQL. After initial load, ongoing replication is set up. The replication task shows 'Task status: failed with error: The specified LSN is not available in the source database logs.' What is the most likely cause?

Question 94hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for clickstream data that arrives in bursts, up to 100 MB/s, and must be processed with exactly-once semantics. The data must be stored in Amazon S3 partitioned by event date and hour. Which combination of services should the engineer use?

Question 95easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest CSV files from an FTP server into Amazon S3 daily. The files are typically 50 MB each, and the process should be fully managed with minimal operational overhead. Which AWS service should be used?

Question 96mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is using AWS Glue ETL to transform a large dataset in S3. The job processes 2 TB of data daily and currently runs for 6 hours. The engineer wants to reduce runtime without changing the transformation logic. What is the best approach?

Question 97hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Analytics (now Managed Service for Apache Flink) to run a Flink application on streaming data. The application fails with 'OutOfMemoryError: Java heap space'. The data volume is 10 MB/s. What is the most likely cause and solution?

Question 98easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data from Amazon S3 into Parquet format using AWS Glue. The source files are in a bucket with thousands of small files. What is the best practice to optimize the Glue job performance?

Question 99mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate data from Oracle to Aurora MySQL. During the ongoing replication, the target table shows duplicate primary key errors. What is the most likely cause?

Question 100hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a real-time data pipeline using Amazon Kinesis Data Streams with a Lambda consumer. The data volume is 2 MB/s with average record size of 5 KB. The Lambda function processes records and writes to DynamoDB. Occasionally, the Lambda function fails with 'ProvisionedThroughputExceededException' on DynamoDB. What is the best way to handle this?

Question 101easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from a SaaS application (Salesforce) into Amazon S3 on a daily basis. Which TWO AWS services can be used for this purpose? (Choose TWO.)

Question 102mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue ETL to process data from Amazon RDS for MySQL to Amazon S3. The job runs daily and takes 2 hours to complete. The engineer wants to improve performance without increasing cost significantly. Which TWO actions should the engineer take? (Choose TWO.)

Question 103hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams with multiple consumers. The data must be processed by a Lambda function for real-time alerts and also stored in Amazon S3 for historical analysis. Which THREE components are needed to implement this architecture? (Choose THREE.)

Question 104easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer has an IAM policy attached to an IAM role used by an AWS Glue job. The Glue job needs to read from S3 bucket 'data-bucket' and write to the same bucket. The job fails with an access denied error when trying to write to S3. What is the issue?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "arn:aws:glue:us-east-1:123456789012:job/etl-job"
    }
  ]
}
```
Question 105mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Lambda function that reads from the Kinesis stream 'my-data-stream'. The Lambda function is able to read data but occasionally fails with 'KMS.AccessDeniedException'. What is the most likely cause?

Network Topology
$ aws kinesis describe-streamstream-name my-data-streamRefer to the exhibit.```"StreamDescription": {"StreamName": "my-data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "340282366920938463463374607431768211455"},"SequenceNumberRange": {"StartingSequenceNumber": "49640281912345678901234567890123456789012345678901234567890","EndingSequenceNumber": null],"EnhancedMonitoring": [],"EncryptionType": "KMS","KeyId": "arn:aws:kms:us-east-1:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab","RetentionPeriodHours": 24,"StreamCreationTimestamp": "2024-01-01T00:00:00Z"
Question 106hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A CloudFormation template defines an AWS Glue job. The job fails during execution with the error 'Unable to locate script: s3://scripts-bucket/etl-script.py'. The S3 bucket 'scripts-bucket' exists and the script file is present. What is the most likely cause?

Network Topology
extra-py-files": "s3://libs-bucket/dependencies.zip""enable-metrics": "true"Refer to the exhibit.```Resources:MyGlueJob:Type: AWS::Glue::JobProperties:Name: etl-jobRole: !GetAtt GlueServiceRole.ArnCommand:Name: glueetlScriptLocation: s3://scripts-bucket/etl-script.pyPythonVersion: "3"DefaultArguments:MaxRetries: 0WorkerType: StandardNumberOfWorkers: 10Timeout: 60
Question 107easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then stored in Amazon S3. Which AWS service should be used to perform the transformation?

Question 108mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team needs to load data from an on-premises Oracle database to Amazon S3 daily. The data volume is about 50 GB per day, and the network bandwidth is 100 Mbps. The team wants to minimize operational overhead and use AWS managed services. Which solution should they choose?

Question 109hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams with a shard count of 10 to ingest clickstream data. The data is consumed by a Lambda function that transforms the records and writes to Amazon S3. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. The average record size is 5 KB, and the incoming data rate is 15 MB/s. What is the most likely cause and solution?

Question 110easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an external partner's FTP server to Amazon S3. The data arrives once daily as a CSV file. Which AWS service should be used for this ingestion?

Question 111mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data in S3. The transformation job reads Parquet files, filters rows, and writes to another S3 bucket. The job takes longer than expected. Which change would MOST likely reduce the job execution time?

Question 112hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is configured with a buffer interval of 60 seconds and a buffer size of 5 MB. The data arrives at an average rate of 2 MB per second. What is the expected time interval between S3 writes?

Question 113easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest real-time data from a social media API into Amazon S3 for analysis. The API provides data as JSON records. Which AWS service is best suited for this ingestion?

Question 114mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data into Parquet format using AWS Glue. The input data has nested fields. Which Glue feature should be used to flatten the nested structure?

Question 115hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The stream has 5 shards. Each consumer reads from all shards. The total incoming data rate is 25 MB/s. What is the maximum read throughput per consumer if enhanced fan-out is enabled?

Question 116mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest streaming data from thousands of IoT devices. The data must be processed in real-time and stored in Amazon S3. Which TWO services should be used together?

Question 117hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform data in Amazon S3 using AWS Glue. The job must handle schema evolution and partition pruning. Which THREE features should be used?

Question 118easymulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to ingest data from an on-premise relational database into Amazon S3 on a one-time basis?

Question 119easymultiple choice
Review the full routing breakdown →

A data engineer is ingesting streaming data from thousands of IoT devices into AWS. The data is JSON-formatted and must be stored in Amazon S3 for long-term analytics. Which service is most appropriate for real-time ingestion and routing to S3?

Question 120mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs a SQL Server transactional database on Amazon RDS. They need to capture change data (inserts, updates, deletes) in near real-time and replicate them to an Amazon S3 data lake. Which AWS service is most suitable?

Question 121hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data pipeline that ingests CSV files from an FTP server to Amazon S3. The files arrive hourly and each file is about 500 MB. The engineer wants to minimize operational overhead and cost. Which approach is best?

Question 122easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to transform JSON data from Amazon Kinesis Data Streams into Parquet format and store it in Amazon S3. The transformation includes simple field mappings and type conversions. Which approach is most cost-effective and serverless?

Question 123mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data volume is about 10 TB per day. The engineer wants to set up a managed Kafka connector. Which AWS service should they use?

Question 124hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting CSV files into Amazon S3. Each file contains a header row. The pipeline uses AWS Glue to crawl the S3 bucket and create a table in the AWS Glue Data Catalog. However, the crawler is including the header as data. What is the most likely cause?

Question 125easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an Amazon S3 bucket into an Amazon Redshift table on a daily schedule. The data is in CSV format and the schema matches. Which service is simplest for this batch ingestion?

Question 126mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from a fleet of weather sensors. Each sensor sends a JSON payload every second. The data is used for real-time dashboarding and also archived to S3. The pipeline should handle sudden bursts of data without data loss. Which architecture meets these requirements?

Question 127hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting XML data from an external API into Amazon S3. The engineer needs to transform the XML to JSON using AWS Glue. The XML structure is deeply nested. Which Apache Spark method should be used in the Glue ETL script?

Question 128easymulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting large volumes of sensor data into Amazon S3. The data must be encrypted at rest using an AWS KMS customer managed key. Which TWO actions are required to enable server-side encryption with AWS KMS (SSE-KMS) on the S3 bucket?

Question 129mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for real-time clickstream data. The data must be available for both real-time analytics and batch processing. The engineer wants to use Amazon Kinesis Data Streams. Which THREE components should be included in the architecture?

Question 130hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting Apache logs from multiple web servers into AWS. The logs are sent via Amazon CloudWatch Logs to a subscription filter that delivers to a Lambda function. The Lambda function parses the logs and writes to Amazon S3. However, there is a significant backlog. Which THREE actions can reduce the backlog?

Question 131mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An IAM policy for an AWS Lambda function. The Lambda function is triggered by an S3 event (object created) and needs to read from a Kinesis stream. However, the function fails with access denied when trying to read from Kinesis. What is the most likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetShardIterator",
        "kinesis:GetRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/clickstream"
    }
  ]
}
Question 132hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A CloudFormation stack outputs the Glue job name and S3 bucket names. The Glue job transforms CSV files from the raw bucket to Parquet in the processed bucket. However, the Glue job is failing with an error that it cannot write to the processed bucket. What is the most likely cause?

Network Topology
aws cloudformation describe-stacksstack-name DataPipelinequery 'Stacks[0].Outputs'output json"OutputKey": "GlueJobName","OutputValue": "transform-csv-to-parquet"},"OutputKey": "S3BucketRaw","OutputValue": "raw-data-bucket""OutputKey": "S3BucketProcessed","OutputValue": "processed-data-bucket"
Question 133easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A Lambda function named 'IngestionProcessor' is failing. The engineer checks CloudWatch Logs and sees the log group exists but storedBytes is 0. Why might the logs show no data?

Network Topology
aws logs describe-log-groupslog-group-name-prefix /aws/lambda/IngestionProcessoroutput json"logGroups": ["logGroupName": "/aws/lambda/IngestionProcessor","creationTime": 1630000000000,"metricFilterCount": 0,"arn": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/IngestionProcessor:*","storedBytes": 0
Question 134easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest real-time clickstream data from a web application into Amazon S3 for analytics. The data must be available within minutes of generation. Which AWS service should be used to capture and deliver this streaming data?

Question 135easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is tasked with transforming JSON data from an S3 bucket into Parquet format for efficient querying. The transformation should run on a schedule every hour. Which AWS service is best suited for this task?

Question 136mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Kinesis Data Streams to ingest IoT data. The data volume varies, and occasionally the shard write throughput is exceeded, causing ProvisionedThroughputExceeded exceptions. The data engineer needs to handle these spikes without losing data. Which approach is most cost-effective and requires minimal code changes?

Question 137mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is 10 TB and the network bandwidth is 1 Gbps. The migration must be completed within 48 hours. What should the data engineer do to meet the deadline?

Question 138hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data pipeline that ingests streaming data from Kinesis Data Streams, transforms it using AWS Lambda, and writes to S3. The Lambda function sometimes fails due to transient errors, and the engineer wants to ensure no data is lost. Which approach should be used?

Question 139easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for centralized analytics. The data volume is several GB per day. Which AWS service is most suitable for this ingestion?

Question 140mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting log files from EC2 instances into CloudWatch Logs and then wants to deliver them to S3 for long-term storage and analysis. The data engineer needs to ensure the logs are delivered to S3 within 5 minutes of being generated. Which approach meets this requirement?

Question 141hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on S3 and needs to ingest data from on-premises Oracle database. The data is 5 TB and changes incrementally. The ingestion must capture changes in near real-time (less than 1 minute latency) and be cost-effective. Which approach should be used?

Question 142mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform CSV files arriving in S3 into Parquet format and partition them by date. The transformation should be event-driven and run immediately after each file is uploaded. Which approach is most efficient?

Question 143easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline that uses AWS Lambda to process records from a Kinesis Data Stream and write to DynamoDB. Which TWO strategies can help handle increased throughput and prevent data loss? (Choose TWO.)

Question 144mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to run ETL jobs that transform data from S3 to Redshift. The jobs are failing intermittently with out-of-memory errors. Which THREE actions can help resolve this issue? (Choose THREE.)

Question 145hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to set up a data ingestion pipeline that reads from Amazon MSK (Managed Streaming for Kafka) and writes to Amazon S3 with transformations. The data is in Avro format and must be converted to Parquet. Which THREE components should be used together? (Choose THREE.)

Question 146easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from thousands of IoT devices into Amazon S3 in near real-time. The data must be processed with minimal latency and stored in a columnar format for analytics. Which service should the engineer use to ingest the data?

Question 147mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data from Amazon RDS to Amazon S3 daily. The job recently started failing with memory errors. The data volume has grown 3x in the past month. Which change should the data engineer make to resolve the issue?

Question 148hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer needs to design a data ingestion pipeline that captures change data capture (CDC) events from an on-premises SQL Server database to Amazon S3 with low latency. The pipeline must handle schema changes and ensure exactly-once delivery semantics. Which combination of AWS services should the engineer use?

Question 149easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error: 'An error occurred while calling o137.pyWriteDynamicFrame. No such file or directory: s3://bucket/output/part-00000.parquet'. The job reads from a JDBC source and writes to S3. What is the most likely cause?

Question 150mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company ingests JSON logs into Amazon S3 using Kinesis Data Firehose. The logs contain a timestamp field, but the delivery to S3 is delayed by up to 15 minutes during peak hours. The business requires near-real-time availability (under 2 minutes). Which configuration change should the data engineer make?

Question 151hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer runs a weekly AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job reads the entire table every time, which is slow and expensive. The job needs to process only items that changed since the last run. Which solution should the engineer implement?

Question 152easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from an external API that returns CSV files daily. The files range from 100 MB to 2 GB. The data should be landed in Amazon S3 and then transformed using AWS Glue. Which ingestion method is most cost-effective and requires the least operational overhead?

Question 153mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer notices that an AWS Glue ETL job is running slower than expected. The job reads from Amazon S3, joins two datasets, and writes the result back to S3. The job uses the default worker type (G.1X) and 10 DPUs. Which action is most likely to improve performance?

Question 154hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company ingests streaming data from social media feeds into Amazon Kinesis Data Streams. The data is consumed by an AWS Lambda function that transforms and writes to Amazon S3. Recently, the Lambda function started timing out and dropping records. The data volume has tripled. Which actions should the data engineer take to resolve this? (Choose TWO.)

Question 155mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to design a data ingestion pipeline that captures streaming data from mobile app events into Amazon S3 for analytics. The pipeline must support real-time processing of events and allow for schema evolution over time. Which AWS services should the engineer use? (Choose THREE.)

Question 156easymulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to run ETL jobs daily. The data engineer wants to reduce costs by optimizing the job configuration. Which two actions will help reduce costs? (Choose TWO.)

Question 157hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data transformation pipeline using AWS Glue. The source data is in Amazon S3 in Parquet format, and the transformed output must be written to another S3 bucket in Parquet format partitioned by year, month, day. The pipeline should handle incremental updates efficiently. Which three features should the engineer use? (Choose THREE.)

Question 158mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue ETL job that fails with an access denied error when writing to S3. The IAM role attached to the Glue job has the policy shown. What is the most likely cause of the error?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-data-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    }
  ]
}
Question 159hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs an AWS Glue job that fails with an 'Access Denied' error when writing to S3. The IAM role attached to the job has s3:PutObject permission on the output bucket. What additional configuration is most likely missing?

Network Topology
aws glue get-job-runjob-name my-jobrun-id abc123Refer to the exhibit."JobRun": {"Id": "abc123","Attempt": 0,"PreviousRunId": "xyz456","Trigger": "SCHEDULED","JobName": "my-job","StartedOn": "2025-03-15T10:00:00Z","LastModifiedOn": "2025-03-15T10:05:30Z","CompletedOn": "2025-03-15T10:05:30Z","JobRunState": "FAILED","PredecessorRuns": [],"AllocatedCapacity": 10,"ExecutionTime": 300,"Timeout": 2880,"MaxCapacity": 10,"WorkerType": "G.1X","NumberOfWorkers": 10,"LogGroupName": "/aws-glue/jobs/error","GlueVersion": "3.0"
Question 160mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is troubleshooting a Kinesis Data Streams consumer that is falling behind. The stream has 2 shards and is receiving data at a rate of 2 MB/s. The consumer is an AWS Lambda function with a batch size of 100 records. What should the engineer do to improve consumer throughput?

Network Topology
aws kinesis describe-streamstream-name my-streamRefer to the exhibit."StreamDescription": {"StreamName": "my-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "113427455640312821154458202477256070485"},"SequenceNumberRange": {"StartingSequenceNumber": "49640292012345678901234567890123456789012345678901234567890""ShardId": "shardId-000000000001","StartingHashKey": "113427455640312821154458202477256070485","EndingHashKey": "226854911280625642308916404954512140970""StartingSequenceNumber": "49640292012345678901234567890123456789012345678901234567891"],"EnhancedMonitoring": [],"EncryptionType": "KMS","KeyId": "arn:aws:kms:us-east-1:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab","RetentionPeriodHours": 24,"ShardCount": 2
Question 161mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError'. What is the MOST likely cause?

Question 162easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The delivery occasionally fails with 'Firehose is throttled'. What should be done to reduce throttling?

Question 163hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate a 2 TB Oracle database to Amazon RDS for PostgreSQL. The migration is taking longer than expected. The task status shows 'Full load in progress' with a low 'Table throughput (rows/s)'. Which action would MOST improve throughput?

Question 164easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a pipeline to ingest JSON files from Amazon S3 into Amazon Redshift. The files are 100 MB each and arrive every 5 minutes. Which service is BEST suited for this ingestion?

Question 165mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue DataBrew to clean and transform data for analytics. The source data is in Parquet format in Amazon S3. The transformation includes filtering rows and adding calculated columns. What is the MOST cost-effective way to run these transformations on a schedule?

Question 166hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis stream with 10 shards and writes to an S3 bucket. The application is experiencing high latency. Analysis shows that the application is not keeping up with the incoming data rate. Which action would MOST effectively reduce latency?

Question 167easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Lake Formation to manage data lake permissions. A data engineer needs to grant an IAM role 'Read' access to a specific database and all its tables in the Data Catalog. What is the MOST efficient way to achieve this?

Question 168mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL to transform data from Amazon RDS for MySQL to Amazon S3. The Glue job reads from a JDBC connection. The job runs once daily and processes all records, but the data volume is growing. Which change would improve performance and reduce costs?

Question 169mediummulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO options are valid methods to ingest on-premises relational database data into Amazon S3 for analytics? (Choose 2.)

Question 170hardmulti select
Read the full Data Ingestion and Transformation explanation →

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a streaming ingestion architecture? (Choose 3.)

Question 171easymulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to transform data in transit during ingestion? (Choose 2.)

Question 172easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is then consumed by a custom application for real-time analytics. Recently, the application has been experiencing high latency. The operations team suspects the shard count is insufficient. How should the team increase the shard count of the existing stream?

Question 173mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. The source database is behind a firewall that does not allow direct internet access. Which service should the engineer use to transfer the data securely?

Question 174hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data stored in Amazon S3. During a run, the job fails with a 'OutOfMemoryError' in the Spark executor. The job processes 2 TB of parquet files using 10 DPUs. The data is evenly distributed across partitions. Which action would MOST likely resolve the issue without impacting the job logic?

Question 175easymulti select
Read the full Data Ingestion and Transformation explanation →

A company is designing a data ingestion pipeline for real-time IoT sensor data. The data volume peaks at 10,000 messages per second. The pipeline must process messages in order per sensor and persist raw data to Amazon S3 for archival. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Question 176mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a batch ETL pipeline using AWS Glue. The source data is in Amazon RDS for MySQL. The pipeline must run daily and process only new and modified records since the last run. The engineer needs to implement change data capture (CDC) efficiently. Which THREE steps should the engineer take? (Choose THREE.)

Question 177hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The Flink application reads from a Kinesis Data Streams source, performs aggregations, and writes results to Amazon S3. The application is experiencing high checkpoint failures, and the processing lag is increasing. The data volume is 50 MB/s with an average record size of 1 KB. Which TWO actions would improve checkpoint reliability and reduce lag? (Choose TWO.)

Question 178mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is attaching this IAM policy to an IAM role used by an AWS Glue job. The job reads from a Kinesis Data Streams stream and writes transformed data to an S3 bucket. When the job runs, it fails with an AccessDenied error for the Kinesis stream. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream"
    }
  ]
}
Question 179hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs the describe-stream command on a Kinesis data stream. The stream has two shards. The engineer wants to increase the shard count to 4 using the UpdateShardCount API. What will be the resulting shard distribution?

Network Topology
$ aws kinesis describe-streamstream-name my-streamRefer to the exhibit."StreamDescription": {"StreamName": "my-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","ParentShardId": null,"AdjacentParentShardId": null,"HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "113427455640312821154458202477256070485"},"SequenceNumberRange": {"StartingSequenceNumber": "49604076157786479506863437267615871614090182630735036418","EndingSequenceNumber": null"ShardId": "shardId-000000000001","StartingHashKey": "113427455640312821154458202477256070485","EndingHashKey": "226854911280625642308916404954512140970"],"EnhancedMonitoring": []
Question 180easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A company uses S3 Event Notifications to trigger an AWS Lambda function whenever a new object is uploaded to an S3 bucket. The Lambda function processes the file and moves it to a different bucket. Recently, the function has been failing intermittently. The engineer checks the Lambda CloudWatch logs and sees the above event. What is the MOST likely cause of the intermittent failures?

Exhibit

Refer to the exhibit.

{
  "Events": [
    {
      "EventID": "1",
      "EventVersion": "1.0",
      "EventSource": "aws:s3",
      "AwsRegion": "us-east-1",
      "EventName": "ObjectCreated:Put",
      "UserIdentity": {
        "principalId": "AWS:AIDAEXAMPLE"
      },
      "RequestParameters": {
        "sourceIPAddress": "192.0.2.1"
      },
      "ResponseElements": {
        "x-amz-request-id": "EXAMPLE123"
      },
      "S3": {
        "s3SchemaVersion": "1.0",
        "bucket": {
          "name": "source-bucket",
          "arn": "arn:aws:s3:::source-bucket"
        },
        "object": {
          "key": "data/file.csv",
          "size": 1024,
          "eTag": "abc123",
          "sequencer": "0055AED6DCD90281E5"
        }
      }
    }
  ]
}
Question 181hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The jobs run nightly and take 3 hours to complete. The data volume is growing by 20% each month. The engineer needs to reduce job runtime and cost. The source RDS is a db.r5.large instance. Which approach would be MOST effective?

Question 182mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration task is set to full load + CDC. After the full load completes, the CDC phase starts but shows a high latency of 5 minutes. The source database has a low write load. What should the engineer do to reduce the CDC latency?

Question 183easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time processing. Each device sends JSON payloads of about 2 KB at a rate of 1 message per second. The data must be processed with a durable, ordered stream per device. Which service should the company use as the ingestion layer?

Question 184hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon EMR to process large datasets stored in Amazon S3. The data is in Parquet format and partitioned by date. The EMR cluster uses Spark SQL for transformations. Recently, the job has been slow and some tasks are failing due to 'java.lang.OutOfMemoryError'. The cluster has 10 core nodes of type m5.xlarge. Which configuration change would MOST improve performance and stability?

Question 185mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for real-time user activity logs. The logs are generated by a web application and must be ingested into Amazon S3 with minimal latency (under 1 minute). The logs also need to be queried in Amazon Athena. The engineer considers using Amazon Kinesis Data Firehose. Which TWO configurations are required to achieve near-real-time delivery to S3? (Choose TWO.)

Question 186hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 (Parquet) into a denormalized format for Amazon Redshift. The Glue job uses the DynamicFrame API. The job is failing with a 'MemoryError' when performing a join operation. The data is skewed on the join key. Which THREE actions can reduce memory usage and improve job stability? (Choose THREE.)

Question 187easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. Which AWS service is best suited for this purpose?

Question 188mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform CSV files arriving in an S3 bucket into Parquet format and store them in another S3 bucket. The transformation is simple and on-demand, triggered by data arrival. Which solution is the MOST cost-effective and requires the least operational overhead?

Question 189hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Kinesis Data Analytics for SQL-based real-time analytics on streaming data. They notice that the application is processing data slower than the incoming rate, causing increased latency. Which action is MOST likely to improve the throughput?

Question 190easymultiple choice
Read the full Data Ingestion and Transformation explanation →

An organization needs to ingest data from on-premises databases into AWS S3 for archival purposes. The data volume is several TB per day, and the network has moderate bandwidth. Which AWS service is BEST suited for this bulk data transfer?

Question 191mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue for ETL jobs. The data engineer needs to ensure that the Glue job can access an S3 bucket in another account. What is the recommended approach?

Question 192hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data streaming application uses Kinesis Data Streams with 10 shards. The data producer is throttled frequently. Which action should be taken to resolve this issue?

Question 193easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to transform JSON data from an S3 bucket into a structured format for Amazon Redshift. The transformation should be done serverlessly. Which service should be used?

Question 194mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting data from multiple sources into S3 using AWS Glue. The data engineer notices that the Glue job is failing with an OutOfMemory error. Which step should be taken to resolve this issue?

Question 195hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest real-time clickstream data from a web application into Amazon Redshift with minimal latency. The data volume is high and requires processing before loading. Which architecture is MOST appropriate?

Question 196mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data is generated at a high velocity and must be processed in near real-time. The pipeline must also handle bursty traffic. Which TWO AWS services should be combined to achieve this? (Choose TWO.)

Question 197hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data stored in S3. The Glue job runs daily and processes data in the range of hundreds of GB. The data engineer wants to optimize the job for cost and performance. Which THREE actions should be taken? (Choose THREE.)

Question 198easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transfer 50 TB of data from an on-premises data center to Amazon S3 over a 1 Gbps network. The transfer must be completed within one week. Which TWO AWS services can be used for this task? (Choose TWO.)

Question 199hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Kinesis Data Streams to ingest clickstream data. They notice that the data processing latency increases as the number of shards grows. What is the most likely cause and solution?

Question 200easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest JSON files from an S3 bucket into a DynamoDB table. The files are updated hourly and contain new records. Which AWS service should be used to trigger a Lambda function for each new object?

Question 201mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data in S3. The job runs successfully but takes longer than expected. The data is in Parquet format and partitioned by date. Which change would most improve performance without increasing cost?

Question 202hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a streaming pipeline that ingests IoT sensor data from 10,000 devices. Each device sends a 1 KB message every second. The data must be processed in near real-time and stored in S3 for analytics. Which combination of services provides the most cost-effective solution?

Question 203mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs a nightly ETL job using AWS Glue. The job reads data from a JDBC connection to an on-premises MySQL database. The job fails with an error indicating that the connection pool is exhausted. What is the most likely cause and solution?

Question 204easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform CSV files to Parquet format using AWS Glue. The source data contains sensitive columns that must be masked. Which Glue feature should be used?

Question 205hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline ingests streaming data from Kinesis Data Streams into S3 via Kinesis Data Firehose. Occasionally, small files are written to S3, increasing downstream processing costs. What is the most efficient way to reduce the number of small files?

Question 206mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Database Migration Service (DMS) to continuously replicate data from an Oracle RDS instance to S3. The data is used for analytics. The replication lags behind the source by several hours. Which change would most likely reduce the lag?

Question 207easymultiple choice
Read the full VPN explanation →

A data engineer needs to ingest data from an external FTP server into S3 on a schedule. The FTP server is only accessible via VPN. Which AWS service is best suited for this task?

Question 208mediummulti select
Read the full NAT/PAT explanation →

A data engineer is designing a near-real-time streaming pipeline to ingest clickstream data from a web application. The data must be enriched with user metadata from a DynamoDB table before being stored in S3. Which combination of AWS services should the engineer use? (Choose TWO.)

Question 209hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses a Kinesis Data Firehose delivery stream to load data into an S3 bucket. The data is in JSON format and must be converted to Parquet before landing in S3. Which steps are required to achieve this? (Choose THREE.)

Question 210easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from a SaaS application that sends webhooks in JSON format. The data must be stored in S3 for batch analysis. Which AWS services can receive the webhooks and store the data in S3 with minimal custom code? (Choose TWO.)

Question 211hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is configuring an IAM policy for a Lambda function that writes transformed data to S3. The function writes to both 'example-bucket/data/' and 'example-bucket/public/'. The policy is intended to enforce server-side encryption with SSE-S3 for all objects written to the 'public/' prefix, while allowing all operations on other prefixes. However, the Lambda function is failing with an AccessDenied error when writing to 'example-bucket/public/'. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::example-bucket/public/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}
Question 212easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs the above CLI command to find files smaller than 1000 bytes in a bucket. The command returns an empty array, but the engineer knows there are small files. What is the issue?

Network Topology
aws s3api list-objects-v2bucket my-bucketprefix logs/2023/01/01/query "Contents[?Size < '1000']"Refer to the exhibit.
Question 213mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer deploys this CloudFormation template to create an AWS Glue job. The job fails on the first run with an error: 'AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/GlueServiceRole/... is not authorized to perform: s3:GetObject on resource: s3://my-bucket/scripts/etl.py'. What is the most likely cause?

Network Topology
enable-metrics": "true","job-language": "python"Refer to the exhibit."AWSTemplateFormatVersion": "2010-09-09","Resources": {"GlueJob": {"Type": "AWS::Glue::Job","Properties": {"Command": {"Name": "glueetl","ScriptLocation": "s3://my-bucket/scripts/etl.py"},"DefaultArguments": {"ExecutionProperty": {"MaxConcurrentRuns": 1"MaxRetries": 0,"Name": "my-glue-job","Role": "arn:aws:iam::123456789012:role/GlueServiceRole"
Question 214easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to run ETL jobs daily. The data is stored in S3 as Parquet files partitioned by date. Recently, jobs have failed with the error 'No such file or directory' for certain partitions. What is the MOST likely cause?

Question 215mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team needs to ingest streaming data from thousands of IoT devices and store it in Amazon S3 for batch processing. The data arrives at a rate of 10 MB/s, with occasional spikes up to 50 MB/s. The data must be processed in near real-time with minimal latency. Which AWS service should be used for ingestion?

Question 216hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs a daily batch ETL job using AWS Glue. The job processes 500 GB of data from Amazon RDS to Amazon S3. The job currently uses a single DPU and takes 6 hours to complete. The team wants to reduce runtime to under 1 hour without increasing costs significantly. Which approach should they use?

Question 217easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to transfer 10 TB of historical data from an on-premises HDFS cluster to Amazon S3. The data is stored on a single 20 TB disk. The network link to AWS has a bandwidth of 1 Gbps. The transfer must be completed within 2 days. Which solution meets these requirements?

Question 218mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer notices that an AWS Glue job writing to Amazon S3 in Parquet format creates many small files (less than 1 MB each). This leads to poor query performance in Amazon Athena. What is the BEST way to reduce the number of output files?

Question 219hardmultiple choice
Review the full subnetting walkthrough →

A company uses AWS Glue to process data from Amazon RDS MySQL into Amazon S3. The Glue job uses a JDBC connection and runs on a schedule. Recently, the job has been failing with a 'Communications link failure' error. The RDS instance is in a private subnet. Which troubleshooting step should the data engineer take FIRST?

Question 220easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest real-time clickstream data from a website into Amazon S3 for analytics. The data arrives as JSON records, each under 1 KB. The engineer wants to use a serverless solution with automatic scaling and minimal operational overhead. Which AWS service should be used as the ingestion endpoint?

Question 221mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that some changes are not being captured in the target database. What is the MOST likely cause?

Question 222hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs an AWS Glue ETL job that reads data from Amazon S3, transforms it, and writes back to S3 in a different partition structure. The job uses the 'spark.sql.shuffle.partitions' option set to 200. After the job completes, the output has many small files. The data engineer wants to minimize the number of output files while maintaining job performance. Which action should the engineer take?

Question 223easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline to load JSON files from Amazon S3 into Amazon Redshift. Which TWO methods can be used to load the data efficiently?

Question 224mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3. They need to ingest data from multiple sources, including relational databases, streaming data, and log files. Which THREE AWS services can be used to ingest data into the data lake?

Question 225hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue job that reads from Amazon RDS MySQL and writes to Amazon S3. The job runs successfully but takes longer than expected. The engineer wants to optimize performance. Which THREE actions would improve job performance?

Question 226mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is creating an IAM policy for an application that sends data to a Kinesis stream and stores processed data in S3. The policy is attached to an IAM role used by an EC2 instance. The application fails to write to S3 with an access denied error. What is the cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": "kinesis:PutRecord",
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream"
    }
  ]
}
Question 227hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs the AWS CLI command to describe a Glue job. The job is expected to process new data incrementally using job bookmarks. However, the job reprocesses all data every time it runs. What is the MOST likely reason?

Network Topology
aws glue get-jobjob-name my-etl-job"TempDir": "s3://my-bucket/temp/",job-bookmark-option": "job-bookmark-enable",extra-py-files": "s3://my-bucket/libs/my-lib.zip"Refer to the exhibit."Job": {"Name": "my-etl-job","Role": "arn:aws:iam::123456789012:role/GlueServiceRole","Command": {"Name": "glueetl","ScriptLocation": "s3://my-bucket/scripts/my-script.py"},"DefaultArguments": {"MaxRetries": 0,"AllocatedCapacity": 10,"Timeout": 2880,"MaxCapacity": 10,"GlueVersion": "3.0"
Question 228mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is running an AWS Glue job that reads data from an S3 source. The job fails with the error shown. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

2019-11-15T10:00:00Z ERROR: Task failed: 'NoneType' object has no attribute 'read'
Question 229easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data in Amazon S3. The data arrives in JSON format but needs to be converted to Parquet for efficient querying. Which AWS Glue feature should be used to infer the schema and generate transformation code?

Question 230mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting streaming data from an IoT fleet into Amazon Kinesis Data Streams. The data must be transformed in real-time and loaded into an Amazon Redshift cluster. Which solution minimizes operational overhead?

Question 231hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company ingests clickstream data into Amazon S3 via Kinesis Data Firehose. The data arrives in 20 MB files every 2 minutes. The data engineering team needs to transform nested JSON into a flat structure before loading into Amazon Redshift. Which approach is most cost-effective and scalable?

Question 232easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to capture change data capture (CDC) events from an Amazon RDS for PostgreSQL database and stream them to Amazon S3 in near real-time. Which AWS service should be used?

Question 233mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process data in Amazon S3. The Glue job fails with an error indicating that the partition keys in the catalog do not match the actual S3 partition structure. What is the most likely cause?

Question 234hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Streams with a shard count of 10. The incoming data rate is 1 MB/second. The consuming application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely performance bottleneck?

Question 235easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple on-premises databases into Amazon S3 for analytics. The databases include Oracle, MySQL, and PostgreSQL. The data must be continuously replicated with minimal latency. Which AWS service should be used?

Question 236mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a data pipeline that ingests data from Amazon S3 into Amazon Redshift. The data is in CSV format and includes a timestamp column. The pipeline should load only new data incrementally. Which approach is most efficient?

Question 237hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is compressed with GZIP and partitioned by year, month, day, and hour. The delivery stream is configured to buffer up to 5 MB or 60 seconds. Some records are missing from S3. What is the most likely cause?

Question 238easymulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to ingest streaming data into Amazon S3? (Choose two.)

Question 239mediummulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO practices improve the performance of AWS Glue ETL jobs? (Choose two.)

Question 240hardmulti select
Read the full Data Ingestion and Transformation explanation →

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for real-time data ingestion? (Choose three.)

Question 241mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer has attached this IAM policy to an AWS Glue job role. The Glue job fails when trying to write transformed data to an S3 bucket located in a different AWS account. What is the most likely reason?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream"
    },
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction"
      ],
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:my-function"
    }
  ]
}
Question 242easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs the AWS CLI command and observes the output. The stream has two shards. A producer sends a record with a partition key that hashes to 150000000000000000000000000000000000000. To which shard will the record be written?

Network Topology
$ aws kinesis describe-streamstream-name my-data-streamRefer to the exhibit."StreamDescription": {"StreamName": "my-data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "113427455640312821154458202477256070484"},"SequenceNumberRange": {"StartingSequenceNumber": "49614364192093460283261643761561152160826180970340319234""ShardId": "shardId-000000000001","StartingHashKey": "113427455640312821154458202477256070485","EndingHashKey": "226854911280625642308916404954512140969""StartingSequenceNumber": "49614364192093460283261643761561152160826180970340319235"
Question 243mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer creates an AWS Glue job using this CloudFormation template. The job processes new data files in S3 and uses job bookmarks to track processed files. After initial success, the job runs again but processes all files again instead of only new ones. What is the most likely cause?

Network Topology
job-bookmark-option": "job-bookmark-enable","enable-metrics": "true"Refer to the exhibit."AWSTemplateFormatVersion": "2010-09-09","Resources": {"MyGlueJob": {"Type": "AWS::Glue::Job","Properties": {"Name": "transform-job","Role": "arn:aws:iam::123456789012:role/GlueServiceRole","Command": {"Name": "glueetl","ScriptLocation": "s3://bucket/scripts/transform.py","PythonVersion": "3"},"DefaultArguments": {"MaxRetries": 0
Question 244easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed into Parquet format and stored in Amazon S3. Which AWS service can perform the transformation in near real-time with minimal operational overhead?

Question 245mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to load data from an on-premises Oracle database to Amazon S3 daily. The table is 500 GB and grows by 50 MB per day. The load must capture only new and changed rows since the last run. Which solution is MOST cost-effective and requires the least maintenance?

Question 246hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process JSON logs from S3. The logs have a nested structure and the schema evolves over time. The data engineer needs to ensure the Glue job can handle schema changes without failing. Which configuration should be used?

Question 247mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for clickstream data. The data arrives in batches of 10-50 MB every 5 seconds. The engineer needs to buffer the data, perform simple transformations (e.g., add timestamp, remove PII), and land it in S3 within 10 minutes. Which TWO services should be combined? (Choose TWO.)

Question 248hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for analytics. The data volume is moderate (~100 GB per day). The pipeline must handle schema changes, deduplicate records, and provide low latency (under 1 hour). Which THREE services should be used? (Choose THREE.)

Question 249easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform CSV files in S3 to Parquet format using a serverless solution. The files are large (up to 5 GB each) and arrive irregularly. Which TWO services can accomplish this with minimal operational overhead? (Choose TWO.)

Question 250easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon S3 Event Notifications to trigger a Lambda function that processes incoming files. Recently, the Lambda function has been timing out for large files (>100 MB). The data engineer wants to improve the pipeline to handle large files reliably. Which solution is the MOST scalable and cost-effective?

Question 251hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a streaming pipeline that ingests data from an Amazon Kinesis Data Stream (with 5 shards) into Amazon S3. The data must be transformed using a complex stateful operation that cannot be done in a Lambda function (limited to 15 minutes). The engineer needs a solution that can maintain state across multiple records. Which service should be used?

Question 252easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from a MySQL database into Amazon S3 in near real-time. The database is running on EC2. The data engineer wants to minimize the impact on the source database. Which service should be used?

Question 253mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer uses AWS Glue to process data from S3. The Glue job frequently fails with 'Out of Memory' errors. The job reads several large compressed files. What is the MOST effective way to resolve this issue without changing the code?

Question 254hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on S3. They have a large volume of CSV files (hundreds of GB) in a source bucket. They need to convert them to Parquet, partition by date, and ensure the data is encrypted at rest with SSE-KMS. The pipeline must be triggered automatically when new files arrive. Which THREE steps should be part of the solution? (Choose THREE.)

Question 255easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is streaming clickstream data from a website into Amazon Kinesis Data Streams. The data must be transformed in near real-time and stored in Amazon S3 for analytics. Which AWS service should be used to transform the data as it is ingested?

Question 256mediummultiple choice
Read the full NAT/PAT explanation →

A data engineer needs to ingest data from an on-premises Oracle database to Amazon S3 daily. The data volume is 500 GB per day, and the network bandwidth is 200 Mbps. The requirement is to minimize the impact on the source database and ensure data integrity. Which combination of AWS services should be used?

Question 257hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The Firehose delivery stream has a buffer size of 64 MB and a buffer interval of 300 seconds. The data volume is 1 GB per minute, and the average record size is 1 KB. The data must be delivered to S3 within 5 minutes of ingestion. The engineer notices that some files are being delivered after 10 minutes. What is the most likely cause?

Question 258easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team needs to transform CSV files to Parquet format after they land in an S3 bucket. The transformation should be triggered automatically as soon as a new file arrives. Which AWS service is best suited for this task?

Question 259mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to process data from an S3 data lake. The job reads data in CSV format, transforms it, and writes to Parquet. The job runs daily and takes 2 hours to complete. The data volume is increasing by 20% each month. The engineer wants to reduce the job runtime. Which action is most effective?

Question 260hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams to ingest IoT sensor data. The data is processed by an AWS Lambda function that transforms the records and writes to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'Rate exceeded' errors for the S3 PUT API calls. The data volume is 10 MB/s with average record size 2 KB. What should be done to resolve this issue?

Question 261easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) and load it into Amazon Redshift. The data must be transformed before loading. Which AWS service should be used to build the ingestion pipelines?

Question 262mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is using Amazon EMR to transform large datasets stored in S3. The cluster runs once a day and takes 3 hours. The engineer notices that the cluster is idle for 30 minutes at the start while waiting for resources. What is the most cost-effective way to reduce the idle time?

Question 263hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue DataBrew for data preparation. The data source is an S3 bucket with millions of small CSV files (each < 1 MB). The DataBrew project takes a long time to load the sample data. What is the most likely cause and solution?

Question 264mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company ingests IoT data into an S3 bucket using AWS IoT Core rules. The data is in JSON format, and each record is about 500 bytes. The data volume is 5 GB per day. The company wants to convert the data to Parquet format and partition it by year/month/day. Which TWO AWS services can be used together to achieve this with minimal operational overhead?

Question 265hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company runs a real-time analytics platform using Amazon Kinesis Data Streams. The data is consumed by multiple consumers: one for real-time dashboard (using Lambda) and one for long-term storage (using Firehose to S3). The Kinesis stream has 10 shards. Each record is 1 KB, and the total incoming data rate is 5 MB/s. The Lambda consumer is falling behind and processing latency exceeds 10 seconds. Which TWO actions should be taken to resolve the issue?

Question 266mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is designing a data ingestion pipeline for real-time clickstream data. The data must be ingested with low latency (< 1 second) and then processed for real-time analytics. The processed data should be stored in Amazon S3 for batch analytics. Which THREE services should be used together?

Question 267hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A financial services company ingests real-time stock trade data from multiple exchanges into Amazon Kinesis Data Streams. Each trade record is a JSON object with fields: trade_id, symbol, price, quantity, timestamp. The stream has 5 shards. The data is consumed by an AWS Lambda function that aggregates trades per symbol every minute and writes the results to an Amazon DynamoDB table for a real-time dashboard. Recently, the dashboard has been showing outdated data, and the Lambda function is experiencing high error rates. The CloudWatch logs show 'ProvisionedThroughputExceededException' errors from DynamoDB. The DynamoDB table has 10 read capacity units (RCU) and 10 write capacity units (WCU). The average trade volume is 5,000 trades per second across all symbols, and there are 100 symbols. The Lambda function is configured with a batch size of 100 and a 1-minute window. The data volume is expected to double in the next month. As a data engineer, what is the most appropriate course of action?

Question 268mediummultiple choice
Read the full NAT/PAT explanation →

A retail company uses AWS Glue ETL jobs to process sales data from an S3 data lake. The source data is partitioned by year/month/day in CSV format. The Glue job reads the latest day's data, performs transformations (e.g., cleaning, aggregating), and writes the results to a separate S3 bucket. The job runs on a schedule every day at 2 AM. Recently, the job has been failing intermittently with the error 'AnalysisException: Path does not exist: s3://source-bucket/year=2024/month=02/day=30/'. The engineer verifies that the folder 'day=30' does not exist because February has only 28 days in 2024. The job is reading data from a hardcoded path. The company expects the job to handle variable days per month automatically. What should the engineer do to fix the issue?

Question 269easymultiple choice
Read the full NAT/PAT explanation →

A startup is building a data pipeline to ingest user activity logs from a mobile app. The logs are sent in real-time via HTTP POST requests. The data volume is low (a few hundred requests per second) but can spike to a few thousand during promotions. The team wants to store the logs in Amazon S3 for analysis. They also need to be able to query the data using Amazon Athena with minimal latency. The data must be transformed from JSON to Parquet and partitioned by date. The team is considering using Amazon API Gateway with AWS Lambda to receive the logs and write to S3. However, they are concerned about Lambda cold starts and the complexity of handling spikes. Which alternative solution should they choose?

Question 270mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is streaming IoT data from thousands of devices into Amazon Kinesis Data Streams. The data must be transformed in real time before being stored in Amazon S3. Which service should be used to perform the transformation as the data streams through Kinesis?

Question 271hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for clickstream data from a mobile app. The data volume varies, with occasional spikes up to 10 MB/s. The pipeline must persist the raw data in Amazon S3 and make it available for near-real-time analytics via Amazon Athena. Which combination of services minimizes cost and operational overhead?

Question 272easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is 500 GB per transfer. Which AWS service is most appropriate for this batch ingestion?

Question 273mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data must be transformed from JSON to Parquet format before landing in S3. The transformation logic is simple: convert the JSON schema to Parquet. Which approach meets the requirements with the least operational overhead?

Question 274hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that is experiencing high error rates when writing to an S3 bucket. The error logs indicate 'AccessDenied' errors. The S3 bucket policy allows access from the Firehose service, but the errors persist. What is the most likely cause?

Question 275easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS sources (e.g., Salesforce, Marketo) into Amazon S3 for analytics. Which AWS service is designed for this purpose?

Question 276mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting log files from multiple EC2 instances into Amazon S3 using the CloudWatch agent. The logs are delivered to a CloudWatch Logs group, and a subscription filter sends them to a Lambda function for transformation, then to Firehose. The Firehose stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The logs are critical and must be available in S3 within 5 minutes. What is the most cost-effective way to reduce the delivery latency?

Question 277hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for real-time financial transactions. The pipeline must ensure exactly-once processing semantics and must handle duplicate records that may occur due to retries. Which combination of AWS services can achieve exactly-once processing?

Question 278easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from an on-premises Hadoop cluster into Amazon S3 for archival and analysis. The total data volume is 50 TB. The migration must be completed within one week. The on-premises network has a 1 Gbps connection to AWS. Which AWS service should be used?

Question 279mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to run ETL jobs that read from Amazon S3 and write to Amazon Redshift. The jobs are failing intermittently with 'Out of Memory' errors. Which TWO actions should the data engineer take to resolve this issue? (Choose TWO.)

Question 280hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data is consumed by multiple applications: one for real-time sentiment analysis and another for archival to S3. The data must be processed in order for each social media post. Which TWO approaches meet the requirements? (Choose TWO.)

Question 281easymulti select
Read the full Data Ingestion and Transformation explanation →

A company is designing a data lake on Amazon S3. The data ingestion pipeline must handle both structured and unstructured data. The data must be cataloged for easy discovery. Which THREE services should be included in the solution? (Choose THREE.)

Question 282hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs a real-time analytics platform that ingests data from thousands of sensors via Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data is consumed by a fleet of EC2 instances running a custom consumer application. Recently, the consumer has been falling behind, with the iterator age exceeding 10 minutes. The company has already increased the number of shards to 100, but the problem persists. The consumer application is single-threaded per shard and uses the Kinesis Client Library (KCL). The CPU utilization on the EC2 instances is below 30%. What should the data engineer do to reduce the iterator age?

Question 283mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A media company ingests video files from content partners into an Amazon S3 bucket. Each video file is 10-50 GB. Upon upload, an AWS Lambda function is triggered to extract metadata (e.g., resolution, duration) and store it in DynamoDB. The company now wants to also generate a thumbnail image for each video. The thumbnail generation is CPU-intensive and can take up to 5 minutes per video. The Lambda function has a maximum execution time of 15 minutes. The company has noticed that some thumbnail generation tasks are timing out. What should the data engineer do to reliably generate thumbnails for all videos?

Question 284easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A small startup is building a data pipeline to ingest customer orders from a web application into Amazon Redshift for analytics. The orders are written to an Amazon RDS MySQL database. The startup wants to replicate the orders to Redshift in near-real time (within 5 minutes) with minimal operational overhead. The data volume is low, averaging 100 new orders per minute. The startup has a single data engineer who is also responsible for other tasks. What is the simplest solution?

Question 285mediummultiple choice
Read the full NAT/PAT explanation →

A data engineer needs to ingest streaming data from thousands of IoT devices into AWS for near-real-time analytics. The data volume varies significantly and can spike unpredictably. The engineer wants to minimize operational overhead and ensure that data is durably stored as soon as it arrives. Which AWS service combination should the engineer use?

Question 286hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Lambda function that reads from a Kinesis Data Stream, processes records, and writes to a Kinesis Data Firehose delivery stream. The Firehose delivery stream is configured to deliver data to an S3 bucket. The Lambda function is failing with an access denied error. The IAM policy attached to the Lambda execution role is shown in the exhibit. Which permission is missing?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:GetRecords",
        "kinesis:GetShardIterator",
        "kinesis:DescribeStream",
        "kinesis:ListShards"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream"
    },
    {
      "Effect": "Allow",
      "Action": [
        "firehose:PutRecord",
        "firehose:PutRecordBatch"
      ],
      "Resource": "arn:aws:firehose:us-east-1:123456789012:deliverystream/my-firehose"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    }
  ]
}
Question 287easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to migrate on-premises data to Amazon S3 using AWS DataSync. The data is stored on an NFS file server and the total volume is 50 TB. The network bandwidth between the on-premises data center and AWS is 1 Gbps (gigabit per second). What is the primary factor that will determine the total time required for the initial data transfer?

Question 288hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer runs an AWS Glue ETL job that reads from a large Amazon S3 source (several terabytes of CSV files) and writes transformed data to an S3 bucket in Parquet format. The job fails with the error shown in the exhibit. The job uses the Standard worker type with 10 workers (G.1X). The engineer needs to resolve the failure with minimal cost increase. What should the engineer do?

Exhibit

Refer to the exhibit.

Error log from AWS Glue job:
```
An error occurred while calling o123.pyWriteDynamicFrame.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 8, ip-10-0-1-45.ec2.internal): java.lang.OutOfMemoryError: Java heap space
```
Question 289mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest data from a SaaS application into Amazon S3. The SaaS application supports streaming data via HTTP POST requests. The data volume is approximately 100 MB per hour, and the company needs to store the raw data in S3 for archival and later analysis. Which approach is the most cost-effective and operationally efficient?

Question 290easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data from a Kinesis Data Stream into Parquet format and store it in an S3 data lake. The transformation includes simple field mapping and data type conversions. Which AWS service is the most cost-effective for performing this transformation in near-real-time?

Question 291hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer runs an AWS Glue crawler that is configured to crawl an S3 bucket named 'my-data-lake' and update the Glue Data Catalog. The crawler fails with an access denied error. The IAM role attached to the crawler has the policy shown in the exhibit. What is the likely cause of the failure?

Exhibit

Refer to the exhibit.

IAM policy for an IAM role used by an AWS Glue crawler:
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-data-lake/*",
                "arn:aws:s3:::my-data-lake"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": "*"
        }
    ]
}
```
Question 292mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration must have minimal downtime. The source database is highly active with continuous writes. Which DMS migration type and additional configuration should the engineer use?

Question 293easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to schedule an AWS Glue ETL job to run every hour and process new data that arrives in an S3 bucket. The job should only process files that have been added since the last run. Which approach should the engineer use to track which files have been processed?

Question 294mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for real-time clickstream data from a website. The data must be ingested with low latency (seconds) and made available for multiple consumer applications, including a dashboard that refreshes every minute and a machine learning model that processes data in near-real-time. The engineer needs to choose a streaming ingestion service. Which TWO services meet these requirements? (Select TWO.)

Question 295hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Kinesis Data Streams consumer application that is falling behind. The stream has 10 shards and is receiving 5 MB/s of data. The consumer uses the Kinesis Client Library (KCL) with a single worker. The worker is processing all 10 shards but is experiencing high latency and checkpointing delays. Which THREE actions should the engineer take to improve consumer performance? (Select THREE.)

Question 296easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform data in an S3 data lake using AWS Glue ETL. The data is in CSV format and needs to be converted to Parquet with partitioning by date. The engineer wants to minimize the number of files written to S3 to improve query performance. Which TWO configuration options should the engineer use? (Select TWO.)

Question 297hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company runs an e-commerce platform that generates clickstream data from user interactions on their website. The data is sent as JSON objects via HTTP POST to an API Gateway endpoint, which triggers a Lambda function that writes each record to a Kinesis Data Stream (100 shards). A second Lambda function consumes the stream, transforms the data (enriches with geolocation from a DynamoDB table), and writes to a Kinesis Data Firehose delivery stream that delivers Parquet files to an S3 data lake every 5 minutes. The system has been working for months, but recently the Firehose delivery stream started showing 'DeliveryFailed' errors for a subset of records. The errors point to 'InvalidData' from the Lambda transformation. The engineer reviews the Lambda transformation code and notices that the geolocation lookup occasionally fails because the DynamoDB table has a throttling issue. The engineer needs to handle these failures gracefully so that records that fail enrichment are still delivered to S3 with a null geolocation field, without blocking other records. Which course of action should the engineer take?

Question 298mediummultiple choice
Read the full VPN explanation →

A data engineer is responsible for ingesting log files from a fleet of on-premises servers into Amazon S3 for central analysis. Each server generates log files that are rotated every hour, resulting in files of about 500 MB each. The total daily data volume is approximately 1 TB. The network connection between the on-premises data center and AWS is a 100 Mbps VPN. The engineer needs to ensure that all log files are transferred to S3 within 24 hours of generation without data loss. The engineer is considering using AWS DataSync. However, the initial setup shows that the transfer speed is insufficient to meet the 24-hour SLA. What should the engineer do to meet the requirement?

Question 299easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is setting up a data pipeline to ingest data from an Amazon RDS for MySQL database into Amazon S3 using AWS Glue ETL. The Glue job uses a JDBC connection to read from the MySQL database. The job runs successfully, but the engineer notices that the job is taking longer than expected. The MySQL database is 500 GB in size and the Glue job uses 10 workers of type G.1X. The engineer wants to improve the performance of the extraction phase. The database is actively used by other applications, so the engineer must minimize the impact on the source database. Which approach should the engineer take?

Question 300mediummultiple choice
Study the full Python automation breakdown →

A company uses AWS Glue to process data from multiple sources. The data is stored in an Amazon S3 data lake. The company needs to transform the data using a custom Python library that is not available in the default Glue environment. What is the MOST efficient way to make this library available to the Glue jobs?

Question 301easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a real-time streaming pipeline to ingest clickstream data from a website into Amazon S3. The data must be transformed before storage. Which TWO AWS services can be used together to build this pipeline? (Choose TWO.)

Question 302mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to run ETL jobs daily. The jobs consume data from an Amazon RDS for MySQL database and write results to Amazon S3. The company wants to minimize the impact on the source database during extraction. Which THREE actions should the data engineer take to achieve this? (Choose THREE.)

Question 303hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a data ingestion pipeline using AWS Glue. The source is an Amazon DynamoDB table, and the target is an Amazon S3 data lake in Parquet format. The pipeline must handle large volumes and ensure exactly-once processing. Which THREE features should the engineer use together to achieve this? (Choose THREE.)

Question 304mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting IoT sensor data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a Lambda function that transforms and writes to Amazon S3. The company notices that occasionally records are dropped. The data engineer needs to identify the cause and prevent data loss. Which TWO actions should the data engineer take? (Choose TWO.)

Question 305easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from multiple on-premises relational databases into Amazon S3 for analytics. The data must be transformed and loaded daily. Which THREE AWS services should the engineer use together to build this pipeline? (Choose THREE.)

Question 306hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A financial services company ingests real-time stock trade data using Amazon Kinesis Data Streams with 10 shards. Each shard receives about 500 records per second, each record approximately 1 KB. The data is consumed by a single AWS Lambda function that transforms the data and writes to Amazon S3. The Lambda function is configured with 1024 MB memory and a timeout of 5 minutes. The company notices that the Lambda function is frequently throttled, and data ingestion lags behind. The Lambda function's CloudWatch metrics show that the iterator age is increasing, and the function's concurrency is maxed out at 1000. The data engineer needs to resolve the throttling issue without changing the Lambda function code. What should the data engineer do?

Question 307mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A retail company uses AWS Glue to process daily sales data from multiple CSV files stored in Amazon S3. The Glue job runs a PySpark script that reads the files, performs joins, and writes the output as Parquet. Recently, the job has been failing with 'Out of Memory' errors. The data volume has grown from 10 GB to 50 GB per day. The Glue job uses 10 DPUs and the standard worker type. The data engineer needs to fix the job without rewriting the script. What should the data engineer do?

Question 308hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A media company uses Amazon Kinesis Data Firehose to ingest log data from web servers into Amazon S3. The data is then processed by AWS Glue jobs. The company wants to ensure that data is delivered to S3 within 5 minutes of ingestion. Currently, the Firehose delivery stream is configured with a buffer interval of 300 seconds and a buffer size of 5 MB. The log data arrives at a rate of 2 MB per second. The data engineer notices that some log files are delayed by up to 10 minutes. The company cannot change the buffer size due to downstream requirements. What should the data engineer do to meet the 5-minute delivery requirement?

Question 309mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A gaming company collects player event data from mobile devices. The data is sent to an Amazon API Gateway endpoint, which triggers an AWS Lambda function that writes the data to an Amazon DynamoDB table. The company wants to also store the data in Amazon S3 for historical analysis. The data volume is about 100 GB per day. The data engineer needs to design a solution to copy data from DynamoDB to S3 with minimal impact on the DynamoDB table. What should the data engineer do?

Question 310hardmultiple choice
Read the full NAT/PAT explanation →

A healthcare company processes patient records in near-real-time using Amazon Kinesis Data Streams. Each record contains sensitive personal health information (PHI). The data must be encrypted at rest and in transit. The company also needs to audit access to the data. The data engineer is designing the ingestion pipeline. Which combination of services and configurations meets these requirements?

Question 311easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A logistics company uses AWS Glue to process GPS data from delivery trucks. The data is stored in Amazon S3 as JSON files. The Glue job reads the JSON files, converts them to Parquet, and writes them back to S3. The company notices that the Glue job takes too long to complete. The data engineer wants to improve the job's performance without changing the code. What should the data engineer do?

Question 312mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A social media company ingests user activity data from multiple sources using Amazon Kinesis Data Firehose. The data is delivered to Amazon S3 in near-real-time. The company wants to transform the data by adding a timestamp and masking email addresses before storing it in S3. The transformation should be applied to all records. What is the most cost-effective way to implement this transformation?

Question 313hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

An e-commerce company uses AWS Glue to process clickstream data from its website. The data is stored in Amazon S3 in partitioned Parquet format by date and hour. A recent increase in traffic has caused the Glue job to fail with 'Java heap space' errors. The job runs with 10 DPUs and uses Spark's default configurations. The data engineer needs to resolve the memory issue without modifying the ETL script. What should the data engineer do?

Question 314easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3. The data engineer wants to minimize operational overhead and avoid managing any servers. Which AWS service should the data engineer use?

Question 315easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from an e-commerce application into Amazon S3 for near-real-time analytics. The solution must handle variable throughput and allow reprocessing of failed records. Which TWO AWS services should the engineer use? (Choose two.)

Question 316mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineering team is building a pipeline to transform CSV files uploaded to Amazon S3 into Parquet format using AWS Glue. The transformation must be serverless and handle files that arrive at irregular intervals. Which TWO actions should the team take? (Choose two.)

Question 317hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company ingests IoT sensor data into Amazon Kinesis Data Streams. The data must be enriched with device metadata from Amazon DynamoDB and then stored in Amazon S3 in Apache Parquet format. The solution must minimize latency and cost. Which THREE steps should a data engineer implement? (Choose three.)

Question 318hardmultiple choice
Read the full NAT/PAT explanation →

A financial services company ingests stock trade data from multiple exchanges into an Amazon S3 bucket (trade-bucket). Each exchange sends a CSV file every 5 minutes. The data must be transformed into Parquet format and partitioned by exchange and date (trade_date) for efficient querying using Amazon Athena. The pipeline must handle late-arriving data (files up to 2 hours late) and ensure exactly-once processing to avoid duplicates. Currently, a scheduled AWS Glue ETL job runs every hour, reads new CSV files, converts them to Parquet, and writes to an output bucket. However, the team is experiencing data duplication: if the job fails midway, upon retry it reprocesses all files in the input folder, causing duplicates in the output. Additionally, the job takes too long because it scans all files each run. The engineer must redesign the pipeline to eliminate duplicates and improve efficiency. What should the engineer do?

Question 319mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A retail company uses Amazon Kinesis Data Firehose to ingest clickstream data from its website into an Amazon S3 bucket. The data includes fields: user_id, event_type, timestamp, page_url. Recently, the data engineering team noticed that some records have malformed JSON (missing commas, extra brackets) causing delivery failures to S3. The Firehose delivery stream is configured to retry failed records for 300 seconds, after which the records are sent to an S3 bucket for failed records. The team wants to transform the data to correct malformed JSON before delivery to the main S3 bucket. They need a solution that does not require managing servers and can handle high throughput. What should the team do?

Question 320easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A marketing analytics team needs to ingest customer transaction data from an on-premises PostgreSQL database into Amazon S3 for analysis. The data volume is about 10 GB daily, and the team wants to perform full refresh daily (truncate and load) into S3 as Parquet files. The company has a Direct Connect connection to AWS. The team needs a simple, managed solution that minimizes operational overhead. What should the team use?

Question 321mediummultiple choice
Read the full NAT/PAT explanation →

A logistics company ingests real-time GPS location data from thousands of delivery vehicles into Amazon Kinesis Data Streams. Each vehicle sends a JSON payload every 10 seconds containing vehicle_id, latitude, longitude, timestamp, and speed. The data must be stored in Amazon S3 for historical analysis, but the company wants to first aggregate the data per vehicle per minute (average speed, min/max coordinates) to reduce storage costs. The solution must be serverless and handle potential duplicate records without double-counting. What should the engineer do?

Question 322hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A healthcare company is building a data pipeline to ingest electronic health records (EHR) from hospitals. The data is sent as JSON files via SFTP to an on-premises server. The company wants to move this data to AWS using AWS Transfer Family (SFTP) and then process it with AWS Glue. Data sovereignty regulations require that all data remain within the EU (Frankfurt) region. The pipeline must detect when a new file arrives and start the Glue job automatically. The engineer has set up an AWS Transfer Family server in Frankfurt, and files are uploaded to an S3 bucket in the same region. However, the Glue job is not triggering automatically. The engineer needs to implement automated triggering. What should the engineer do?

Question 323mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A media company ingests large video files (up to 100 GB each) from content creators via Amazon S3 multipart uploads. After upload, the company needs to transcode the videos into multiple formats using AWS Elemental MediaConvert. The current pipeline uses S3 event notifications to trigger an AWS Lambda function that starts a MediaConvert job. However, for very large files, the Lambda function times out (15-minute limit) before the upload completes because the event is sent when the multipart upload is initiated, not when it completes. How should the engineer fix this issue?

Question 324easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest log files from multiple EC2 instances into Amazon S3. The logs are written to local disk on each instance. The engineer wants a simple agent-based solution that can collect, compress, and upload logs to S3 with minimal configuration. The solution must support incremental uploads (only new log lines) and handle log rotation. What should the engineer use?

Question 325mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A financial company needs to ingest real-time stock trade data from multiple sources and store it in Amazon S3 for compliance. The data must be delivered within 1 minute of the trade occurring. The data volume is approximately 10,000 records per second, with occasional spikes to 50,000 records per second. The engineer has set up Amazon Kinesis Data Streams with 10 shards and a Kinesis Data Firehose delivery stream that reads from the Kinesis stream and writes to S3. However, during spikes, the Firehose delivery stream falls behind, causing data to be delayed beyond the 1-minute SLA. What should the engineer do to meet the SLA without over-provisioning?

Question 326hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A social media company ingests user activity data from multiple sources into Amazon S3. The data is in JSON format and includes fields: user_id, activity_type, timestamp, and metadata. The company wants to transform this data into a columnar format (Parquet) partitioned by date and activity_type for efficient querying with Amazon Athena. The pipeline must handle data that arrives up to 3 days late. Currently, a daily AWS Glue ETL job scans the entire S3 bucket for new files, transforms them, and writes to a separate output bucket. The job is taking longer as data volume grows, and the team wants to reduce processing time and cost. What should the engineer do?

Question 327mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A gaming company ingests player event data from mobile games into Amazon Kinesis Data Streams. Each event is a small JSON payload (<1 KB). The data must be delivered to Amazon S3 for analytics, and the company wants to minimize storage costs by aggregating events into larger files (e.g., 100 MB per file). The current setup uses Kinesis Data Firehose with a buffer size of 10 MB and a buffer interval of 60 seconds, but the resulting files are very small (average 5 MB) because the data volume is low. The engineer needs to ensure that files are at least 100 MB to reduce the number of S3 objects and lower costs. What should the engineer do?

Question 328easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time and then loaded into an Amazon S3 bucket for long-term storage. Which AWS service should be used to perform the transformation and delivery to S3 with minimal operational overhead?

Question 329mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate a 2 TB PostgreSQL database to Amazon Aurora PostgreSQL. The migration is taking longer than expected due to the initial load. Which AWS service can be used to accelerate the initial load by transferring the database files directly?

Question 330easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline ingests daily CSV files from an FTP server into an Amazon S3 bucket. The files must be converted to Parquet format and partitioned by date for efficient querying using Amazon Athena. Which AWS service is most suitable for this transformation?

Question 331hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is ingesting data from multiple sources into Amazon S3 using AWS Glue ETL jobs. The jobs are failing intermittently with the error: 'Task ran out of memory'. The input data size varies widely from 100 MB to 10 GB per job. Which configuration change would best mitigate this issue?

Question 332easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest real-time clickstream data from a website into Amazon S3 with a maximum latency of 60 seconds. The data volume peaks at 500 MB/s. Which service should they use to buffer and deliver the data to S3?

Question 333mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a pipeline that ingests JSON logs from an application into Amazon S3. The logs contain a timestamp field. The pipeline must partition the data by date in S3 (e.g., year=2024/month=10/day=01). Which approach minimizes transformation effort?

Question 334hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS DMS to replicate data from an on-premises Oracle database to Amazon RDS for MySQL. The replication is working, but the target table has a different schema. Which DMS feature should be used to transform the source schema to match the target?

Question 335easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an Amazon S3 bucket into Amazon Redshift for analytics. The data is in CSV format and the Redshift table already exists. Which service can be used to perform this ingestion with minimal configuration?

Question 336mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A streaming application sends data to Amazon Kinesis Data Streams. The data must be enriched with reference data from an Amazon DynamoDB table in real-time. Which AWS service can be used to perform this enrichment with minimal latency?

Question 337mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineering team is designing a data ingestion pipeline for a social media analytics platform. The pipeline must handle up to 100,000 events per second with less than 1 second processing latency. Which TWO services should be used together to meet these requirements?

Question 338hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is migrating its data warehouse from on-premises to Amazon Redshift. The migration involves copying 50 TB of data from an S3 bucket to Redshift. The network bandwidth is limited to 1 Gbps. Which TWO approaches should the team use to complete the transfer within 7 days?

Question 339easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest JSON files from an Amazon S3 bucket into an Amazon DynamoDB table. The files are uploaded every hour. Which THREE services can be used together to build this ingestion pipeline?

Question 340hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs an AWS Glue ETL job that writes output to an S3 bucket. The job fails with the error shown. What is the most likely cause?

Network Topology
aws glue get-job-runsjob-name my-etl-job"JobRuns": ["Id": "jr_123","JobRunState": "FAILED","StartedOn": "2024-10-01T10:00:00Z","CompletedOn": "2024-10-01T10:05:00Z"
Question 341mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is troubleshooting an AWS Lambda function that reads from an S3 bucket and writes to a Kinesis Data Stream. The Lambda function fails with an AccessDeniedException when calling the kinesis:PutRecords API. Which change is needed to the IAM policy?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::data-bucket",
        "arn:aws:s3:::data-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/input-stream"
    }
  ]
}
Question 342hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is using a Kinesis Data Stream with 2 shards. The producer uses a partition key that is the user ID (a UUID). The consumer is falling behind. Which change would improve throughput?

Network Topology
aws kinesis describe-streamstream-name my-stream"StreamDescription": {"StreamName": "my-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "170141183460469231731687303715884105727"},"SequenceNumberRange": {"StartingSequenceNumber": "49617280354433721362922140867345427375946737258393878530""ShardId": "shardId-000000000001","StartingHashKey": "170141183460469231731687303715884105728","EndingHashKey": "340282366920938463463374607431768211455""StartingSequenceNumber": "49617280354433721362922140867345427375946737258393878531"
Question 343easymultiple choice
Read the full NAT/PAT explanation →

A company receives streaming clickstream data from its website. The data must be ingested with low latency and transformed in real time before being stored in Amazon S3. Which AWS service combination is most suitable for this use case?

Question 344mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline to load millions of small JSON files from an on-premises FTP server into Amazon S3. The pipeline should minimize cost and operational overhead. Which approach is most suitable?

Question 345hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Recently, the transformation errors have increased due to Lambda timeouts. The data engineer needs to diagnose and resolve the issue without losing data. What should the engineer do?

Question 346easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an Amazon RDS for PostgreSQL database into Amazon S3 on a daily basis. The data volume is approximately 500 GB per day. Which service is most appropriate for this task?

Question 347mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is complex and involves multiple steps. The data engineer wants to implement a workflow that handles dependencies and retries on failure. Which AWS service should be used to orchestrate the Glue jobs?

Question 348hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes to Amazon DynamoDB. The engineer notices that the Lambda function is throttled during high traffic. Which action should the engineer take to reduce throttling?

Question 349easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS applications into Amazon S3. The data sources provide REST APIs. Which AWS service can be used to build a fully managed data ingestion pipeline without writing custom code?

Question 350mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data lake on Amazon S3. The data ingestion pipeline must handle both batch and streaming data. The engineer wants to use a single service to ingest both types of data. Which service should the engineer choose?

Question 351hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is transformed using an AWS Lambda function. Some records fail transformation and are lost because the Lambda function throws an exception. The data engineer needs to capture the failed records for analysis without affecting the pipeline. What should the engineer do?

Question 352mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for streaming data from IoT devices. The devices send JSON messages every second. The engineer needs to ingest the data with low latency and store it in Amazon S3 in Parquet format. Which TWO services should the engineer use together?

Question 353hardmulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a pipeline to ingest data from an on-premises Oracle database into Amazon S3. The pipeline must capture change data (CDC) in near real-time and handle schema changes. Which TWO AWS services should the engineer use?

Question 354mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data engineer needs to transform the data before delivery. Which THREE options can be used to perform the transformation?

Question 355hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An IAM policy is attached to an AWS Glue ETL job. The job reads from the Kinesis stream 'input-stream' and writes to S3 bucket 'data-lake-bucket'. The job fails with an access denied error. Which missing permission is most likely the cause?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/input-stream"
    }
  ]
}
```
Question 356mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is using a Kinesis Data Stream with one shard. The application writes 2000 records per second, each 1 KB. The put record calls are frequently throttled. What is the most likely cause?

Network Topology
$ aws kinesis describe-streamstream-name data-streamRefer to the exhibit.```"StreamDescription": {"StreamName": "data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "340282366920938463463374607431768211455"},"SequenceNumberRange": {"StartingSequenceNumber": "49640055075767719372430087390896614568350141964484706306"],"EnhancedMonitoring": [],"EncryptionType": "KMS","KeyId": "arn:aws:kms:us-east-1:123456789012:key/abc12345-...","RetentionPeriodHours": 24
Question 357easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An S3 event notification is configured to trigger an AWS Lambda function when objects are created in 'my-bucket'. The Lambda function processes the JSON file and writes results to Amazon DynamoDB. The function fails with a timeout error. Which action should the engineer take to resolve the issue?

Exhibit

Refer to the exhibit.

```
{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "s3": {
        "s3SchemaVersion": "1.0",
        "bucket": {
          "name": "my-bucket",
          "arn": "arn:aws:s3:::my-bucket"
        },
        "object": {
          "key": "data/2024/01/01/file.json",
          "size": 1024,
          "eTag": "abc123"
        }
      }
    }
  ]
}
```
Question 358easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is ingesting streaming data from IoT devices using AWS IoT Core and needs to process the data in near real-time with minimal code. Which AWS service should they use to transform the data before storing it in Amazon S3?

Question 359mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon S3 to store raw data and needs to transform it into Parquet format for analytics. The transformation job runs daily on a schedule. Which AWS service is BEST suited for this task?

Question 360hardmultiple choice
Read the full NAT/PAT explanation →

A company is ingesting data from multiple on-premises databases into AWS using AWS Database Migration Service (DMS). The data must be continuously replicated with minimal downtime. However, the source databases do not support native CDC. What should the data engineer do to enable continuous replication?

Question 361mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is designing a data ingestion pipeline that will receive millions of small JSON files per hour from external partners via API. The files should be stored in Amazon S3 and then transformed into Parquet for querying. Which approach is MOST cost-effective and scalable?

Question 362easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to import data from an external FTP server into Amazon S3 on a daily basis. The data volumes are moderate. Which AWS service is MOST suitable for this task?

Question 363hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a slow AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 10 GB of CSV data. The engineer notices that the job runs with a single DPU and takes longer than expected. Which change would MOST likely improve performance?

Question 364mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data must be transformed (e.g., enrich with user location) before being stored in Amazon S3. Which architecture is MOST efficient for this transformation?

Question 365hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company has a 100 TB dataset stored on-premises in a Hadoop cluster. They want to ingest this data into Amazon S3 for processing with AWS Glue. The company has a limited time window and a slow internet connection. Which strategy is MOST appropriate?

Question 366easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from a social media API into Amazon S3 for batch analytics. The data arrives at a rate of 500 records per second. Which service should be used to capture the stream?

Question 367mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is designing a data ingestion pipeline for real-time sensor data from thousands of devices. The data must be processed with low latency and stored in Amazon S3. Which TWO services would be appropriate for this use case? (Choose TWO.)

Question 368hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data from Amazon S3 into Parquet format. The job fails with an out-of-memory error for large files. Which TWO actions can resolve this issue? (Choose TWO.)

Question 369mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3 and needs to ingest data from multiple sources. The ingestion must be automated and handle schema changes. Which THREE services can be used together to achieve this? (Choose THREE.)

Question 370mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

An IAM policy is attached to an AWS Glue job. The job needs to read from and write to S3 buckets, and also trigger other Glue jobs. The job is failing with an AccessDenied error when trying to write to a bucket named 'example-bucket'. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    }
  ]
}
Question 371hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer runs the describe-stream command and sees the output above. The stream has a retention period of 24 hours. The engineer needs to ensure that consumers can replay data for up to 7 days. Which action is required?

Network Topology
Command: aws kinesis describe-streamstream-name my-streamRefer to the exhibit."StreamDescription": {"StreamName": "my-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","ParentShardId": null,"AdjacentParentShardId": null,"HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "340282366920938463463374607431768211455"},"SequenceNumberRange": {"StartingSequenceNumber": "49611900753075948581609125847874523344139236273755832322"],"EnhancedMonitoring": [],"EncryptionType": "NONE","KeyId": null,"RetentionPeriodHours": 24,"StreamCreationTimestamp": 1672531200.0
Question 372easymultiple choice
Read the full Data Ingestion and Transformation explanation →

An IAM policy includes the above resource ARN for CloudWatch Logs. A data engineer needs to allow a Lambda function to create log streams and put logs to the log group 'my-log-group'. However, the Lambda function is failing with access denied. What is the issue?

Exhibit

Refer to the exhibit.

Resource: "arn:aws:logs:us-east-1:123456789012:log-group:my-log-group:*"
Question 373mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from thousands of IoT devices into Amazon S3 with minimal latency and then transform the data using Spark SQL. Which AWS service should be used for data ingestion?

Question 374hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team is troubleshooting a slow AWS Glue ETL job that reads from an Amazon DynamoDB table and writes to Amazon S3 in Parquet format. The job processes 50 GB of data. Which action would most effectively improve job performance?

Question 375easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from an on-premises Oracle database into Amazon Redshift for analytics. The data volume is 500 GB and the network bandwidth is limited. Which AWS service should be used for the initial one-time data migration?

Question 376mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that writes to Amazon DynamoDB. The Lambda function is seeing high error rates due to DynamoDB write throttling. Which action should be taken to reduce throttling?

Question 377hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses AWS Glue to read from an Amazon S3 bucket containing millions of small CSV files (each < 1 MB). The ETL job is slow. Which optimization would most improve performance?

Question 378easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to transform data in Amazon S3 using SQL queries without provisioning servers. The transformations are ad-hoc and run occasionally. Which service should be used?

Question 379mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format and needs to be converted to Parquet. However, the conversion is failing. What is the most likely cause?

Question 380hardmultiple choice
Read the full NAT/PAT explanation →

A company is designing a data ingestion pipeline for real-time analytics. The source is a relational database, and the target is Amazon Redshift. The pipeline must handle schema changes in the source database automatically. Which combination of services should be used?

Question 381easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company has a nightly batch job that processes 100 GB of data from an Amazon S3 bucket and loads it into an Amazon Redshift table. The job currently runs on an Amazon EMR cluster. Which service would reduce operational overhead while providing similar functionality?

Question 382mediummulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to ingest streaming data from a mobile application into Amazon S3 for near-real-time analytics? (Choose 2.)

Question 383hardmulti select
Read the full Data Ingestion and Transformation explanation →

Which THREE factors should be considered when selecting a data ingestion service for a high-volume, real-time streaming pipeline that requires exactly-once processing semantics? (Choose 3.)

Question 384easymulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to transform data in an Amazon S3 data lake before loading into Amazon Redshift? (Choose 2.)

Question 385mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

The exhibit shows an AWS CLI command and its output. A data engineer wants to copy only objects larger than 10 MB from the S3 bucket to another bucket for processing. Which approach should be used to automate this task?

Network Topology
aws s3api list-objectsbucket my-data-lakeprefix logs/2023/query 'Contents[?Size>`10000000`]'Refer to the exhibit.Output:"Key": "logs/2023/01/01/app.log","Size": 15000000},"Key": "logs/2023/01/02/app.log","Size": 20000000
Question 386hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

The exhibit shows an IAM policy attached to a role used by an AWS Glue ETL job. The job reads from an S3 bucket and writes to another S3 bucket. However, the job fails with an access denied error when trying to write to the output bucket. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::my-data-lake/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun"
            ],
            "Resource": "*"
        }
    ]
}
Question 387easymultiple choice
Read the full Data Ingestion and Transformation explanation →

The exhibit shows the output of describing an Amazon Kinesis Data Stream. A producer is sending records but the consumer is not receiving all records. What is the most likely cause?

Network Topology
aws kinesis describe-streamstream-name clickstreamRefer to the exhibit.Output:"StreamDescription": {"StreamName": "clickstream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/clickstream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","ParentShardId": null,"HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "340282366920938463463374607431768211455"},"SequenceNumberRange": {"StartingSequenceNumber": "49639287029282832212345678901234567890"],"RetentionPeriodHours": 24
Question 388mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline ingests CSV files from an S3 bucket into a Redshift table using the COPY command. Recently, files with inconsistent column delimiters (some use pipes, others use commas) have been arriving. The pipeline must handle both delimiters without manual intervention. What is the MOST efficient solution?

Question 389hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data is then processed by a Kinesis Data Analytics application running SQL queries. The analytics application is falling behind and processing records with increasing latency. The stream has 4 shards, and the average record size is 5 KB. What is the MOST effective way to improve processing latency?

Question 390easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from a relational database (MySQL) into Amazon S3 for analytics. The database is 500 GB and the job must run daily with incremental updates. Which AWS service is BEST suited for this task?

Question 391mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Lambda to process messages from an Amazon SQS queue. The messages contain JSON payloads that need to be transformed and written to an Amazon DynamoDB table. Recently, the Lambda function has been timing out and messages are being sent to the dead-letter queue (DLQ). What is the BEST way to troubleshoot and resolve this issue?

Question 392hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses Amazon Kinesis Data Firehose to ingest log data from web servers and deliver it to Amazon S3. The data is then transformed by an AWS Glue job before being loaded into Amazon Redshift. The pipeline must handle a sudden spike in log volume without data loss. Which configuration change is MOST appropriate?

Question 393easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from IoT devices into Amazon S3 using Amazon Kinesis Data Firehose. The data must be transformed from JSON to Parquet format before landing in S3. What is the SIMPLEST way to achieve this?

Question 394mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a pipeline to ingest change data capture (CDC) events from an Amazon RDS for PostgreSQL database into Amazon S3. The CDC events are captured using AWS DMS. The data must be available for querying within 5 minutes of the change. Which approach meets these requirements?

Question 395hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on clickstream data. The application uses a sliding window of 1 minute. The data engineer notices that the application is producing incorrect results because late-arriving records are not being handled properly. What should the data engineer do to ensure late records are included in the window calculations?

Question 396easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from a self-managed Apache Kafka cluster running on EC2 into Amazon S3. The data must be delivered in near real-time. Which AWS service is BEST suited for this task?

Question 397mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a pipeline to ingest daily CSV files from an SFTP server into Amazon S3. The files are large (up to 10 GB) and must be encrypted in transit. The pipeline should be fully managed and serverless where possible. Which TWO services should be used together to achieve this? (Choose TWO.)

Question 398hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting IoT sensor data into Amazon Kinesis Data Streams. Each sensor sends a JSON payload every second. The data must be transformed and aggregated in real-time before being stored in Amazon DynamoDB. Which THREE services should be used together in the pipeline? (Choose THREE.)

Question 399mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company wants to use AWS Glue to transform data stored in Amazon S3. The data is partitioned by date and includes both CSV and Parquet files. The transformation should be optimized for cost and performance. Which THREE actions should the data engineer take? (Choose THREE.)

Question 400mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An IAM policy is attached to an EC2 instance role that runs a data ingestion application. The application reads files from an S3 bucket 'data-lake-primary' and sends records to a Kinesis stream named 'clickstream'. The application is failing with an 'AccessDenied' error when trying to read from S3. What is the MOST likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-primary/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:PutRecord",
        "kinesis:PutRecords"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/clickstream"
    }
  ]
}
Question 401hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs the describe-stream command and sees this output. The application is writing records to the stream but is experiencing high write latency. The average record size is 50 KB, and the write rate is 1500 records per second. What is the MOST likely cause of the latency?

Network Topology
aws kinesis describe-streamstream-name my-data-stream"StreamDescription": {"StreamName": "my-data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000001","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "113427455640312821154458202477256070485"},"SequenceNumberRange": {"StartingSequenceNumber": "496103355238277109119582415771234567890123456789""ShardId": "shardId-000000000002","StartingHashKey": "113427455640312821154458202477256070486","EndingHashKey": "226854911280625642308916404954512140970""StartingSequenceNumber": "496103355238277109119582415771234567890123456790"
Question 402easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs this AWS Glue Data Catalog DDL statement to create a table. The CSV files in 's3://my-bucket/sales/' use a pipe delimiter (|) instead of a comma. What change is needed to correctly read the data?

Exhibit

CREATE EXTERNAL TABLE IF NOT EXISTS my_database.sales (
  order_id INT,
  customer_name STRING,
  product STRING,
  amount DECIMAL(10,2),
  order_date STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim' = ','
)
LOCATION 's3://my-bucket/sales/'
Question 403mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. The data volume is variable and can spike unpredictably. The solution must be serverless and minimize operational overhead. Which AWS service should be used for ingestion?

Question 404mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The on-premises network has a 1 Gbps link to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

Question 405hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from multiple sources using Amazon Kinesis Data Streams. The data is then processed by an AWS Lambda function that transforms the records and writes them to an Amazon S3 bucket. The Lambda function is failing intermittently with timeout errors. The average record size is 5 KB, and the shard count is 2. What is the MOST likely cause of the timeout errors?

Question 406easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data from an S3 bucket into Parquet format and load it into Amazon Redshift. The transformation must be performed incrementally as new data arrives. Which AWS service is BEST suited for this task?

Question 407hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Database Migration Service (DMS) to migrate a 2 TB MySQL database to Amazon Aurora MySQL. The migration is taking longer than expected. The source database is in a different AWS region. Which change would MOST likely improve the migration speed?

Question 408easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is streaming data from an application to Amazon Kinesis Data Streams. The data must be transformed in real time and then stored in Amazon S3 in Parquet format. Which AWS service should be used for the transformation step?

Question 409mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to run ETL jobs that process data from Amazon S3 and load it into Amazon Redshift. The jobs are failing with the error 'Unable to connect to Redshift cluster'. The Redshift cluster is in the same VPC as the Glue job. What is the MOST likely cause?

Question 410easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from an on-premises Oracle database into Amazon S3 on a daily basis. The data volume is about 100 GB per day. Which AWS service is BEST suited for this task?

Question 411mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The data is delivered in JSON format. The company wants to convert the data to Apache Parquet format before delivery to reduce storage costs and improve query performance. How can this be achieved?

Question 412mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to process data from an Amazon S3 data lake. The Glue job runs daily and transforms data into multiple output formats. Which TWO actions can the company take to optimize the Glue job's performance and reduce costs? (Choose TWO.)

Question 413hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is running a 10-node Amazon EMR cluster to process data from Amazon S3. The cluster is using Apache Spark for transformations. The data processing is taking longer than expected. Which THREE actions can improve the performance of the Spark jobs on EMR? (Choose THREE.)

Question 414easymulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to catalog data in Amazon S3. The data is stored in CSV format, but the schema is not consistent across all files. Which TWO actions can the company take to handle schema evolution and ensure the Glue Data Catalog is up to date? (Choose TWO.)

Question 415hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is reviewing the S3 Lifecycle policy for a data lake bucket. The goal is to archive log data after 30 days and delete it after 365 days, and delete temporary data after 1 day. What is wrong with the current configuration?

Network Topology
$ aws s3api get-bucket-lifecycle-configurationbucket my-data-lakeRefer to the exhibit.```"Rules": ["ID": "Archive logs","Status": "Enabled","Filter": {"Prefix": "logs/"},"Transitions": ["Days": 30,"StorageClass": "GLACIER"],"Expiration": {"Days": 365"ID": "Delete temp data","Prefix": "temp/""Days": 1
Question 416mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue job that is failing with an Access Denied error when trying to read data from an S3 bucket. The IAM policy attached to the Glue job's IAM role is shown in the exhibit. What is the likely cause of the failure?

Exhibit

Refer to the exhibit.
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    }
  ]
}
```
Question 417hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a Kinesis Data Streams application that is experiencing high latency. The stream has 2 shards. The application is using a single Kinesis Client Library (KCL) worker to process all shards. Which change will MOST likely reduce latency?

Network Topology
$ aws kinesis describe-streamstream-name my-data-streamRefer to the exhibit.```"StreamDescription": {"StreamName": "my-data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","ParentShardId": null,"AdjacentParentShardId": null,"HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "170141183460469231731687303715884105727"},"SequenceNumberRange": {"StartingSequenceNumber": "49598123064581000000000000000000000000000000000000000001","EndingSequenceNumber": null"ShardId": "shardId-000000000001","StartingHashKey": "170141183460469231731687303715884105728","EndingHashKey": "340282366920938463463374607431768211455""StartingSequenceNumber": "49598123064581000000000000000000000000000000000000000002",
Question 418easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is delivered in 5-minute intervals. The company wants to reduce the delivery frequency to 1 minute to get data faster. Which parameter should be changed in the Firehose delivery stream configuration?

Question 419easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team needs to ingest real-time streaming data from thousands of IoT devices and transform the data before storing it in Amazon S3. Which AWS service is most suitable for performing the transformation step in near real-time?

Question 420mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to ingest application logs into an Amazon S3 bucket. The logs are in JSON format. The data engineering team wants to convert the logs from JSON to Parquet format before landing in S3. What is the most cost-effective way to achieve this?

Question 421hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data from social media feeds using Amazon Kinesis Data Streams. The data volume peaks at 10,000 records per second, and each record is up to 1 KB. The company needs to archive the raw data in Amazon S3 in near real-time and also make it available for real-time analytics using Amazon Kinesis Data Analytics. What is the MOST efficient architecture to meet these requirements?

Question 422mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to design a data ingestion pipeline that ingests CSV files from an Amazon S3 bucket, transforms the data by adding a timestamp column, and loads it into an Amazon Redshift table. The pipeline should run automatically whenever a new file is uploaded to the S3 bucket. Which AWS service should be used to trigger the transformation?

Question 423hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is in JSON format and contains a 'timestamp' field with a Unix epoch value. The company wants to partition the S3 objects by year, month, day, and hour based on the timestamp. What is the MOST efficient method to achieve this?

Question 424easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from a relational database into Amazon S3 for analytics. The database is an Amazon RDS MySQL instance. Which AWS service should be used for a one-time historical data load?

Question 425mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The data rate is high and the Lambda function is falling behind, resulting in increased processing latency. What is the MOST effective way to improve throughput?

Question 426hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for JSON files landing in an Amazon S3 bucket. The pipeline must transform the data (e.g., flatten nested structures) and load it into Amazon Redshift. The transformation logic is complex and may evolve frequently. Which approach provides the MOST flexibility and ease of maintenance?

Question 427easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest streaming data from multiple sources and store it in Amazon S3. The data volume is up to 5 GB per hour. What is the MOST cost-effective ingestion service?

Question 428mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue ETL to transform and load data from Amazon S3 to Amazon Redshift. The data engineer notices that the job is taking longer than expected. Which TWO actions can improve the job performance?

Question 429hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from a MySQL database into Amazon S3 using AWS DMS. The data changes frequently and the requirement is to capture changes in near real-time. Which THREE configurations are necessary?

Question 430easymulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3 and needs to ingest data from various on-premises sources. Which TWO AWS services can be used to transfer data securely over the internet?

Question 431mediummultiple choice
Study the full Python automation breakdown →

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time using custom Python code before being stored in Amazon S3. Which AWS service should be used to perform this transformation?

Question 432hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3 using AWS DMS. The change data capture (CDC) must be enabled to capture ongoing changes. Which additional AWS service is required to store the transaction logs for CDC?

Question 433easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest data from multiple SaaS applications into Amazon S3 using a fully managed service that supports schema discovery and transformation. Which AWS service should they use?

Question 434mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting an AWS Glue ETL job that fails with an OutOfMemory error when processing large JSON files from Amazon S3. The files contain deeply nested structures. Which approach should the engineer take to resolve this issue?

Question 435hardmultiple choice
Read the full NAT/PAT explanation →

A company is using AWS Lake Formation to manage permissions on data in Amazon S3. They need to ingest data from an external source into a new database 'sales_db' and a table 'transactions' using AWS Glue. The IAM role used by Glue must have the minimal permissions to create the database and table in the Data Catalog and write data to the S3 location. Which combination of permissions should be granted?

Question 436easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from an on-premises SQL Server database into Amazon Redshift. The data volume is less than 1 TB and the network bandwidth is limited. Which AWS service should be used for the initial full load?

Question 437mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline for real-time clickstream data using Amazon Kinesis Data Streams. The data must be transformed using AWS Lambda and then stored in Amazon S3 in Parquet format. Which Kinesis client library configuration should be used to minimize the number of Lambda invocations while ensuring data is processed within 60 seconds?

Question 438hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon MSK (Managed Streaming for Apache Kafka) to ingest real-time data. They need to transform the data using custom Java code before writing to Amazon S3. The transformation must be fault-tolerant and exactly-once semantics are required. Which AWS service should be used?

Question 439easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from Amazon S3 into Amazon Redshift for analytics. The data arrives in CSV format with headers and may contain duplicate rows. Which Redshift command should be used to load the data while handling duplicates?

Question 440mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company is building a data lake on Amazon S3 and needs to ingest data from multiple sources. Which of the following AWS services can be used to ingest and transform data in near real-time? (Select TWO.)

Question 441hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to replicate data from an Amazon RDS for MySQL database to Amazon S3. Which TWO configurations are required to enable continuous change data capture (CDC) from MySQL?

Question 442mediummulti select
Study the full Python automation breakdown →

A data engineer is designing a data pipeline that uses AWS Glue to transform data stored in Amazon S3. The transformation logic must be written in Python and should handle schema evolution automatically. Which THREE features or configurations should the engineer use? (Select THREE.)

Question 443mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job fails intermittently with a 'MemoryError' when the stream has a sudden spike in data volume. Which configuration change would best prevent this error?

Question 444easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest on-premises CSV files into Amazon S3 every hour. The files are less than 1 GB each. Which service is the most cost-effective and requires the least operational overhead?

Question 445hardmultiple choice
Study the full Python automation breakdown →

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is written in Python and references several libraries that are not included in the default Glue environment. Which approach should the data engineer use to make these libraries available?

Question 446mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest data from an Amazon RDS for MySQL database into Amazon S3 on a daily basis. The data volume is about 50 GB per day. The engineer wants to minimize the impact on the source database. Which AWS service should be used?

Question 447easymultiple choice
Read the full Data Ingestion and Transformation explanation →

An e-commerce company wants to capture clickstream data from its website and store it in Amazon S3 for analytics. The data arrives continuously and the company needs near-real-time processing. Which solution is most appropriate?

Question 448hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for IoT sensor data. The sensors send JSON messages every second. The data must be available in Amazon S3 within 5 minutes and must be transformed (JSON to Parquet) before storage. Which combination of services meets these requirements?

Question 449mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue crawlers to populate the Data Catalog from data in Amazon S3. The crawler fails to update the schema when new columns are added to the CSV files. What is the most likely cause?

Question 450hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is ingesting data from a third-party API into Amazon S3 using AWS Lambda. The API returns a JSON payload of up to 10 MB per request. The Lambda function runs every minute. Occasionally, the function times out after 15 seconds. What is the most likely cause?

Question 451easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to transfer 20 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited and the transfer must complete within one week. Which service should the company use?

Question 452mediummulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO AWS services can be used to ingest streaming data into Amazon S3 with minimal code? (Choose two.)

Question 453hardmulti select
Read the full Data Ingestion and Transformation explanation →

Which THREE factors should a data engineer consider when choosing between AWS Glue and Amazon EMR for a data transformation job? (Choose three.)

Question 454mediummulti select
Read the full Data Ingestion and Transformation explanation →

Which TWO actions can improve the performance of an AWS Glue ETL job that processes large datasets in Amazon S3? (Choose two.)

Question 455mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. An IAM policy is attached to a role used by an AWS Glue job. The job fails with an 'AccessDenied' error when trying to write to 's3://my-bucket/output/'. What is the most likely cause?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "glue:GetJobRun"
      ],
      "Resource": "*"
    }
  ]
}
Question 456hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs this AWS CLI command to create a Glue job. The job processes JSON files in an S3 bucket and writes Parquet files to another bucket. After the first successful run, the job re-processes all input files instead of only new files. What is the most likely cause?

Network Topology
default-arguments "{\"job-bookmark-option\": \"job-bookmark-enable\"}" \name my-json-to-parquet-job \role MyGlueServiceRole \command Name=GlueETLmax-retries 0Refer to the exhibit.aws glue create-job \
Question 457easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs this CLI command on an S3 bucket. The data is ingested from multiple sources. Which AWS service would be best to process these files in a single batch transformation?

Network Topology
$ aws s3api list-objectsbucket my-data-lakeprefix logs/2023/query 'Contents[].Size'Refer to the exhibit.1048576,2097152,524288,1572864
Question 458easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to ingest streaming data from thousands of IoT devices and immediately process each record with minimal latency. Which AWS service should be used as the ingestion point?

Question 459easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to schedule a nightly batch job to copy data from an on-premises PostgreSQL database to Amazon S3. The solution must minimize operational overhead. Which AWS service should be used?

Question 460easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data from an S3 bucket into Parquet format for efficient querying with Amazon Athena. The transformation must be serverless and event-driven. Which approach meets these requirements?

Question 461mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is ingesting streaming data into Kinesis Data Streams. The consumer application experiences high latency due to a single shard bottleneck. What is the most effective way to reduce latency?

Question 462mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineering team uses AWS Glue ETL jobs to process data daily. They notice that job run times are increasing as data volume grows. Which action will most effectively improve performance without changing the code?

Question 463mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The data must be delivered within 60 seconds of ingestion. Currently, the delivery takes 3 minutes due to large buffer sizes. How should the engineer adjust the Firehose configuration?

Question 464hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company ingests millions of small files (1-10 KB) into Amazon S3 every hour. These files are then processed by AWS Glue ETL jobs. The Glue jobs are slow because of the overhead of reading many small files. Which strategy will most effectively improve Glue job performance?

Question 465hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data pipeline uses Amazon Kinesis Data Streams with enhanced fan-out consumers. The team notices that one consumer falls behind and data accumulates. Which action will help this consumer catch up without affecting other consumers?

Question 466hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to transform data in an S3 data lake. The transformation logic requires joining two large datasets that are each hundreds of gigabytes. The Glue job runs out of memory. Which configuration change will most likely resolve this issue?

Question 467easymulti select
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest streaming data from social media feeds into AWS for real-time analytics. Which TWO services can directly ingest streaming data without writing custom code? (Choose TWO.)

Question 468mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to schedule a nightly ETL job that reads from an Amazon RDS database and writes to Amazon S3 in Parquet format. The solution must be serverless and minimize cost. Which TWO AWS services should be used? (Choose TWO.)

Question 469hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver data to an S3 bucket. The data contains personally identifiable information (PII) that must be redacted before storage. Which THREE actions can achieve this requirement? (Choose THREE.)

Question 470easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer attached this IAM policy to a Lambda function used to transform data in S3. The function is unable to write output to the bucket. What is the most likely reason?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-lake-bucket/*"
    }
  ]
}
Question 471mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer runs the command shown. The consumer application is unable to read data older than 24 hours. What is the most likely cause?

Network Topology
aws kinesis describe-streamstream-name my-data-streamRefer to the exhibit."StreamDescription": {"StreamName": "my-data-stream","StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/my-data-stream","StreamStatus": "ACTIVE","Shards": ["ShardId": "shardId-000000000000","HashKeyRange": {"StartingHashKey": "0","EndingHashKey": "340282366920938463463374607431768211455"},"SequenceNumberRange": {"StartingSequenceNumber": "49614583901234567890123456789012345678901234567890123456"],"EnhancedMonitoring": [],"EncryptionType": "KMS","KeyId": "arn:aws:kms:us-east-1:123456789012:key/abc123","RetentionPeriodHours": 24
Question 472hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer reviews the Glue job configuration. The job fails when processing large datasets. The error message indicates out-of-memory in the executors. Which change to the job configuration will most directly address this issue?

Network Topology
aws glue get-jobjob-name transform-job"job-language": "python",TempDir": "s3://glue-temp/"Refer to the exhibit."Job": {"Name": "transform-job","Role": "arn:aws:iam::123456789012:role/GlueServiceRole","Command": {"Name": "glueetl","ScriptLocation": "s3://glue-scripts/transform.py","PythonVersion": "3"},"DefaultArguments": {"MaxRetries": 0,"AllocatedCapacity": 5,"Timeout": 30,"MaxCapacity": 5,"WorkerType": "Standard","NumberOfWorkers": 5,"GlueVersion": "3.0"
Question 473mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to catalog data in Amazon S3. The data arrives in Parquet format, but the crawler fails to update the schema when new columns are added. What is the most likely cause?

Question 474hardmultiple choice
Read the full NAT/PAT explanation →

A data engineering team needs to ingest streaming data from thousands of IoT devices. The data must be processed in near real-time and stored in Amazon S3 in Apache Parquet format partitioned by device_id and timestamp. Which combination of services should the team use to minimize operational overhead and cost?

Question 475easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue ETL jobs to transform data and load it into Amazon Redshift. The jobs are failing with 'Out of Memory' errors. What is the most cost-effective way to resolve this issue without changing the transformation logic?

Question 476hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The network bandwidth is limited to 1 Gbps, and the transfer must complete within 10 days. The data is compressible. Which solution is MOST appropriate?

Question 477mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The delivery occasionally fails due to 'ThrottlingException' from S3. What should the team do to resolve this issue without losing data?

Question 478easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer needs to transform JSON data into CSV format using AWS Glue. The transformation is simple and must be executed on a schedule. Which Glue component is MOST suitable?

Question 479hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to migrate an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration completes, but the target table has more rows than the source. Which is the MOST likely cause?

Question 480mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest data from SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for analytics. The data volume is moderate and updates occur frequently. Which AWS service is BEST suited for this task?

Question 481easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Lambda to process events from an S3 bucket. The Lambda function writes transformed data to another S3 bucket. Occasionally, the Lambda invocation fails with 'ResourceNotFoundException'. What is the MOST likely cause?

Question 482hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS Glue to process large datasets. The Glue job occasionally fails with 'DiskFull' errors. Which TWO actions should the engineer take to resolve this issue? (Choose two.)

Question 483mediummulti select
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest streaming data from an existing Amazon Kinesis Data Streams into Amazon S3 with partitioning by date. Which TWO services can accomplish this with minimal coding? (Choose two.)

Question 484hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company uses AWS DMS to continuously replicate data from an on-premises SQL Server to Amazon Aurora MySQL. The replication lag is increasing. Which THREE actions can reduce the lag? (Choose three.)

Question 485mediummultiple choice
Study the full Python automation breakdown →

A company ingests streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time using custom Python code before being stored in Amazon S3. Which AWS service should be used to perform this transformation?

Question 486hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is troubleshooting a daily batch ingestion pipeline that uses AWS Glue to read CSV files from Amazon S3 and write Parquet files to another S3 bucket. The job runs successfully but takes significantly longer than expected. The engineer notices that the input data is highly skewed with many small files. Which is the most effective optimization to reduce job duration?

Question 487easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company wants to ingest real-time clickstream data from a website into Amazon S3 with minimal code. The data should be delivered within 60 seconds of generation. Which AWS service should be used?

Question 488mediummultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database into Amazon Redshift. The pipeline must capture changes (inserts, updates, deletes) with low latency and minimal impact on the source database. Which combination of AWS services should the engineer use?

Question 489easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Firehose to ingest data into Amazon S3. The data must be transformed from JSON to Parquet format before delivery. Which feature should be enabled on the Firehose delivery stream?

Question 490hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a streaming ingestion pipeline using Amazon Kinesis Data Streams. The stream has 10 shards, and the data volume is expected to grow by 50% over the next month. The engineer needs to ensure that the pipeline can scale without manual intervention. Which approach should be used?

Question 491mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for analytics. The data sources have different schemas and update frequencies. Which AWS service should be used to build this ingestion pipeline with minimal code?

Question 492easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A data engineer is using AWS Glue to run an ETL job that reads data from Amazon DynamoDB and writes to Amazon Redshift. The job fails with a 'ThroughputExceededException' error. What is the most likely cause?

Question 493hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process real-time events. The Lambda function is triggered by a DynamoDB stream to update a counter. Recently, the counter has been inaccurate due to duplicate processing. What is the most likely cause?

Question 494mediummulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is designing a data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The engineer wants to ensure that the data is organized in a directory structure by year, month, day, and hour. Which TWO configurations should the engineer set on the Firehose delivery stream? (Choose TWO.)

Question 495hardmulti select
Read the full Data Ingestion and Transformation explanation →

A company is ingesting data from multiple sources into Amazon S3 using AWS Glue. The data is then transformed using Apache Spark on Amazon EMR. The data engineer wants to reduce the cost of storing and processing data by compressing the ingested files. Which THREE file formats support compression and are commonly used with Spark? (Choose THREE.)

Question 496easymulti select
Read the full Data Ingestion and Transformation explanation →

A data engineer is building a data ingestion pipeline that uses AWS Lambda to process records from Amazon Kinesis Data Streams. The Lambda function writes the processed data to Amazon DynamoDB. Which TWO factors affect the maximum number of concurrent Lambda executions for this stream? (Choose TWO.)

Question 497mediummultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer created this IAM policy for a Lambda function that reads from a Kinesis stream and writes to an S3 bucket. The Lambda function fails with an 'AccessDenied' error when trying to write to S3. What is the missing permission?

Exhibit

Refer to the exhibit.

IAM Policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::data-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kinesis:DescribeStream",
        "kinesis:GetRecords",
        "kinesis:GetShardIterator"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/input-stream"
    }
  ]
}
Question 498hardmultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer runs this AWS Glue job but it fails with an error that the table 'orders' does not exist in the 'sales_db' database. The engineer has verified that the table exists in the AWS Glue Data Catalog. What is the most likely cause of the error?

Exhibit

Refer to the exhibit.

AWS Glue Job Script (PySpark):
```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sales_db", table_name = "orders", transformation_ctx = "datasource0")

datasource1 = glueContext.create_dynamic_frame.from_options(connection_type = "s3", connection_options = {"paths": ["s3://data-lake/raw/"]}, format = "json", transformation_ctx = "datasource1")

job.commit()
```
Question 499easymultiple choice
Read the full Data Ingestion and Transformation explanation →

Refer to the exhibit. A data engineer is configuring a Kinesis Data Firehose delivery stream. The stream is expected to receive bursts of 10 MB of data every 2 minutes. What is the maximum time it will take for data to be delivered to S3 during a burst?

Exhibit

Refer to the exhibit.

Kinesis Data Firehose Configuration (Partial):
- DeliveryStreamName: "my-firehose"
- Destination: "s3"
- S3DestinationConfiguration:
    BucketARN: "arn:aws:s3:::data-bucket"
    Prefix: "data/"
    ErrorOutputPrefix: "errors/"
    BufferingHints:
      IntervalInSeconds: 300
      SizeInMBs: 5
- ProcessingConfiguration:
    Enabled: True
    Processors:
      - Type: "Lambda"
        Parameters:
          - ParameterName: "LambdaArn"
            ParameterValue: "arn:aws:lambda:us-east-1:123456789012:function:transform"
Question 500easymultiple choice
Read the full Data Ingestion and Transformation explanation →

A company streams clickstream data from websites to Amazon Kinesis Data Streams. A Lambda function processes each record and writes it to Amazon S3. Recently, the function has been timing out under high load. Which solution should a data engineer implement to handle the increased throughput?

Practice tests

Scored 10-question sessions with instant feedback and explanations.

DEA-C01 Practice Test 1 — 10 Questions→DEA-C01 Practice Test 2 — 10 Questions→DEA-C01 Practice Test 3 — 10 Questions→DEA-C01 Practice Test 4 — 10 Questions→DEA-C01 Practice Test 5 — 10 Questions→DEA-C01 Practice Exam 1 — 20 Questions→DEA-C01 Practice Exam 2 — 20 Questions→DEA-C01 Practice Exam 3 — 20 Questions→DEA-C01 Practice Exam 4 — 20 Questions→Free DEA-C01 Practice Test 1 — 30 Questions→Free DEA-C01 Practice Test 2 — 30 Questions→Free DEA-C01 Practice Test 3 — 30 Questions→DEA-C01 Practice Questions 1 — 50 Questions→DEA-C01 Practice Questions 2 — 50 Questions→DEA-C01 Exam Simulation 1 — 100 Questions→

Practice by domain

Each domain maps to a weighted exam section. Focus on the domain where you are weakest.

Data Ingestion and TransformationData Operations and SupportData Security and GovernanceData Store Management

Practice by scenario

Filter questions by type — troubleshooting, exhibit, drag-and-drop, PBQ, ACLs, OSPF, and more.

Browse scenarios→

Continue studying

All Data Ingestion and Transformation setsAll Data Ingestion and Transformation questionsDEA-C01 Practice Hub