DEA-C01 Practice Questions

Question 1

A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?

Accepted Answer

Reduce the buffer interval to 60 seconds.. Reducing the buffer interval to 60 seconds ensures that Firehose delivers data to S3 at most every 60 seconds, directly capping latency even if the buffer size is not full. This aligns with the requirement to keep latency under 60 seconds, as Firehose delivers data when either the buffer interval or buffer size threshold is met first.

Answer

Enable GZIP compression on the Firehose delivery stream.

Answer

Increase the buffer size to 128 MB to accommodate larger batches.

Answer

Switch to Kinesis Data Streams with a Lambda consumer.

Question 2

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

Accepted Answer

Increase the number of Glue worker nodes (DPUs) for the job.. Increasing the number of Glue worker nodes (DPUs) directly scales the distributed processing capacity of the ETL job, allowing it to process larger volumes of Parquet data in parallel. This is the most straightforward way to reduce execution time when data volume is growing, as AWS Glue automatically partitions the workload across the additional workers.

Answer

Write the aggregated results to a single large file instead of multiple partitions.

Answer

Convert the Parquet files to CSV to simplify the schema.

Answer

Replace the Redshift target with Amazon Redshift Spectrum.

Question 3

A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?

Accepted Answer

Increase the checkpointing interval.. Increasing the checkpointing interval reduces the frequency of checkpoint operations, giving the system more time to complete each checkpoint before the next one starts. This alleviates backpressure and resource contention, which is critical when dealing with large state, as checkpointing large state is I/O and CPU intensive and can fail if intervals are too tight.

Answer

Disable checkpointing to avoid failures.

Answer

Switch the state backend from in-memory to RocksDB.

Answer

Increase the parallelism of the application.

Question 4

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

Accepted Answer

Increase the number of DPUs allocated to the Glue job.. The 'Out of Memory' error in AWS Glue indicates that the job's allocated resources are insufficient for the data volume or processing complexity. Increasing the number of DPUs (Data Processing Units) directly increases the available memory and compute capacity, which is the most straightforward fix for OOM errors in Glue streaming jobs. Option C is correct because it addresses the root cause—resource exhaustion—by scaling the job horizontally.

Answer

Enable compression on the Kinesis stream.

Answer

Change the output format from Parquet to ORC.

Answer

Reduce the streaming batch size in the Glue job configuration.

Question 5

A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)

Accepted Answer

Configure a Lambda function as a data transformation source in the Firehose delivery stream.. Option B is correct because Amazon Kinesis Data Firehose can be configured to invoke a Lambda function as a data transformation source. This allows Firehose to pass incoming records to the Lambda function, which processes and returns the transformed records before they are delivered to the S3 destination. Option C is correct because the Lambda function must return data in the specific format that Firehose expects, including a record ID, result status, and base64-encoded data, otherwise the transformation will fail.

Answer

Set up an S3 event notification to trigger the Lambda function on object creation.

Answer

Subscribe the Lambda function to the CloudWatch Logs log group for the Firehose stream.

Answer

Have the Lambda function write the transformed data directly to the S3 bucket.

Question 6

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

Accepted Answer

Partition the source table by year and use pushdown predicates in the Glue job.. Option C is correct because partitioning the source table by year and using pushdown predicates allows AWS Glue to read only the relevant partitions from PostgreSQL, drastically reducing the data scanned and transferred. This directly addresses the 500 million row volume and frequent updates by minimizing the JDBC read workload, which is the primary bottleneck in the 6-hour runtime.

Answer

Enable job bookmark and schedule the job to run more frequently.

Answer

Use multiple JDBC connections in parallel by setting 'hashexpression' and 'hashfield'.

Answer

Increase the number of DPUs for the Glue job to 100.

Question 7

Match each AWS database service to its primary use case.

Question 8

A company uses Amazon DynamoDB with on-demand capacity. They notice higher than expected costs due to a sudden spike in read traffic from a reporting job. The reporting job scans the entire table daily. What is the most cost-effective way to reduce costs while maintaining the same reporting output?

Accepted Answer

Use a Global Secondary Index (GSI) with a sort key that matches the reporting query pattern.. Option B is correct because using a Global Secondary Index (GSI) with a sort key tailored to the reporting query pattern allows the reporting job to query only the relevant items instead of scanning the entire table. This reduces the read capacity units consumed per operation, directly lowering costs under on-demand capacity, which charges per RCU consumed. The reporting output remains identical because the GSI returns the same data filtered by the query pattern.

Answer

Enable DynamoDB Accelerator (DAX) for caching.

Answer

Set a TTL attribute to automatically expire old data.

Answer

Reduce the read capacity units (RCU) in the table.

Question 9

A data engineer needs to migrate an on-premises MySQL database to Amazon RDS for MySQL with minimal downtime. Which approach should they use?

Accepted Answer

Use AWS Database Migration Service (DMS) with ongoing replication from the source database.. AWS DMS with ongoing replication (change data capture, CDC) is the correct approach because it allows continuous synchronization from the on-premises MySQL source to the RDS target, enabling a cutover with minimal downtime. Unlike one-time export/import tools, DMS captures ongoing changes during the migration, so the target stays up-to-date until you switch over.

Answer

Use mysqldump to export the database and import into RDS.

Answer

Create an RDS read replica and promote it.

Answer

Use AWS Schema Conversion Tool (SCT) to convert the schema and then copy data.

Question 10

A data engineer attaches the above IAM policy to an IAM user. The user tries to download an object from my-bucket using the AWS CLI without specifying SSE headers. The object is stored with SSE-S3. Will the download succeed?

Accepted Answer

No, because the request does not include the required encryption header.. Option B is correct because when an object is stored with SSE-S3, AWS S3 requires that any request to download it without specifying the `x-amz-server-side-encryption` header (or the equivalent CLI parameter) will fail. The IAM policy grants `s3:GetObject` but does not override the S3 API's requirement for the encryption header to be present in the request. Without the header, S3 rejects the request with a `400 Bad Request` error, even though the user has the necessary IAM permissions.

Answer

No, because the object is encrypted and the user does not have decrypt permission.

Answer

Yes, because the object is encrypted with SSE-S3, which uses AES256.

Answer

Yes, because the policy allows s3:GetObject on the bucket.

Question 11

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

Accepted Answer

Amazon Kinesis Data Firehose with data transformation via AWS Lambda, delivering to Amazon S3.. Amazon Kinesis Data Firehose is the correct choice because it can directly ingest streaming data from AWS IoT Core, use a built-in AWS Lambda function to transform JSON to Parquet, and deliver the data to Amazon S3 with automatic partitioning. It also supports buffering and retry logic to handle late-arriving data (up to 1 hour) and provides exactly-once delivery to S3 when configured with the appropriate error handling and idempotent transformations.

Answer

Amazon Kinesis Data Streams with AWS Lambda for transformation and Amazon S3.

Answer

Amazon Simple Queue Service (SQS) with AWS Lambda for transformation and Amazon S3.

Answer

AWS Glue streaming jobs consuming from Amazon Kinesis Data Streams and writing to Amazon S3.

Question 12

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

Accepted Answer

Relationalize. The Relationalize transform is specifically designed to flatten nested JSON structures (arrays and objects) into a set of related tables, making it ideal for this use case. It automatically handles complex nesting by creating separate DataFrames for each nested level and linking them via foreign keys, which is exactly what is needed when ingesting JSON with nested arrays and objects into a relational format.

Answer

Unnest

Answer

ResolveChoice

Answer

Map

Question 13

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

Accepted Answer

AWS Database Migration Service (DMS) with change data capture (CDC) to Amazon S3.. AWS DMS with CDC is the correct choice because it supports continuous replication from Oracle to Amazon S3 with minimal overhead. It handles both the initial 500 GB full load and ongoing 10 GB daily increments via change data capture, without requiring custom code or complex pipeline management.

Answer

AWS Glue with a JDBC connection and incremental crawl.

Answer

Amazon Kinesis Data Firehose with a custom producer.

Answer

AWS Data Pipeline with a SQL activity and HiveCopyActivity.

Question 14

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

Accepted Answer

Use a job parameter to store the last processed timestamp with millisecond precision and query records greater than that value.. Option A is correct because AWS Glue job bookmarks track timestamps with only second precision, so records with microsecond differences within the same second are missed. By using a custom job parameter to store the last processed timestamp with millisecond precision and querying records greater than that value, you bypass Glue's bookmark limitation and capture all new or modified records.

Answer

Increase the job frequency to every 30 minutes.

Answer

Run a full refresh of the table each time instead of incremental.

Answer

Modify the MySQL table to use a DATE data type instead of TIMESTAMP.

Question 15

A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?

Accepted Answer

Configure the crawler to 'Create a single schema for each S3 path' and enable 'Merge tables'.. Option B is correct because when CSV files have varying schemas (extra columns), the Glue crawler must be configured to 'Create a single schema for each S3 path' with 'Merge tables' enabled. This configuration instructs the crawler to union the schemas from all files in the S3 path, adding new columns as they appear, rather than creating separate tables for each schema variation.

Answer

Configure the crawler to 'Inherit schema from table' and set the table name.

Answer

Configure the crawler to 'Create a single schema for each S3 path' without enabling 'Merge tables'.

Answer

Configure the crawler to 'Create a single schema for each S3 path' and set 'Each file as a separate table'.

AWS Certified Data Engineer Associate DEA-C01 practice test

Three ways to study

All 1,786 DEA-C01 questions with answers

Study DEA-C01 by domain

Study DEA-C01 by topic

Data Ingestion and Transformation practice questions

Data Operations and Support practice questions

Data Security and Governance practice questions

Data Store Management practice questions

DEA-C01 fundamentals practice questions

DEA-C01 scenario practice questions

DEA-C01 troubleshooting practice questions

Top DEA-C01 questions

AWS Certified Data Engineer Associate DEA-C01 practice questions

A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?

A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?

A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)

Match each AWS database service to its primary use case.

A company uses Amazon DynamoDB with on-demand capacity. They notice higher than expected costs due to a sudden spike in read traffic from a reporting job. The reporting job scans the entire table daily. What is the most cost-effective way to reduce costs while maintaining the same reporting output?

A data engineer needs to migrate an on-premises MySQL database to Amazon RDS for MySQL with minimal downtime. Which approach should they use?

A data engineer attaches the above IAM policy to an IAM user. The user tries to download an object from my-bucket using the AWS CLI without specifying SSE headers. The object is stored with SSE-S3. Will the download succeed?

Exhibit

A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)

A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?

A company uses AWS Data Pipeline to copy data from DynamoDB to S3 daily. Recently, the pipeline started failing with 'ThrottlingException' errors. The DynamoDB table has on-demand capacity. Which action should be taken to resolve the issue?

Arrange the steps to set up cross-region replication for an S3 bucket.

Arrange the steps to implement data encryption at rest for an Amazon Redshift cluster using AWS KMS.

Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.

Order the steps to set up an Amazon EMR cluster for processing data in S3 using Spark.

A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?

Question Discussion

How to use these DEA-C01 questions

Quick answer