Back to AWS Certified Data Engineer Associate DEA-C01 questions

Scenario-based practice

Hard Difficulty Questions

Practise AWS Certified Data Engineer Associate DEA-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

20
scenario questions
DEA-C01
exam code
Amazon Web Services
vendor

Scenario guide

How to approach hard difficulty questions

These are the questions most candidates get wrong. They require connecting multiple concepts, reading tricky output, or knowing edge-case behaviour that isn't on most study cards. Practising them trains you to operate under uncertainty — a necessary skill on the real exam.

Quick answer

Hard Difficulty Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Related practice questions

Related DEA-C01 topic practice pages

Scenario questions usually connect to one or more exam topics. Use these links to review the underlying concepts behind the scenario.

Practice set

Practice scenarios

Question 1hardmultiple choice
Full question →

An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?

Question 2hardmultiple choice
Full question →

A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?

Question 3hardmultiple choice
Full question →

A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?

Question 4hardmultiple choice
Full question →

A company uses Amazon DynamoDB with on-demand capacity. They notice higher than expected costs due to a sudden spike in read traffic from a reporting job. The reporting job scans the entire table daily. What is the most cost-effective way to reduce costs while maintaining the same reporting output?

Question 5hardmultiple choice
Full question →

A data engineer attaches the above IAM policy to an IAM user. The user tries to download an object from my-bucket using the AWS CLI without specifying SSE headers. The object is stored with SSE-S3. Will the download succeed?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-bucket/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}
```
Question 6hardmultiple choice
Read the full NAT/PAT explanation →

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

Question 7hardmultiple choice
Full question →

A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?

Question 8hardmulti select
Full question →

A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)

Question 9hardmulti select
Full question →

A company uses Amazon RDS for MySQL as a source for AWS DMS to replicate data to S3. The replication task is failing with 'OutOfMemory' errors on the DMS instance. The source table has 10 million rows with large BLOB columns. Which THREE changes would most likely resolve the issue?

Question 10hardmultiple choice
Full question →

A data engineer is setting up an Amazon Kinesis Data Analytics application to process streaming data from a Kinesis data stream named "input-stream". The application uses a reference data source from an S3 bucket. The engineer has attached the IAM policy shown in the exhibit to the application's IAM role. When starting the application, the engineer receives an 'AccessDeniedException' error. Which additional permission is required?

Exhibit

Refer to the exhibit.

"Effect": "Allow",
"Action": [
  "kinesis:DescribeStream",
  "kinesis:GetShardIterator",
  "kinesis:GetRecords",
  "kinesis:ListShards"
],
"Resource": "arn:aws:kinesis:us-east-1:123456789012:stream/input-stream"
Question 11hardmulti select
Full question →

A company is migrating a legacy data warehouse to Amazon Redshift. They need to choose a distribution style to minimize data movement during joins. Which THREE factors should they consider?

Question 12hardmulti select
Full question →

A data engineer is designing a data lake on Amazon S3. The data must be immutable and support high-throughput streaming ingestion. Which THREE features should the engineer consider? (Select THREE.)

Question 13hardmultiple choice
Full question →

A data team runs a daily AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes results to Amazon S3. The job completes successfully but takes 2 hours longer than expected. The job uses the JDBC connection to Redshift. The Redshift cluster is 4 dc2.large nodes. The Glue job has 10 workers of type G.1X. Which change would MOST likely reduce the job duration?

Question 14hardmulti select
Full question →

A data engineer is designing a data lake on Amazon S3 for analytics. The data includes sensitive PII that must be encrypted at rest. The company requires that the encryption keys be managed by the company's own hardware security module (HSM) and rotated every 90 days. Which TWO options meet these requirements? (Choose TWO.)

Question 15hardmulti select
Full question →

A data engineer is troubleshooting an AWS Glue job that reads from an Amazon RDS for PostgreSQL database using a JDBC connection. The job fails with the error 'java.sql.SQLException: No suitable driver'. Which TWO actions should the engineer take to resolve this issue? (Select TWO.)

Question 16hardmultiple choice
Full question →

A data engineer is troubleshooting an AWS Lake Formation permissions issue. A user is able to query an Amazon Athena table but cannot see the underlying S3 data in the AWS Glue Data Catalog. The user has been granted SELECT permission on the table in Lake Formation. What is the most likely cause?

Question 17hardmultiple choice
Full question →

A data engineer is troubleshooting an AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but writes duplicate rows into Redshift. The source data is static and does not contain duplicates. Which configuration change is most likely to resolve this issue?

Question 18hardmulti select
Full question →

A data engineer is setting up an Amazon Redshift cluster for a data warehouse. The cluster will store historical sales data and support complex analytical queries. To optimize query performance and manage storage, the engineer needs to choose appropriate distribution styles and sort keys for a large fact table 'sales' and several dimension tables. Which TWO of the following design decisions are BEST practices?

Question 19hardmultiple choice
Full question →

A data engineering team uses Amazon Redshift for analytics. They notice that queries on a large fact table are slow. The table is distributed using DISTSTYLE ALL. Which design change would most likely improve query performance?

Question 20hardmulti select
Full question →

A data engineer is designing a data lake on Amazon S3 with AWS Lake Formation. The data lake contains personally identifiable information (PII). The company has a policy that only users who have completed data privacy training can access the PII data. The training status is stored in an external identity provider (IdP) as an attribute. The data engineer needs to enforce this policy using Lake Formation. Which THREE steps should the data engineer take? (Choose THREE.)

These DEA-C01 practice questions are part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style DEA-C01 questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.