Sample questions
AWS Certified Data Engineer Associate DEA-C01 practice questions
A data pipeline uses Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data volume spikes occasionally, causing the Firehose buffer to fill up and leading to increased delivery latency. The latency must remain under 60 seconds. What should be done to minimize latency?
Trap 1: Enable GZIP compression on the Firehose delivery stream.
Compression adds CPU overhead and may increase latency.
Trap 2: Increase the buffer size to 128 MB to accommodate larger batches.
Larger buffer size increases latency as it waits to fill.
Trap 3: Switch to Kinesis Data Streams with a Lambda consumer.
This changes architecture but does not directly reduce Firehose latency.
- A
Enable GZIP compression on the Firehose delivery stream.
Why wrong: Compression adds CPU overhead and may increase latency.
- B
Increase the buffer size to 128 MB to accommodate larger batches.
Why wrong: Larger buffer size increases latency as it waits to fill.
- C
Switch to Kinesis Data Streams with a Lambda consumer.
Why wrong: This changes architecture but does not directly reduce Firehose latency.
- D
Reduce the buffer interval to 60 seconds.
This forces delivery every 60 seconds, meeting the latency requirement.
An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?
Trap 1: Write the aggregated results to a single large file instead of…
Single file reduces parallelism and increases shuffle overhead.
Trap 2: Convert the Parquet files to CSV to simplify the schema.
CSV is less efficient than Parquet for columnar storage and compression.
Trap 3: Replace the Redshift target with Amazon Redshift Spectrum.
Spectrum is for querying S3, not for loading transformed data into Redshift.
- A
Write the aggregated results to a single large file instead of multiple partitions.
Why wrong: Single file reduces parallelism and increases shuffle overhead.
- B
Convert the Parquet files to CSV to simplify the schema.
Why wrong: CSV is less efficient than Parquet for columnar storage and compression.
- C
Replace the Redshift target with Amazon Redshift Spectrum.
Why wrong: Spectrum is for querying S3, not for loading transformed data into Redshift.
- D
Increase the number of Glue worker nodes (DPUs) for the job.
More workers parallelize tasks and reduce runtime.
A data engineering team uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. They notice that the application's checkpointing is failing intermittently, causing data reprocessing. The application uses a large state. Which configuration change should the team make to improve checkpoint reliability?
Trap 1: Disable checkpointing to avoid failures.
Disabling checkpointing removes fault tolerance.
Trap 2: Switch the state backend from in-memory to RocksDB.
RocksDB helps manage large state but does not directly fix checkpoint failures.
Trap 3: Increase the parallelism of the application.
Higher parallelism can increase checkpoint overhead.
- A
Disable checkpointing to avoid failures.
Why wrong: Disabling checkpointing removes fault tolerance.
- B
Switch the state backend from in-memory to RocksDB.
Why wrong: RocksDB helps manage large state but does not directly fix checkpoint failures.
- C
Increase the parallelism of the application.
Why wrong: Higher parallelism can increase checkpoint overhead.
- D
Increase the checkpointing interval.
Longer intervals reduce checkpoint frequency and associated failures.
A company uses AWS Glue to process streaming data from Amazon Kinesis Data Streams. The job reads JSON records and writes Parquet to Amazon S3. Recently, the job started failing with 'Out of Memory' errors. Which change is MOST likely to resolve the issue?
Trap 1: Enable compression on the Kinesis stream.
Compression on Kinesis does not affect Glue memory.
Trap 2: Change the output format from Parquet to ORC.
ORC is similar to Parquet; format change does not fix OOM.
Trap 3: Reduce the streaming batch size in the Glue job configuration.
Reducing batch size can help but does not address the root cause of insufficient memory.
- A
Enable compression on the Kinesis stream.
Why wrong: Compression on Kinesis does not affect Glue memory.
- B
Change the output format from Parquet to ORC.
Why wrong: ORC is similar to Parquet; format change does not fix OOM.
- C
Increase the number of DPUs allocated to the Glue job.
More DPUs provide more memory and CPU.
- D
Reduce the streaming batch size in the Glue job configuration.
Why wrong: Reducing batch size can help but does not address the root cause of insufficient memory.
A data engineer is designing a serverless data ingestion pipeline that uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed using AWS Lambda before being written to S3. Which two steps are required to enable this transformation? (Select TWO.)
Trap 1: Set up an S3 event notification to trigger the Lambda function on…
This triggers Lambda after data is written, not before.
Trap 2: Subscribe the Lambda function to the CloudWatch Logs log group for…
This is for monitoring, not for data transformation.
Trap 3: Have the Lambda function write the transformed data directly to the…
This bypasses Firehose and is not the intended design.
- A
Set up an S3 event notification to trigger the Lambda function on object creation.
Why wrong: This triggers Lambda after data is written, not before.
- B
Configure a Lambda function as a data transformation source in the Firehose delivery stream.
This enables Firehose to invoke Lambda for transformation.
- C
Ensure the Lambda function returns the transformed data in the format required by Firehose.
Firehose expects a specific response format from the Lambda function.
- D
Subscribe the Lambda function to the CloudWatch Logs log group for the Firehose stream.
Why wrong: This is for monitoring, not for data transformation.
- E
Have the Lambda function write the transformed data directly to the S3 bucket.
Why wrong: This bypasses Firehose and is not the intended design.
A company runs a nightly AWS Glue ETL job that reads from a JDBC source (PostgreSQL) and writes to S3 in Parquet format. The job takes over 6 hours, but the SLA requires completion within 4 hours. The source table has 500 million rows and is updated frequently. Which approach will most reliably reduce job duration?
Trap 1: Enable job bookmark and schedule the job to run more frequently.
Bookmarks help with incremental processing but not with the initial full load.
Trap 2: Use multiple JDBC connections in parallel by setting…
Glue supports parallel JDBC reads using hash partitioning; this can improve performance.
Trap 3: Increase the number of DPUs for the Glue job to 100.
More DPUs can help but are limited by source parallelism and may not reduce time enough.
- A
Enable job bookmark and schedule the job to run more frequently.
Why wrong: Bookmarks help with incremental processing but not with the initial full load.
- B
Use multiple JDBC connections in parallel by setting 'hashexpression' and 'hashfield'.
Why wrong: Glue supports parallel JDBC reads using hash partitioning; this can improve performance.
- C
Partition the source table by year and use pushdown predicates in the Glue job.
This reduces the data scanned by filtering on partition columns.
- D
Increase the number of DPUs for the Glue job to 100.
Why wrong: More DPUs can help but are limited by source parallelism and may not reduce time enough.
Match each AWS database service to its primary use case.
Drag a concept onto its matching description — or click a concept then click the description.
Relational database with managed operations
NoSQL key-value and document database
In-memory caching for low latency
Graph database for connected data
Time-series data for IoT and analytics
A company uses Amazon DynamoDB with on-demand capacity. They notice higher than expected costs due to a sudden spike in read traffic from a reporting job. The reporting job scans the entire table daily. What is the most cost-effective way to reduce costs while maintaining the same reporting output?
Trap 1: Enable DynamoDB Accelerator (DAX) for caching.
DAX adds cost and may not eliminate scans fully.
Trap 2: Set a TTL attribute to automatically expire old data.
TTL removes old data but does not optimize the reporting query.
Trap 3: Reduce the read capacity units (RCU) in the table.
On-demand capacity cannot be manually reduced; it auto-scales.
- A
Enable DynamoDB Accelerator (DAX) for caching.
Why wrong: DAX adds cost and may not eliminate scans fully.
- B
Use a Global Secondary Index (GSI) with a sort key that matches the reporting query pattern.
A GSI allows efficient querying instead of scanning, reducing read costs.
- C
Set a TTL attribute to automatically expire old data.
Why wrong: TTL removes old data but does not optimize the reporting query.
- D
Reduce the read capacity units (RCU) in the table.
Why wrong: On-demand capacity cannot be manually reduced; it auto-scales.
A data engineer needs to migrate an on-premises MySQL database to Amazon RDS for MySQL with minimal downtime. Which approach should they use?
Trap 1: Use mysqldump to export the database and import into RDS.
This requires significant downtime for large databases.
Trap 2: Create an RDS read replica and promote it.
Read replica is for RDS instances, not on-premises.
Trap 3: Use AWS Schema Conversion Tool (SCT) to convert the schema and then…
SCT is for heterogeneous migrations, not for minimal downtime.
- A
Use mysqldump to export the database and import into RDS.
Why wrong: This requires significant downtime for large databases.
- B
Use AWS Database Migration Service (DMS) with ongoing replication from the source database.
DMS with ongoing replication minimizes downtime by continuously syncing changes.
- C
Create an RDS read replica and promote it.
Why wrong: Read replica is for RDS instances, not on-premises.
- D
Use AWS Schema Conversion Tool (SCT) to convert the schema and then copy data.
Why wrong: SCT is for heterogeneous migrations, not for minimal downtime.
A data engineer attaches the above IAM policy to an IAM user. The user tries to download an object from my-bucket using the AWS CLI without specifying SSE headers. The object is stored with SSE-S3. Will the download succeed?
Exhibit
Refer to the exhibit.
```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"StringEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
}
]
}
```Trap 1: No, because the object is encrypted and the user does not have…
SSE-S3 does not require separate decrypt permission.
Trap 2: Yes, because the object is encrypted with SSE-S3, which uses AES256.
The condition is on the request header, not the object encryption.
Trap 3: Yes, because the policy allows s3:GetObject on the bucket.
The condition restricts the Allow.
- A
No, because the object is encrypted and the user does not have decrypt permission.
Why wrong: SSE-S3 does not require separate decrypt permission.
- B
No, because the request does not include the required encryption header.
The condition requires the request to have x-amz-server-side-encryption: AES256.
- C
Yes, because the object is encrypted with SSE-S3, which uses AES256.
Why wrong: The condition is on the request header, not the object encryption.
- D
Yes, because the policy allows s3:GetObject on the bucket.
Why wrong: The condition restricts the Allow.
A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?
Trap 1: Amazon Kinesis Data Streams with AWS Lambda for transformation and…
Requires custom exactly-once logic; Lambda has no native Parquet conversion.
Trap 2: Amazon Simple Queue Service (SQS) with AWS Lambda for…
SQS does not natively convert to Parquet or partition data.
Trap 3: AWS Glue streaming jobs consuming from Amazon Kinesis Data Streams…
Glue streaming jobs are more complex and do not directly integrate with IoT Core.
- A
Amazon Kinesis Data Streams with AWS Lambda for transformation and Amazon S3.
Why wrong: Requires custom exactly-once logic; Lambda has no native Parquet conversion.
- B
Amazon Simple Queue Service (SQS) with AWS Lambda for transformation and Amazon S3.
Why wrong: SQS does not natively convert to Parquet or partition data.
- C
AWS Glue streaming jobs consuming from Amazon Kinesis Data Streams and writing to Amazon S3.
Why wrong: Glue streaming jobs are more complex and do not directly integrate with IoT Core.
- D
Amazon Kinesis Data Firehose with data transformation via AWS Lambda, delivering to Amazon S3.
Firehose supports Parquet conversion and partitioning; Lambda handles transformation.
A data engineer needs to transform JSON data from an S3 bucket using AWS Glue. The JSON contains nested arrays and objects. Which Glue transform is best suited for flattening nested structures?
Trap 1: Unnest
Unnest is not a standard Glue transform.
Trap 2: ResolveChoice
ResolveChoice handles schema ambiguities, not nesting.
Trap 3: Map
Map applies a function to each row, not for unnesting.
- A
Unnest
Why wrong: Unnest is not a standard Glue transform.
- B
ResolveChoice
Why wrong: ResolveChoice handles schema ambiguities, not nesting.
- C
Relationalize
Relationalize flattens nested structures into separate DynamicFrames.
- D
Map
Why wrong: Map applies a function to each row, not for unnesting.
A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?
Trap 1: AWS Glue with a JDBC connection and incremental crawl.
Glue is not optimized for CDC; requires custom logic.
Trap 2: Amazon Kinesis Data Firehose with a custom producer.
Firehose is for streaming, not database replication.
Trap 3: AWS Data Pipeline with a SQL activity and HiveCopyActivity.
Data Pipeline is legacy and more overhead.
- A
AWS Database Migration Service (DMS) with change data capture (CDC) to Amazon S3.
DMS supports full load and CDC with low overhead.
- B
AWS Glue with a JDBC connection and incremental crawl.
Why wrong: Glue is not optimized for CDC; requires custom logic.
- C
Amazon Kinesis Data Firehose with a custom producer.
Why wrong: Firehose is for streaming, not database replication.
- D
AWS Data Pipeline with a SQL activity and HiveCopyActivity.
Why wrong: Data Pipeline is legacy and more overhead.
A company has a Glue ETL job that reads from an Amazon RDS for MySQL table and writes to Amazon S3. The job runs hourly and processes new records based on a 'last_modified' timestamp column. Recently, the job started missing some records because the timestamp in MySQL is stored with microsecond precision but Glue's job bookmark only tracks second precision. Which solution addresses this issue?
Trap 1: Increase the job frequency to every 30 minutes.
Does not solve the precision problem.
Trap 2: Run a full refresh of the table each time instead of incremental.
Inefficient and not recommended.
Trap 3: Modify the MySQL table to use a DATE data type instead of TIMESTAMP.
Loses time information entirely.
- A
Use a job parameter to store the last processed timestamp with millisecond precision and query records greater than that value.
Custom job bookmark with higher precision.
- B
Increase the job frequency to every 30 minutes.
Why wrong: Does not solve the precision problem.
- C
Run a full refresh of the table each time instead of incremental.
Why wrong: Inefficient and not recommended.
- D
Modify the MySQL table to use a DATE data type instead of TIMESTAMP.
Why wrong: Loses time information entirely.
A data engineer is ingesting CSV files from an Amazon S3 bucket into a Glue Data Catalog table. The files have headers, but some files have extra columns not present in the first file. The engineer wants the Glue crawler to automatically detect the schema. Which crawler configuration option should be used?
Trap 1: Configure the crawler to 'Inherit schema from table' and set the…
That option is for updating existing tables, not schema inference.
Trap 2: Configure the crawler to 'Create a single schema for each S3 path'…
Only uses the first file's schema.
Trap 3: Configure the crawler to 'Create a single schema for each S3 path'…
Creates multiple tables, not desired.
- A
Configure the crawler to 'Inherit schema from table' and set the table name.
Why wrong: That option is for updating existing tables, not schema inference.
- B
Configure the crawler to 'Create a single schema for each S3 path' and enable 'Merge tables'.
This merges schemas from all files in the path.
- C
Configure the crawler to 'Create a single schema for each S3 path' without enabling 'Merge tables'.
Why wrong: Only uses the first file's schema.
- D
Configure the crawler to 'Create a single schema for each S3 path' and set 'Each file as a separate table'.
Why wrong: Creates multiple tables, not desired.
A company is building a data lake on Amazon S3. Data arrives from multiple sources in JSON, CSV, and Avro formats. The data must be transformed to Parquet and partitioned by date and source. Which TWO services can perform this transformation with minimal custom code? (Choose TWO.)
Trap 1: AWS Lake Formation
Lake Formation is for data lake management, not transformation.
Trap 2: Amazon Athena CTAS queries
Athena is for querying, not transformation pipelines.
Trap 3: Amazon Kinesis Data Firehose
Firehose is for streaming, not batch transformation.
- A
Amazon EMR with Spark
EMR can run Spark for large-scale transformations.
- B
AWS Lake Formation
Why wrong: Lake Formation is for data lake management, not transformation.
- C
Amazon Athena CTAS queries
Why wrong: Athena is for querying, not transformation pipelines.
- D
AWS Glue ETL jobs
Glue provides built-in transforms and can write Parquet.
- E
Amazon Kinesis Data Firehose
Why wrong: Firehose is for streaming, not batch transformation.
A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)
Trap 1: Increase the number of DPUs for the Glue job.
More DPUs improve performance but not data consistency.
Trap 2: Use a staging table in Redshift with a transaction to commit.
Adds complexity; not a direct diagnostic step.
- A
Use the Redshift COPY command with a manifest file to load data.
Manifest file ensures all files are loaded.
- B
Increase the number of DPUs for the Glue job.
Why wrong: More DPUs improve performance but not data consistency.
- C
Enable Glue job bookmarks to track processed files.
Bookmarks prevent reprocessing or missing files.
- D
Use a staging table in Redshift with a transaction to commit.
Why wrong: Adds complexity; not a direct diagnostic step.
- E
Review the job's CloudWatch Logs for any error messages.
Logs may show partial failures.
A company uses AWS Glue to process CSV files from an S3 bucket. The job fails intermittently with a 'SchemaDetectionError' for files that have inconsistent column counts. What is the most efficient way to handle this?
Trap 1: Convert all CSV files to Parquet format using a separate…
This adds complexity and does not solve the schema inconsistency during the initial read.
Trap 2: Define a fixed schema in the Glue job using 'apply_mapping' to map…
This requires manual mapping and does not handle varying columns.
Trap 3: Set the job to 'ignore' schema mismatches in the job parameters.
There is no such parameter; schema mismatches cause errors.
- A
Use the 'mergeSchema' option when reading the DynamicFrame.
'mergeSchema' allows Glue to handle schemas that evolve over time.
- B
Convert all CSV files to Parquet format using a separate preprocessing job.
Why wrong: This adds complexity and does not solve the schema inconsistency during the initial read.
- C
Define a fixed schema in the Glue job using 'apply_mapping' to map columns.
Why wrong: This requires manual mapping and does not handle varying columns.
- D
Set the job to 'ignore' schema mismatches in the job parameters.
Why wrong: There is no such parameter; schema mismatches cause errors.
A company uses AWS Data Pipeline to copy data from DynamoDB to S3 daily. Recently, the pipeline started failing with 'ThrottlingException' errors. The DynamoDB table has on-demand capacity. Which action should be taken to resolve the issue?
Trap 1: Increase the write capacity units of the DynamoDB table.
The issue is read throttling, not write.
Trap 2: Replace Data Pipeline with AWS Glue using a DynamoDB connector.
Glue also reads from DynamoDB and may encounter the same throttling.
Trap 3: Disable the pipeline's retry logic and increase the timeout.
Without retries, the pipeline will fail on the first throttling.
- A
Increase the write capacity units of the DynamoDB table.
Why wrong: The issue is read throttling, not write.
- B
Replace Data Pipeline with AWS Glue using a DynamoDB connector.
Why wrong: Glue also reads from DynamoDB and may encounter the same throttling.
- C
Configure the pipeline to use a retry strategy with exponential backoff.
Retries with backoff alleviate throttling by slowing down requests.
- D
Disable the pipeline's retry logic and increase the timeout.
Why wrong: Without retries, the pipeline will fail on the first throttling.
Arrange the steps to set up cross-region replication for an S3 bucket.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Arrange the steps to implement data encryption at rest for an Amazon Redshift cluster using AWS KMS.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Arrange the steps to create an AWS Glue job that transforms data from Amazon S3 to Amazon Redshift in the correct order.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Order the steps to set up an Amazon EMR cluster for processing data in S3 using Spark.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
A company ingests IoT sensor data into Kinesis Data Streams. The data is then processed by a Lambda function that aggregates readings and writes to DynamoDB. The Lambda function is experiencing high error rates due to throttling. Which TWO actions would reduce throttling?
Trap 1: Increase the number of shards in the Kinesis stream.
More shards increase parallelism but do not directly reduce Lambda throttling.
Trap 2: Decrease the batch window in the Lambda event source mapping.
Shorter windows increase invocations, worsening throttling.
Trap 3: Increase the Lambda reserved concurrency to 1000.
Reserved concurrency sets a limit but does not prevent throttling if the account limit is exceeded.
- A
Increase the number of shards in the Kinesis stream.
Why wrong: More shards increase parallelism but do not directly reduce Lambda throttling.
- B
Increase the batch size in the Lambda event source mapping.
Larger batches mean fewer invocations, reducing throttling.
- C
Decrease the batch window in the Lambda event source mapping.
Why wrong: Shorter windows increase invocations, worsening throttling.
- D
Configure DynamoDB to use on-demand capacity mode.
On-demand mode eliminates write throttling from DynamoDB.
- E
Increase the Lambda reserved concurrency to 1000.
Why wrong: Reserved concurrency sets a limit but does not prevent throttling if the account limit is exceeded.
Question Discussion
Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.
Sign in to join the discussion.