The AWS Data Engineer Associate (DEA-C01) validates your ability to design, build, and maintain data pipelines and analytics infrastructure on AWS. It bridges the gap between developer and data analyst — you need to know how to ingest data at scale, store it efficiently, transform it with the right tools, and make it available for analysis. If you build the data infrastructure that data scientists and analysts consume, this exam validates that foundational engineering capability.
Practice this topic
Data ingestion patterns form the foundation of any data engineering architecture. Kinesis Data Streams: real-time streaming — shards are the unit of capacity (1 MB/s write, 2 MB/s read per shard), retention up to 365 days, replay capability for late consumers. Kinesis Data Firehose: managed delivery to S3, Redshift, OpenSearch, Splunk — buffered (by size or time), optional Lambda transformation inline, no consumer management required. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink): real-time SQL or Flink processing on streams — window functions, anomaly detection, aggregations. Amazon MSK (Managed Streaming for Apache Kafka): fully managed Kafka — bring existing Kafka workloads to AWS, use Kafka Connect for source/sink integrations, MSK Serverless for variable throughput. AWS Glue: serverless ETL — Glue Crawlers automatically discover schemas and populate the Data Catalog, Glue ETL jobs run PySpark or Python Shell scripts for transformation, Glue DataBrew for visual no-code data preparation. AWS Glue Data Catalog: central metadata repository — tables, partitions, schemas — integrated with Athena, Redshift Spectrum, and EMR for schema-on-read queries.
Storage layer decisions for data engineering. S3 data lake: the foundation — use S3 intelligent-tiering for cost optimisation, partition data in S3 by date or category (year=2025/month=01/day=15/) for Athena and Redshift Spectrum query efficiency, enable S3 Event Notifications to trigger Glue or Lambda on new file arrival. Apache Iceberg on S3: open table format for data lake — ACID transactions on S3 files, time travel (query previous snapshots), schema evolution without rewriting data — supported natively by Athena, Glue, and EMR. Amazon Redshift: fully managed columnar data warehouse — Redshift Serverless for variable workloads (pay per RPU-hour), provisioned clusters for consistent high-volume workloads. Distribution styles: KEY (collocate matching rows for join performance), EVEN (round-robin, balanced load), ALL (replicate small dimension tables to every node). Sort keys: COMPOUND (sequential queries on leading columns), INTERLEAVED (multiple columns equally weighted). Redshift Spectrum: query S3 data directly from Redshift SQL — no data movement, separates compute from storage. DynamoDB Streams + Kinesis Data Streams integration: capture all DynamoDB changes for downstream processing.
Transformation and orchestration are the glue of data engineering. AWS Glue workflows: chain crawlers and jobs into dependency graphs — triggered on schedule or event. AWS Step Functions: orchestrate multi-step workflows involving Lambda, Glue, EMR, and ECS — Standard workflow (audit trail, up to 1 year execution), Express workflow (high-volume, up to 5 minutes — for streaming pipelines). Amazon EMR: managed Hadoop/Spark/Hive/Presto cluster — ephemeral clusters for cost efficiency (terminate after job completes), EMR Serverless for fully managed compute. Use EMR for large-scale Spark transformations that exceed Glue's capabilities (complex ML feature engineering, custom Spark configurations). Amazon Athena: serverless SQL on S3 — pay per TB scanned, use partitioning and columnar formats (Parquet, ORC) to reduce scan volume and cost. Athena Federated Query: query data in RDS, DynamoDB, and on-premises sources alongside S3 — Lambda data source connectors. AWS Lake Formation: security and governance layer for the data lake — column-level and row-level permissions on Glue Data Catalog tables, cross-account data sharing with Lake Formation permissions.
Production data engineering requires quality and governance controls. AWS Glue Data Quality: define data quality rules using DQDL (Data Quality Definition Language) — completeness, uniqueness, accuracy, consistency rules — runs as part of Glue ETL jobs. AWS Deequ: open-source data quality library for Spark — used in EMR for programmatic quality checks. Data lineage: AWS Glue lineage tracking records data origins and transformations automatically — visualised in the Glue console. Security: S3 bucket policies and IAM for data lake access control, Lake Formation for fine-grained column and row permissions, KMS CMK encryption for S3 and Redshift, VPC endpoints for private access to S3 and Redshift from EMR and Glue. Monitoring: CloudWatch metrics for Kinesis (GetRecords.IteratorAgeMilliseconds monitors consumer lag — high age means consumers are falling behind), Glue job metrics (bytes read/written, error counts), Redshift query monitoring rules (alert on long-running queries, memory-intensive queries). AWS CloudTrail data events: log every S3 object access and DynamoDB API call — essential for data audit compliance.
AWS Glue and AWS EMR do the same transformation work
Glue is managed Spark with a simplified development model — great for standard ETL. EMR gives full control over Spark, Hive, Presto, and Hadoop — needed for complex ML feature engineering, custom Spark configurations, or workloads that need fine-grained cluster tuning.
Storing all data in S3 in CSV format is sufficient for analytics
CSV is row-oriented and uncompressed — Athena scans the entire file regardless of the columns you query. Columnar formats like Parquet store data by column, compressed, and support predicate pushdown. A Parquet file is typically 10-100x cheaper to query in Athena than the equivalent CSV.
Kinesis Data Streams and Kinesis Data Firehose are interchangeable
Kinesis Data Streams is a low-latency streaming buffer you consume directly with custom consumers (Lambda, KCL app, Flink). Firehose is a managed delivery service — it buffers and delivers to a destination (S3, Redshift, Splunk) without you writing consumer code. Choose Streams when you need real-time processing; Firehose when you need managed near-real-time delivery.
Try free AWS Data Engineer practice questions with explanations, topic links and progress tracking.