MLS-C01 · topic practice

Data Engineering practice questions

Practise AWS Certified Machine Learning Specialty MLS-C01 Data Engineering practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Data Engineering

What the exam tests

What to know about Data Engineering

Data Engineering questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Data Engineering exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Data Engineering questions

20 questions · select your answer, then reveal the explanation

Question 1mediummultiple choice
Read the full Data Engineering explanation →

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

Question 4mediummultiple choice
Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

Question 7mediummultiple choice
Read the full Data Engineering explanation →

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

Question 10mediummultiple choice
Read the full Data Engineering explanation →

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?

An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?

Question 12mediummultiple choice
Read the full Data Engineering explanation →

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

Question 14easymultiple choice
Read the full NAT/PAT explanation →

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

Question 15mediummultiple choice
Read the full Data Engineering explanation →

A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?

A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?

Question 17mediummulti select
Read the full NAT/PAT explanation →

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

Question 20mediummultiple choice
Read the full Data Engineering explanation →

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Data Engineering sessions

Start a Data Engineering only practice session

Every question in these sessions is drawn from the Data Engineering domain — nothing else.

Related practice questions

Related MLS-C01 topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the MLS-C01 exam test about Data Engineering?
Data Engineering questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Data Engineering questions in a focused session?
Yes — the session launcher on this page draws every question from the Data Engineering domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other MLS-C01 topics?
Use the topic links above to move to related areas, or go back to the MLS-C01 question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the MLS-C01 exam covers. They are not copied from any real exam or dump site.