How should I use these Data Engineering practice questions?

Read each scenario carefully and choose your answer before revealing the explanation. Then check why your choice was right or wrong. Repeat until the reasoning feels automatic.

Can I practise just Data Engineering questions in a focused session?

Yes — use the session launcher on this page to start a 10-, 20-, 30- or 50-question session drawn entirely from the Data Engineering domain.

MLS-C01 · topic practice

Data Engineering practice questions

Practise AWS Certified Machine Learning Specialty MLS-C01 Data Engineering practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security

20 questionsDomain: Data Engineering

Practice 10 questions Browse domain →

What the exam tests

What to know about Data Engineering

Data Engineering questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Data Engineering exam traps

▸Answering from memory before reading the full scenario.
▸Missing a constraint such as cost, availability, security, scope or command context.
▸Choosing a broad answer when the question asks for the most specific fix.
▸Ignoring why the wrong options are tempting.

Practice set

Data Engineering questions

20 questions · select your answer, then reveal the explanation

Question 1mediummultiple choice

Read the full Data Engineering explanation →

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

Trap 1: Use SageMaker File input mode and increase the EBS volume size to 1…

Larger EBS volume does not reduce I/O wait time for downloading data.

Trap 2: Convert the CSV files to Parquet format and use File input mode.

Parquet reduces storage and improves read speed but File mode still downloads the entire dataset to EBS, causing I/O wait.

Trap 3: Load the data into an Amazon EFS file system and mount it to the…

EFS adds network latency and cost without addressing the fundamental I/O wait issue.

Study all Data Engineering common traps →

A
Use SageMaker File input mode and increase the EBS volume size to 1 TB.
Why wrong: Larger EBS volume does not reduce I/O wait time for downloading data.
B
Use SageMaker Pipe input mode to stream data directly from S3.
Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.
C
Convert the CSV files to Parquet format and use File input mode.
Why wrong: Parquet reduces storage and improves read speed but File mode still downloads the entire dataset to EBS, causing I/O wait.
D
Load the data into an Amazon EFS file system and mount it to the training instance.
Why wrong: EFS adds network latency and cost without addressing the fundamental I/O wait issue.

Data Engineering practice questions

What to know about Data Engineering

Common Data Engineering exam traps

Data Engineering questions

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

Track your progress over time

Start a Data Engineering only practice session

Related MLS-C01 topic practice pages

Data Engineering practice questions

Machine Learning Implementation and Operations practice questions

Modeling practice questions

Exploratory Data Analysis practice questions

MLS-C01 fundamentals practice questions

MLS-C01 scenario practice questions

MLS-C01 troubleshooting practice questions

Frequently asked questions

Track your progress

Study resources

Exam traps to avoid