Amazon Web Services · Free Practice Questions · Last reviewed May 2026
24real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.
A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?
Use SageMaker File input mode and increase the EBS volume size to 1 TB.
Use SageMaker Pipe input mode to stream data directly from S3.
Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.
Convert the CSV files to Parquet format and use File input mode.
Load the data into an Amazon EFS file system and mount it to the training instance.
A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?
The S3 bucket is not configured with versioning, causing overwrites.
The Lambda function is reading from the oldest sequence number, causing high IteratorAgeSeconds.
The Lambda function’s reserved concurrency is too low for the increased shard count.
The partition key used by the producer does not ensure that related records go to the same shard after resharding.
After resharding, the mapping of partition keys to shards changes. If ordering matters, the partition key must be chosen to keep related records together.
A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?
Amazon Athena
Amazon EMR with Spark
AWS Glue
AWS Glue is a serverless ETL service that can perform the transformation efficiently.
AWS Data Pipeline
A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?
The data is encrypted with AWS KMS and Firehose cannot write to encrypted buckets.
The delivery stream does not have dynamic partitioning enabled with the appropriate custom prefix.
Without dynamic partitioning and the correct prefix, Firehose will not partition the data by year/month/day.
The buffer interval is too short for the data volume, causing incomplete records.
The S3 bucket has versioning enabled, which prevents partitioning.
An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?
Use the DynamoDB Export to S3 feature and schedule it daily with AWS Glue.
Use DynamoDB Streams with AWS Lambda to write changes to S3 in Parquet format.
Streams capture changes in near-real-time, enabling incremental exports with minimal overhead.
Use a script that scans the DynamoDB table and filters by last updated timestamp.
Set up an Amazon EMR cluster running Spark jobs to read DynamoDB and write to S3.
A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?
Local input mode
Batch input mode
File input mode
File mode downloads data to the local file system, making it available for training.
Pipe input mode
Want more Data Engineering practice?
Practice this domainA company is using Amazon SageMaker to train a deep learning model. The training job is failing with an error 'CUDA out of memory'. The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model architecture and batch size are appropriate for this instance size. What is the most likely cause of this error?
Reduce the number of epochs.
Increase the number of GPUs by using a distributed training instance type.
Enable automatic mixed precision (AMP) training to reduce memory usage.
AMP uses FP16 where possible, cutting memory usage roughly in half, which often resolves out-of-memory errors.
Use a smaller instance type to force lower memory usage.
A data scientist is deploying a model using Amazon SageMaker. The model endpoint needs to handle real-time inference requests with low latency. The model is a large ensemble of 10 deep learning models, each approximately 500 MB. What is the most cost-effective deployment strategy that meets the low-latency requirement?
Deploy each model to a separate endpoint and use a load balancer.
Use a single endpoint with multiple instances behind it.
Use a SageMaker batch transform job to process inference requests in batches.
Use a SageMaker multi-model endpoint to host all models on one or more instances.
Multi-model endpoints efficiently host multiple models on shared instances, reducing cost.
A company is using Amazon SageMaker to train a model with a custom algorithm. The training script reads data from an S3 bucket using boto3. The training job fails with an 'AccessDenied' error when trying to access the S3 bucket. The IAM role attached to the SageMaker notebook instance has full S3 access. What is the most likely cause?
The S3 bucket has a bucket policy that denies access from the SageMaker service.
The SageMaker execution role used for the training job does not have S3 access permissions.
The training job uses its own execution role, which must be granted S3 access.
The training script is using an incorrect S3 bucket name.
The SageMaker training job is not configured to use the S3 VPC endpoint.
A machine learning engineer is deploying a model using AWS Lambda for real-time inference. The model is a scikit-learn RandomForestClassifier with 100 trees, serialized as a pickle file of 150 MB. The Lambda function has 3 GB memory allocated. However, the inference requests are timing out after 30 seconds. What is the most likely cause?
scikit-learn is not compatible with AWS Lambda.
The Lambda function does not have enough memory to load the model.
The model is loaded from S3 on every invocation, causing high latency.
Lambda should load the model outside the handler to reuse across invocations, but even then, cold starts with a large model are slow.
The Lambda function timeout is set too low; increase it to 5 minutes.
A data scientist is using Amazon SageMaker for hyperparameter tuning. The tuning job uses a Bayesian optimization strategy. After 10 training jobs, the objective metric (validation accuracy) has plateaued at 0.85. The data scientist wants to explore more diverse hyperparameter combinations. What should the data scientist do?
Decrease the exploration weight in the tuning job configuration.
Switch to random search strategy.
Increase the exploration weight in the tuning job configuration.
Increasing exploration weight prompts the algorithm to try more diverse combinations.
Increase the number of parallel training jobs.
An IAM policy is attached to a SageMaker execution role. A data scientist tries to create a training job using a custom algorithm stored in an ECR repository. The training job fails with an 'AccessDenied' error when pulling the Docker image from ECR. What is the missing permission?
ecr:GetDownloadUrlForLayer and ecr:BatchGetImage on the ECR repository
These permissions are required to pull a Docker image from ECR.
ecr:PutImage on the ECR repository
s3:GetObject on the ECR repository
sagemaker:CreateTrainingJob on the ECR resource
Want more Machine Learning Implementation and Operations practice?
Practice this domainA data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (99% negative class, 1% positive class). The model currently achieves 99% accuracy but fails to detect most positive cases. Which metric should the data scientist primarily use to evaluate model performance?
ROC AUC
F1 score
F1 score balances precision and recall, suitable for imbalanced data.
Recall
Accuracy
A team is building a product recommendation system using matrix factorization in Amazon SageMaker. They notice that the model's training loss decreases steadily but validation loss starts increasing after 5 epochs. What is the most likely cause?
Underfitting
Not enough training data
Learning rate too high
Overfitting
The model is memorizing the training data.
A company is using Amazon SageMaker to train a deep learning model on a large dataset. The training job is taking too long. The team wants to reduce training time without changing the model architecture. Which action should they take?
Increase the learning rate by a factor of 10
Use SageMaker's distributed training with multiple instances
Distributed training parallelizes the workload.
Reduce the number of epochs
Reduce the batch size
A data scientist is deploying a regression model in Amazon SageMaker that predicts housing prices. The model shows high bias (underfitting). Which action is most likely to reduce bias?
Reduce the amount of training data
Increase regularization strength
Use a simpler model
Add more features or increase model complexity
More complex models can capture patterns better.
A machine learning engineer is training a neural network on Amazon SageMaker using a custom Docker container. The training job fails with an error: 'CUDA out of memory.' The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model and data fit into memory when using batch size 32, but the engineer wants to maximize GPU utilization. Which approach should the engineer use to fix the out-of-memory error while maintaining efficient training?
Enable mixed precision training
Reduce batch size to 1
Use a CPU-only instance
Implement gradient accumulation with a larger effective batch size
Accumulates gradients over smaller batches to simulate larger batches.
A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)
Increase the number of layers
Remove L2 regularization
Increase the number of training steps
Add dropout layers
Dropout regularizes by randomly dropping neurons.
Add early stopping based on validation loss
Stops training when validation loss stops improving.
Want more Modeling practice?
Practice this domainA data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?
Create an interaction term between the two features.
Remove one of the two highly correlated features from the dataset.
Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.
Increase the regularization parameter (e.g., lambda) in the model.
Apply mean-centering to both features to reduce correlation.
A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?
One-hot encoding introduced multicollinearity among the binary columns.
One-hot encoding reduced the number of features, causing underfitting.
The one-hot encoding introduced high variance, but the validation set has low variance.
The model suffers from the curse of dimensionality due to the large number of features.
With 100 additional sparse features, the model may overfit and not generalize well.
During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?
Apply a log transformation to the feature.
Log transformation compresses high values and can make the distribution more symmetric.
Apply z-score normalization.
Apply one-hot encoding.
Apply min-max scaling.
A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?
The imputation will introduce bias if the missing values are not random.
Imputation using median is computationally expensive for large datasets.
The imputed values may reduce the variance of the 'age' distribution.
Replacing missing values with a constant reduces the variability of the feature.
The imputed values will increase the variance of the feature, leading to overfitting.
A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?
Remove all but one feature from each group of highly correlated features.
Apply Principal Component Analysis (PCA) and keep the top 50 principal components.
PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.
Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.
Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.
A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)
Replace missing values with the mode of the feature.
Identify and remove outliers from the feature.
Use multiple imputation to fill in the missing values.
Multiple imputation creates several plausible imputed datasets and combines results.
Delete all rows that contain missing values for this feature.
If missingness is random and 15% is acceptable, listwise deletion is straightforward.
Drop the entire feature from the dataset.
Want more Exploratory Data Analysis practice?
Practice this domainThe MLS-C01 exam has 65 questions and must be completed in 180 minutes. The passing score is 750/1000.
Scenario-based questions covering exam objectives with detailed answer explanations.
The exam covers 4 domains: Data Engineering, Machine Learning Implementation and Operations, Modeling, Exploratory Data Analysis. Questions are weighted by domain — higher-weight domains appear more on your actual exam.
No. These are original exam-style practice questions written against the official Amazon Web Services MLS-C01 exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.
Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.