Objective 2.0

Exploratory Data Analysis

MLS-C01 Practice Questions

Use this page to practise Exploratory Data Analysis questions for this certification. Focus on how the exam tests exploratory data analysis in scenario format — understanding the why behind each answer builds more durable knowledge than memorising options.

Full Practice Test →All Objectives

What this objective tests

MLS-C01 Exploratory Data Analysis — Key Topics

Exploratory Data Analysis questions on this certification test your ability to deploy and manage exploratory data analysis concepts in scenario-based situations.

Core Exploratory Data Analysis concepts and how they apply in real-world cloud scenarios.
How to deploy exploratory data analysis correctly and verify the outcome.
Troubleshooting exploratory data analysis issues by interpreting error output and system state.
Cloud best practices and Exploratory Data Analysis design trade-offs tested by this certification.

Common exam traps

Where candidates lose marks on Exploratory Data Analysis

⚠Selecting the most expensive service when a simpler managed option meets the requirement.
⚠Forgetting that cloud resources must be explicitly secured — defaults are rarely secure.
⚠Choosing a global service fix when the issue is region-specific.
⚠Overlooking cost implications of cross-region data transfer in architecture questions.

MLS-C01 Exploratory Data Analysis — Practice Questions

30 questions from this objective

Question 2mediummultiple choice

Exploratory Data Analysis

MLS-C01 Exploratory Data Analysis — Key Topics

Where candidates lose marks on Exploratory Data Analysis

MLS-C01 Exploratory Data Analysis — Practice Questions

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

Exhibit

Drag and drop the steps to create a data processing job using Amazon SageMaker Processing in the correct order.

Drag and drop the steps to use Amazon SageMaker Feature Store for feature engineering in the correct order.

Match each SageMaker feature to its description.

Match each ML model evaluation concept to its definition.

A data scientist is analyzing a dataset with 500 features and 10,000 samples. After running a correlation matrix, they find that many feature pairs have correlation >0.95. What is the most appropriate next step to improve model performance?

A data scientist is analyzing a dataset with missing values. The missing data mechanism is missing at random (MAR). Which imputation method is most appropriate to preserve relationships between variables?

Which TWO actions are appropriate when dealing with outliers in a dataset during exploratory data analysis? (Select TWO.)

Which THREE techniques are commonly used for feature engineering in exploratory data analysis? (Select THREE.)

All MLS-C01 Objectives

Exploratory Data Analysis

MLS-C01 Exploratory Data Analysis — Key Topics

Where candidates lose marks on Exploratory Data Analysis

MLS-C01 Exploratory Data Analysis — Practice Questions

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

Exhibit

Drag and drop the steps to create a data processing job using Amazon SageMaker Processing in the correct order.

Drag and drop the steps to use Amazon SageMaker Feature Store for feature engineering in the correct order.

Match each SageMaker feature to its description.

Match each ML model evaluation concept to its definition.

A data scientist is analyzing a dataset with 500 features and 10,000 samples. After running a correlation matrix, they find that many feature pairs have correlation >0.95. What is the most appropriate next step to improve model performance?

A data scientist is analyzing a dataset with missing values. The missing data mechanism is missing at random (MAR). Which imputation method is most appropriate to preserve relationships between variables?

Which TWO actions are appropriate when dealing with outliers in a dataset during exploratory data analysis? (Select TWO.)

Which THREE techniques are commonly used for feature engineering in exploratory data analysis? (Select THREE.)

All MLS-C01 Objectives