MLS-C01 · topic practice

Exploratory Data Analysis practice questions

Use this page to practise Exploratory Data Analysis questions for this certification. Focus on how the exam tests exploratory data analysis in scenario format — understanding the why behind each answer builds more durable knowledge than memorising options.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Exploratory Data Analysis

What the exam tests

What to know about Exploratory Data Analysis

Exploratory Data Analysis questions on this certification test your ability to deploy and manage exploratory data analysis concepts in scenario-based situations.

Core Exploratory Data Analysis concepts and how they apply in real-world cloud scenarios.

How to deploy exploratory data analysis correctly and verify the outcome.

Troubleshooting exploratory data analysis issues by interpreting error output and system state.

Cloud best practices and Exploratory Data Analysis design trade-offs tested by this certification.

Watch out for

Common Exploratory Data Analysis exam traps

  • Selecting the most expensive service when a simpler managed option meets the requirement.
  • Forgetting that cloud resources must be explicitly secured — defaults are rarely secure.
  • Choosing a global service fix when the issue is region-specific.
  • Overlooking cost implications of cross-region data transfer in architecture questions.

Practice set

Exploratory Data Analysis questions

20 questions · select your answer, then reveal the explanation

A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

Question 12easymultiple choice
Read the full NAT/PAT explanation →

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Question 15mediummultiple choice
Read the full NAT/PAT explanation →

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has 1 million rows with 50 features, including numerical and categorical variables. The goal is to identify patterns and potential data quality issues before building a model. Which approach should the data scientist take to efficiently explore the data?

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/"
    }
  ]
}
```
Question 17hardmultiple choice
Read the full NAT/PAT explanation →

A data scientist is building a fraud detection model using a dataset of 500,000 credit card transactions. The dataset contains 20 features, including transaction amount, merchant category, time since last transaction, and customer age. The target variable 'is_fraud' has 0.1% positive examples. Initial EDA reveals that the transaction amount distribution is highly skewed with a long tail. Also, there are missing values in the 'customer_age' field (5% missing). The data scientist needs to prepare the data for training a binary classifier. Which combination of preprocessing steps should the data scientist apply to address these issues and improve model performance? (Select TWO.)

Question 18mediummultiple choice
Read the full NAT/PAT explanation →

A machine learning engineer is working on a customer churn prediction project. The dataset contains 100,000 records with 15 features, including customer demographics, account information, and usage patterns. The target variable 'churned' is binary with 15% positive examples. During EDA, the engineer notices that the feature 'tenure' (number of months the customer has been with the company) has a multimodal distribution with peaks at 1, 12, 24, and 36 months. Also, the feature 'monthly_charges' has a strong positive correlation with 'total_charges' (correlation coefficient = 0.95). The engineer wants to build a logistic regression model. Which preprocessing steps should the engineer take to address these issues? (Select TWO.)

A data scientist is analyzing a dataset with 100 features and 10,000 observations. The target variable is binary (0/1). Initial exploratory data analysis reveals that many features have missing values, high correlation with each other, and non-normal distributions. The data scientist wants to identify the most important features for predicting the target while reducing dimensionality. Which TWO actions should the data scientist take? (Choose two.)

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

Exhibit

Refer to the exhibit.

```
# S3 Select query result on a CSV file
SELECT * FROM s3object s WHERE s."age" > 30 AND s."city" = 'New York'

# Result:
{
  "Payload": [
    {"Records": {"Payload": "name,age,city\nAlice,35,New York\nBob,40,New York\n"}},
    {"Stats": {"Details": {"BytesScanned": 1024, "BytesProcessed": 512, "BytesReturned": 64}}}
  ]
}
```

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Exploratory Data Analysis sessions

Start a Exploratory Data Analysis only practice session

Every question in these sessions is drawn from the Exploratory Data Analysis domain — nothing else.

Related practice questions

Related MLS-C01 topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the MLS-C01 exam test about Exploratory Data Analysis?
Exploratory Data Analysis questions on this certification test your ability to deploy and manage exploratory data analysis concepts in scenario-based situations.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Exploratory Data Analysis questions in a focused session?
Yes — the session launcher on this page draws every question from the Exploratory Data Analysis domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other MLS-C01 topics?
Use the topic links above to move to related areas, or go back to the MLS-C01 question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the MLS-C01 exam covers. They are not copied from any real exam or dump site.