MLS-C01 Exploratory Data Analysis — All Questions With Answers

Question 1mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?

Question 2hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

Question 3easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

Question 4mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

Question 5hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

Question 6mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

Question 7hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)

Question 8mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

Question 9easymultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

Question 10hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

Question 11mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

Question 12easymultiple choice

Read the full NAT/PAT explanation →

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

Question 13mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Question 14hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Question 15mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has 1 million rows with 50 features, including numerical and categorical variables. The goal is to identify patterns and potential data quality issues before building a model. Which approach should the data scientist take to efficiently explore the data?

Question 16hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/"
    }
  ]
}
```

Question 17hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is building a fraud detection model using a dataset of 500,000 credit card transactions. The dataset contains 20 features, including transaction amount, merchant category, time since last transaction, and customer age. The target variable 'is_fraud' has 0.1% positive examples. Initial EDA reveals that the transaction amount distribution is highly skewed with a long tail. Also, there are missing values in the 'customer_age' field (5% missing). The data scientist needs to prepare the data for training a binary classifier. Which combination of preprocessing steps should the data scientist apply to address these issues and improve model performance? (Select TWO.)

Question 18mediummultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer is working on a customer churn prediction project. The dataset contains 100,000 records with 15 features, including customer demographics, account information, and usage patterns. The target variable 'churned' is binary with 15% positive examples. During EDA, the engineer notices that the feature 'tenure' (number of months the customer has been with the company) has a multimodal distribution with peaks at 1, 12, 24, and 36 months. Also, the feature 'monthly_charges' has a strong positive correlation with 'total_charges' (correlation coefficient = 0.95). The engineer wants to build a logistic regression model. Which preprocessing steps should the engineer take to address these issues? (Select TWO.)

Question 19mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 100 features and 10,000 observations. The target variable is binary (0/1). Initial exploratory data analysis reveals that many features have missing values, high correlation with each other, and non-normal distributions. The data scientist wants to identify the most important features for predicting the target while reducing dimensionality. Which TWO actions should the data scientist take? (Choose two.)

Question 20hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

Exhibit

Refer to the exhibit.

```
# S3 Select query result on a CSV file
SELECT * FROM s3object s WHERE s."age" > 30 AND s."city" = 'New York'

# Result:
{
  "Payload": [
    {"Records": {"Payload": "name,age,city\nAlice,35,New York\nBob,40,New York\n"}},
    {"Stats": {"Details": {"BytesScanned": 1024, "BytesProcessed": 512, "BytesReturned": 64}}}
  ]
}
```

Question 21easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is working on a regression problem to predict house prices. The dataset contains 500,000 rows and 20 features, including 'sqft_living', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'yr_built', 'zipcode', and 'lat'. After performing exploratory data analysis, the engineer notices that the 'sqft_living' feature has a right-skewed distribution with a long tail. The 'zipcode' feature is categorical with 70 unique values. The 'lat' feature is continuous. The engineer wants to prepare the data for a linear regression model. Which action should the engineer take to improve model performance?

Question 22mediumdrag order

Read the full Exploratory Data Analysis explanation →

Drag and drop the steps to create a data processing job using Amazon SageMaker Processing in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 23mediumdrag order

Read the full Exploratory Data Analysis explanation →

Drag and drop the steps to use Amazon SageMaker Feature Store for feature engineering in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 24mediummatching

Read the full Exploratory Data Analysis explanation →

Match each SageMaker feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Managed compute to train a model

Host a model for real-time inference

Run inference on a batch of data

Jupyter notebook for exploration

Run data processing scripts

Question 25mediummatching

Read the full Exploratory Data Analysis explanation →

Match each ML model evaluation concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model performs well on training data but poorly on unseen data

Model fails to capture underlying patterns in data

Error from wrong assumptions in the learning algorithm

Error from sensitivity to small fluctuations in training data

Balance between underfitting and overfitting

Question 26easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 500 features and 10,000 samples. After running a correlation matrix, they find that many feature pairs have correlation >0.95. What is the most appropriate next step to improve model performance?

Question 27mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transactions. They notice that the target variable is highly imbalanced: 99% of samples belong to class 0 and 1% to class 1. Which technique should they use to address this imbalance before training a classification model?

Question 28hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values. The missing data mechanism is missing at random (MAR). Which imputation method is most appropriate to preserve relationships between variables?

Question 29easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO actions are appropriate when dealing with outliers in a dataset during exploratory data analysis? (Select TWO.)

Question 30mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE techniques are commonly used for feature engineering in exploratory data analysis? (Select THREE.)

Question 31hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO statements about handling categorical variables in exploratory data analysis are correct? (Select TWO.)

Question 32easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs the above AWS CLI command on a file in S3. What can be concluded from the output?

Network Topology

Question 33mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is troubleshooting access to an S3 bucket. The above IAM policy is attached to their role. What is the likely result when they try to list objects in the 'confidential' folder?

Exhibit

Refer to the exhibit.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-bucket",
                "arn:aws:s3:::my-bucket/*"
            ]
        },
        {
            "Effect": "Deny",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::my-bucket/confidential/*",
            "Condition": {
                "StringNotEquals": {
                    "aws:sourceVpce": "vpce-12345678"
                }
            }
        }
    ]
}
```

Question 34hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon Athena to query a CSV file stored in S3. The above error occurs. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
ERROR: Could not read CSV file 's3://bucket/data.csv':
Error: (103) The CSV file contains a row with 5 fields, but the header has 4 fields.
Row 1502: "2023-01-15","A","B","C","D"
```

Question 35easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is reviewing a dataset and notices that the distribution of a numerical feature is heavily right-skewed with a long tail. Which visualization is most appropriate to assess the distribution?

Question 36mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that includes a 'timestamp' column. They want to create features that capture seasonality. Which feature engineering approach is most appropriate?

Question 37hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist discovers that a feature has a variance of 0.01, while other features have variances around 1.0. Which action should be taken?

Question 38easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to understand the relationship between a categorical feature with 3 levels and a continuous target variable. Which visualization is most appropriate?

Question 39mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset and finds that the target variable has a bimodal distribution. Which preprocessing step is most appropriate before modeling?

Question 40hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on text data. They want to identify the most common terms and their frequencies. Which approach should they use?

Question 41mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in several features. The dataset is large (10 million rows) and stored in an S3 bucket as CSV files. The scientist wants to use AWS Glue to catalog the data and then use Amazon Athena to query it. However, the missing values are causing errors in downstream machine learning models. Which approach should the scientist take to handle missing values during exploratory data analysis?

Question 42easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset includes a column 'transaction_date' with timestamps. The engineer wants to derive features such as day of the week, hour, and month for modeling. Which AWS service can be used directly to extract these features without writing custom code?

Question 43hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 500 features and 100,000 observations for a regression problem. The scientist notices that many features are highly correlated with each other. Which technique should the scientist use to reduce multicollinearity and improve model interpretability during exploratory data analysis?

Question 44mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). They want to understand the distribution and relationships before modeling. Which exploratory data analysis technique is most appropriate to visualize the imbalance and guide resampling strategy?

Question 45easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is using Amazon SageMaker Studio to perform exploratory data analysis on a dataset stored in S3. The analyst wants to generate summary statistics and visualizations quickly. Which built-in feature of SageMaker Studio should the analyst use?

Question 46hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is working with a dataset containing geospatial coordinates (latitude and longitude) of customer locations. The scientist wants to engineer features such as distance to the nearest store, and cluster customers into regions. Which AWS service is best suited for performing geospatial analysis and clustering during exploratory data analysis?

Question 47mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset with a mix of categorical and numerical features. The engineer wants to understand the correlation between categorical features and the target variable. Which statistical test is most appropriate for measuring association between a categorical feature and a binary target?

Question 48easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is exploring a dataset and wants to identify outliers in a numerical feature. Which visualization technique is most effective for detecting outliers?

Question 49hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A team is performing exploratory data analysis on a dataset containing 10 million records stored in Amazon S3. They want to sample the data efficiently to build a representative subset for initial modeling. Which sampling method should they use to minimize bias and ensure the sample reflects the population distribution?

Question 50mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker to perform exploratory data analysis on a dataset with missing values and outliers. Which TWO actions should the scientist take to understand the data quality? (Choose TWO.)

Question 51hardmulti select

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset with a large number of features (p >> n). The engineer suspects that many features are irrelevant. Which THREE methods are suitable for feature selection during exploratory data analysis? (Choose THREE.)

Question 52easymulti select

Read the full Exploratory Data Analysis explanation →

A data analyst is using AWS Glue to catalog datasets for exploratory analysis. The analyst wants to understand the schema and data types. Which TWO tools can the analyst use to view the schema of a table in the AWS Glue Data Catalog? (Choose TWO.)

Question 53easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 10,000 rows and 50 columns. The target variable is binary. Which technique is most appropriate for identifying the most important features for predicting the target?

Question 54mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company has a dataset with a large number of missing values in several columns. The data scientist wants to impute missing values without introducing bias. Which approach should be used?

Question 55hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis on a dataset with 1 million rows, a data scientist notices that the distribution of the target variable is highly imbalanced (99% class A, 1% class B). Which technique should be applied to address this imbalance before model training?

Question 56easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to visualize the correlation between a continuous feature and a binary target variable. Which plot is most appropriate?

Question 57mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset and finds that the variance of a feature is 0. What should be done with this feature?

Question 58hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A company stores customer transaction data in Amazon S3. A data scientist needs to perform exploratory data analysis using Amazon SageMaker. The dataset is 500 GB in CSV format. Which approach is most cost-effective and time-efficient for initial data profiling?

Question 59easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset and notices that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution more symmetric?

Question 60mediummultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that two features have a Pearson correlation coefficient of 0.95. What is the primary concern when using these features together in a linear regression model?

Question 61hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset of customer churn. The dataset includes a categorical feature 'Region' with 100 unique values. What is the best way to encode this feature for a tree-based model?

Question 62easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous dataset? (Select TWO.)

Question 63mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are common issues that can be identified during exploratory data analysis? (Select THREE.)

Question 64hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are valid reasons to use a sample of the data during exploratory data analysis instead of the full dataset? (Select TWO.)

Question 65mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs the above AWS CLI command. What does the command do?

Network Topology

Question 66hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is setting up an IAM policy for a SageMaker notebook instance that needs to read and write data in the 'training/' folder of an S3 bucket, and also list objects in the bucket. Does the policy satisfy the requirements?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-bucket",
      "Condition": {
        "StringLike": {
          "s3:prefix": "training/*"
        }
      }
    }
  ]
}

Question 67easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist receives the above error during model training. What is the most likely cause?

Exhibit

Refer to the exhibit.

CloudWatch Logs snippet:
2023-07-01T10:00:00 ERROR: Model training failed: ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Traceback:
  File "train.py", line 45, in <module>
    model.fit(X_train, y_train)
  File "sklearn/linear_model/_logistic.py", line 1523, in fit
    ...

Question 68mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has a column 'transaction_date' with timestamps in string format. Which AWS service can be used to parse the timestamps and extract features like day of week and hour?

Question 69hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset with high cardinality categorical features. They want to reduce the number of categories by grouping rare categories into an 'Other' category. Which Amazon SageMaker processing job capability is best suited for this task?

Question 70easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst needs to visualize the distribution of a numerical feature in a dataset. Which AWS service can be used to create a histogram directly from data stored in S3 without writing code?

Question 71mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A team is exploring a dataset with missing values in multiple columns. They want to decide whether to drop rows or impute values. Which approach is most appropriate for exploratory data analysis?

Question 72hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a large dataset (10 TB) stored in S3. They need to compute summary statistics for each column. Which approach is most cost-effective and efficient?

Question 73easymultiple choice

Read the full Exploratory Data Analysis explanation →

A company has customer feedback data stored in CSV files in S3. The data includes a 'feedback_text' column. Which AWS service is best suited for performing sentiment analysis as part of exploratory data analysis?

Question 74mediummultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that a feature has a skewed distribution. They want to apply a log transformation to make it more Gaussian-like. Which Amazon SageMaker feature is most appropriate for this transformation?

Question 75hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is exploring a dataset with a timestamp column and wants to resample the data to a consistent 1-hour frequency. The data is irregularly spaced. Which approach is most efficient using AWS services?

Question 76easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst wants to check for duplicate rows in a dataset stored in S3. Which AWS service can be used to run a SQL query to count duplicates without moving the data?

Question 77mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are appropriate techniques for handling missing data during exploratory data analysis? (Select TWO.)

Question 78hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are best practices for feature engineering during EDA? (Select THREE.)

Question 79easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO AWS services can be used to visualize data distributions as part of exploratory data analysis? (Select TWO.)

Question 80mediummultiple choice

Read the full Exploratory Data Analysis explanation →

The exhibit shows the result of an Athena query. What does the value '5000' represent?

Exhibit

Refer to the exhibit. Consider the following AWS CLI output from an Amazon Athena query: 

QueryExecutionId: "12345678-1234-1234-1234-123456789012"
Query: "SELECT COUNT(*) FROM my_table WHERE col1 IS NULL"
Status: "SUCCEEDED"
ResultConfiguration:
  OutputLocation: "s3://my-bucket/athena-results/"
ResultSet:
  Rows:
  - Data:
    - VarCharValue: "_col0"
  - Data:
    - VarCharValue: "5000"

Question 81hardmultiple choice

Read the full Exploratory Data Analysis explanation →

The exhibit shows an IAM policy for a SageMaker notebook. A data scientist wants to use the notebook to run an Athena query and then load the results into a pandas DataFrame. Which action is NOT possible with this policy?

Exhibit

Refer to the exhibit. Consider the following IAM policy attached to a SageMaker notebook instance:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-training-data/*",
                "arn:aws:s3:::my-training-data"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults"
            ],
            "Resource": "*"
        }
    ]
}

Question 82easymultiple choice

Read the full Exploratory Data Analysis explanation →

The exhibit shows a data quality report for a column named 'age'. Which potential data issue should be investigated further?

Exhibit

Refer to the exhibit. Consider the following output from an Amazon SageMaker Data Wrangler data quality report:

Column: 'age'
Missing: 2.3%
Mean: 38.5
Median: 37.0
StdDev: 15.2
Min: 0
Max: 120
Unique: 85

Question 83mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'income' with 20% missing values. The income distribution is right-skewed. Which imputation method is most appropriate to preserve the skewness?

Question 84easymultiple choice

Read the full Exploratory Data Analysis explanation →

A company has a dataset with 1 million rows and 500 features. They want to reduce dimensionality for visualization. Which technique is most suitable for preserving global structure?

Question 85hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with many categorical features. The target variable is binary. Which statistical test should be used to assess the association between each categorical feature and the target?

Question 86mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company is performing EDA on a dataset with 10,000 rows and 200 columns. They run a correlation matrix and find many high correlations (|r| > 0.9). What is the best approach to address multicollinearity before modeling?

Question 87easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with a column 'transaction_date'. They want to create features for day of week and month. What is the correct AWS service to schedule a recurring ETL job for this transformation?

Question 88hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset where the target variable is highly imbalanced (1% positive class). They are performing EDA. Which metric is most appropriate for evaluating class separation in the feature space?

Question 89mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company stores sensor data in Amazon S3. A data scientist wants to explore the data using SQL without moving it. Which AWS service should they use?

Question 90easymultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist notices that a numeric feature 'age' has outliers beyond 3 standard deviations. What is the most appropriate first step?

Question 91hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 100 features. They want to identify which features are most predictive of the target using a model-agnostic method. Which technique should they use?

Question 92mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO statements about handling missing data during EDA are correct? (Select TWO.)

Question 93hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE are common techniques for detecting outliers in a univariate dataset? (Select THREE.)

Question 94easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO are appropriate visualizations for exploring the distribution of a single numeric variable? (Select TWO.)

Question 95mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs the above AWS CLI command and gets the output. The object size is 1 GB. They try to open the CSV file in Amazon Athena but get an error. What is the most likely cause?

Network Topology

Question 96hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist creates the above IAM policy and attaches it to a role used by an Amazon SageMaker notebook instance. When trying to save a file to the S3 bucket, the operation fails. What is the missing permission?

Exhibit

Refer to the exhibit.

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-bucket/*"
        }
    ]
}
```

Question 97mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing server logs stored in Amazon CloudWatch Logs. The above snippet shows three log entries. They want to count the number of 500 errors per minute using CloudWatch Logs Insights. Which query should they use?

Exhibit

Refer to the exhibit.

```
2019-09-01 12:00:01 ERROR 500 Server Error: /api/v1/users
2019-09-01 12:00:02 INFO 200 OK: /api/v1/users
2019-09-01 12:00:03 ERROR 500 Server Error: /api/v1/users
```

Question 98mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 100 features and 10,000 samples. The target variable is highly imbalanced (1% positive class). Which exploratory data analysis step is most critical before model training?

Question 99hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A company uses Amazon SageMaker to train a regression model. After training, the data scientist notices that the training loss decreases but validation loss increases after a few epochs. Which EDA technique could have helped predict this behavior?

Question 100easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset containing customer transactions. The dataset has a column 'transaction_amount' with values ranging from $0.01 to $10,000. Which EDA step is most appropriate to detect skewed distribution?

Question 101hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is evaluating a dataset for building a fraud detection model. The dataset has 1 million transactions, but only 500 are fraudulent. The engineer wants to understand the distribution of fraudulent vs. non-fraudulent transactions over time. Which EDA visualization is most suitable?

Question 102mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that has missing values in 30% of rows for a categorical feature 'city'. Which EDA step should be performed before deciding on imputation?

Question 103easymultiple choice

Read the full Exploratory Data Analysis explanation →

A team has a dataset with 500 features and wants to reduce dimensionality. During EDA, they compute the variance of each feature. Which finding would most likely lead to feature removal?

Question 104hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset. The target column is 'price' (continuous). Which EDA analysis would best help decide between linear regression and tree-based models?

Question 105easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a text classification dataset with 50,000 documents. Which EDA step is most important to understand the vocabulary size and frequency distribution?

Question 106mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with a timestamp column. They want to detect seasonality. Which visualization is most appropriate?

Question 107mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 50 features. Which TWO EDA techniques are most effective for detecting multicollinearity?

Question 108hardmulti select

Read the full Exploratory Data Analysis explanation →

A machine learning team is analyzing a dataset with 10,000 rows and 200 features. They suspect data leakage due to time-based features. Which THREE EDA checks should they perform?

Question 109easymulti select

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist notices that a numeric feature 'age' has values ranging from 0 to 150, but expects adult ages between 18-100. Which TWO steps should the scientist take to investigate?

Question 110mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 500 features and 10,000 rows. The target variable is binary. After training a logistic regression model, the coefficients show many non-zero values but the model has low accuracy on the test set. Which EDA step should the data scientist perform next to improve model performance?

Question 111hardmultiple choice

Read the full Exploratory Data Analysis explanation →

An ML engineer is performing EDA on a dataset of customer transactions. The dataset has 1 million rows and 20 columns, including a 'transaction_amount' column. The engineer notices that 5% of the transaction amounts are negative, which are data entry errors. The rest are positive. Which approach is most appropriate for handling these negative values during EDA?

Question 112easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset and wants to understand the distribution of a continuous feature. Which visualization is most appropriate for identifying skewness and potential outliers?

Question 113mediummultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that a feature 'age' has 30% missing values. The dataset has 100,000 rows. Which imputation strategy is most robust if the data is not missing at random (MNAR) and the missingness is related to the age value itself?

Question 114hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is building a model to predict customer churn. The dataset has 20 features and 50,000 rows. After initial EDA, they notice that the target variable 'churn' is highly imbalanced (5% churn, 95% non-churn). Which EDA step should the team prioritize to address this imbalance before model training?

Question 115easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with both numerical and categorical features. Which technique is best for detecting multicollinearity among numerical features?

Question 116mediummultiple choice

Read the full NAT/PAT explanation →

An ML team is analyzing a time series dataset of daily website traffic. They notice a pattern where traffic spikes every Sunday. Which EDA technique should they use to confirm this seasonality?

Question 117hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset containing text reviews. The goal is to classify sentiment. During EDA, they compute the word frequency distribution. They notice that the most frequent words are common stop words like 'the', 'and', 'a'. Which action should they take to improve the feature representation for modeling?

Question 118easymultiple choice

Read the full Exploratory Data Analysis explanation →

After loading a dataset into a pandas DataFrame, a data scientist runs df.info() and sees that a column 'income' has object dtype. What does this indicate, and what EDA step should be taken?

Question 119mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 100 features. They want to reduce dimensionality by removing highly correlated features. Which TWO approaches are appropriate? (Choose TWO.)

Question 120hardmulti select

Read the full Exploratory Data Analysis explanation →

During EDA of a dataset for a regression problem, a data scientist notices that the target variable has a right-skewed distribution. Which THREE transformations are appropriate to address this skewness? (Choose THREE.)

Question 121easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with categorical variables. Which TWO EDA techniques are appropriate for understanding the relationship between a categorical feature and a continuous target? (Choose TWO.)

Question 122mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring log files stored in S3. They ran the above AWS CLI command. What does the output indicate about the data, and what EDA step should be taken next?

Network Topology

Question 123hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is trying to upload a CSV file to an S3 bucket using the AWS CLI without specifying server-side encryption. The upload fails with an AccessDenied error. Based on the bucket policy exhibit, what is the most likely cause?

Exhibit

Refer to the exhibit.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::data-bucket/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        }
    ]
}

Question 124mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing application logs in JSON format. Based on the exhibit, which EDA insight is most valuable for troubleshooting?

Exhibit

Refer to the exhibit.

{
  "Logs": [
    {
      "timestamp": "2023-10-01T10:00:00Z",
      "level": "ERROR",
      "message": "NullPointerException: Cannot invoke method on null object"
    },
    {
      "timestamp": "2023-10-01T10:01:00Z",
      "level": "ERROR",
      "message": "NullPointerException: Cannot invoke method on null object"
    },
    {
      "timestamp": "2023-10-01T10:02:00Z",
      "level": "WARN",
      "message": "Connection timeout"
    },
    {
      "timestamp": "2023-10-01T10:03:00Z",
      "level": "ERROR",
      "message": "NullPointerException: Cannot invoke method on null object"
    }
  ]
}

Question 125easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values. Which technique is most appropriate for imputing missing values in a numerical feature that follows a normal distribution?

Question 126mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is exploring a dataset with 50 features. Some features are highly correlated. Which technique should the engineer use to reduce dimensionality while preserving variance?

Question 127hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a large number of categorical features. The target variable is binary. Which technique should the scientist use to assess the relationship between each categorical feature and the target?

Question 128easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is visualizing the distribution of a numerical feature that is heavily right-skewed. Which visualization technique is most appropriate?

Question 129mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company has a dataset with a timestamp column and multiple numerical metrics. They want to identify seasonality and trends. Which AWS service is best suited for this analysis?

Question 130hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a binary target variable. The dataset is highly imbalanced (99% negative class). Which metric is most appropriate for evaluating the model's performance during exploratory data analysis?

Question 131easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to understand the distribution of a categorical feature with 100 unique values. Which visualization is most appropriate?

Question 132mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that contains a feature with many outliers. Which transformation should the scientist apply to reduce the impact of outliers?

Question 133hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with mixed data types (numerical, categorical, text). The goal is to identify clusters of similar records. Which technique is most appropriate?

Question 134easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are common techniques for detecting outliers in a numerical feature?

Question 135mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are appropriate data visualization techniques for exploring the relationship between two numerical variables?

Question 136hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are appropriate methods for handling missing data in a dataset?

Question 137easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'age' with some missing entries. Which technique is most appropriate for imputing missing values in the 'age' column if the data is normally distributed?

Question 138mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The analysis reveals high cardinality in a categorical feature with over 1 million unique values. What is the best approach to handle this before training a model?

Question 139hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist observes a strong correlation (r=0.95) between two numeric features. The model to be trained is a linear regression. What is the most appropriate action?

Question 140easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a target variable that is binary (0/1). Which visualization is most appropriate to explore the relationship between a continuous feature and the target?

Question 141mediummultiple choice

Read the full Exploratory Data Analysis explanation →

In exploratory data analysis, a data scientist notices that the distribution of a feature 'income' is heavily right-skewed. Which transformation is most appropriate to reduce skewness?

Question 142hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 1,000 features and only 200 samples. The goal is to build a binary classifier. Which technique should be used first during exploratory data analysis to reduce dimensionality and avoid overfitting?

Question 143easymultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that a categorical feature 'city' has 500 unique values but only 10 cities account for 90% of the data. What is a recommended way to handle the rare categories?

Question 144mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA and observes that a feature 'purchase_amount' has many zeros and a long tail of positive values. What type of model would be appropriate for this target variable?

Question 145hardmultiple choice

Read the full NAT/PAT explanation →

During EDA, a data scientist plots the distribution of a feature and sees a bimodal pattern. What does this likely indicate?

Question 146mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO are appropriate techniques for detecting outliers in a dataset during exploratory data analysis?

Question 147hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE are valid reasons to perform feature scaling during exploratory data analysis?

Question 148easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO are common steps in exploratory data analysis?

Question 149mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in a numeric column. The missing rate is 30% and the data is not missing completely at random. Which imputation method should the data scientist avoid to minimize bias?

Question 150easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is exploring a dataset with 500 features and 10,000 samples. To reduce dimensionality for visualization, which technique is most suitable if the goal is to preserve global data structure?

Question 151hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is examining a dataset for a binary classification problem. The target variable has a 1:1000 imbalance. Which technique should be used to assess model performance during exploratory data analysis?

Question 152mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is working with a time series dataset that shows increasing variance over time. To stabilize the variance before modeling, which transformation is most appropriate?

Question 153hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A team is analyzing a dataset with many categorical features. They notice that one feature has 1,000 unique values but a long tail where most values appear only once. Which encoding method is most appropriate to avoid overfitting?

Question 154easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset and finds that the correlation between two features is 0.95. What should the data scientist do to address multicollinearity before training a linear regression model?

Question 155mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset and observes that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution approximately normal?

Question 156hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset containing text reviews. To understand the most common words, the data scientist generates a word cloud. Which preprocessing step is most important to ensure the word cloud reflects meaningful content?

Question 157easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is exploring a dataset and notices that the target variable has a Poisson distribution. Which type of model is most appropriate for this target?

Question 158mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO techniques are appropriate for detecting outliers in a univariate numeric dataset?

Question 159hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are common causes of multicollinearity in a linear regression model?

Question 160easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are benefits of feature scaling for machine learning algorithms?

Question 161mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains both numerical and categorical features. Which approach should the data scientist use to handle missing values while minimizing bias and preserving relationships in the data?

Question 162hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is preparing a dataset for training a binary classification model. The target variable is highly imbalanced (95% negative, 5% positive). The engineer needs to split the data into training and test sets while maintaining the class distribution in both sets. Which method should the engineer use?

Question 163easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing feature distributions in a dataset and notices that one feature has a long tail. Which transformation is most appropriate to reduce skewness and make the distribution more normal?

Question 164hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a high-dimensional dataset with 500 features. The scientist wants to visualize the data in 2D to check for clusters. Which dimensionality reduction technique should the scientist use that preserves global structure and is computationally efficient for large datasets?

Question 165easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is examining a scatter plot of two variables and notices a strong positive correlation. Which of the following is a valid conclusion?

Question 166mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset containing 10,000 observations and 100 features. The scientist wants to detect outliers in the dataset. Which method is most appropriate for outlier detection in a high-dimensional space?

Question 167mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is building a data pipeline that aggregates customer transaction data. The engineer notices that some transactions have duplicate entries due to a system error. Which approach should the engineer use to identify and remove duplicates based on a unique transaction ID?

Question 168hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is analyzing feature importance in a dataset with many categorical features. They plan to use a tree-based model. Which encoding method should they use to handle high-cardinality categorical features without creating too many dummy variables?

Question 169easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst wants to understand the distribution of a continuous variable. Which visualization is most appropriate for this purpose?

Question 170mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset and finds that two features have a Pearson correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose two.)

Question 171hardmulti select

Read the full Exploratory Data Analysis explanation →

A data engineer is performing exploratory data analysis on a dataset with 1 million rows and 50 features. The engineer wants to identify missing values and outliers. Which THREE approaches should the engineer use? (Choose three.)

Question 172easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with a binary target variable. Which TWO metrics are appropriate for evaluating the balance of the target classes? (Choose two.)

Question 173hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data science team at a financial services company is building a fraud detection model using a dataset of credit card transactions. The dataset contains 10 million rows and 20 features, including transaction amount, merchant category, time since last transaction, and customer ID. The target variable 'is_fraud' is highly imbalanced: only 0.1% of transactions are fraudulent. The team is performing exploratory data analysis (EDA) on a sample of 100,000 rows. They compute the correlation matrix and find that 'transaction amount' has a correlation of 0.02 with 'is_fraud'. They also plot the distribution of 'transaction amount' and see that it is heavily right-skewed with a long tail. The team wants to understand the relationship between 'transaction amount' and fraud more deeply before feature engineering. They have access to AWS SageMaker and can run processing jobs. Which course of action is most appropriate?

Question 174easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with missing values. They want to understand the distribution of each feature and identify outliers. Which AWS service can be used to create visualizations such as histograms and box plots without writing any code?

Question 175easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that the target variable is highly imbalanced. Which technique should be used to address this issue before training a classification model?

Question 176mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset stored in an Amazon S3 bucket. The dataset contains both numerical and categorical features. The scientist wants to compute summary statistics (mean, median, standard deviation) for all numerical features and count the distinct values for categorical features. Which AWS service is most appropriate for this task with minimal coding?

Question 177mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is performing EDA on a time-series dataset and observes a strong upward trend and seasonal patterns. The scientist needs to make the data stationary for modeling. Which transformation should be applied?

Question 178hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with high cardinality categorical features (e.g., user IDs with millions of unique values). They want to visualize the relationship between these categorical features and a continuous target variable. Which approach is most effective for EDA?

Question 179hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist discovers that two numerical features have a Pearson correlation coefficient of 0.95. Which action should the scientist take to avoid multicollinearity in a linear regression model?

Question 180easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset and wants to check for missing values. Which method is most appropriate to identify the percentage of missing values per column?

Question 181mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with many features. They suspect some features are redundant due to high pairwise correlations. Which technique can help identify groups of correlated features?

Question 182hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a large dataset (10 TB) stored in Amazon S3. The dataset is in CSV format and has many columns. The scientist wants to quickly compute summary statistics (mean, min, max, count) for each column without moving the data. Which approach is most cost-effective and efficient?

Question 183mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with mixed data types (numerical and categorical). Which TWO visualizations are most appropriate for understanding the distribution of categorical features?

Question 184hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset and suspects the presence of outliers that could affect the mean and standard deviation. Which TWO methods are robust to outliers for measuring central tendency and dispersion?

Question 185mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with a binary target variable. Which THREE techniques can help assess the relationship between a continuous feature and the target?

Question 186mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset containing customer transaction records stored in Amazon S3 as CSV files. The dataset has 500 columns and 2 million rows. The scientist wants to perform EDA to understand data types, missing values, and summary statistics for each column. They need to do this quickly and without writing custom code. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Data Wrangler. Which approach should the scientist take?

Question 187hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is analyzing a dataset with a large number of missing values in several columns. The dataset is stored in an Amazon S3 bucket and is about 5 TB in size. The scientist wants to understand the pattern of missingness (e.g., is it missing completely at random, missing at random, or not missing at random) before deciding on an imputation strategy. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Studio. Which approach should the scientist take to best understand the missing data patterns?

Question 188easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset that contains customer demographics and purchase history. The dataset has a column 'age' with some values that are negative or unreasonably high (e.g., 200). The scientist wants to identify and handle these outliers. The scientist is using a SageMaker notebook with pandas. Which approach should the scientist take to effectively handle these outliers?

Question 189easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in several columns. The dataset is stored in an S3 bucket. What is the most efficient method to identify the percentage of missing values per column using AWS services?

Question 190mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is performing exploratory data analysis on a large dataset stored in S3 using Amazon Athena. The dataset contains a timestamp column 'event_time' of type string. The engineer wants to analyze daily trends. Which approach is the most cost-effective and efficient?

Question 191hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is analyzing a dataset with 500 features and 100,000 observations. The target variable is binary. The dataset contains highly correlated features and some categorical variables with high cardinality. Which combination of techniques should the data scientist use to reduce dimensionality while preserving interpretability for EDA?

Question 192mediummultiple choice

Read the full Exploratory Data Analysis explanation →

An organization stores streaming data in Amazon Kinesis Data Streams. A data analyst wants to perform real-time exploratory data analysis on the incoming data to detect anomalies. Which AWS service should the analyst use to run SQL queries on the streaming data?

Question 193easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker Data Wrangler for exploratory data analysis. The dataset contains a column with missing values that are encoded as 'NA' strings. The data scientist wants to treat these as missing values during the import. Which step should the data scientist take?

Question 194hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is performing EDA on a dataset with 1 million rows and 200 columns. The dataset is stored in S3 as CSV files. The engineer notices that some columns have a high proportion of zeros. What is the best approach to determine if these zeros represent missing data or actual zero values?

Question 195mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a skewed target variable for a regression problem. During EDA, the scientist wants to transform the target variable to approximate a normal distribution. Which transformation should the scientist apply first?

Question 196easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is using Amazon QuickSight to explore a dataset with 10 million rows. The analyst wants to create a histogram of a numerical column. However, the query is taking too long. Which action should the analyst take to improve performance without losing accuracy?

Question 197mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with mixed data types (numerical, categorical, text). The dataset is stored in S3. Which TWO AWS services can be used to directly perform statistical summaries and visualizations without writing custom code?

Question 198hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with several categorical features and a binary target. The scientist wants to check for association between each categorical feature and the target. Which THREE statistical tests are appropriate?

Question 199mediummulti select

Read the full Exploratory Data Analysis explanation →

A data engineer is exploring a large dataset in Amazon Athena. The dataset is partitioned by date and stored in Parquet format. The engineer wants to check the number of distinct values in a column for a specific date range. Which THREE practices reduce query cost and improve performance?

Question 200hardmultiple choice

Read the full Exploratory Data Analysis explanation →

The exhibit shows an Athena query result from a table. What is the output of the query?

Network Topology

Question 201hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working on a customer churn prediction project for a telecom company. The dataset contains 50,000 records with 25 features, including 'tenure' (number of months customer stayed), 'monthly_charges', 'total_charges', 'contract_type' (month-to-month, one year, two year), 'payment_method', and a target 'churn' (Yes/No). The data is stored in an S3 bucket as a single CSV file. The scientist uses Amazon SageMaker Data Wrangler to perform EDA. After importing the data, the scientist notices that the 'total_charges' column has many missing values (about 20% of rows). The scientist suspects that missing values occur only for customers with tenure = 0 (new customers). After verifying that suspicion, the scientist wants to handle the missing values appropriately. Which course of action should the scientist take?

Question 202mediummultiple choice

Read the full NAT/PAT explanation →

A data engineer is performing EDA on a dataset containing user activity logs from a mobile app. The dataset has 10 million rows and includes columns: 'user_id', 'event_type', 'timestamp', 'device_type', and 'session_duration'. The engineer uses Amazon Athena to query the data stored in S3 as CSV files. The engineer runs a query to find the average session_duration per device_type, but the query takes over 5 minutes and scans 100 GB of data. The engineer wants to reduce query cost and improve performance for future EDA. The dataset is not partitioned, and the engineer anticipates frequent queries filtering on 'timestamp' and 'device_type'. Which action will most effectively reduce data scanned?

Question 203mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset of customer reviews for a retail company. The dataset contains text reviews, star ratings (1-5), and customer metadata. The scientist wants to perform sentiment analysis to classify reviews as positive or negative. During EDA, the scientist uses Amazon SageMaker Data Wrangler to visualize the distribution of star ratings and notices that 90% of reviews are 4 or 5 stars, while only 2% are 1 star. The scientist is concerned about class imbalance. Which approach should the scientist take to address the imbalance before modeling?

Question 204easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist wants to understand the distribution and missing values in a large dataset stored in Amazon S3. Which TWO AWS services can be used directly for this exploratory data analysis? (Choose TWO.)

Question 205mediummulti select

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset with 500 features and suspects multicollinearity. Which TWO techniques can help identify and address multicollinearity during exploratory data analysis? (Choose TWO.)

Question 206hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a time-series dataset of website traffic. The dataset contains hourly page views for the past two years. The scientist wants to analyze seasonality and trends. Which THREE techniques are appropriate for this analysis? (Choose THREE.)

Question 207hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a large dataset stored in Amazon S3 (100 GB, CSV format, 500 columns). The dataset contains customer transaction records with features such as transaction amount, timestamp, customer ID, and numerous categorical variables (e.g., product category, payment method, location). The scientist wants to understand the distribution of transaction amounts across different product categories and identify any outliers. They have an Amazon SageMaker notebook instance with a ml.t3.medium instance and are using pandas. However, when trying to load the entire dataset into a DataFrame using pd.read_csv('s3://bucket/data.csv'), the notebook crashes with a memory error. Additionally, the scientist suspects that some categorical columns have high cardinality (e.g., product category has thousands of unique values), and there are missing values in several columns. What is the MOST efficient approach to perform the EDA without modifying the original dataset or using additional AWS services? Options: A) Use the SageMaker SDK to launch a parallel processing job with PySpark and read the data into a Spark DataFrame, then compute statistics and visualize with matplotlib. B) Use pandas with chunksize parameter to iterate through the dataset in chunks, compute per-chunk statistics, and aggregate results; for high-cardinality columns, use value_counts() with dropna=False and then plot the top 20 categories. C) Use the S3 Select API to filter rows and columns before loading into pandas, reducing the data size; then use pandas for EDA. D) Use SageMaker Data Wrangler to import the dataset, create a flow to handle missing values and reduce cardinality, and export a sample to the notebook for analysis.

Question 208easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 10 features and observes that the correlation between feature A and feature B is 0.98. Which action should be taken to address multicollinearity before training a linear regression model?

Question 209mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer trains a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. After training, the model achieves 99% accuracy but only 10% recall on the positive class. Which metric should the engineer focus on to evaluate the model's performance on the minority class?

Question 210hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist plots the distribution of a numeric feature and observes that it is right-skewed. The feature will be used as input to a linear model. Which transformation should the data scientist apply?

Question 211easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist has a dataset with 500 features and wants to reduce dimensionality for visualization. Which technique is most appropriate for identifying the two components that capture the most variance?

Question 212mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer runs a SQL query on Amazon Athena to explore a dataset stored in S3 as CSV. The query returns zero rows for a column that should have numeric values. Which step should the engineer take to diagnose the issue?

Question 213hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with missing values in 3 of 20 features. The missing rate is 5% for each feature. The scientist wants to preserve as much data as possible while avoiding bias. Which imputation strategy is most appropriate?

Question 214easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset and notices that the target variable is highly imbalanced. Which technique should the data scientist apply to balance the dataset before training?

Question 215mediummultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that a numeric feature has many outliers. The feature will be used in a linear regression model. Which approach should the scientist take to handle the outliers?

Question 216hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist examines a dataset with 100 features and suspects that some features are redundant due to high pairwise correlations. Which EDA technique should the scientist use to systematically identify groups of highly correlated features?

Question 217easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO actions should a data scientist take when exploring a dataset that contains missing values and outliers? (Select TWO.)

Question 218mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE techniques are commonly used in exploratory data analysis to understand the relationships between features and the target variable? (Select THREE.)

Question 219hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO statements about handling missing data during exploratory data analysis are correct? (Select TWO.)

Question 220mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 10 million rows and 500 features. The target variable is binary. The dataset is stored in an Amazon S3 bucket. The data scientist wants to quickly identify which features have the highest correlation with the target variable. Which approach is MOST efficient?

Question 221easymultiple choice

Read the full NAT/PAT explanation →

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset has missing values in the 'age' column and outliers in the 'amount' column. Which combination of techniques should the engineer use to handle these issues during EDA?

Question 222hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 1 million records and 20 features. The target variable is continuous. The scientist wants to identify non-linear relationships between features and the target. Which technique is MOST suitable for this purpose during exploratory data analysis?

Question 223mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company is preparing a dataset for training a binary classification model. The dataset has a severe class imbalance (1% positive class). The data scientist wants to understand the impact of this imbalance on model performance before sampling. Which exploratory analysis step is MOST critical?

Question 224easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that contains text reviews and a numeric rating (1-5). The goal is to predict the rating from the review text. During EDA, the scientist wants to check if there are any spelling errors or unusual characters. Which tool is BEST suited for this task?

Question 225hardmultiple choice

Read the full NAT/PAT explanation →

A data engineer is performing EDA on a time-series dataset of server metrics (CPU, memory, disk I/O) collected every minute. The dataset contains 2 years of data. The engineer suspects there are seasonal patterns and wants to decompose the time series for one metric. Which AWS service can be used to perform this decomposition natively?

Question 226mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 100 features. After generating pair plots, the scientist notices that many features have skewed distributions. Which transformation should the scientist apply to make the distributions more Gaussian-like for modeling?

Question 227easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 500,000 rows and 10 columns. The dataset is stored in an S3 bucket as CSV files. The scientist wants to generate summary statistics (mean, median, min, max) for all numeric columns. Which service allows the quickest ad-hoc analysis without provisioning any infrastructure?

Question 228hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 100,000 observations and 50 features. The scientist uses a Jupyter notebook on Amazon SageMaker. During EDA, the scientist runs a command to check for missing values and notices that 20% of the data in one feature is missing. The missing values are not random; they are correlated with another feature. Which imputation method is MOST appropriate?

Question 229mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target is binary. The scientist wants to reduce dimensionality while preserving information related to the target. Which TWO methods are appropriate?

Question 230hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with mixed data types (numeric, categorical, text). The dataset has 5 million rows. The scientist wants to understand the relationships between variables and identify potential data quality issues. Which THREE tools are suitable for this analysis?

Question 231easymulti select

Read the full NAT/PAT explanation →

A data scientist is working with a dataset that contains geolocation coordinates (latitude and longitude) and timestamps. The scientist wants to visualize the data to check for spatial and temporal patterns. Which TWO AWS services can be used for this visualization?

Question 232mediummultiple choice

Read the full Exploratory Data Analysis explanation →

An ML engineer runs the AWS CLI command above to list files in a training data bucket. The engineer notices that the three CSV files have different sizes but the same number of columns. What is the MOST likely cause of the size variation?

Network Topology

Question 233hardmultiple choice

Read the full Exploratory Data Analysis explanation →

An IAM policy is attached to a data scientist's role. The scientist is trying to list objects in the 'data-bucket' using Amazon Athena. The query fails with an access denied error. What is the MOST likely reason?

Exhibit

Refer to the exhibit.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-bucket",
                "arn:aws:s3:::data-bucket/*"
            ]
        },
        {
            "Effect": "Deny",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::data-bucket/sensitive/*"
        }
    ]
}

Question 234mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A DevOps engineer runs the CloudWatch Logs Insights query shown above on the log group for an ML training job. The result shows a spike in ERROR messages at a specific hour. What should the engineer do next to identify the root cause?

Exhibit

Refer to the exhibit.

CloudWatch Logs Insights query:
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(1h)
| sort @timestamp desc
| limit 10

Question 235easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs a SQL query on an Amazon Athena table and notices that the query scans a large amount of data. Which approach would reduce the amount of data scanned without changing the SQL logic?

Question 236mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer ingests streaming data into Amazon Kinesis Data Streams. The data science team needs to analyze the data using Amazon SageMaker notebooks. What is the most efficient way to provide access to the stream data for ad-hoc exploration?

Question 237hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is building a fraud detection model. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraudulent). Which EDA technique is most important to apply before modeling?

Question 238mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset containing customer transaction records. The target variable is 'churn' (1 = churned, 0 = not churned). Which TWO actions should the scientist take to understand the data distribution and prepare for modeling?

Question 239hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a large dataset of images stored in Amazon S3. The dataset is used to train a computer vision model. Which THREE EDA steps are appropriate for this image dataset?

Question 240easymulti select

Read the full Exploratory Data Analysis explanation →

A data analyst is performing EDA on a tabular dataset with 500 features. The goal is to reduce dimensionality before modeling. Which TWO techniques are appropriate for this task?

Question 241mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is unable to query a table in Amazon Athena that is located in the 'my-data-bucket' S3 bucket. The IAM policy shown is attached to the scientist's role. What is the most likely reason for the failure?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-data-bucket/*",
        "arn:aws:s3:::my-data-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryResults"
      ],
      "Resource": "*"
    }
  ]
}

Question 242hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist runs the AWS CLI command shown and gets the output. The scientist wants to create an Athena table over all log files in the 'logs/2023/' prefix, including files smaller than 1000 bytes. Which approach achieves this?

Network Topology

Question 243mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is using AWS Glue ETL jobs to process data from a source database. The job logs show repeated timeout errors. Which EDA step should the scientist perform to diagnose the issue?

Exhibit

Refer to the exhibit.

[ERROR] 2023-01-15T10:30:00.000Z 12345678-1234-1234-1234-123456789012
Task timed out after 300.00 seconds

[ERROR] 2023-01-15T10:35:00.000Z 12345678-1234-1234-1234-123456789012
Task timed out after 300.00 seconds

Question 244easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that contains both numerical and categorical features. The target variable is continuous. Which TWO EDA techniques should the scientist use to understand relationships between features and the target?

Question 245hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with many missing values. The scientist wants to decide on an imputation strategy. Which THREE considerations are important for choosing the imputation method?

Question 246easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset and wants to identify outliers in a numerical feature. The feature is not normally distributed. Which technique is robust to non-normal distributions?

Question 247mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is performing EDA on a time series dataset of daily website visits. The scientist wants to identify any seasonality patterns. Which visualization is most appropriate?

Question 248hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist queries the table with 'SELECT COUNT(*) FROM mytable' in Athena and gets a result of 1000 rows. However, the scientist knows there are 1500 data files in the S3 location. What is the most likely reason for the discrepancy?

Network Topology

Question 249easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with many features and wants to detect multicollinearity. Which technique should the scientist use?

Question 250mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains customer demographic information and purchase history. Which approach should the data scientist take to handle missing values without introducing bias into the dataset?

Question 251hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 using AWS Glue. The dataset contains a mix of numeric and categorical features. The engineer wants to efficiently compute summary statistics (e.g., mean, median, standard deviation) for the numeric columns. Which AWS service or feature should the engineer use to achieve this with minimal setup?

Question 252easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is exploring a dataset with a target variable that is highly imbalanced. The minority class represents only 1% of the data. Which technique should the analyst use to better understand the relationships between features and the minority class?

Question 253mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is analyzing a time series dataset of daily website traffic. The scientist notices a strong weekly seasonality. To better understand the underlying patterns, which decomposition method should the scientist use to separate the trend, seasonal, and residual components?

Question 254hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is working with a dataset containing high-dimensional sparse features, such as text data represented as bag-of-words. The team wants to reduce dimensionality while preserving the structure of the sparse matrix. Which technique is most appropriate for this scenario?

Question 255easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is examining the distribution of a continuous variable and notices that its histogram is heavily skewed to the right. Which transformation should the analyst apply to make the distribution more symmetrical?

Question 256mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with both numerical and categorical features. The scientist wants to visualize the pairwise relationships between numerical features and also see the distribution of each feature. Which type of plot should the scientist use?

Question 257hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A team is analyzing a dataset with many categorical features that have high cardinality (e.g., ZIP code, user ID). They want to explore relationships between these features and a continuous target variable. Which approach is most appropriate for visualizing these relationships without overwhelming the viewer?

Question 258easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is investigating a dataset where the target variable is binary (0/1). The analyst wants to check for multicollinearity among the numerical features. Which statistical measure should the analyst use?

Question 259mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with many features and suspects that some features are highly correlated. Which TWO methods can the scientist use to detect and handle multicollinearity before building a linear regression model?

Question 260hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a continuous target variable and suspects that the relationship between a predictor and the target is non-linear. Which THREE techniques can the scientist use to explore and model this non-linearity?

Question 261easymulti select

Read the full Exploratory Data Analysis explanation →

A data analyst is performing exploratory data analysis on a dataset and notices that there are outliers in several numerical columns. Which TWO methods can the analyst use to identify outliers?

Question 262mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset stored as a single 2 GB object in S3. The scientist wants to read only a subset of the file (e.g., the first 1000 lines) to perform initial data inspection. Which approach should the scientist take to minimize data transfer and cost?

Network Topology

Question 263hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is trying to list objects in an S3 bucket named 'my-bucket' using the AWS CLI command: `aws s3 ls s3://my-bucket/`. The command fails with an access denied error. The IAM policy attached to the scientist's role is shown in the exhibit. What is the most likely cause of the failure?

Exhibit

Refer to the exhibit.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-bucket",
                "arn:aws:s3:::my-bucket/*"
            ],
            "Condition": {
                "StringEquals": {
                    "s3:ExistingObjectTag/data-type": "training"
                }
            }
        }
    ]
}
```

Question 264easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is querying the AWS Glue Data Catalog table shown in the exhibit. The engineer runs an Athena query: SELECT * FROM transactions WHERE year=2023. The query returns results quickly. However, a subsequent query: SELECT * FROM transactions WHERE amount > 100 takes a long time. What is the most likely reason for the performance difference?

Network Topology

Question 265easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 100 features and wants to identify which features are most correlated with the target variable. Which AWS service is most appropriate for this task?

Question 266mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company is building a classification model and discovers that the target variable is imbalanced: 95% of samples belong to class A and 5% to class B. The data scientist needs to understand the distribution of numeric features for each class. Which approach is most appropriate?

Question 267hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 (10 TB in CSV format). The dataset has 2000 columns and 50 million rows. The engineer needs to compute summary statistics (mean, median, standard deviation) for each numeric column and identify missing values. Which approach is MOST cost-effective and time-efficient?

Question 268easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is reviewing a dataset for a regression problem. They notice that the target variable has a right-skewed distribution. Which transformation should they consider applying to the target variable to improve model performance?

Question 269mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset containing customer transactions. The dataset has a column named 'transaction_date' with timestamp values. The scientist wants to create new features such as day of week, hour, and whether the transaction occurred on a weekend. Which AWS service provides built-in feature engineering capabilities for datetime columns?

Question 270hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing clickstream data from a website. The data is stored in Amazon S3 as JSON files, each containing nested arrays. The scientist needs to flatten the nested structures and compute user session durations. Which approach is most efficient for this EDA task?

Question 271easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that a categorical feature 'city' has over 1,000 unique values. The dataset has 10,000 rows. Which technique should the scientist consider to reduce the cardinality of this feature?

Question 272mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 500 features. The dataset has a mix of numeric and categorical features. The scientist wants to identify which features have a strong nonlinear relationship with the target variable. Which technique is most appropriate?

Question 273hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is running an Amazon SageMaker Data Wrangler flow on a dataset with 5 million rows. The flow includes several transformations. The engineer wants to validate the data quality by checking for missing values and outliers before training. Which approach is most efficient?

Question 274easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are common techniques for handling missing values in a dataset during exploratory data analysis? (Select TWO.)

Question 275mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE of the following are valid techniques for detecting outliers in a dataset during exploratory data analysis? (Select THREE.)

Question 276hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are best practices for exploratory data analysis when using Amazon SageMaker Data Wrangler? (Select TWO.)

Question 277easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist needs to understand the distribution of a numeric feature in a dataset stored in Amazon S3. Which AWS service can be used to run a quick exploratory query without setting up a server?

Question 278mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 10 million rows and 50 columns. The target variable is highly imbalanced (99% negative, 1% positive). Which approach is most appropriate for exploratory data analysis before modeling?

Question 279hardmultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that the correlation matrix of features shows many pairs with absolute correlation > 0.95. The dataset includes both numerical and categorical variables. Which technique is most appropriate to reduce multicollinearity while preserving the most information?

Question 280easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a timestamp column. The goal is to identify seasonality and trends. Which visualization technique is most suitable?

Question 281mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company uses Amazon SageMaker Data Wrangler to perform exploratory data analysis. They want to detect outliers in a numerical column using the Interquartile Range (IQR) method. Which transformation should they apply in Data Wrangler?

Question 282hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 1 million rows. They suspect the dataset contains duplicate rows. Which approach is most efficient to identify duplicates in Amazon SageMaker Studio?

Question 283easymultiple choice

Read the full Exploratory Data Analysis explanation →

Which AWS service can be used to generate a data profile (including histograms, correlations, and statistics) for a dataset stored in Amazon S3 without writing code?

Question 284mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is exploring a dataset with many missing values. They want to understand the pattern of missingness before deciding on imputation. Which approach is most appropriate?

Question 285hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker Studio notebooks for EDA. They want to share a reproducible report that includes code, visualizations, and narrative text with their team. Which approach should they use?

Question 286easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO approaches are appropriate for handling missing categorical data during exploratory data analysis? (Choose two.)

Question 287mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE actions are valid steps in exploratory data analysis when working with a new dataset? (Choose three.)

Question 288hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO techniques can be used to detect multicollinearity among numerical features during exploratory data analysis? (Choose two.)

Question 289easymultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist lists files in an S3 bucket. The dataset is split into train, test, and validation sets. What is the most likely issue with this data split?

Exhibit

Refer to the exhibit.
```
aws s3 ls s3://my-bucket/data/
2024-01-01 12:00:00    1024 train.csv
2024-01-01 12:00:01    2048 test.csv
2024-01-01 12:00:02     512 validation.csv
```

Question 290mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is unable to run an Amazon Athena query on data in `my-bucket`. The IAM policy shown is attached to the user. What is the most likely reason for the failure?

Exhibit

Refer to the exhibit.
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}
```

Question 291hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is running an Amazon EMR Spark job for exploratory data analysis on a large dataset. The job fails with the error shown. What is the most appropriate action to resolve this?

Exhibit

Refer to the exhibit.
```
$ cat /var/log/syslog | grep "OutOfMemory"
2024-01-15 10:30:45 ERROR OutOfMemoryError: Java heap space
   at org.apache.spark.sql.catalyst.expressions.GenerateMutableProjection.apply(Unknown Source)
```

Question 292mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs a SageMaker notebook and uses pandas to explore a dataset. The dataset contains 500,000 rows and 20 columns, including a 'timestamp' column. After loading the data into a DataFrame, the memory usage is unexpectedly high. What is the most likely cause?

Question 293hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset for a regression problem. The target variable has a long-tail distribution with extreme outliers. The engineer wants to reduce the influence of outliers while preserving the relative order of values. Which data transformation should the engineer apply to the target variable?

Question 294easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 100 features. The goal is to build a binary classification model. The dataset is highly imbalanced with 95% negative class and 5% positive class. The data scientist wants to understand the relationship between features and the target. Which technique is most appropriate for initial exploratory analysis?

Question 295mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company is storing customer transaction data in Amazon S3 as CSV files. A data scientist uses AWS Glue to crawl the data and create a table in the AWS Glue Data Catalog. When querying the table with Amazon Athena, the data scientist notices that some columns have NULL values where data should exist. The data scientist examines the raw CSV files and confirms the data is present. What is the most likely cause of the NULL values?

Question 296hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a dataset. The dataset contains a feature 'age' with values ranging from 0 to 120. The data scientist wants to detect outliers. Which built-in transform in Data Wrangler is most appropriate for this task?

Question 297easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with numerical features and a binary target variable. The data scientist creates a pairplot and notices that one feature has a bimodal distribution when colored by the target class. What does this observation suggest?

Question 298hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset of 1 million images stored in Amazon S3. Each image is 100x100 pixels in RGB format. The data scientist wants to compute the mean pixel value per channel across the entire dataset. Which approach is most efficient?

Question 299mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset containing customer reviews. The data scientist wants to understand the most common words used in positive and negative reviews. Which AWS service is most suitable for this task?

Question 300easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist loads a large dataset from Amazon S3 into a pandas DataFrame using a SageMaker notebook. The dataset contains a mix of numeric and categorical features. The data scientist wants to quickly check for missing values. Which pandas function is most appropriate?

Question 301mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 50 features and a binary target. The data scientist computes the correlation matrix and finds that two features, X1 and X2, have a correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose 2.)

Question 302hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset stored in Amazon S3 using Amazon Athena. The dataset is partitioned by date, and each partition contains CSV files. The data scientist notices that some queries return zero rows for partitions that should have data. Which THREE steps should the data scientist take to troubleshoot? (Choose 3.)

Question 303easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a mix of numerical and categorical features. The target variable is binary. The data scientist wants to visualize the distribution of a numerical feature across the two target classes. Which TWO visualization techniques are appropriate? (Choose 2.)

Question 304mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs the AWS CLI command shown in the exhibit to list objects larger than 100 KB in an S3 bucket. The data scientist wants to understand the size distribution of these files. What is the most significant limitation of this approach for EDA?

Network Topology

Question 305hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is granted the IAM policy shown in the exhibit. The data scientist can query the 'data-lake-bucket' using Athena and get results. However, when the data scientist tries to run a CTAS (CREATE TABLE AS SELECT) query in Athena to write results to a new S3 location, the query fails. What is the most likely reason?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::data-lake-bucket",
        "arn:aws:s3:::data-lake-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "athena:StartQueryExecution",
        "athena:GetQueryResults"
      ],
      "Resource": "*"
    }
  ]
}
```

Question 306easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is investigating an application that logs errors to Amazon CloudWatch Logs. The data scientist runs the CloudWatch Logs Insights query shown in the exhibit. The query returns no results, even though the data scientist knows errors have occurred. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
# CloudWatch Logs Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
| sort @timestamp desc
```

Question 307mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (only 1% positive class). The goal is to build a binary classifier. During exploratory data analysis, which metric is MOST appropriate to evaluate the performance of different sampling strategies before model training?

Question 308hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is performing exploratory data analysis on a dataset stored in Amazon S3 using AWS Glue DataBrew. The dataset contains a column 'age' with missing values. DataBrew's profile shows that the column has 5% missing values, a mean of 45, and a standard deviation of 15. Which imputation strategy should the engineer recommend to minimize bias if the missing data is Missing at Random (MAR)?

Question 309easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that the Pearson correlation coefficient between two continuous variables is 0.85. What does this indicate?

Question 310hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is analyzing a dataset that contains a categorical feature 'country' with 200 unique values. The target variable is binary. The engineer wants to use this feature in a linear model. Which encoding method should be applied during EDA to prepare the data for modeling, considering the high cardinality?

Question 311mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is performing exploratory data analysis on a dataset with 100 features. The analyst wants to identify which features contribute most to the variance in the data. Which technique should the analyst use?

Question 312easymultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist discovers that a numerical feature 'income' has a skewness of 3.5. Which transformation should the scientist apply to make the distribution more symmetric?

Question 313mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with a large number of features. The scientist suspects that some features are redundant because they are highly correlated with each other. Which technique should the scientist use during EDA to identify and remove such redundant features?

Question 314easymultiple choice

Read the full Exploratory Data Analysis explanation →

In exploratory data analysis, a data scientist notices that the distribution of a continuous variable is bimodal. The scientist suspects that the two modes correspond to two different groups in the data. Which visualization is MOST appropriate to confirm this suspicion?

Question 315hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 1 million rows and 50 features. The scientist wants to detect outliers in a numerical feature 'transaction_amount' which has a long right tail. The scientist suspects that outliers are due to data entry errors and should be removed. Which outlier detection method is MOST robust for this scenario?

Question 316mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 500,000 rows and 20 columns. The dataset contains missing values in some columns. Which TWO approaches are appropriate for handling missing data during EDA? (Choose 2)

Question 317hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is evaluating feature engineering options for a dataset containing a categorical variable 'education_level' with values: High School, Bachelor, Master, PhD. The target variable is continuous. Which THREE encoding methods are appropriate for this ordinal categorical variable? (Choose 3)

Question 318easymulti select

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist generates a pairplot of the dataset and observes that two features have a Pearson correlation coefficient of 0.95. Which TWO conclusions can the scientist draw from this observation? (Choose 2)

Question 319mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist plans to read this CSV file into memory for exploratory data analysis using pandas. The instance has 8 GB of RAM. What is the MOST likely issue the scientist will encounter?

Network Topology

Question 320hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is setting up an IAM policy for EDA on a data lake. The scientist needs to run exploratory SQL queries using Amazon Athena and save results to a new S3 bucket. What is a critical missing permission in this policy?

Exhibit

Refer to the exhibit.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-lake-bucket",
                "arn:aws:s3:::data-lake-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetPartitions"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryResults"
            ],
            "Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/primary"
        }
    ]
}

Question 321easymultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist examines a sample of data and notices that all columns are numeric. The scientist wants to check for multicollinearity. Which statistic should be computed from this sample?

Network Topology

Question 322mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 50 features and 10,000 samples. After generating a correlation matrix, they notice several pairs of features have correlation coefficients above 0.95. What should the data scientist do to prepare the data for linear regression?

Question 323hardmultiple choice

Read the full NAT/PAT explanation →

A team is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The dataset contains missing values, outliers, and categorical variables with high cardinality. The team wants to understand data distributions and relationships before modeling. Which combination of Data Wrangler features should they use?

Question 324easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer notices that the target variable in a regression dataset has a long-tailed distribution. Which visualization technique is most appropriate to assess the distribution before applying a log transformation?

Question 325mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is using Amazon Athena to query a partitioned dataset in S3. They notice that queries are scanning more data than expected. Which step should they take during exploratory data analysis to optimize query performance?

Question 326hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with mixed data types: numerical, categorical, and text. They want to use Amazon SageMaker Data Wrangler to create a quick visualization dashboard. Which set of transformations should they apply in Data Wrangler to handle all data types appropriately?

Question 327easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a machine learning engineer finds that a dataset has a significant number of missing values in a categorical feature with 10 levels. Which approach should they take to handle these missing values before modeling?

Question 328mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist uses Amazon QuickSight to visualize a dataset and observes that a numerical feature has a skewness of 2.5 and a kurtosis of 8. Which transformation should they apply to make the distribution more normal?

Question 329hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is using AWS Glue to catalog a dataset with 200 columns. During exploratory data analysis, they run a crawler and then view the table schema in the AWS Glue Data Catalog. They notice that many columns are inferred as 'string' even though they contain numeric values. What is the most likely cause?

Question 330easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to identify outliers in a dataset with 1,000 samples and 5 numerical features. Which technique is most appropriate for univariate outlier detection?

Question 331mediummulti select

Read the full Exploratory Data Analysis explanation →

Which TWO actions are appropriate during exploratory data analysis when you discover that a categorical feature has 50 unique values (high cardinality)?

Question 332hardmulti select

Read the full Exploratory Data Analysis explanation →

Which THREE techniques are commonly used to detect multicollinearity in a dataset during exploratory data analysis?

Question 333easymulti select

Read the full Exploratory Data Analysis explanation →

A data analyst is exploring a dataset with a binary target variable. Which TWO visualizations are most useful for understanding the relationship between a numerical feature and the target?

Question 334easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 1,000 features. They suspect many features are redundant and want to reduce dimensionality before training a model. Which technique is most appropriate for identifying the most important features?

Question 335mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A company runs a real-time fraud detection system using Amazon SageMaker. The model is deployed as a SageMaker endpoint and receives predictions within milliseconds. Recently, the model's accuracy has degraded due to data drift. The data scientists want to monitor the model's performance continuously. What is the most effective way to detect data drift?

Question 336hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working on a binary classification problem with a highly imbalanced dataset (1% positive class). They have applied oversampling using SMOTE and trained a logistic regression model. The model achieves 99% accuracy on the test set, but the recall for the positive class is only 5%. What is the most likely cause?

Question 337easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist needs to analyze a dataset stored in Amazon S3 as CSV files. The dataset contains 100 columns, and the data scientist wants to quickly understand the distribution of each column, including missing values, data types, and basic statistics. Which AWS service is best suited for this task?

Question 338mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker to train a model. The training dataset contains missing values in several features. The data scientist wants to impute missing values using the median of each feature. Which approach is most appropriate?

Question 339hardmultiple choice

Study the full Python automation breakdown →

A company has a large dataset of customer transactions stored in Amazon Redshift. A data scientist wants to perform EDA using Python libraries like pandas and matplotlib. The dataset is too large to fit into memory on a single EC2 instance. What is the most efficient approach?

Question 340easymultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that a feature has a highly skewed distribution. Which transformation is most likely to make the distribution approximately normal?

Question 341mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is building a regression model to predict house prices. The dataset includes a feature 'zip_code' with 1,000 unique values. What is the best way to handle this categorical feature in the exploratory data analysis phase?

Question 342hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset for a binary classification problem. The dataset has 10,000 samples and 200 features. After splitting into training (80%) and test (20%), the data scientist trains a decision tree classifier and achieves 100% accuracy on the training set but only 55% on the test set. Which step should the data scientist take first to address this issue?

Question 343easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO actions are appropriate when handling missing data in a dataset for machine learning? (Select TWO.)

Question 344mediummulti select

Read the full Exploratory Data Analysis explanation →

Which THREE techniques are commonly used to detect outliers in a dataset? (Select THREE.)

Question 345hardmulti select

Read the full Exploratory Data Analysis explanation →

Which TWO statements about data leakage in machine learning are correct? (Select TWO.)

Question 346easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist needs to understand the distribution of a continuous variable in a large dataset stored in Amazon S3. Which AWS service is most appropriate for quickly generating summary statistics and visualizations?

Question 347mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values. The missing data is not random and is correlated with other features. Which imputation method is most appropriate to minimize bias?

Question 348hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing feature engineering on a dataset with high cardinality categorical features (e.g., ZIP codes with thousands of unique values). Which technique is most effective for reducing dimensionality while preserving predictive power?

Question 349mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is using Amazon SageMaker Data Wrangler to explore a dataset. They notice that a feature has a very high correlation (0.95) with the target variable. What should they do to avoid overfitting?

Question 350easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist needs to detect outliers in a dataset with multiple features that follow different distributions. Which method is most robust for multivariate outlier detection?

Question 351mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset containing customer transactions. They want to create a feature that captures the average purchase amount per customer over the last 30 days. Which approach is most efficient in Amazon SageMaker Processing?

Question 352hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that has imbalanced classes (1% positive). They want to explore the data before modeling. Which visualization technique is most appropriate to understand the distribution of features with respect to the target class?

Question 353easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to understand the statistical relationship between two categorical variables in a dataset. Which test is most appropriate?

Question 354mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a time series component. They suspect there is a weekly seasonality. Which technique should they use to confirm this?

Question 355easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with missing values. Which TWO approaches are appropriate for handling missing data in a way that retains as much data as possible?

Question 356mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with skewed numerical features. Which THREE transformations can help make the features more normally distributed?

Question 357hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with high multicollinearity. Which TWO techniques can help identify and address multicollinearity?

Question 358mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist ran an AWS Glue ETL job that failed with the error shown. What is the most likely cause?

Network Topology

Question 359hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist queried an Athena table and got only one row back, but the CSV file is 1 MB. What is the most likely reason?

Network Topology

Question 360hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working on a predictive maintenance project for a manufacturing company. Sensor data is collected every second from 100 machines and stored in an Amazon S3 bucket as Parquet files, partitioned by machine_id and date. The dataset is massive (10 TB) and contains over 2000 features per machine. The data scientist needs to perform exploratory data analysis to identify which features are most predictive of machine failure. They have access to Amazon SageMaker Studio with a SageMaker Data Wrangler flow. The initial data exploration is taking too long due to the volume of data. The data scientist wants to speed up the analysis without losing accuracy in feature selection. Which course of action is most appropriate?

Question 361easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist wants to understand the distribution of a continuous feature before training a model. Which visualization is most appropriate?

Question 362mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist runs a logistic regression and obtains a model with 95% accuracy on the training set. However, the model performs poorly on the test set. Which exploratory data analysis step should have been performed to identify this issue?

Question 363hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A company's dataset contains a feature 'zip_code' with 500 unique values. The data scientist wants to use this feature in a linear model. Which EDA step is most important before feature engineering?

Question 364easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with many features and wants to identify which features are most correlated with the target variable. Which EDA technique should be used?

Question 365mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist uses Amazon SageMaker Data Wrangler to perform EDA on a large dataset stored in S3. The data scientist notices that the target variable is highly imbalanced. Which SageMaker Data Wrangler transform can be used to address this during data preparation?

Question 366hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is performing EDA on a time series dataset of daily sales. The data scientist observes a pattern that repeats every 7 days. Which characteristic of the time series is being observed?

Question 367easymultiple choice

Read the full Exploratory Data Analysis explanation →

During EDA, a data scientist finds that a feature has a skewness value of 2.5. What does this indicate about the data distribution?

Question 368mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist uses SageMaker Studio to run EDA on a dataset with 500 features. The goal is to reduce dimensionality before modeling. Which EDA technique should the data scientist use to understand the variance explained by each feature?

Question 369hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset containing text reviews. The goal is to build a sentiment analysis model. Which EDA step is most critical before feature extraction?

Question 370mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with both numeric and categorical features. Which TWO techniques are appropriate for visualizing the relationship between a numeric feature and a binary categorical target?

Question 371hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with missing values. Which THREE methods are appropriate for handling missing data during EDA and preprocessing?

Question 372easymulti select

Read the full Exploratory Data Analysis explanation →

A data scientist wants to identify outliers in a dataset. Which TWO techniques are commonly used for outlier detection during EDA?

Question 373mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is unable to read a CSV file from the S3 bucket 'my-bucket' using SageMaker. The IAM policy attached to the SageMaker execution role is shown. What is the most likely cause of the failure?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket/*",
        "arn:aws:s3:::my-bucket"
      ]
    }
  ]
}

Question 374hardmultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist runs the AWS CLI command shown to explore the contents of an S3 bucket. The command returns an empty array. However, the data scientist knows there are objects larger than 1000 bytes in the bucket. What is the most likely reason for the empty result?

Network Topology

Question 375mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working on a project to predict customer churn for a telecom company. The dataset includes 50,000 records with 20 features, including customer demographics, account information, and service usage. The data scientist uses Amazon SageMaker Studio and loads the data into a pandas DataFrame. During EDA, the data scientist notices that the target variable 'churn' has only 10% positive cases. Additionally, several features have missing values: 'income' has 5% missing, 'age' has 2% missing, and 'total_charges' has 1% missing. The data scientist also observes that 'income' is highly skewed with a long right tail, and 'age' is moderately skewed. The data scientist wants to handle missing values and prepare the data for modeling. Which course of action is most appropriate?

Question 376hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is building a model to predict housing prices using a dataset with 100,000 records and 50 features. The features include 'sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', etc. The data scientist uses Amazon SageMaker Data Wrangler for EDA. Upon reviewing the data, the data scientist finds that 'sqft_living' has a correlation of 0.7 with 'sqft_above' (square footage above ground) and 0.6 with 'sqft_basement'. Also, 'grade' (overall grade of the house) is highly correlated with 'condition' (0.8). The target variable 'price' is right-skewed. The data scientist plans to use a linear regression model. Which set of actions should the data scientist take to improve model performance?

Question 377easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset of online retail transactions. The dataset contains 500,000 rows and 10 columns: 'TransactionID', 'CustomerID', 'ProductID', 'Quantity', 'UnitPrice', 'TransactionDate', 'PaymentMethod', 'ShippingAddress', 'Country', and 'TotalAmount'. The data scientist loads the data into a SageMaker notebook and performs initial EDA. The data scientist finds that 'UnitPrice' has a range from $0.01 to $10,000, with a mean of $50 and a median of $20. 'Quantity' ranges from -10 to 100, with negative values indicating returns. 'TotalAmount' is calculated as Quantity * UnitPrice. The data scientist also notices that 2% of the 'CustomerID' values are missing, and 1% of 'ProductID' values are missing. There are no missing values in other columns. The data scientist wants to clean the data and prepare it for customer segmentation. Which course of action is most appropriate?

Question 378mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with 500 features and notices that many features are highly correlated. Which AWS service can be used to automatically reduce dimensionality by identifying and removing redundant features before training a model?

Question 379easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset stored in Amazon S3 using Amazon SageMaker Studio. The dataset has missing values in several columns. Which approach is the MOST efficient way to handle missing values within SageMaker Studio?

Question 380hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). The dataset has 10 million rows. The goal is to train a binary classifier. Which technique should be applied during exploratory data analysis to best address the imbalance?

Question 381mediummultiple choice

Read the full Exploratory Data Analysis explanation →

During exploratory data analysis, a data scientist notices that the distribution of a continuous feature is heavily right-skewed. Which transformation should be applied to make the distribution more symmetric for linear regression?

Question 382easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 1,000 features. The goal is to select the most important features for a regression model. Which technique can be used to rank feature importance quickly?

Question 383hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a timestamp column and several numeric measurements. The goal is to detect seasonality and trends. Which AWS service can be used directly from SageMaker Studio to perform this analysis without writing code?

Question 384mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that contains both numerical and categorical features. During EDA, they want to understand the relationship between a categorical feature with 10 unique values and the target variable. Which visualization is most appropriate?

Question 385easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist needs to profile a large dataset in Amazon S3 to understand its schema, data types, and quality. Which AWS service can automatically generate a data profile with statistics and visualizations?

Question 386hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 1 million rows and 50 features. The dataset includes a column 'user_id' with unique identifiers, a column 'event_date' with timestamps, and other columns. Which TWO actions should the data scientist take to understand data quality issues?

Question 387mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a binary target variable. The dataset has 50,000 rows and 200 features. The data scientist wants to identify which features are most predictive. Which TWO methods are appropriate for feature selection during EDA?

Question 388hardmulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a dataset with 10 million rows. The dataset has a column 'income' with outliers. The data scientist wants to detect and handle outliers. Which THREE approaches are appropriate?

Question 389mediummultiple choice

Read the full Exploratory Data Analysis explanation →

Refer to the exhibit. A data scientist is using an IAM role with this policy to run a SageMaker processing job that reads data from S3. The job fails with an access error. What is the most likely cause?

Exhibit

{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Action": ["s3:GetObject"], "Resource": "arn:aws:s3:::my-bucket/*"}, {"Effect": "Allow", "Action": ["sagemaker:CreateProcessingJob"], "Resource": "*"}]}

Question 390hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset stored in Amazon S3 (100 GB, CSV format) using Amazon SageMaker Studio. The dataset contains 500 columns and 10 million rows. The data scientist wants to understand the distribution of each column, detect missing values, and identify outliers. However, the SageMaker Studio notebook instance runs out of memory when loading the entire dataset into a pandas DataFrame. The data scientist needs to complete the EDA efficiently without modifying the source data. What should the data scientist do?

Question 391mediummultiple choice

Read the full NAT/PAT explanation →

A data scientist is performing EDA on a dataset containing customer transaction records. The dataset includes columns: 'transaction_id', 'customer_id', 'transaction_amount', 'transaction_date', and 'product_category'. The data scientist wants to check for duplicate transactions and identify any suspicious patterns, such as multiple transactions from the same customer on the same day with the same amount. The dataset has 5 million rows. The data scientist is using a SageMaker Studio notebook with a ml.t3.medium instance. The data is stored in S3. What is the most efficient way to perform this analysis?

Question 392easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is starting a new machine learning project and needs to understand the dataset. The dataset is stored as CSV files in Amazon S3, with a total size of 50 GB. The data scientist wants to quickly get summary statistics (count, mean, standard deviation, min, max) for each numerical column, and also check for missing values. The data scientist has access to SageMaker Studio. What is the most efficient way to achieve this?

Question 393easymulti select

Read the full Exploratory Data Analysis explanation →

Which TWO of the following are common techniques for detecting outliers in a dataset?

Question 394mediummulti select

Read the full Exploratory Data Analysis explanation →

A data scientist is performing exploratory data analysis on a dataset with 100 features. They want to identify which features are most correlated with the target variable. Which THREE methods are appropriate for this task?

Question 395hardmulti select

Read the full Exploratory Data Analysis explanation →

A data engineer is analyzing a large dataset stored in Amazon S3 using AWS Glue and Amazon Athena. They notice that queries against a table with many small files are slow. Which TWO actions can improve query performance?

Question 396easymultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working on a project to predict customer churn. The dataset contains 50,000 rows and 20 features, including categorical variables like 'Region' (10 categories) and 'SubscriptionType' (5 categories). The target variable is binary (churn or not). During exploratory data analysis, they plot the distribution of each feature and notice that 'Region' has a highly imbalanced distribution: one region accounts for 80% of the data. Which of the following is the most appropriate next step?

Question 397easymultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning team is analyzing a dataset with numerical features. They compute the pairwise correlation matrix and find that two features, 'X1' and 'X2', have a correlation coefficient of 0.98. The team plans to train a linear regression model. Which of the following actions should the team take to avoid multicollinearity issues?

Question 398easymultiple choice

Read the full NAT/PAT explanation →

A data analyst is performing EDA on a dataset containing timestamps of user logins. They want to understand daily login patterns. The timestamp column is in Unix epoch format (integer). Which of the following is the most appropriate transformation to extract day-of-week patterns?

Question 399mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is working with a dataset that contains a 'Price' column. After plotting a histogram, they observe that the distribution is right-skewed with many extreme high values. They plan to use a linear model that assumes normally distributed errors. Which of the following transformations should they apply to the 'Price' column to make it more normally distributed?

Question 400mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data engineer is exploring a dataset with 1 million rows and 50 features. They notice that some features have missing values. The 'Age' column has 5% missingness, and 'Income' has 20% missingness. The target variable is 'LoanDefault' (binary). The engineer wants to impute missing values. Which of the following strategies is most appropriate?

Question 401mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is analyzing a dataset with a binary target variable. They compute the correlation matrix and find that all features have correlations between -0.1 and 0.1 with the target. They suspect that the relationship might be non-linear. Which of the following techniques should they use to detect non-linear relationships?

Question 402mediummultiple choice

Read the full Exploratory Data Analysis explanation →

A machine learning engineer is examining a dataset containing text reviews. They want to convert the text into numerical features for a model. During EDA, they notice that the word 'the' appears in almost every review, while words like 'excellent' appear rarely. Which of the following techniques should they use to reduce the impact of very common words?

Question 403hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is performing EDA on a high-dimensional dataset with 500 features. They want to visualize the data in 2D to check for clusters. They first apply PCA and get a 2D projection that shows no clear structure. They suspect that the data lies on a non-linear manifold. Which of the following techniques should they try next?

Question 404hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data analyst is examining a dataset with a target variable that has three classes: A, B, C. They plot the distribution of a feature 'X' for each class and notice that for classes A and B, the distributions are bimodal, while for class C it is unimodal. They want to assess whether feature 'X' is useful for separating the classes. Which of the following metrics should they compute to quantify the separability?

Question 405hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A team is building a model to predict house prices. They have a dataset with features like 'SquareFootage', 'Bedrooms', 'YearBuilt', and 'Neighborhood'. They notice that 'SquareFootage' has a few extreme values (e.g., 50,000 sq ft) that are likely data entry errors. They want to handle these outliers without losing all the data. Which of the following approaches is most robust?

Question 406hardmultiple choice

Read the full Exploratory Data Analysis explanation →

A data scientist is exploring a dataset with 200 features. They compute the pairwise correlation matrix and notice that many features have correlations above 0.95. They want to reduce redundancy before modeling. Which of the following techniques is most appropriate for identifying and removing highly correlated features?