CCNA Exploratory Data Analysis Questions

75 of 406 questions · Page 3/6 · Exploratory Data Analysis · Answers revealed

151
MCQeasy

A data scientist is working on a project to predict customer churn. The dataset contains 50,000 rows and 20 features, including categorical variables like 'Region' (10 categories) and 'SubscriptionType' (5 categories). The target variable is binary (churn or not). During exploratory data analysis, they plot the distribution of each feature and notice that 'Region' has a highly imbalanced distribution: one region accounts for 80% of the data. Which of the following is the most appropriate next step?

A.Apply one-hot encoding to the 'Region' feature.
B.Remove the 'Region' feature from the dataset.
C.Group rare categories into an 'Other' category.
D.Oversample the minority classes in the target variable.
AnswerC

This reduces sparsity and helps the model learn patterns for rare categories.

Why this answer

Option B is correct because imbalanced categorical features may cause the model to ignore rare categories; grouping rare levels into an 'Other' category can improve model performance. Option A is wrong because removing the feature could discard useful information. Option C is wrong because one-hot encoding does not address imbalance.

Option D is wrong because oversampling addresses target imbalance, not feature imbalance.

152
MCQmedium

A machine learning team is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). They want to understand the distribution and relationships before modeling. Which exploratory data analysis technique is most appropriate to visualize the imbalance and guide resampling strategy?

A.Confusion matrix on a sample of the data
B.Scatterplot matrix of all features colored by class
C.Box plots of each feature grouped by the target class
D.Bar chart of class frequencies and a correlation heatmap
AnswerD

Bar chart shows imbalance clearly; correlation heatmap helps identify features related to the target.

Why this answer

Option D is correct because a bar chart of class counts clearly shows the imbalance, and a correlation heatmap helps understand feature relationships with the target. Option A is wrong because a scatterplot matrix is for continuous variables, not for a binary target. Option B is wrong because box plots show distribution of continuous features by class, but not the imbalance itself.

Option C is wrong because a confusion matrix is for model evaluation, not for initial data exploration.

153
MCQhard

A data scientist is performing feature engineering on a dataset with high cardinality categorical features (e.g., ZIP codes with thousands of unique values). Which technique is most effective for reducing dimensionality while preserving predictive power?

A.Hash encoding
B.One-hot encoding
C.Target encoding
D.Label encoding
AnswerC

Correct: Target encoding reduces cardinality by using target statistics, preserving predictive power.

Why this answer

Option D is correct because target encoding (mean encoding) replaces categories with the mean of the target variable, which captures predictive signal and reduces cardinality. Option A is wrong because one-hot encoding creates many columns, leading to high dimensionality. Option B is wrong because label encoding implies ordinality that may not exist.

Option C is wrong because hashing can cause collisions and loss of information.

154
MCQeasy

During exploratory data analysis, a machine learning engineer finds that a dataset has a significant number of missing values in a categorical feature with 10 levels. Which approach should they take to handle these missing values before modeling?

A.Impute missing values with the mean of the feature.
B.Create a new category labeled 'Missing' for missing values.
C.Drop all rows with missing values.
D.Impute missing values with the mode of the feature.
AnswerB

Preserves the missingness pattern and avoids bias.

Why this answer

Option C is correct because creating a separate 'Missing' category preserves the missingness pattern and avoids data loss or bias from imputation for categorical features. Option A is incorrect because dropping rows with missing values may discard valuable data. Option B is incorrect because mean imputation is for numerical features, not categorical.

Option D is incorrect because mode imputation may introduce bias if missingness is not random.

155
MCQeasy

A data scientist is analyzing a dataset of online retail transactions. The dataset contains 500,000 rows and 10 columns: 'TransactionID', 'CustomerID', 'ProductID', 'Quantity', 'UnitPrice', 'TransactionDate', 'PaymentMethod', 'ShippingAddress', 'Country', and 'TotalAmount'. The data scientist loads the data into a SageMaker notebook and performs initial EDA. The data scientist finds that 'UnitPrice' has a range from $0.01 to $10,000, with a mean of $50 and a median of $20. 'Quantity' ranges from -10 to 100, with negative values indicating returns. 'TotalAmount' is calculated as Quantity * UnitPrice. The data scientist also notices that 2% of the 'CustomerID' values are missing, and 1% of 'ProductID' values are missing. There are no missing values in other columns. The data scientist wants to clean the data and prepare it for customer segmentation. Which course of action is most appropriate?

A.Impute missing 'CustomerID' with the mean of 'CustomerID' and missing 'ProductID' with the mode.
B.Remove all rows with any missing values.
C.Keep negative 'Quantity' and treat them as errors; replace them with the median of positive quantities.
D.Remove rows with negative 'Quantity' to focus on purchases. Impute missing 'CustomerID' and 'ProductID' with a placeholder such as 'Unknown'.
AnswerD

Negative quantities are returns; imputing with 'Unknown' preserves rows.

Why this answer

Option A is correct because negative quantities are returns and should be removed if the goal is to model purchase behavior, and missing CustomerID and ProductID can be imputed with 'Unknown' to avoid data loss. Option B is wrong because mean imputation for CustomerID is not valid (categorical). Option C is wrong because removing all rows with any missing values would discard 3% of data.

Option D is wrong because negative quantities are meaningful as returns, not errors.

156
MCQmedium

A machine learning engineer is analyzing a dataset with a mix of categorical and numerical features. The engineer wants to understand the correlation between categorical features and the target variable. Which statistical test is most appropriate for measuring association between a categorical feature and a binary target?

A.Pearson correlation coefficient
B.ANOVA (Analysis of Variance)
C.Chi-squared test of independence
D.Mutual information
AnswerC

Chi-squared test tests association between two categorical variables.

Why this answer

Option C is correct because the Chi-squared test of independence is used to determine if there is a significant association between two categorical variables, which is applicable here. Option A is wrong because Pearson correlation is for continuous variables. Option B is wrong because ANOVA is for comparing means across groups, but assumes continuous target.

Option D is wrong because Mutual Information can be used but is not a statistical test with a p-value.

157
MCQeasy

A data scientist is starting a new machine learning project and needs to understand the dataset. The dataset is stored as CSV files in Amazon S3, with a total size of 50 GB. The data scientist wants to quickly get summary statistics (count, mean, standard deviation, min, max) for each numerical column, and also check for missing values. The data scientist has access to SageMaker Studio. What is the most efficient way to achieve this?

A.Use AWS Glue Crawler to infer schema and then query with Athena.
B.Write a PySpark script in a SageMaker notebook to compute statistics.
C.Load a sample into Amazon QuickSight and use SPICE to compute statistics.
D.Use SageMaker Data Wrangler to import the data and generate a data quality report.
AnswerD

Data Wrangler provides summary statistics and missing value analysis.

Why this answer

SageMaker Data Wrangler can profile the data without writing code. Option A is wrong because Glue Crawler creates a schema but not statistics. Option B is wrong because writing a Spark job is overkill.

Option D is wrong because QuickSight requires data import.

158
MCQhard

A data scientist examines a dataset with 100 features and suspects that some features are redundant due to high pairwise correlations. Which EDA technique should the scientist use to systematically identify groups of highly correlated features?

A.Generate a correlation matrix and visualize it as a heatmap.
B.Plot histograms for each feature.
C.Create scatter plots for each pair of features.
D.Use box plots to identify outliers.
AnswerA

Heatmap of correlation matrix quickly reveals high pairwise correlations.

Why this answer

Option B is correct because a correlation matrix heatmap visually identifies high correlations. Option A is wrong because histograms show univariate distributions. Option C is wrong because scatter plots are for pairs, not systematic.

Option D is wrong because box plots show outliers.

159
MCQhard

During EDA, a data scientist plots the distribution of a feature and sees a bimodal pattern. What does this likely indicate?

A.The data may contain two distinct groups.
B.The feature has missing values.
C.The feature contains outliers.
D.The feature needs to be standardized.
AnswerA

Bimodal suggests mixture of two populations.

Why this answer

Option C is correct because bimodal distribution often indicates two underlying subpopulations. Option A is wrong because missing values cause spikes, not bimodal. Option B is wrong because outliers cause tails, not two peaks.

Option D is wrong because scaling does not create bimodality.

160
Multi-Selectmedium

A data scientist is performing EDA on a dataset with 100 features. They want to reduce dimensionality by removing highly correlated features. Which TWO approaches are appropriate? (Choose TWO.)

Select 2 answers
A.Use feature importance from a random forest to select top features.
B.Remove features with low variance using VarianceThreshold.
C.Compute a correlation matrix and remove one feature from each pair with correlation >0.95.
D.Use Principal Component Analysis (PCA) and select components that explain 95% of variance.
E.Apply L1 regularization (Lasso) during model training to zero out coefficients of correlated features.
AnswersC, D

This directly removes redundant features.

Why this answer

Options A and D are correct. Option A: Removing features with correlation >0.95 directly reduces redundancy. Option D: Using PCA creates uncorrelated components.

Option B is wrong because L1 regularization is a modeling technique, not EDA. Option C is wrong because feature importance from tree-based models is not specifically for removing correlated features.

161
MCQhard

During exploratory data analysis on a dataset with 1 million rows, a data scientist notices that the distribution of the target variable is highly imbalanced (99% class A, 1% class B). Which technique should be applied to address this imbalance before model training?

A.Randomly undersample the majority class to match the minority class size
B.Apply standard scaling to all features
C.Use PCA to reduce dimensionality and oversample in principal component space
D.Use SMOTE to generate synthetic samples for the minority class
AnswerD

SMOTE creates synthetic examples to balance classes.

Why this answer

Option D is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class, balancing the dataset. Option A is wrong because random undersampling can discard important data. Option B is wrong because scaling does not address imbalance.

Option C is wrong because PCA does not fix imbalance.

162
MCQmedium

A data scientist is exploring a dataset containing customer transactions. They want to create a feature that captures the average purchase amount per customer over the last 30 days. Which approach is most efficient in Amazon SageMaker Processing?

A.Use Amazon Athena SQL query with GROUP BY
B.Use PySpark with window functions in SageMaker Processing
C.Use a Python script with a for loop to calculate per customer
D.Use pandas groupby and rolling functions
AnswerB

Correct: PySpark window functions are optimized for large-scale grouped rolling aggregates.

Why this answer

Option D is correct because using PySpark in SageMaker Processing with window functions is efficient for grouped time-series aggregations. Option A is wrong because iterating over rows is inefficient in Python. Option B is wrong because SQL in Athena may be simpler but requires moving data.

Option C is wrong because pandas may not scale to large datasets.

163
MCQeasy

A data analyst is exploring a dataset with a target variable that is highly imbalanced. The minority class represents only 1% of the data. Which technique should the analyst use to better understand the relationships between features and the minority class?

A.Apply SMOTE to the dataset before analysis.
B.Use random sampling to reduce the dataset size.
C.Scale the features using Min-Max scaling.
D.Use stratified sampling to create a balanced sample for analysis.
AnswerD

Stratified sampling preserves class proportions.

Why this answer

Option A is correct because stratified sampling ensures the minority class is proportionally represented in the sample, allowing meaningful analysis. Option B is wrong because random sampling may miss the minority class entirely. Option C is wrong because SMOTE is for generating synthetic data, not for exploratory analysis.

Option D is wrong because feature scaling does not address class imbalance.

164
MCQhard

A data scientist is performing EDA on a time series dataset of daily sales. The data scientist observes a pattern that repeats every 7 days. Which characteristic of the time series is being observed?

A.Stationarity
B.Autocorrelation
C.Seasonality
D.Trend
AnswerC

Seasonality is a periodic pattern with a fixed frequency.

Why this answer

A pattern that repeats at a fixed frequency (every 7 days) is called seasonality. Option A is wrong because trend is a long-term increase or decrease. Option C is wrong because autocorrelation measures correlation with lagged values, not a repeating pattern.

Option D is wrong because stationarity refers to constant mean/variance over time.

165
MCQeasy

During exploratory data analysis, a data scientist plots the distribution of a numerical feature and observes a heavy right skew. The feature has many outliers at the high end. Which transformation is most appropriate to reduce skewness?

A.Apply a log transformation to the feature.
B.Apply z-score normalization.
C.Apply one-hot encoding.
D.Apply min-max scaling.
AnswerA

Log transformation compresses high values and can make the distribution more symmetric.

Why this answer

A log transformation compresses the range of the data, reducing the impact of extreme values and pulling in the long tail of a right-skewed distribution. This makes the feature more normally distributed, which is often required for linear models and many statistical tests. It is the standard technique for handling positive-valued features with heavy right skew.

Exam trap

AWS often tests the distinction between scaling (which changes range) and transformation (which changes distribution shape), so the trap here is that candidates might pick min-max scaling or z-score normalization thinking they handle outliers, but they only rescale without fixing skewness.

How to eliminate wrong answers

Option B is wrong because z-score normalization (standardization) centers the data around zero with unit variance but does not change the shape of the distribution; it will still be skewed. Option C is wrong because one-hot encoding is used for categorical features, not for transforming numerical features to reduce skewness. Option D is wrong because min-max scaling rescales the feature to a fixed range (e.g., [0,1]) but does not alter the distribution's skewness; outliers remain outliers in the scaled range.

166
MCQmedium

A company is performing EDA on a dataset with 10,000 rows and 200 columns. They run a correlation matrix and find many high correlations (|r| > 0.9). What is the best approach to address multicollinearity before modeling?

A.Standardize all features
B.Calculate Variance Inflation Factor (VIF) and remove features with VIF > 10
C.Use Lasso regression with cross-validation
D.Apply Principal Component Analysis (PCA) to all features
AnswerB

VIF identifies highly correlated features for removal.

Why this answer

Option B is correct because VIF measures how much a variable is correlated with others; removing high VIF variables reduces multicollinearity. Option A is wrong because scaling does not remove correlation. Option C is wrong because PCA creates orthogonal components but reduces interpretability.

Option D is wrong because Lasso can handle multicollinearity but may not be the best EDA step.

167
MCQeasy

A data scientist is analyzing a dataset with a timestamp column. The goal is to identify seasonality and trends. Which visualization technique is most suitable?

A.Time series line plot of the target variable over time.
B.Box plot of the target variable grouped by day of week.
C.Scatter plot of the target variable vs. the timestamp.
D.Heatmap of correlation between all features.
AnswerA

Line plots are standard for time series data.

Why this answer

Option B is correct because a time series line plot is standard for visualizing trends and seasonality over time. Option A (scatter plot) is for two numerical variables; Option C (heatmap) shows correlation; Option D (box plot) shows distribution.

168
MCQhard

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

A.The policy does not include the s3:ListBucket permission, which is required to access the object.
B.The object is encrypted with SSE-KMS and the role does not have kms:Decrypt permission.
C.The resource ARN in the first statement should be 'arn:aws:s3:::my-bucket/training' without the wildcard.
D.The policy explicitly denies s3:GetObject because of the second statement with the trailing slash.
AnswerA

To read an S3 object, the principal needs both s3:GetObject on the object and s3:ListBucket on the bucket (or at least the bucket-level permission to allow access). The policy only grants object-level permissions, not bucket-level ListBucket.

Why this answer

The s3:GetObject permission alone is insufficient to read an object from S3 when the request is made via the AWS Console or certain SDK operations that first list the bucket's contents. The s3:ListBucket permission is required for the ListObjects API call, which is often implicitly invoked to resolve the object key path. Without it, the read operation fails with an Access Denied error even if the GetObject permission is granted.

Exam trap

Cisco often tests the subtle distinction between object-level permissions (GetObject) and bucket-level permissions (ListBucket), where candidates mistakenly assume that granting GetObject alone is sufficient for all read operations, ignoring that many S3 interactions implicitly require ListBucket to resolve the object path.

How to eliminate wrong answers

Option B is wrong because the question does not mention any encryption settings on the object, and the error is Access Denied, not a KMS-related permission error (which would typically return a 400 Bad Request with a KMS-specific message). Option C is wrong because the resource ARN 'arn:aws:s3:::my-bucket/training/*' correctly grants access to all objects under the 'training/' prefix; removing the wildcard would restrict access to a single object named 'training' (without a trailing slash), which is not the intended scope. Option D is wrong because the second statement with a trailing slash ('arn:aws:s3:::my-bucket/training/') does not explicitly deny s3:GetObject; it only grants s3:GetObject on objects with keys starting with 'training/' (the trailing slash is part of the prefix pattern, not a denial).

169
Multi-Selectmedium

Which THREE of the following are valid techniques for detecting outliers in a dataset during exploratory data analysis? (Select THREE.)

Select 3 answers
A.Z-score method: flag points with absolute Z-score > 3.
B.Linear regression residuals.
C.Isolation Forest algorithm.
D.K-means clustering.
E.Interquartile Range (IQR) method: flag points outside 1.5*IQR from quartiles.
AnswersA, C, E

Z-score is a standard outlier detection technique.

Why this answer

Z-score, IQR, and Isolation Forest are all common outlier detection methods. Option D (Linear regression) is not for outlier detection. Option E (K-means) can be used for clustering but not primarily for outlier detection.

170
MCQhard

A data scientist is granted the IAM policy shown in the exhibit. The data scientist can query the 'data-lake-bucket' using Athena and get results. However, when the data scientist tries to run a CTAS (CREATE TABLE AS SELECT) query in Athena to write results to a new S3 location, the query fails. What is the most likely reason?

A.The policy does not grant athena:CreateTable permission.
B.The policy does not grant s3:PutObject permission on the bucket.
C.The policy does not grant permissions to the Glue Data Catalog.
D.The policy uses a wildcard for Athena actions, which is not allowed.
AnswerB

CTAS queries write output to S3, requiring s3:PutObject.

Why this answer

Option B is correct because the policy allows s3:GetObject and s3:ListBucket, but not s3:PutObject, which is required for CTAS queries. Option A is wrong because the policy uses resource-level permissions for S3. Option C is wrong because Athena does not require Glue Data Catalog permissions for CTAS if the table metadata is already stored.

Option D is wrong because the policy does not restrict Athena resource ARNs.

171
MCQmedium

During EDA, a data scientist finds that a numeric feature has many outliers. The feature will be used in a linear regression model. Which approach should the scientist take to handle the outliers?

A.Remove all rows with outlier values.
B.Apply a logarithmic transformation to the feature.
C.Standardize the feature using Z-score normalization.
D.Cap the feature values at the 1st and 99th percentiles.
AnswerD

Capping limits extreme values, reducing their influence.

Why this answer

Option C is correct because capping winsorizes outliers, reducing their impact while retaining data. Option A is wrong because removing all outliers may lose information. Option B is wrong because log transform reduces skew but does not remove outliers.

Option D is wrong because scaling does not mitigate outlier influence.

172
Multi-Selecthard

A data scientist is analyzing a dataset with missing values. Which THREE methods are appropriate for handling missing data during EDA and preprocessing?

Select 3 answers
A.Remove rows with any missing values
B.Impute missing values with the mean of the column
C.Replace missing values with 0
D.Ignore missing values and proceed with modeling
E.Impute missing values with the median of the column
AnswersA, B, E

Listwise deletion is acceptable if missing is MCAR and few rows.

Why this answer

Option A (remove rows with missing values) is valid if missing is random and small. Option B (impute with mean) is common for numeric data. Option C (impute with median) is robust to outliers.

Option D is wrong because using a constant 0 can introduce bias. Option E is wrong because ignoring missing values in models causes errors.

173
MCQeasy

A data scientist is analyzing a dataset with numerical features and a binary target variable. The data scientist creates a pairplot and notices that one feature has a bimodal distribution when colored by the target class. What does this observation suggest?

A.The feature is irrelevant and should be removed.
B.The feature is likely predictive of the target.
C.The feature contains outliers that need to be removed.
D.The feature has missing values that need to be imputed.
AnswerB

Different distributions for each class indicate the feature can separate the classes.

Why this answer

Option A is correct because bimodal distribution separated by class indicates the feature can help distinguish between classes. Option B is wrong because bimodality does not necessarily imply missing values. Option C is wrong because it suggests the feature is useful, not irrelevant.

Option D is wrong because bimodality is not an indication of outliers.

174
MCQhard

A data scientist is working on a binary classification problem with a highly imbalanced dataset (1% positive class). They have applied oversampling using SMOTE and trained a logistic regression model. The model achieves 99% accuracy on the test set, but the recall for the positive class is only 5%. What is the most likely cause?

A.SMOTE was applied before splitting the data into training and test sets
B.The model is overfitting due to lack of regularization
C.Accuracy is not a suitable metric for imbalanced data
D.Logistic regression is inappropriate for imbalanced datasets
AnswerA

Why D is correct

Why this answer

Option D is correct because if SMOTE was applied before splitting, synthetic samples leak information from the test set into the training set, leading to overoptimistic accuracy but poor generalization. Option A is wrong because logistic regression can handle balanced data, though it may not capture complex patterns. Option B is wrong because accuracy is a poor metric for imbalanced data, but the low recall indicates a problem beyond metric choice.

Option C is wrong because while L2 regularization might help, it would not cause such a discrepancy between accuracy and recall.

175
MCQmedium

A data scientist uses Amazon SageMaker Data Wrangler to perform EDA on a large dataset stored in S3. The data scientist notices that the target variable is highly imbalanced. Which SageMaker Data Wrangler transform can be used to address this during data preparation?

A.Standard scaling
B.Principal component analysis (PCA)
C.SMOTE (Synthetic Minority Over-sampling)
D.One-hot encoding
AnswerC

SMOTE generates synthetic samples to balance the target distribution.

Why this answer

SMOTE is available in SageMaker Data Wrangler to generate synthetic samples for the minority class. Option A is wrong because one-hot encoding is for categorical features. Option B is wrong because standard scaling normalizes numeric features.

Option D is wrong because principal component analysis reduces dimensionality.

176
MCQhard

A data scientist is performing exploratory data analysis on a dataset with mixed data types: numerical, categorical, and text. They want to use Amazon SageMaker Data Wrangler to create a quick visualization dashboard. Which set of transformations should they apply in Data Wrangler to handle all data types appropriately?

A.Use the built-in analysis: summary statistics for numerical, word cloud for text, and frequency for categorical.
B.Convert all features to numerical using one-hot encoding and then create a scatter matrix.
C.Apply TF-IDF vectorization to text and then run k-means clustering.
D.Use PCA to reduce dimensionality and then visualize the first two components.
AnswerA

These are appropriate EDA visualizations for different data types.

Why this answer

Option D is correct because Data Wrangler's built-in analysis includes summary statistics for numerical features, word clouds for text, and frequency counts for categorical features. These are appropriate for initial EDA. Option A is incorrect because PCA is for dimensionality reduction, not EDA.

Option B is incorrect because TF-IDF is a feature engineering step, not EDA. Option C is incorrect because clustering is a modeling step, not EDA.

177
Multi-Selecteasy

Which TWO of the following are common techniques for handling missing values in a dataset during exploratory data analysis? (Select TWO.)

Select 2 answers
A.Apply feature scaling to normalize the data.
B.Remove rows or columns with missing values if they are few.
C.Use Principal Component Analysis (PCA) to reduce dimensionality.
D.Apply one-hot encoding to the missing values.
E.Impute missing values with the mean or median of the column.
AnswersB, E

Deletion is a valid approach when missing data is minimal.

Why this answer

Imputation with mean/median and removing rows/columns are standard techniques. Options C (one-hot encoding), D (PCA), and E (feature scaling) are not for handling missing values.

178
MCQmedium

A machine learning engineer is analyzing a dataset and observes that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution approximately normal?

A.Square root transformation
B.Exponential transformation
C.Log transformation
D.Box-Cox transformation with lambda = 0
AnswerC

Log transformation is standard for right-skewed data.

Why this answer

Option B is correct because a log transformation compresses the right tail and is effective for right-skewed data. Square root (A) is less effective for heavy skew. Exponential (C) would worsen skew.

Box-Cox (D) is a family that includes log, but log is the most common and straightforward.

179
MCQmedium

A team is exploring a dataset with missing values in multiple columns. They want to decide whether to drop rows or impute values. Which approach is most appropriate for exploratory data analysis?

A.Impute missing values with the mean of each column
B.Analyze the missing data pattern using visualizations and summary statistics
C.Drop all rows with missing values to ensure data quality
D.Use Amazon SageMaker Data Wrangler to automatically impute missing values
AnswerB

Understanding the missing data pattern is crucial before deciding on imputation or deletion.

Why this answer

Option A is correct because during EDA, it is important to first understand the pattern and extent of missing data before deciding on treatment. Option B is wrong because dropping rows without analysis may discard valuable data. Option C is wrong because imputing without understanding the missing mechanism may introduce bias.

Option D is wrong because EDA does not require using a specific AWS service.

180
MCQhard

Refer to the exhibit. A data scientist ran an S3 Select query on a large CSV file stored in Amazon S3. The output shows only 2 records returned, but the data scientist expected thousands. The file size is 10 GB. What is the MOST likely reason for the small result set?

A.The file needs to be indexed by S3 Select before querying.
B.The city column may have leading/trailing spaces or case differences.
C.The CSV file contains nested arrays that S3 Select cannot parse.
D.S3 Select does not support the WHERE clause on CSV files.
AnswerB

String comparison is exact; variations cause mismatches, reducing results.

Why this answer

S3 Select performs exact string matching by default, so if the WHERE clause filters on the city column, any leading/trailing spaces or case differences will cause mismatches, returning far fewer rows than expected. The query likely used a literal like 'New York' while the data contains ' New York ' or 'new york', resulting in only 2 matches instead of thousands.

Exam trap

Cisco often tests the nuance that S3 Select does not automatically trim or normalize string data, so candidates mistakenly assume the query engine handles such common data quality issues.

How to eliminate wrong answers

Option A is wrong because S3 Select does not require indexing; it scans the entire file and applies the query on the fly. Option C is wrong because S3 Select can parse CSV files with nested arrays as long as the CSV is well-formed (e.g., quoted fields), and nested arrays are not inherently unsupported. Option D is wrong because S3 Select fully supports the WHERE clause on CSV files, including standard SQL predicates.

181
MCQmedium

A data engineer is performing EDA on a dataset containing user activity logs from a mobile app. The dataset has 10 million rows and includes columns: 'user_id', 'event_type', 'timestamp', 'device_type', and 'session_duration'. The engineer uses Amazon Athena to query the data stored in S3 as CSV files. The engineer runs a query to find the average session_duration per device_type, but the query takes over 5 minutes and scans 100 GB of data. The engineer wants to reduce query cost and improve performance for future EDA. The dataset is not partitioned, and the engineer anticipates frequent queries filtering on 'timestamp' and 'device_type'. Which action will most effectively reduce data scanned?

A.Partition the table by date derived from timestamp and convert to Parquet.
B.Use random sampling to query a subset of data.
C.Convert the data to Parquet format and use columnar storage.
D.Partition the table by device_type.
AnswerA

Combining partitioning and columnar storage maximizes reduction in scanned data.

Why this answer

Option C is correct because partitioning by date (derived from timestamp) allows partition pruning when filtering by timestamp, significantly reducing data scanned. Converting to Parquet (Option A) helps but without partitioning, full scans still occur. Option B is wrong because it only partitions by device_type, but time-based filters are common.

Option D is wrong because sampling loses accuracy.

182
Multi-Selecthard

Which THREE of the following are common causes of multicollinearity in a linear regression model?

Select 3 answers
A.Including a polynomial term (e.g., x^2) along with the original variable
B.Including interaction terms between independent variables
C.Including all dummy variables for a categorical feature
D.Having two or more predictors that are highly correlated
E.Presence of outliers in the target variable
AnswersA, C, D

Polynomial terms are correlated with the original variable.

Why this answer

Options A, C, and D are correct. Dummy variable trap occurs when all categories are included without dropping one. Highly correlated predictors directly cause multicollinearity.

Including polynomial terms creates correlation with the original variable. B (interaction terms) can also cause but is less common. E (outliers) does not cause multicollinearity.

183
MCQeasy

A data scientist receives the above error during model training. What is the most likely cause?

A.The training data contains missing or infinite values.
B.The learning rate is too high.
C.The data format is incorrect; expected CSV but received JSON.
D.The instance type lacks sufficient memory.
AnswerA

The error explicitly states 'Input contains NaN, infinity or a value too large'.

Why this answer

Option B is correct. The error indicates NaN or infinite values in the input data. Option A is wrong because the error is about data, not hyperparameters.

Option C is wrong because the error is not about memory. Option D is wrong because the error is not about data format.

184
MCQhard

A company's dataset contains a feature 'zip_code' with 500 unique values. The data scientist wants to use this feature in a linear model. Which EDA step is most important before feature engineering?

A.Check the proportion of missing values
B.Compute the frequency of each zip code
C.Plot a histogram of the feature
D.Calculate the correlation between zip code and the target
AnswerB

Knowing frequency helps decide which categories to combine or how to encode.

Why this answer

Because zip codes are categorical with high cardinality, analyzing the frequency distribution helps decide how to group or encode them (e.g., target encoding). Option A is wrong because histograms are for continuous variables. Option C is wrong because correlation is for numeric features.

Option D is wrong because missing value proportion is unrelated to cardinality handling.

185
MCQhard

A data scientist is performing EDA on a dataset with 1 million rows. They suspect the dataset contains duplicate rows. Which approach is most efficient to identify duplicates in Amazon SageMaker Studio?

A.Write a Python script that loops through each row and compares to a set of seen rows.
B.Use pandas drop_duplicates and then check the length difference.
C.Use DuckDB SQL query: SELECT COUNT(*) - COUNT(DISTINCT *) FROM table.
D.Use Amazon Athena to query the S3 data with COUNT(DISTINCT *).
AnswerC

DuckDB efficiently processes large DataFrames in-memory.

Why this answer

Option C is correct because DuckDB is an in-process SQL OLAP database that can run on a single machine and efficiently handle large datasets. Option A (Python loop) is slow; Option B (pandas drop_duplicates) may be memory-intensive; Option D (Athena) is serverless but incurs cost and latency.

186
MCQhard

A data scientist is working with a dataset containing text reviews. The goal is to build a sentiment analysis model. Which EDA step is most critical before feature extraction?

A.Calculating the vocabulary size
B.Creating a word cloud
C.Removing stop words
D.Checking the distribution of sentiment labels
AnswerD

Class imbalance can significantly impact model performance.

Why this answer

Checking for class imbalance in sentiment labels is critical because it can bias the model. Option A is wrong because stop word removal is part of preprocessing, not EDA. Option B is wrong because word clouds are for visualization, not a critical step.

Option D is wrong because vocabulary size is not a primary concern at this stage.

187
MCQhard

A data scientist is exploring a dataset with 500 features and 100,000 observations for a regression problem. The scientist notices that many features are highly correlated with each other. Which technique should the scientist use to reduce multicollinearity and improve model interpretability during exploratory data analysis?

A.Compute mutual information between each feature and the target, and keep only the top 50 features.
B.Apply Principal Component Analysis (PCA) to reduce the feature space.
C.Use Lasso regression to select features with non-zero coefficients.
D.Calculate Variance Inflation Factor (VIF) for each feature and remove those with VIF > 10.
AnswerD

VIF quantifies how much a feature is explained by other features; high VIF indicates multicollinearity.

Why this answer

Option A is correct because Variance Inflation Factor (VIF) is a standard metric to detect multicollinearity, and features with high VIF can be removed. Option B is wrong because PCA creates new features that are linear combinations, reducing interpretability. Option C is wrong because Lasso can be used for feature selection but is a modeling step, not exploratory analysis.

Option D is wrong because mutual information measures dependency but not specifically multicollinearity.

188
MCQmedium

A data scientist is analyzing a dataset with 500 features and 10,000 rows. The target variable is binary. After training a logistic regression model, the coefficients show many non-zero values but the model has low accuracy on the test set. Which EDA step should the data scientist perform next to improve model performance?

A.Apply Principal Component Analysis (PCA) to reduce dimensionality.
B.Collect more training data to improve generalization.
C.Normalize the features using StandardScaler.
D.Use correlation analysis or mutual information to select the most relevant features.
AnswerD

Feature selection removes irrelevant features, reducing noise and overfitting.

Why this answer

Option B is correct because feature selection helps reduce noise and overfitting, improving model accuracy. Option A is wrong because scaling does not reduce the number of features. Option C is wrong because PCA may lose interpretability and is not directly aimed at reducing overfitting due to irrelevant features.

Option D is wrong because more data does not necessarily address the issue of irrelevant features.

189
MCQmedium

A data scientist is analyzing a dataset with missing values in several features. The dataset is large (10 million rows) and stored in an S3 bucket as CSV files. The scientist wants to use AWS Glue to catalog the data and then use Amazon Athena to query it. However, the missing values are causing errors in downstream machine learning models. Which approach should the scientist take to handle missing values during exploratory data analysis?

A.Use Amazon SageMaker Data Wrangler to create a data flow that imputes missing values and export the transformed dataset to S3.
B.Use AWS Glue ETL jobs with a custom transformation script that uses the AWS Glue library to drop or impute missing values before writing to a new dataset.
C.Use Amazon Redshift Spectrum with an external table to query the data and use SQL COALESCE to handle missing values on the fly.
D.Use Amazon Athena to run SQL queries that impute missing values and write the results to a new table.
AnswerB

AWS Glue provides native transforms like DropNullFields and FillWithValue, and custom scripts allow handling missing values efficiently at scale.

Why this answer

Option C is correct because AWS Glue provides built-in transforms to handle missing values during the ETL process, and using a custom script with the AWS Glue library allows fine-grained control. Option A is wrong because Athena cannot modify data; it is only a query engine. Option B is wrong because SageMaker Data Wrangler is for interactive data preparation, not for large-scale automated ETL.

Option D is wrong because Redshift Spectrum is for querying, not for cleaning missing values.

190
MCQeasy

A data scientist is investigating an application that logs errors to Amazon CloudWatch Logs. The data scientist runs the CloudWatch Logs Insights query shown in the exhibit. The query returns no results, even though the data scientist knows errors have occurred. What is the most likely cause?

A.The stats count() function is misspelled.
B.The filter pattern is case-sensitive and the log messages use a different case for 'error'.
C.The query sorts by timestamp descending, which hides results.
D.The bin(5m) function is not supported in CloudWatch Logs Insights.
AnswerB

CloudWatch Logs Insights is case-sensitive; 'ERROR' will not match 'Error'.

Why this answer

Option A is correct because the query is case-sensitive; 'ERROR' may not match 'error' or 'Error'. Option B is wrong because the sort order does not affect whether results are returned. Option C is wrong because the query uses correct syntax.

Option D is wrong because bin(5m) is valid if there are logs within the time range.

191
Multi-Selecteasy

A data scientist wants to understand the distribution and missing values in a large dataset stored in Amazon S3. Which TWO AWS services can be used directly for this exploratory data analysis? (Choose TWO.)

Select 2 answers
A.Amazon SageMaker Data Wrangler
B.AWS CloudTrail
C.Amazon Athena
D.Amazon EMR
E.AWS Glue DataBrew
AnswersC, E

Athena can run SQL queries to compute distributions and count nulls.

Why this answer

Amazon Athena allows SQL-based queries on S3 data, including aggregation for distribution analysis. AWS Glue DataBrew provides visual profiling to detect missing values and distributions. SageMaker Data Wrangler is also a valid choice but is not a direct service for S3 data without additional steps; EMR requires cluster setup.

192
MCQmedium

A company is storing customer transaction data in Amazon S3 as CSV files. A data scientist uses AWS Glue to crawl the data and create a table in the AWS Glue Data Catalog. When querying the table with Amazon Athena, the data scientist notices that some columns have NULL values where data should exist. The data scientist examines the raw CSV files and confirms the data is present. What is the most likely cause of the NULL values?

A.The CSV files have different schemas (e.g., different columns) across partitions.
B.Athena is configured to skip corrupted records, causing NULLs.
C.The Glue crawler incorrectly inferred the data type of the columns.
D.The CSV files use a custom delimiter that the Glue crawler does not recognize.
AnswerA

Schema evolution causes missing columns to appear as NULL when queried.

Why this answer

Option D is correct because the Glue crawler infers schema from the first few files; if later files have different schemas (e.g., more columns), the extra data is not captured. Option A is wrong because the crawler handles CSV without SerDe issues. Option B is wrong because Athena does not modify data.

Option C is wrong because the issue is schema mismatch, not data type inference.

193
Multi-Selecthard

A data scientist is evaluating feature engineering options for a dataset containing a categorical variable 'education_level' with values: High School, Bachelor, Master, PhD. The target variable is continuous. Which THREE encoding methods are appropriate for this ordinal categorical variable? (Choose 3)

Select 3 answers
A.One-hot encoding
B.Target encoding (mean of target per category)
C.Hash encoding (using feature hashing)
D.Label encoding (e.g., High School=0, Bachelor=1, Master=2, PhD=3)
E.Binary encoding (convert to binary representation)
AnswersA, B, D

One-hot encoding is a safe option that does not assume any order, though it increases dimensionality.

Why this answer

Options A, B, and E are correct because label encoding preserves ordinality, target encoding captures the relationship with the target, and one-hot encoding is a safe fallback. Option C is wrong because binary encoding assumes nominal categories. Option D is wrong because hash encoding loses interpretability and may cause collisions.

194
MCQeasy

A data scientist wants to understand the statistical relationship between two categorical variables in a dataset. Which test is most appropriate?

A.Chi-squared test
B.Pearson correlation coefficient
C.Student's t-test
D.ANOVA test
AnswerA

Correct: Chi-squared test is used for association between categorical variables.

Why this answer

Option B is correct because the chi-squared test tests independence between categorical variables. Option A is wrong because ANOVA is for continuous vs categorical. Option C is wrong because Pearson correlation is for continuous variables.

Option D is wrong because t-test compares means of two groups.

195
Multi-Selecthard

A data scientist is analyzing a dataset with several categorical features and a binary target. The scientist wants to check for association between each categorical feature and the target. Which THREE statistical tests are appropriate?

Select 3 answers
A.ANOVA
B.Pearson correlation coefficient
C.Chi-square test of independence
D.Mutual information
E.Cramér's V
AnswersC, D, E

Tests association between two categorical variables.

Why this answer

Options A, B, and D are correct. Chi-square test of independence is for categorical-categorical association. Cramér's V is a measure of association based on chi-square.

Mutual information is a non-parametric measure that can capture non-linear dependencies. Option C is wrong because ANOVA is for categorical vs continuous. Option E is wrong because Pearson correlation is for continuous variables.

196
MCQhard

A data scientist is performing EDA on a dataset of customer churn. The dataset includes a categorical feature 'Region' with 100 unique values. What is the best way to encode this feature for a tree-based model?

A.Replace each category with its frequency in the dataset
B.Use the feature as a categorical variable directly in the tree-based model
C.Label encode the feature (assign integers 0-99)
D.One-hot encode the feature
AnswerB

Many tree-based models (e.g., LightGBM, CatBoost) handle high-cardinality categoricals efficiently.

Why this answer

Option C is correct because tree-based models can handle high-cardinality categorical features natively without encoding; many implementations (e.g., LightGBM, CatBoost) support categorical features directly. Option A is wrong because one-hot encoding creates 100 columns, causing sparsity. Option B is wrong because label encoding imposes ordinality.

Option D is wrong because frequency encoding may cause target leakage if using target encoding without proper cross-validation.

197
Multi-Selecteasy

Which TWO of the following are common techniques for detecting outliers in a dataset?

Select 2 answers
A.Z-score
B.Interquartile range (IQR) method
C.Principal Component Analysis (PCA)
D.K-means clustering
E.Standard scaling
AnswersA, B

Z-score measures how many standard deviations a point is from the mean; values beyond a threshold (e.g., 3) are outliers.

Why this answer

Z-score identifies outliers based on standard deviations from the mean. IQR method uses quartile ranges to flag points outside 1.5*IQR. Standard scaling, PCA, and K-means are not primarily outlier detection methods.

198
MCQhard

An IAM policy is attached to a data scientist's role. The scientist is trying to list objects in the 'data-bucket' using Amazon Athena. The query fails with an access denied error. What is the MOST likely reason?

A.The policy does not allow s3:ListBucket on the bucket.
B.The policy has a syntax error.
C.The query is trying to read data from the 'sensitive/' prefix.
D.The s3:GetObject action is explicitly denied for all objects.
AnswerC

Deny overrides Allow for that prefix.

Why this answer

Option B is correct because Athena needs s3:GetObject on the bucket to read data, but the Deny statement prevents access to objects under 'sensitive/' prefix. However, the query may be trying to read from that prefix. Option A is wrong because ListBucket is allowed.

Option C is wrong because Deny blocks GetObject. Option D is wrong because the policy is valid.

199
MCQmedium

A data scientist is performing EDA on a dataset with a timestamp column. They want to detect seasonality. Which visualization is most appropriate?

A.Box plot of value grouped by month
B.Bar chart of average value per month
C.Line plot of value over time
D.Scatter plot of timestamp vs. value
AnswerC

Why A is correct

Why this answer

Option A is correct because a time series line plot clearly shows seasonal patterns. Option B is wrong because bar chart of monthly averages may not show seasonality within months. Option C is wrong because scatter plot with timestamp vs. value may be cluttered.

Option D is wrong because box plot by month shows distribution, not trend over time.

200
Multi-Selecteasy

Which TWO of the following are common techniques for detecting outliers in a numerical feature?

Select 2 answers
A.Chi-square test
B.Standard deviation
C.Interquartile Range (IQR)
D.Z-score
E.Principal Component Analysis (PCA)
AnswersC, D

Outliers are defined as points beyond 1.5*IQR from Q1 or Q3.

Why this answer

Z-score and IQR are standard outlier detection methods. PCA can detect outliers but is not a common direct method. Chi-square is for categorical association.

Standard deviation alone is not a method.

201
MCQeasy

A data scientist is performing EDA on a dataset with 500,000 rows and 10 columns. The dataset is stored in an S3 bucket as CSV files. The scientist wants to generate summary statistics (mean, median, min, max) for all numeric columns. Which service allows the quickest ad-hoc analysis without provisioning any infrastructure?

A.AWS Glue ETL
B.Amazon Athena
C.Amazon SageMaker Data Wrangler
D.Amazon QuickSight
AnswerB

Serverless SQL query service.

Why this answer

Option B is correct because Amazon Athena can query data in S3 directly using SQL. Option A is wrong because SageMaker Data Wrangler requires a notebook instance. Option C is wrong because AWS Glue ETL requires job setup.

Option D is wrong because QuickSight is for visualization, not direct summary statistics.

202
MCQmedium

A data scientist is performing exploratory data analysis on a dataset with both numerical and categorical features. The scientist wants to visualize the pairwise relationships between numerical features and also see the distribution of each feature. Which type of plot should the scientist use?

A.Pair plot (scatter matrix) with histograms on the diagonal.
B.Box plot for each feature.
C.Heatmap of the correlation matrix.
D.Correlation matrix with numbers.
AnswerA

Shows pairwise scatter plots and distributions.

Why this answer

Option C is correct because a pair plot (scatter matrix) shows pairwise scatter plots and histograms on the diagonal. Option A is wrong because a correlation matrix does not show distributions. Option B is wrong because a heatmap only shows correlation values.

Option D is wrong because a box plot does not show pairwise relationships.

203
MCQeasy

A data scientist needs to profile a large dataset in Amazon S3 to understand its schema, data types, and quality. Which AWS service can automatically generate a data profile with statistics and visualizations?

A.Amazon Athena
B.AWS Glue DataBrew
C.Amazon QuickSight
D.Amazon Redshift
AnswerB

DataBrew can profile data and generate statistics.

Why this answer

AWS Glue DataBrew provides data profiling capabilities. Option B is wrong because Athena is a query service. Option C is wrong because Redshift is a data warehouse.

Option D is wrong because QuickSight is for visualization after data is prepared.

204
MCQeasy

A data scientist wants to visualize the correlation between a continuous feature and a binary target variable. Which plot is most appropriate?

A.Scatter plot with feature on x-axis and target on y-axis
B.Histogram of the feature
C.Box plot of the feature grouped by target class
D.Bar chart of target class counts
AnswerC

Box plot compares distributions across two groups.

Why this answer

Option B is correct because a box plot shows distribution of the continuous feature across categories of the binary target, highlighting differences. Option A is wrong because a scatter plot is for two continuous variables. Option C is wrong because a histogram shows distribution of a single variable.

Option D is wrong because a bar chart is for categorical vs categorical or counts.

205
MCQmedium

A company is preparing a dataset for training a binary classification model. The dataset has a severe class imbalance (1% positive class). The data scientist wants to understand the impact of this imbalance on model performance before sampling. Which exploratory analysis step is MOST critical?

A.Compute the correlation matrix of all features with the target variable.
B.Check for missing values and outliers in the dataset.
C.Perform PCA and visualize the first two principal components colored by class.
D.Plot the distribution of each feature separately for the positive and negative classes.
AnswerD

Overlapping distributions indicate difficulty in classification.

Why this answer

Option B is correct because analyzing the distribution of features across classes can reveal separability and potential issues. Option A is wrong because correlation with target is not the primary concern. Option C is wrong because missing values are not the immediate concern.

Option D is wrong because PCA is not necessary at this stage.

206
MCQeasy

A data scientist is working with a dataset that contains text reviews and a numeric rating (1-5). The goal is to predict the rating from the review text. During EDA, the scientist wants to check if there are any spelling errors or unusual characters. Which tool is BEST suited for this task?

A.Amazon SageMaker Data Wrangler with a custom transform for text cleaning.
B.Amazon Athena with SQL queries to find anomalies.
C.Amazon Comprehend to detect syntax and entities.
D.Amazon QuickSight to create word clouds.
AnswerC

Comprehend can analyze text for structure.

Why this answer

Option C is correct because Amazon Comprehend can detect entities, key phrases, and syntax, but not spelling errors directly; however, it can be used to identify unusual patterns. Actually, for spelling errors, a custom solution may be needed. But among options, Comprehend is the only AWS AI service that processes text.

Option A is wrong because SageMaker Data Wrangler is for tabular data. Option B is wrong because Athena is for SQL. Option D is wrong because QuickSight is for visualization.

207
MCQmedium

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains customer demographic information and purchase history. Which approach should the data scientist take to handle missing values without introducing bias into the dataset?

A.Drop all rows with any missing values.
B.Impute missing values with the mean of each column.
C.Replace missing values with a constant, such as 0.
D.Use multiple imputation to estimate missing values.
AnswerD

Multiple imputation accounts for uncertainty and reduces bias.

Why this answer

Option C is correct because multiple imputation accounts for the uncertainty of missing values by creating multiple imputed datasets and combining results, reducing bias compared to single imputation or deletion methods. Option A is wrong because dropping rows with missing values can introduce bias if the missingness is not completely random. Option B is wrong because mean imputation can reduce variance and bias relationships.

Option D is wrong because using a constant value (e.g., 0) is arbitrary and can distort the data distribution.

208
MCQeasy

A data scientist is analyzing a dataset with many features and wants to identify which features are most correlated with the target variable. Which EDA technique should be used?

A.Box plots grouped by target
B.Scatter plot matrix
C.Histogram of each feature
D.Correlation matrix
AnswerD

Correlation matrix provides a compact view of pairwise correlations.

Why this answer

A correlation matrix shows pairwise correlations between all numeric features and the target. Option A is wrong because scatter plots can only show one pair at a time. Option B is wrong because histograms show distributions, not correlations.

Option D is wrong because box plots show distributions per category, not correlations.

209
MCQhard

A data engineer is using AWS Glue to catalog a dataset with 200 columns. During exploratory data analysis, they run a crawler and then view the table schema in the AWS Glue Data Catalog. They notice that many columns are inferred as 'string' even though they contain numeric values. What is the most likely cause?

A.The data is stored in JSON format, which only supports string types.
B.The crawler sample size is too small, and the sampled rows contain non-numeric values.
C.The data is stored in Parquet format, which does not support numeric types.
D.The column names contain special characters that prevent type inference.
AnswerB

The crawler samples a subset; if the sample includes non-numeric values, it infers string.

Why this answer

Option D is correct because the crawler samples data and may not see enough numeric values if the sample size is small or if the first few rows contain non-numeric values (e.g., headers or missing values). Option A is incorrect because the crawler does not rely on column names for type inference. Option B is incorrect because Parquet files store schema, but if the data is CSV, the crawler infers types.

Option C is incorrect because JSON files also have type information, but the crawler can still infer incorrectly.

210
MCQhard

During EDA, a data scientist discovers that two numerical features have a Pearson correlation coefficient of 0.95. Which action should the scientist take to avoid multicollinearity in a linear regression model?

A.Remove one of the features
B.Apply PCA to the two features
C.Use Ridge regression to penalize coefficients
D.Create polynomial features from the correlated pair
E.Apply min-max scaling to both features
AnswerA

Removing one feature eliminates multicollinearity and retains interpretability.

Why this answer

High correlation indicates multicollinearity, which can be addressed by removing one of the correlated features. Option A is wrong because PCA reduces dimensionality but loses interpretability. Option B is wrong because regularization (e.g., Ridge) can handle multicollinearity but does not remove it; removing one feature is simpler.

Option D is wrong because polynomial features introduce more multicollinearity. Option E is wrong because scaling does not address correlation.

211
MCQmedium

Refer to the exhibit. A data scientist is unable to query a table in Amazon Athena that is located in the 'my-data-bucket' S3 bucket. The IAM policy shown is attached to the scientist's role. What is the most likely reason for the failure?

A.The policy does not allow decrypting data encrypted with AWS KMS.
B.The policy does not allow athena:StartQueryExecution.
C.The policy does not allow s3:GetObject on the bucket.
D.The policy does not allow s3:PutObject to write query results to an S3 bucket.
AnswerD

Athena writes results to S3, requiring s3:PutObject.

Why this answer

Athena queries also require permission to write query results to a S3 bucket, typically specified as 's3:PutObject' on an output location. The policy lacks that permission. Option A is wrong because the policy allows s3:GetObject and s3:ListBucket.

Option C is wrong because athena:StartQueryExecution is allowed. Option D is wrong because there is no encryption restriction in the policy.

212
Multi-Selectmedium

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target is binary. The scientist wants to reduce dimensionality while preserving information related to the target. Which TWO methods are appropriate?

Select 2 answers
A.Principal Component Analysis (PCA)
B.Autoencoders
C.L1-regularized logistic regression
D.Mutual information-based feature selection
E.t-Distributed Stochastic Neighbor Embedding (t-SNE)
AnswersC, D

Can perform feature selection by shrinking coefficients to zero.

Why this answer

Options A and D are correct. Mutual information selection selects features with highest dependency on target, and L1-regularized logistic regression can drive coefficients to zero for feature selection. Option B is wrong because PCA is unsupervised and may discard target-related variance.

Option C is wrong because t-SNE is for visualization only. Option E is wrong because Autoencoders are unsupervised.

213
MCQhard

A data scientist is working with a dataset containing text reviews. The goal is to classify sentiment. During EDA, they compute the word frequency distribution. They notice that the most frequent words are common stop words like 'the', 'and', 'a'. Which action should they take to improve the feature representation for modeling?

A.Use n-grams instead of unigrams to capture phrase patterns.
B.Add more stop words to the default list to remove even more common words.
C.Remove the stop words from the text before creating the bag-of-words representation.
D.Apply stemming to reduce words to their root forms.
AnswerC

Stop words are usually not informative for sentiment; removing them reduces noise.

Why this answer

Option B is correct because removing stop words focuses on content words that carry sentiment. Option A is wrong because adding more stop words would remove even more potentially useful words. Option C is wrong because stemming reduces words to root forms but does not address stop words.

Option D is wrong because n-grams capture phrases but still include stop words.

214
MCQhard

A data scientist is analyzing a dataset where the target variable is highly imbalanced (1% positive class). They are performing EDA. Which metric is most appropriate for evaluating class separation in the feature space?

A.Area Under the ROC Curve (AUC-ROC)
B.Accuracy
C.F1 score
D.Log loss
AnswerA

AUC-ROC measures separability independent of class distribution.

Why this answer

Option D is correct because AUC-ROC is robust to class imbalance and measures separability. Option A is wrong because accuracy is misleading on imbalanced data. Option B is wrong because F1 score is a model evaluation metric, not for EDA.

Option C is wrong because log loss is a probabilistic metric.

215
MCQhard

A data engineer is exploring a dataset with a timestamp column and wants to resample the data to a consistent 1-hour frequency. The data is irregularly spaced. Which approach is most efficient using AWS services?

A.Use Amazon EMR with Spark
B.Use AWS Glue with built-in transforms
C.Use Amazon Athena with SQL window functions
D.Use Amazon SageMaker Processing with a custom script
AnswerD

SageMaker Processing allows custom scripts for flexible resampling.

Why this answer

Option D is correct because Amazon SageMaker Processing jobs allow custom scripts (e.g., using pandas resample) to handle irregular time series, and they are fully managed. Option A is wrong because Amazon Athena is a query engine and cannot resample easily. Option B is wrong because AWS Glue is more suited for batch ETL but may be overkill.

Option C is wrong because Amazon EMR requires cluster management and is more complex for simple resampling.

216
MCQhard

A data scientist is analyzing a dataset with 100,000 observations and 50 features. The scientist uses a Jupyter notebook on Amazon SageMaker. During EDA, the scientist runs a command to check for missing values and notices that 20% of the data in one feature is missing. The missing values are not random; they are correlated with another feature. Which imputation method is MOST appropriate?

A.Median imputation
B.Listwise deletion (remove rows with missing values)
C.Mean imputation
D.Multiple imputation by chained equations (MICE)
AnswerD

Models missing values using other features.

Why this answer

Option D is correct because MICE uses multiple imputation based on other features, accounting for correlations. Option A is wrong because mean imputation ignores correlation. Option B is wrong because median imputation also ignores correlation.

Option C is wrong because removing rows loses data and may introduce bias.

217
MCQmedium

A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?

A.Create an interaction term between the two features.
B.Remove one of the two highly correlated features from the dataset.
C.Increase the regularization parameter (e.g., lambda) in the model.
D.Apply mean-centering to both features to reduce correlation.
AnswerB

Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.

Why this answer

Two features with a correlation coefficient of 0.98 are nearly perfectly multicollinear. This inflates the variance of coefficient estimates in linear models, making them unstable and reducing interpretability. Removing one of the highly correlated features is a standard dimensionality reduction technique that mitigates multicollinearity without significant information loss, as the remaining feature captures almost the same variance.

Exam trap

AWS often tests the misconception that regularization alone fixes multicollinearity, but regularization only penalizes coefficient magnitude, not the linear dependency between features.

How to eliminate wrong answers

Option A is wrong because creating an interaction term between two nearly perfectly correlated features would introduce even more severe multicollinearity (the interaction term will be highly correlated with the original features), worsening model stability. Option C is wrong because increasing the regularization parameter (e.g., lambda in L2 regularization) can shrink coefficients but does not eliminate the underlying multicollinearity; the model remains sensitive to small data changes and coefficient interpretation is still problematic. Option D is wrong because mean-centering only shifts the features' means to zero and does not change the correlation coefficient between them; it has no effect on multicollinearity.

218
MCQeasy

A data scientist needs to understand the distribution of a numeric feature in a dataset stored in Amazon S3. Which AWS service can be used to run a quick exploratory query without setting up a server?

A.Amazon Redshift
B.Amazon EMR
C.Amazon Athena
D.AWS Glue
AnswerC

Athena is serverless and allows SQL queries directly on S3 data.

Why this answer

Option C is correct because Amazon Athena allows serverless SQL queries on data in S3. Option A (Amazon EMR) requires cluster setup; Option B (AWS Glue) is for ETL; Option D (Amazon Redshift) is a data warehouse.

219
MCQmedium

A data scientist is analyzing a time-series dataset and wants to check for stationarity. Which EDA technique is most appropriate?

A.Plot the autocorrelation function (ACF).
B.Use time-series cross-validation.
C.Perform the Augmented Dickey-Fuller (ADF) test.
D.Create a scatter plot of the series against its lag.
AnswerC

ADF test formally tests for unit root (non-stationarity).

Why this answer

The Augmented Dickey-Fuller (ADF) test is a formal statistical hypothesis test specifically designed to check for stationarity in a time series. It tests the null hypothesis that a unit root is present, indicating non-stationarity, against the alternative of stationarity. This makes it the most appropriate EDA technique for directly assessing stationarity.

Exam trap

AWS often tests the distinction between visual EDA techniques (like ACF plots) and formal statistical tests (like ADF), trapping candidates who confuse diagnostic plots with hypothesis testing for stationarity.

How to eliminate wrong answers

Option A is wrong because plotting the autocorrelation function (ACF) is a visual diagnostic for identifying autocorrelation patterns and model order (e.g., AR or MA terms), but it does not provide a formal statistical test for stationarity. Option B is wrong because time-series cross-validation is a model evaluation technique used to assess predictive performance, not a method for testing stationarity. Option D is wrong because a scatter plot of the series against its lag can reveal linear relationships and autocorrelation, but it lacks a formal hypothesis test and cannot definitively confirm or reject stationarity.

220
MCQeasy

A data scientist is analyzing a dataset with 500 features and 10,000 samples. After running a correlation matrix, they find that many feature pairs have correlation >0.95. What is the most appropriate next step to improve model performance?

A.Collect more training data to reduce the impact of correlated features.
B.Increase the regularization parameter in the model.
C.Apply principal component analysis (PCA) to reduce dimensionality.
D.Remove all features with correlation above 0.95.
AnswerC

PCA reduces multicollinearity by transforming correlated features into orthogonal components.

Why this answer

Option A is correct because high correlation indicates multicollinearity, which can be addressed by dimensionality reduction techniques like PCA. Option B is wrong because adding more data does not fix multicollinearity. Option C is wrong because removing all correlated features may discard useful information.

Option D is wrong because increasing regularization can help but is not the most appropriate first step for a large number of correlated features.

221
MCQhard

A team is performing exploratory data analysis on a dataset containing 10 million records stored in Amazon S3. They want to sample the data efficiently to build a representative subset for initial modeling. Which sampling method should they use to minimize bias and ensure the sample reflects the population distribution?

A.Stratified random sampling
B.Simple random sampling
C.Systematic sampling
D.Reservoir sampling
AnswerA

Stratified sampling ensures representation from all strata, reducing bias.

Why this answer

Option D is correct because stratified random sampling ensures that each subgroup (stratum) is proportionally represented, which is important for imbalanced data. Option A is wrong because simple random sampling may miss rare subgroups. Option B is wrong because systematic sampling can introduce bias if there is periodicity.

Option C is wrong because reservoir sampling is for streaming data, not for static datasets.

222
MCQeasy

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset has missing values in the 'age' column and outliers in the 'amount' column. Which combination of techniques should the engineer use to handle these issues during EDA?

A.Impute missing age values with the median and cap outliers in 'amount' using the interquartile range (IQR) method.
B.Remove rows with missing age and apply log transformation to 'amount'.
C.Impute missing age values with a constant (e.g., 0) and cap outliers using mean ± 3*std.
D.Impute missing age values with the mean and remove outliers in 'amount' using z-score.
AnswerA

Median is robust; IQR handles outliers.

Why this answer

Option A is correct because median imputation is robust to outliers, and IQR-based capping is standard for outlier handling. Option B is wrong because mean imputation is sensitive to outliers. Option C is wrong because removing rows with missing age may lose data.

Option D is wrong because z-score with mean/std is also sensitive to outliers.

223
Multi-Selecthard

A data scientist is analyzing a dataset and suspects the presence of outliers that could affect the mean and standard deviation. Which TWO methods are robust to outliers for measuring central tendency and dispersion?

Select 2 answers
A.Interquartile range (IQR)
B.Range
C.Standard deviation
D.Median
E.Mean
AnswersA, D

IQR is robust to outliers.

Why this answer

Median and interquartile range (IQR) are robust to outliers. Mean and standard deviation are sensitive to outliers. Range is also sensitive.

224
Multi-Selectmedium

A data scientist is performing exploratory data analysis on a dataset with 10,000 rows and 20 features. The target variable is binary. The data scientist observes that one feature has 15% missing values. Which TWO actions are appropriate to handle this missing data? (Choose TWO.)

Select 2 answers
A.Replace missing values with the mode of the feature.
B.Identify and remove outliers from the feature.
C.Use multiple imputation to fill in the missing values.
D.Delete all rows that contain missing values for this feature.
E.Drop the entire feature from the dataset.
AnswersC, D

Multiple imputation creates several plausible imputed datasets and combines results.

Why this answer

Option C is correct because multiple imputation is a robust statistical technique that accounts for uncertainty in missing values by creating multiple complete datasets, analyzing each, and pooling results. This is particularly appropriate for a dataset with 10,000 rows and 20 features, as it preserves the sample size and avoids bias that simpler methods might introduce.

Exam trap

AWS often tests the misconception that mode imputation (Option A) is a safe default for missing data, but it ignores feature relationships and can distort distributions, whereas multiple imputation is preferred for non-trivial missingness.

225
MCQmedium

During EDA, a data scientist finds that two features have a Pearson correlation coefficient of 0.95. What is the primary concern when using these features together in a linear regression model?

A.The model will underfit because of redundant information
B.Heteroscedasticity will be introduced
C.The model will overfit due to redundant features
D.Multicollinearity will make coefficient estimates unstable
AnswerD

High correlation between predictors leads to multicollinearity, increasing standard errors.

Why this answer

Option B is correct because high correlation indicates multicollinearity, which can destabilize coefficient estimates in linear regression. Option A is wrong because overfitting is not directly caused by correlation. Option C is wrong because high correlation does not cause underfitting.

Option D is wrong because heteroscedasticity is unrelated to correlation.

← PreviousPage 3 of 6 · 406 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.