CCNA Exploratory Data Analysis Questions

75 of 406 questions · Page 4/6 · Exploratory Data Analysis · Answers revealed

226
MCQhard

A data scientist is analyzing a dataset with high cardinality categorical features (e.g., user IDs with millions of unique values). They want to visualize the relationship between these categorical features and a continuous target variable. Which approach is most effective for EDA?

A.Group rare categories into an 'Other' category and use box plots
B.Apply one-hot encoding and use scatter plots
C.Use a bar chart with all categories on x-axis
D.Remove the categorical features from analysis
E.Apply feature hashing and visualize the hashed values
AnswerA

Grouping reduces cardinality and box plots effectively show relationship with target.

Why this answer

For high cardinality categorical features, grouping rare categories into an 'Other' category reduces cardinality and allows meaningful visualizations like box plots. Option A is wrong because removing the feature loses information. Option B is wrong because one-hot encoding creates too many columns and is not suitable for visualization.

Option D is wrong because visualizing millions of categories is not feasible. Option E is wrong because feature hashing is for modeling, not EDA visualization.

227
Multi-Selectmedium

Which THREE of the following are common issues that can be identified during exploratory data analysis? (Select THREE.)

Select 3 answers
A.Multicollinearity between features
B.High latency in API endpoints
C.Gradient vanishing in neural networks
D.Class imbalance in the target variable
E.Missing values in features
AnswersA, D, E

High correlation between features can be detected via correlation matrix.

Why this answer

Multicollinearity occurs when two or more features in a dataset are highly correlated, meaning they contain redundant information. During exploratory data analysis (EDA), correlation matrices and variance inflation factor (VIF) calculations can reveal this issue, which can destabilize linear regression models and inflate coefficient standard errors.

Exam trap

Cisco often tests the boundary between data-level issues (EDA) and model training issues, so candidates mistakenly select gradient vanishing (a deep learning optimization problem) or API latency (an operational concern) as EDA findings.

228
Multi-Selectmedium

A data scientist is performing EDA on a dataset with both numeric and categorical features. Which TWO techniques are appropriate for visualizing the relationship between a numeric feature and a binary categorical target?

Select 2 answers
A.Histogram
B.Stacked bar chart
C.Violin plot grouped by target
D.Box plot grouped by target
E.Scatter plot
AnswersC, D

Violin plots show distribution and density across categories.

Why this answer

Option A (box plot) shows distribution of a numeric feature across categories. Option C (violin plot) combines box plot and density. Option B is wrong because bar charts are for categorical vs categorical.

Option D is wrong because histograms show distribution of a single variable. Option E is wrong because scatter plots are for two numeric variables.

229
MCQhard

A data scientist is working with a dataset that has imbalanced classes (1% positive). They want to explore the data before modeling. Which visualization technique is most appropriate to understand the distribution of features with respect to the target class?

A.Box plots grouped by class
B.Parallel coordinates plot
C.Histograms overlaid by class
D.Scatter plot matrix
AnswerB

Correct: Parallel coordinates can display multiple features and highlight class separations.

Why this answer

Option B is correct because parallel coordinates plot can show feature patterns for minority vs majority class in high dimensions. Option A is wrong because scatter plot matrices become cluttered with many features. Option C is wrong because histograms are univariate and do not show interaction.

Option D is wrong because box plots are univariate.

230
MCQeasy

A data scientist is exploring a dataset with a column 'transaction_date'. They want to create features for day of week and month. What is the correct AWS service to schedule a recurring ETL job for this transformation?

A.Amazon Athena
B.AWS Glue
C.Amazon SageMaker
D.AWS Lambda
AnswerB

Glue is a managed ETL service.

Why this answer

Option C is correct because AWS Glue is a serverless ETL service. Option A is wrong because SageMaker is for ML modeling. Option B is wrong because Athena is for querying.

Option D is wrong because Lambda is for serverless functions, but not a full ETL service.

231
MCQeasy

A data scientist is exploring a dataset and wants to identify outliers in a numerical feature. The feature is not normally distributed. Which technique is robust to non-normal distributions?

A.Compute the Median Absolute Deviation (MAD) and flag values with MAD > 3.
B.Use the IQR method: flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
C.Calculate the Z-score and flag values with |Z| > 3.
D.Flag values more than 3 standard deviations from the mean.
AnswerB

Does not assume normality; uses robust quartiles.

Why this answer

The Interquartile Range (IQR) method does not assume normality and uses percentiles. Option A (Z-score) assumes normality. Option C (mean ± 3σ) also assumes normality.

Option D (MAD) is robust but Z-score based; IQR is more common for non-normal.

232
MCQhard

A data scientist is using Amazon Athena to query a CSV file stored in S3. The above error occurs. What is the most likely cause?

A.The CSV file uses a different delimiter than comma.
B.The CSV file is missing a header row.
C.The CSV file is too large for Athena to process.
D.The CSV file has inconsistent number of columns in some rows.
AnswerD

The error indicates row 1502 has 5 fields while header has 4.

Why this answer

Option A is correct because the error clearly states that a row has more fields than the header. Option B is wrong because the error is about field count mismatch, not encoding. Option C is wrong because the error mentions row number, but the issue is field count.

Option D is wrong because the header is present and read correctly.

233
MCQeasy

A data scientist has a dataset with 500 features and wants to reduce dimensionality for visualization. Which technique is most appropriate for identifying the two components that capture the most variance?

A.t-Distributed Stochastic Neighbor Embedding (t-SNE)
B.Linear Discriminant Analysis (LDA)
C.Principal Component Analysis (PCA)
D.K-means clustering
AnswerC

PCA projects data onto directions of maximum variance.

Why this answer

Option A is correct because PCA is designed to find principal components that maximize variance. Option B is wrong because t-SNE is for visualization but not variance-based; it focuses on preserving local structure. Option C is wrong because LDA is supervised and requires labels.

Option D is wrong because K-means is clustering, not dimensionality reduction.

234
MCQmedium

A data scientist is analyzing a dataset with 500 features and notices that many features are highly correlated. Which AWS service can be used to automatically reduce dimensionality by identifying and removing redundant features before training a model?

A.AWS Glue
B.Amazon SageMaker Data Wrangler
C.Amazon QuickSight
D.Amazon Athena
AnswerB

Provides built-in transformations including correlation analysis and dimensionality reduction.

Why this answer

Amazon SageMaker Data Wrangler provides built-in transformations including correlation analysis and dimensionality reduction. Option A is wrong because QuickSight is for visualization, not feature reduction. Option B is wrong because Glue is for ETL but lacks automatic dimensionality reduction.

Option D is wrong because Athena is a query service.

235
MCQeasy

A data analyst is using Amazon SageMaker Studio to perform exploratory data analysis on a dataset stored in S3. The analyst wants to generate summary statistics and visualizations quickly. Which built-in feature of SageMaker Studio should the analyst use?

A.SageMaker Ground Truth
B.SageMaker Data Wrangler
C.SageMaker Autopilot
D.SageMaker Clarify
AnswerB

Data Wrangler provides visual EDA capabilities like summary stats and charts.

Why this answer

Option C is correct because SageMaker Data Wrangler is a visual data preparation tool integrated into SageMaker Studio that provides summary statistics, histograms, and correlation matrices without code. Option A is wrong because SageMaker Autopilot automates model building, not EDA. Option B is wrong because SageMaker Clarify is for bias detection and explainability.

Option D is wrong because SageMaker Ground Truth is for labeling.

236
MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training dataset contains missing values in several features. The data scientist wants to impute missing values using the median of each feature. Which approach is most appropriate?

A.Drop all rows that contain missing values
B.Compute the median on the entire dataset, then split into training and test sets
C.Impute missing values with zero for all features before splitting
D.Compute the median of each feature on the training set only, then impute both training and test sets using that median
AnswerD

Why B is correct

Why this answer

Option B is correct because the median should be computed on the training set only to avoid data leakage, then applied to both training and test sets. Option A is wrong because imputing with zero may not be appropriate. Option C is wrong because computing median on the entire dataset and then splitting causes data leakage.

Option D is wrong because dropping rows with missing values may discard useful data and is not imputation.

237
Multi-Selecthard

Which THREE are common techniques for detecting outliers in a univariate dataset? (Select THREE.)

Select 3 answers
A.Cook's distance
B.DBSCAN clustering
C.Z-score
D.Interquartile range (IQR) method
E.Modified Z-score using median absolute deviation (MAD)
AnswersC, D, E

Z-score measures how many standard deviations an observation is from the mean.

Why this answer

Options A, C, and D are correct. Option B is wrong because DBSCAN is a multivariate clustering method. Option E is wrong because Cook's distance is for regression diagnostics.

238
Multi-Selecteasy

A data scientist is working with a dataset that contains both numerical and categorical features. The target variable is continuous. Which TWO EDA techniques should the scientist use to understand relationships between features and the target?

Select 2 answers
A.Generate a confusion matrix for the target variable.
B.Compute the silhouette score for each feature.
C.Create scatter plots of numerical features against the target variable.
D.Use box plots to compare target distribution across categorical feature categories.
E.Plot a histogram of the target variable.
AnswersC, D

Reveals linear/nonlinear relationships.

Why this answer

Scatter plots for numerical vs continuous target and box plots for categorical vs continuous target are standard. Option C (confusion matrix) is for classification. Option D (histogram of target) is univariate.

Option E (silhouette score) is for clustering.

239
Multi-Selectmedium

Which TWO statements about handling missing data during EDA are correct? (Select TWO.)

Select 2 answers
A.Dropping columns with >50% missing values is always recommended.
B.Mean imputation preserves the variance of the original distribution.
C.If data are missing completely at random (MCAR), listwise deletion yields unbiased estimates.
D.Multiple imputation (MICE) is always the safest method regardless of missing data mechanism.
E.Imputing with the median is more robust to outliers than imputing with the mean.
AnswersC, E

Under MCAR, missingness is independent of data, so deletion is unbiased.

Why this answer

Options B and C are correct. Option A is wrong because MICE is multivariate imputation, not necessarily safest. Option D is wrong because listwise deletion can introduce bias.

Option E is wrong because mean imputation reduces variance.

240
MCQhard

A data scientist is performing EDA on a high-dimensional dataset with 500 features. They want to visualize the data in 2D to check for clusters. They first apply PCA and get a 2D projection that shows no clear structure. They suspect that the data lies on a non-linear manifold. Which of the following techniques should they try next?

A.Use Independent Component Analysis (ICA).
B.Use Linear Discriminant Analysis (LDA).
C.Apply PCA again with more components.
D.Use t-distributed Stochastic Neighbor Embedding (t-SNE).
AnswerD

t-SNE is a non-linear technique that preserves local structure for visualization.

Why this answer

Option D is correct because t-SNE is designed for non-linear dimensionality reduction and visualization. Option A is wrong because PCA is linear. Option B is wrong because LDA is supervised.

Option C is wrong because ICA separates independent components, not for visualization.

241
MCQmedium

A data scientist is exploring a dataset and finds that the variance of a feature is 0. What should be done with this feature?

A.Remove the feature from the dataset
B.Create interaction terms with other features
C.Apply Min-Max scaling to normalize the feature
D.Impute missing values using the mean
AnswerA

Constant feature provides no predictive power.

Why this answer

Option C is correct because zero variance means the feature is constant and provides no information for modeling; it should be removed. Option A is wrong because scaling does not change constant values. Option B is wrong because imputation is for missing values, not constant.

Option D is wrong because interaction with a constant feature remains constant.

242
MCQmedium

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transactions. They notice that the target variable is highly imbalanced: 99% of samples belong to class 0 and 1% to class 1. Which technique should they use to address this imbalance before training a classification model?

A.Train the model on the raw data without any modification.
B.Apply SMOTE to generate synthetic samples for the minority class.
C.Use accuracy as the evaluation metric and train on the raw data.
D.Under-sample the majority class to match the minority class size.
AnswerB

SMOTE creates synthetic minority samples, helping balance the dataset.

Why this answer

Option C is correct because SMOTE generates synthetic samples for the minority class, which is effective for imbalanced datasets. Option A is wrong because accuracy is not a good metric for imbalanced data. Option B is wrong because under-sampling discards majority class data.

Option D is wrong because using raw data without handling imbalance typically leads to poor minority class performance.

243
MCQeasy

A data analyst is examining a scatter plot of two variables and notices a strong positive correlation. Which of the following is a valid conclusion?

A.The relationship is linear
B.One variable causes the other
C.The two variables are related, but causation cannot be inferred
D.The relationship can be used to accurately predict one variable from the other
AnswerC

Correlation does not imply causation.

Why this answer

Option A is correct because correlation indicates a relationship, but does not imply causation. Option B is wrong because correlation does not imply causation. Option C is wrong because correlation does not provide a prediction model.

Option D is wrong because correlation does not guarantee linearity; it could be non-linear.

244
MCQeasy

A data scientist wants to understand the distribution of a continuous feature before training a model. Which visualization is most appropriate?

A.Scatter plot
B.Box plot
C.Histogram
D.Bar chart
AnswerC

Histograms display the frequency distribution of a continuous variable.

Why this answer

A histogram is the standard tool for showing the distribution of a single continuous variable. Option A is wrong because scatter plots compare two variables. Option B is wrong because bar charts are for categorical data.

Option D is wrong because box plots show summary statistics, not the full distribution shape.

245
MCQeasy

A data scientist is analyzing a dataset with 10,000 rows and 50 columns. The target variable is binary. Which technique is most appropriate for identifying the most important features for predicting the target?

A.Use t-SNE to reduce dimensionality and inspect clusters
B.Run K-means clustering and examine cluster centroids
C.Train a Random Forest classifier and use feature_importances_
D.Apply PCA and select components with highest variance
AnswerC

Random Forest provides feature importance scores based on impurity reduction.

Why this answer

Option A is correct because Random Forest feature importance is a well-known method for ranking features in classification tasks. Option B is wrong because PCA is unsupervised and does not use the target. Option C is wrong because K-means is clustering, not feature selection.

Option D is wrong because t-SNE is for visualization, not feature importance.

246
Multi-Selectmedium

Which THREE techniques are commonly used for feature engineering in exploratory data analysis? (Select THREE.)

Select 3 answers
A.Extracting date/time components like day of week or hour.
B.Using principal component analysis (PCA) to create new features.
C.Applying one-hot encoding to numerical features.
D.Creating interaction features between variables.
E.Binning continuous variables into discrete intervals.
AnswersA, D, E

Temporal features often reveal patterns.

Why this answer

Option A is correct because extracting date/time components such as day of week, hour, or month from a timestamp is a standard feature engineering technique. It transforms a single datetime column into multiple categorical or cyclical features that can reveal temporal patterns like weekly seasonality or peak hours, which are often critical for time-series models.

Exam trap

Cisco often tests the distinction between feature engineering (creating new features from existing data) and dimensionality reduction (PCA) or encoding (one-hot encoding), leading candidates to mistakenly select PCA as a feature engineering technique when it is actually a preprocessing step for reducing feature space.

247
Multi-Selecthard

A data scientist is exploring a dataset with mixed data types (numeric, categorical, text). The dataset has 5 million rows. The scientist wants to understand the relationships between variables and identify potential data quality issues. Which THREE tools are suitable for this analysis?

Select 3 answers
A.AWS Glue DataBrew
B.AWS Data Pipeline
C.Amazon SageMaker Data Wrangler
D.Amazon Athena
E.Amazon Kinesis Data Analytics
AnswersA, C, D

Data profiling and visualization.

Why this answer

Options A, B, and D are correct. AWS Glue DataBrew can profile data, visualize distributions, and detect anomalies. Amazon SageMaker Data Wrangler provides interactive data preparation and visualization.

Amazon Athena can be used to run SQL queries for data quality checks. Option C is wrong because AWS Data Pipeline is for workflow orchestration, not EDA. Option E is wrong because Amazon Kinesis Data Analytics is for streaming data, not batch EDA.

248
MCQeasy

A data scientist is analyzing a dataset with 100 features and wants to identify which features are most correlated with the target variable. Which AWS service is most appropriate for this task?

A.Amazon QuickSight
B.Amazon Athena
C.AWS Glue DataBrew
D.Amazon SageMaker Data Wrangler
AnswerD

Data Wrangler provides data analysis and feature correlation within SageMaker Studio.

Why this answer

Amazon SageMaker Data Wrangler provides built-in data analysis and visualization capabilities, including correlation analysis, making it suitable for this task. Option A (Amazon QuickSight) is a BI tool for dashboards, not embedded data wrangling. Option C (AWS Glue) is for ETL jobs.

Option D (Amazon Athena) is for querying data in S3.

249
Multi-Selecthard

A data scientist is performing EDA on a dataset with 10 million rows. The dataset has a column 'income' with outliers. The data scientist wants to detect and handle outliers. Which THREE approaches are appropriate?

Select 3 answers
A.Calculate z-scores and flag values beyond 3 standard deviations
B.Apply min-max scaling to the column
C.Convert the column to one-hot encoding
D.Visualize the distribution with box plots
E.Use the interquartile range (IQR) to identify outliers
AnswersA, D, E

Z-score is a common method.

Why this answer

IQR method, z-score, and visualization like box plots are standard for outlier detection. Option D is wrong because min-max scaling does not handle outliers. Option E is wrong because one-hot encoding is for categorical data.

250
MCQhard

A machine learning engineer is analyzing a dataset that contains a categorical feature 'country' with 200 unique values. The target variable is binary. The engineer wants to use this feature in a linear model. Which encoding method should be applied during EDA to prepare the data for modeling, considering the high cardinality?

A.Target encoding with cross-validation
B.Label encoding
C.Frequency encoding
D.One-hot encoding
AnswerA

Target encoding captures the relationship with the target, and cross-validation prevents data leakage.

Why this answer

Option D is correct because target encoding (or mean encoding) replaces each category with the mean of the target, which is suitable for high cardinality in linear models but requires careful validation to avoid overfitting. Option A is wrong because one-hot encoding would create 199 dummy variables, leading to high dimensionality. Option B is wrong because label encoding imposes an arbitrary ordinal relationship.

Option C is wrong because frequency encoding may not capture the relationship with the target.

251
Matchingmedium

Match each SageMaker feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Managed compute to train a model

Host a model for real-time inference

Run inference on a batch of data

Jupyter notebook for exploration

Run data processing scripts

Why these pairings

These are core components of SageMaker.

252
MCQmedium

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'income' with 20% missing values. The income distribution is right-skewed. Which imputation method is most appropriate to preserve the skewness?

A.Impute with the mean income
B.Impute with the median income
C.Drop rows with missing income
D.Impute with the mode income
AnswerB

Median is robust to skewness and preserves the distribution shape.

Why this answer

Option D is correct because median is robust to skewness and preserves the central tendency without affecting the skewness as much as mean. Option A is wrong because mean is sensitive to outliers and would reduce skewness. Option B is wrong because mode is for categorical data.

Option C is wrong because dropping rows reduces sample size.

253
Multi-Selecteasy

A data analyst is exploring a dataset with a binary target variable. Which TWO visualizations are most useful for understanding the relationship between a numerical feature and the target?

Select 2 answers
A.Pie chart of the feature
B.Bar chart of the feature
C.Histogram with overlaid target classes
D.Box plot grouped by target class
E.Scatter plot of the feature versus target
AnswersC, D

Shows how the feature distribution differs by class.

Why this answer

Options A and D are correct. A: A box plot grouped by target class shows distribution differences. D: A histogram with overlaid target classes shows how the feature distribution differs.

Option B is incorrect because a scatter plot is for two numerical variables. Option C is incorrect because a bar chart is for categorical features. Option E is incorrect because a pie chart is for proportions, not relationships.

254
MCQeasy

A data scientist wants to understand the distribution of a categorical feature with 100 unique values. Which visualization is most appropriate?

A.Histogram
B.Bar chart
C.Scatter plot
D.Pie chart
AnswerB

Bar charts are ideal for displaying categorical frequencies.

Why this answer

A bar chart is the most appropriate visualization for displaying the distribution of a categorical feature with 100 unique values because it uses discrete bars to represent the frequency or proportion of each category. Unlike a histogram, which requires continuous numeric bins, a bar chart preserves the distinct categories and allows clear comparison of counts across all 100 levels.

Exam trap

Cisco often tests the distinction between histograms (for continuous data) and bar charts (for categorical data), and candidates mistakenly choose histogram because they confuse 'distribution' with 'numeric distribution' without recognizing the categorical nature of the feature.

How to eliminate wrong answers

Option A is wrong because a histogram is designed for continuous numeric data and groups values into bins, which is inappropriate for categorical features and would obscure the distinct categories. Option C is wrong because a scatter plot is used to visualize the relationship between two continuous variables, not the distribution of a single categorical feature. Option D is wrong because a pie chart, while usable for categorical data, becomes unreadable and misleading with 100 unique values due to overlapping small slices and difficulty comparing proportions; bar charts are far superior for many categories.

255
Multi-Selecteasy

Which TWO of the following are benefits of feature scaling for machine learning algorithms?

Select 2 answers
A.Eliminates the effect of outliers
B.Reduces the need for feature selection
C.Improves performance of decision tree algorithms
D.Faster convergence of gradient descent
E.Prevents features with larger magnitudes from dominating distance-based algorithms
AnswersD, E

Scaling ensures all features contribute equally to the gradient.

Why this answer

Option D is correct because feature scaling, typically via standardization (z-score) or min-max normalization, ensures that gradient descent converges faster. Without scaling, features with larger numerical ranges dominate the gradient updates, causing the algorithm to oscillate and require more iterations to reach the optimum. Scaling produces a more spherical contour of the loss function, allowing gradient descent to take more direct steps toward the minimum.

Exam trap

The trap here is that candidates often assume feature scaling universally improves all algorithms, but Cisco specifically tests that tree-based models (like decision trees) are scale-invariant, making option C a common distractor.

256
MCQhard

Refer to the exhibit. A data scientist runs the AWS CLI command shown and gets the output. The scientist wants to create an Athena table over all log files in the 'logs/2023/' prefix, including files smaller than 1000 bytes. Which approach achieves this?

A.Create the table using LOCATION 's3://my-bucket/logs/2023/' which includes all files under that prefix.
B.Create the table and add a WHERE clause to include small files.
C.Ask the S3 team to remove the size restriction on the bucket.
D.Modify the CLI command to remove the size filter and re-run it before creating the table.
AnswerA

The table location covers all files regardless of size.

Why this answer

The CLI command filters objects larger than 1000 bytes, but the scientist wants all files in the prefix. The Athena table definition should point to the entire prefix without size filtering. Option A is wrong because the command was run locally, not affecting Athena.

Option B is wrong because adding a WHERE clause in Athena only filters after scanning. Option D is wrong because the scientist can still create the table without size restrictions.

257
MCQmedium

An ML team is analyzing a time series dataset of daily website traffic. They notice a pattern where traffic spikes every Sunday. Which EDA technique should they use to confirm this seasonality?

A.Plot the time series data with a line plot
B.Compute autocorrelation at different lags
C.Create a scatter plot of traffic vs. day of week
D.Plot a histogram of the traffic values
AnswerA

A line plot over time directly reveals seasonal patterns.

Why this answer

Option A is correct because a line plot over time clearly shows weekly patterns. Option B is wrong because histograms show distribution of values, not time patterns. Option C is wrong because autocorrelation measures correlation with lagged values, but visual inspection is more direct for confirming seasonality.

Option D is wrong because scatter plots show relationships between two variables.

258
MCQeasy

A data engineer is querying the AWS Glue Data Catalog table shown in the exhibit. The engineer runs an Athena query: SELECT * FROM transactions WHERE year=2023. The query returns results quickly. However, a subsequent query: SELECT * FROM transactions WHERE amount > 100 takes a long time. What is the most likely reason for the performance difference?

A.The data is compressed, and the first query benefits from compression.
B.The first query uses a partition column (year), allowing partition pruning, while the second query does not.
C.The data is stored in Parquet format, which is optimized for columnar access.
D.The second query is not optimized because it uses 'SELECT *'.
AnswerB

Partition pruning reduces data scanned.

Why this answer

Option A is correct because the table is partitioned by year and month. The first query filters on a partition column (year), so Athena prunes partitions and scans only the relevant data. The second query filters on a non-partition column (amount), so Athena scans all partitions.

Option B is wrong because the data format is text (CSV), not Parquet. Option C is wrong because compression is not mentioned. Option D is wrong because the query is not partitioned correctly; the second query does not use partition columns.

259
MCQmedium

A data scientist runs a SageMaker notebook and uses pandas to explore a dataset. The dataset contains 500,000 rows and 20 columns, including a 'timestamp' column. After loading the data into a DataFrame, the memory usage is unexpectedly high. What is the most likely cause?

A.The DataFrame created an index column on the timestamp field, doubling memory usage.
B.The default data types inferred by pandas are unnecessarily large for the actual data ranges.
C.The DataFrame only loaded a sample of the data, but the sample size was too large.
D.The CSV file was compressed, and pandas inflated it in memory.
AnswerB

Pandas uses int64/float64 by default, which can be optimized by downcasting.

Why this answer

Option C is correct because pandas reads numeric columns as int64 or float64 by default, which uses 8 bytes per value. For 500,000 rows, even a single numeric column can consume significant memory. Option A is wrong because pandas does not automatically index columns.

Option B is wrong because memory usage is typically not due to loading all rows; pandas loads all rows. Option D is wrong because CSV files are plain text and do not have compression overhead.

260
MCQhard

A team is analyzing a dataset with many categorical features that have high cardinality (e.g., ZIP code, user ID). They want to explore relationships between these features and a continuous target variable. Which approach is most appropriate for visualizing these relationships without overwhelming the viewer?

A.Group categories into top K levels and use a box plot for each group.
B.Compute a correlation matrix using Pearson correlation.
C.Create a scatter plot with each category as a different color.
D.Use a heatmap to show pairwise chi-square statistics.
AnswerA

Aggregating categories makes the plot interpretable.

Why this answer

Option D is correct because aggregating the categories into groups (e.g., top 10 levels) and then using a bar plot or box plot makes the visualization interpretable. Option A is wrong because scatter plots are not suitable for categorical data. Option B is wrong because a heatmap of pairwise chi-square tests is for categorical-categorical relationships, not categorical-continuous.

Option C is wrong because a correlation matrix with Pearson correlation is for numerical data.

261
MCQeasy

A data scientist is performing EDA on a dataset that contains customer demographics and purchase history. The dataset has a column 'age' with some values that are negative or unreasonably high (e.g., 200). The scientist wants to identify and handle these outliers. The scientist is using a SageMaker notebook with pandas. Which approach should the scientist take to effectively handle these outliers?

A.Apply standard scaling to the 'age' column
B.Impute the outlier values with the mean of the column
C.Define reasonable bounds based on domain knowledge and filter or cap the outliers
D.Remove the 'age' column entirely
AnswerC

Domain knowledge provides logical bounds to handle outliers appropriately.

Why this answer

Using domain knowledge to define valid age range (e.g., 0-120) and filtering out or capping outliers is the most appropriate approach. Option B is wrong because removing the entire column loses information. Option C is wrong because imputing with mean distorts the distribution if outliers are extreme.

Option D is wrong because standard scaling does not handle outliers; it will still be affected.

262
MCQmedium

A data scientist is analyzing application logs in JSON format. Based on the exhibit, which EDA insight is most valuable for troubleshooting?

A.There is a recurring NullPointerException error.
B.All logs occurred at the same timestamp.
C.There is a connection timeout issue.
D.Most logs are at WARN level.
AnswerA

Three out of four logs are the same error, indicating a pattern.

Why this answer

Option B is correct because the repeated NullPointerException suggests a recurring issue that needs immediate attention. Option A is wrong because connection timeout appears only once. Option C is wrong because there are multiple timestamps.

Option D is wrong because WARN level appears only once.

263
MCQhard

During EDA, a data scientist plots the distribution of a numeric feature and observes that it is right-skewed. The feature will be used as input to a linear model. Which transformation should the data scientist apply?

A.Square transformation
B.Log transformation
C.One-hot encoding
D.Standardization (Z-score)
AnswerB

Log transformation compresses the tail and reduces right skewness.

Why this answer

A right-skewed distribution indicates that the feature has a long tail on the right, which can violate the linear model assumption of normally distributed errors. The log transformation compresses the high values and expands the low values, making the distribution more symmetric and stabilizing variance, which improves linear model performance.

Exam trap

Cisco often tests the misconception that standardization or scaling fixes skewness, but candidates must remember that only shape-altering transformations like log or Box-Cox address non-normality, not just rescaling.

How to eliminate wrong answers

Option A is wrong because a square transformation amplifies skewness by increasing the spread of high values, making the distribution even more right-skewed. Option C is wrong because one-hot encoding is used for categorical features, not for transforming the distribution of numeric features. Option D is wrong because standardization (Z-score) centers and scales the data but does not change the shape of the distribution, so it does not address skewness.

264
Multi-Selectmedium

A data scientist is performing EDA on a dataset with mixed data types (numerical and categorical). Which TWO visualizations are most appropriate for understanding the distribution of categorical features?

Select 2 answers
A.Histogram
B.Box plot
C.Pie chart
D.Scatter plot
E.Bar chart
AnswersC, E

Pie charts show proportions of categories.

Why this answer

A bar chart shows the count of each category, and a pie chart shows the proportion. Both are suitable for categorical data. Options C, D, and E are for numerical data.

265
MCQeasy

A data analyst is exploring a dataset and notices that the target variable has a Poisson distribution. Which type of model is most appropriate for this target?

A.Poisson regression
B.Linear regression
C.Cox proportional hazards model
D.Logistic regression
AnswerA

Poisson regression models count data with Poisson distribution.

Why this answer

Poisson regression is the correct choice because it is specifically designed for modeling count data where the target variable follows a Poisson distribution, which is characterized by non-negative integer values and a variance equal to the mean. This aligns directly with the data analyst's observation of a Poisson-distributed target, making Poisson regression the most appropriate generalized linear model (GLM) for this scenario.

Exam trap

The trap here is that candidates may confuse Poisson regression with logistic regression or linear regression, mistakenly applying a model for binary outcomes or continuous data to count data, without recognizing that the Poisson distribution's unique properties require a specialized GLM.

How to eliminate wrong answers

Option B is wrong because linear regression assumes a normally distributed target variable with constant variance, which is violated when the target follows a Poisson distribution (count data with variance equal to the mean). Option C is wrong because Cox proportional hazards model is a survival analysis technique for time-to-event data with censoring, not for modeling a Poisson-distributed count target. Option D is wrong because logistic regression models binary or ordinal outcomes using a logit link function, not count data with a Poisson distribution.

266
MCQmedium

A data scientist runs the above AWS CLI command and gets the output. The object size is 1 GB. They try to open the CSV file in Amazon Athena but get an error. What is the most likely cause?

A.The file format is not supported by Athena
B.The file exceeds the maximum CSV file size that Athena can query without partitioning
C.The file is not compressed with gzip
D.The file is too large for Athena to query at all
AnswerB

Athena has a 100 MB limit for CSV files when not partitioned.

Why this answer

Option B is correct because Athena has a 100 MB per file limit for CSV queries without partitioning. Option A is wrong because gzip compression is supported. Option C is wrong because 1 GB is large but Athena can handle larger with proper partitioning.

Option D is wrong because CSV is supported.

267
MCQmedium

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has a column 'transaction_date' with timestamps in string format. Which AWS service can be used to parse the timestamps and extract features like day of week and hour?

A.Amazon Athena
B.Amazon SageMaker Studio
C.AWS Glue
D.AWS Data Pipeline
AnswerC

AWS Glue provides built-in transformations for timestamp parsing and feature extraction.

Why this answer

Option C is correct because AWS Glue provides built-in transformations to parse timestamps and extract date/time features. Option A is wrong because Amazon Athena is a query service, not a transformation service. Option B is wrong because Amazon SageMaker Studio is an IDE, not a data transformation service.

Option D is wrong because AWS Data Pipeline is a workflow orchestration service, not a timestamp parsing tool.

268
Multi-Selecteasy

A data analyst is performing exploratory data analysis on a dataset and notices that there are outliers in several numerical columns. Which TWO methods can the analyst use to identify outliers?

Select 2 answers
A.Create a scatter plot matrix to visually inspect.
B.Calculate z-scores and flag any data points with |z| > 3.
C.Use a box plot to visualize the interquartile range (IQR) and identify points outside the whiskers.
D.Compare the mean and median of each column.
E.Plot a histogram and look for gaps.
AnswersB, C

Z-scores provide a statistical threshold for outliers.

Why this answer

Options B and D are correct. Box plots use the IQR to identify outliers as points outside 1.5*IQR from the quartiles. Z-scores identify outliers as points with |z| > 3 (assuming normal distribution).

Option A is wrong because mean and median are measures of central tendency, not outlier detection. Option C is wrong because histograms show distribution shape but do not explicitly identify outliers. Option E is wrong because pairwise scatter plots may show outliers but are not a systematic method.

269
Drag & Dropmedium

Drag and drop the steps to create a data processing job using Amazon SageMaker Processing in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Processing requires script creation, data upload, job configuration, execution, and verification.

270
MCQmedium

A data scientist uses Amazon QuickSight to visualize a dataset and observes that a numerical feature has a skewness of 2.5 and a kurtosis of 8. Which transformation should they apply to make the distribution more normal?

A.Standardize the feature using Z-score normalization.
B.Apply a Box-Cox transformation with lambda=0.5.
C.Apply Min-Max scaling to the range [0,1].
D.Apply a log transformation.
AnswerD

Log transformation reduces right skewness.

Why this answer

Option B is correct because a skewness of 2.5 indicates right skew, and a log transformation is commonly used to reduce skewness. Option A is incorrect because standardization does not change distribution shape. Option C is incorrect because Min-Max scaling does not change skewness.

Option D is incorrect because a Box-Cox transformation requires positive data and is a more general solution, but log is simpler and often sufficient; however, Box-Cox is also valid. In the context of this question, log is the most direct answer.

271
MCQmedium

A data scientist is working with a dataset that contains both numerical and categorical features. During EDA, they want to understand the relationship between a categorical feature with 10 unique values and the target variable. Which visualization is most appropriate?

A.Heatmap
B.Box plot
C.Histogram
D.Scatter plot
AnswerB

Box plot shows target distribution across categories.

Why this answer

A box plot shows the distribution of the target across categories. Option A is wrong because scatter plots are for two numerical features. Option B is wrong because histogram shows distribution of a single numerical feature.

Option D is wrong because heatmaps typically show correlation between numerical features.

272
MCQhard

A data scientist is analyzing a dataset with a large number of missing values in several columns. The dataset is stored in an Amazon S3 bucket and is about 5 TB in size. The scientist wants to understand the pattern of missingness (e.g., is it missing completely at random, missing at random, or not missing at random) before deciding on an imputation strategy. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Studio. Which approach should the scientist take to best understand the missing data patterns?

A.Use Amazon SageMaker Data Wrangler to create a flow and analyze missingness visually
B.Use AWS Glue DataBrew's data quality and missing data reports
C.Use AWS Glue ETL jobs with PySpark to compute missingness statistics
D.Use Amazon Athena to run queries to find missing values per column
AnswerB

DataBrew's reports visualize missing data patterns and correlations.

Why this answer

AWS Glue DataBrew provides a missing data report that includes patterns and correlations of missingness, such as heatmaps and bar charts. This helps determine the type of missingness. Option B is wrong because SageMaker Data Wrangler does not have built-in missingness pattern analysis.

Option C is wrong because Athena is for querying, not pattern analysis. Option D is wrong because Glue ETL jobs require custom coding and are less efficient for exploratory analysis.

273
Multi-Selecthard

Which TWO of the following are best practices for exploratory data analysis when using Amazon SageMaker Data Wrangler? (Select TWO.)

Select 2 answers
A.Store all intermediate results in Amazon Athena for querying.
B.Use Data Wrangler's built-in data visualizations to explore feature distributions and relationships.
C.Use Amazon EMR to run Spark jobs for data profiling.
D.Always export the data to Amazon QuickSight for analysis before transformation.
E.Export the Data Wrangler flow as a Jupyter notebook to share with the team.
AnswersB, E

Built-in visualizations enable quick EDA.

Why this answer

Using Data Wrangler's built-in visualizations for quick analysis and exporting the flow as a Jupyter notebook for reproducibility are best practices. Option B (QuickSight) is separate. Option C (EMR) is not needed.

Option D (Athena) is for queries, not for building into pipeline.

274
Multi-Selecthard

A machine learning team is analyzing a dataset with 10,000 rows and 200 features. They suspect data leakage due to time-based features. Which THREE EDA checks should they perform?

Select 3 answers
A.Plot distribution of each feature in training vs. test sets
B.Apply PCA and check if first two components separate train/test
C.Check whether the dataset is sorted by time and if any feature uses future information
D.Compare feature correlations with target in training and test sets
E.Perform k-means clustering on the whole dataset
AnswersA, C, D

Why D is correct

Why this answer

Option A is correct because feature correlation with target in training vs. test sets may indicate leakage. Option C is correct because time-based split (chronological) can reveal if future data leaks into training. Option D is correct because distribution differences between train and test sets can indicate leakage (e.g., train has future data).

Option B is wrong because clustering is not directly helpful for leakage detection. Option E is wrong because PCA is for dimensionality reduction, not leakage detection.

275
Multi-Selectmedium

Which TWO actions are appropriate during exploratory data analysis when you discover that a categorical feature has 50 unique values (high cardinality)?

Select 2 answers
A.Group rare categories into a single 'Other' category.
B.Apply one-hot encoding to create 50 dummy variables.
C.Apply label encoding to assign integers to each category.
D.Drop the feature entirely.
E.Use feature hashing (hashing trick) to reduce dimensionality.
AnswersA, E

Reduces cardinality while keeping most information.

Why this answer

Options B and D are correct. B: Grouping rare categories into an 'Other' category reduces cardinality while preserving information. D: Using feature hashing can transform high-cardinality categorical features into a fixed-size vector.

Option A is incorrect because one-hot encoding creates many columns, which can be problematic. Option C is incorrect because dropping the feature may lose important information. Option E is incorrect because label encoding implies ordinality, which may not exist.

276
MCQhard

A data scientist is analyzing a dataset with a binary target variable. The dataset is highly imbalanced (99% negative class). Which metric is most appropriate for evaluating the model's performance during exploratory data analysis?

A.Accuracy
B.Precision
C.F1 Score
D.Area Under the ROC Curve (AUC-ROC)
AnswerD

AUC-ROC is insensitive to class imbalance and provides a global measure of performance.

Why this answer

The Area Under the ROC Curve (AUC-ROC) is robust to class imbalance and measures the trade-off between true positive rate and false positive rate. Accuracy is misleading. F1 score is sensitive to imbalance but less so for threshold selection.

Precision and recall individually are less comprehensive.

277
MCQhard

The exhibit shows an Athena query result from a table. What is the output of the query?

A.3, 3, 4
B.2, 3, 3
C.2, 4, 4
D.2, 4, 3
AnswerC

Correct counts: col2 non-null=2, rows=4, distinct col1=4.

Why this answer

Option B is correct. COUNT(col2) counts non-null values in col2: rows 1 and 3 have non-null, so 2. COUNT(*) counts all rows: 4.

COUNT(DISTINCT col1) counts distinct col1 values: A,B,C,D = 4. Option A is wrong because COUNT(col2) is 2, not 3. Option C is wrong because COUNT(*) is 4, not 3.

Option D is wrong because COUNT(DISTINCT col1) is 4, not 3.

278
MCQmedium

The exhibit shows the result of an Athena query. What does the value '5000' represent?

A.The total number of rows in the table
B.The number of rows where col1 is NULL
C.The number of rows where col1 is not NULL
D.The number of distinct values in col1
AnswerB

The query counts rows with NULL in col1.

Why this answer

Option B is correct because the query counts rows where col1 IS NULL, and the result '5000' is that count. Option A is wrong because the query is not selecting distinct values. Option C is wrong because the query counts NULL rows, not non-NULL.

Option D is wrong because total row count was not queried.

279
MCQhard

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

A.Remove all but one feature from each group of highly correlated features.
B.Apply Principal Component Analysis (PCA) and keep the top 50 principal components.
C.Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.
D.Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.
AnswerB

PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.

Why this answer

Principal Component Analysis (PCA) is the correct technique because it performs an orthogonal linear transformation that projects the original 500 features into a new coordinate system where the axes (principal components) are ordered by the variance they capture. By keeping the top 50 principal components, the data scientist retains the maximum possible variance in the reduced 50-dimensional space, directly addressing the goal of preserving variance while handling high multicollinearity.

Exam trap

Cisco often tests the distinction between unsupervised variance-preserving techniques (PCA) and supervised or visualization-specific techniques (LDA, t-SNE), leading candidates to mistakenly choose LDA for dimensionality reduction without recognizing its supervised nature and dimension limit.

How to eliminate wrong answers

Option A is wrong because simply removing all but one feature from each group of highly correlated features is a heuristic that does not guarantee preserving maximum variance; it discards potentially useful information and does not leverage the correlation structure to create new, uncorrelated features. Option C is wrong because Linear Discriminant Analysis (LDA) is a supervised technique that requires class labels to maximize class separability, not variance preservation, and it can project to at most (number of classes - 1) dimensions, which is typically far fewer than 50. Option D is wrong because t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear, stochastic dimensionality reduction technique primarily used for visualization of high-dimensional data in 2 or 3 dimensions; it does not preserve global variance structure and is not suitable for reducing to 50 dimensions while retaining maximum variance.

280
Matchingmedium

Match each ML model evaluation concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Model performs well on training data but poorly on unseen data

Model fails to capture underlying patterns in data

Error from wrong assumptions in the learning algorithm

Error from sensitivity to small fluctuations in training data

Balance between underfitting and overfitting

Why these pairings

These are fundamental concepts in model evaluation.

281
MCQhard

A company uses Amazon SageMaker to train a regression model. After training, the data scientist notices that the training loss decreases but validation loss increases after a few epochs. Which EDA technique could have helped predict this behavior?

A.Create box plots of each feature to identify outliers
B.Plot learning curves showing training and validation loss over epochs
C.Generate residual plots to check heteroscedasticity
D.Plot confusion matrix on the validation set
AnswerB

Why B is correct

Why this answer

Option B is correct because plotting learning curves (training and validation loss vs. epochs) would show overfitting. Option A is wrong because confusion matrix is for classification. Option C is wrong because box plots show outliers but not overfitting.

Option D is wrong because residual plots are for linear regression assumptions.

282
Multi-Selectmedium

A data scientist is using Amazon SageMaker to perform exploratory data analysis on a dataset with missing values and outliers. Which TWO actions should the scientist take to understand the data quality? (Choose TWO.)

Select 2 answers
A.Build a scatterplot matrix to visualize pairwise relationships
B.Use histograms to visualize the distribution of each numerical feature
C.Plot a confusion matrix to assess class separation
D.Create a correlation matrix to identify redundant features
E.Generate summary statistics using df.describe() in a SageMaker notebook
AnswersB, E

Histograms reveal outliers, skewness, and missing data patterns (e.g., zero counts).

Why this answer

Option A is correct because generating summary statistics helps identify missing counts and outliers via min/max. Option D is correct because visualizing distributions with histograms helps spot outliers and skewness. Option B is wrong because a correlation matrix does not directly show missing values or outliers.

Option C is wrong because a confusion matrix is for classification models, not for data exploration. Option E is wrong because a scatterplot matrix shows pairwise relationships, not missing values.

283
Multi-Selectmedium

Which THREE techniques are commonly used to detect outliers in a dataset? (Select THREE.)

Select 3 answers
A.Interquartile range (IQR)
B.k-means clustering
C.Principal component analysis (PCA)
D.Z-score
E.Isolation Forest
AnswersA, D, E

Why B is correct

Why this answer

Options A, B, and D are correct. Z-score and IQR are standard statistical methods, and isolation forest is a machine learning algorithm for anomaly detection. Option C is wrong because PCA is for dimensionality reduction, not outlier detection, though it can be used in some contexts but is not common.

Option E is wrong because k-means clustering is for clustering, not specifically for outlier detection.

284
MCQhard

A team is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The dataset contains missing values, outliers, and categorical variables with high cardinality. The team wants to understand data distributions and relationships before modeling. Which combination of Data Wrangler features should they use?

A.Generate a data quality report, view histograms, and create scatter plots for selected features.
B.Drop rows with missing values and visualize box plots for numerical features.
C.Use imputation to handle missing values and one-hot encoding for categorical features.
D.Generate a data quality report and a correlation heatmap.
AnswerA

Data quality report provides summary statistics and missing values; histograms and scatter plots show distributions and relationships.

Why this answer

Option D is correct because Data Wrangler's data quality report provides summary statistics and missing value analysis, and the histogram visualization shows distributions. Scatter plots reveal relationships between variables. Option A is incorrect because Data Wrangler does not include correlation heatmaps directly.

Option B is incorrect because imputation and one-hot encoding are transformations, not EDA steps. Option C is incorrect because handling missing values is part of data preparation, not initial EDA.

285
MCQhard

A data scientist is performing exploratory data analysis on a large dataset stored in Amazon S3 (100 GB, CSV format, 500 columns). The dataset contains customer transaction records with features such as transaction amount, timestamp, customer ID, and numerous categorical variables (e.g., product category, payment method, location). The scientist wants to understand the distribution of transaction amounts across different product categories and identify any outliers. They have an Amazon SageMaker notebook instance with a ml.t3.medium instance and are using pandas. However, when trying to load the entire dataset into a DataFrame using pd.read_csv('s3://bucket/data.csv'), the notebook crashes with a memory error. Additionally, the scientist suspects that some categorical columns have high cardinality (e.g., product category has thousands of unique values), and there are missing values in several columns. What is the MOST efficient approach to perform the EDA without modifying the original dataset or using additional AWS services? Options: A) Use the SageMaker SDK to launch a parallel processing job with PySpark and read the data into a Spark DataFrame, then compute statistics and visualize with matplotlib. B) Use pandas with chunksize parameter to iterate through the dataset in chunks, compute per-chunk statistics, and aggregate results; for high-cardinality columns, use value_counts() with dropna=False and then plot the top 20 categories. C) Use the S3 Select API to filter rows and columns before loading into pandas, reducing the data size; then use pandas for EDA. D) Use SageMaker Data Wrangler to import the dataset, create a flow to handle missing values and reduce cardinality, and export a sample to the notebook for analysis.

A.Use the SageMaker SDK to launch a parallel processing job with PySpark and read the data into a Spark DataFrame, then compute statistics and visualize with matplotlib.
B.Use the S3 Select API to filter rows and columns before loading into pandas, reducing the data size; then use pandas for EDA.
C.Use SageMaker Data Wrangler to import the dataset, create a flow to handle missing values and reduce cardinality, and export a sample to the notebook for analysis.
D.Use pandas with chunksize parameter to iterate through the dataset in chunks, compute per-chunk statistics, and aggregate results; for high-cardinality columns, use value_counts() with dropna=False and then plot the top 20 categories.
AnswerD

Directly solves memory issue by chunking; handles high cardinality by limiting to top categories; no extra services needed.

Why this answer

Option B is correct because it directly addresses the memory issue by processing data in chunks and handles high-cardinality categorical columns by focusing on top categories, all within the existing pandas environment without additional services. Option A requires PySpark which is not set up on the current instance and adds complexity. Option C, S3 Select, can reduce data size but cannot perform the aggregation needed (e.g., distribution across categories) without pulling all rows; it's more suitable for simple filtering.

Option D, SageMaker Data Wrangler, is a separate service that requires additional setup and is not the most efficient for an ad-hoc EDA; it also modifies the workflow.

286
MCQhard

A data scientist creates the above IAM policy and attaches it to a role used by an Amazon SageMaker notebook instance. When trying to save a file to the S3 bucket, the operation fails. What is the missing permission?

A.kms:Decrypt
B.s3:ListBucket
C.kms:GenerateDataKey
D.s3:GetObject
AnswerC

If the bucket uses SSE-KMS, PutObject requires kms:GenerateDataKey to encrypt the object.

Why this answer

Option D is correct because SageMaker needs s3:PutObject, but also needs s3:GetObject for some operations. However, the error is likely due to missing encryption permissions. Option A is wrong because s3:ListBucket is for listing.

Option B is wrong because kms:Decrypt is for reading. Option C is wrong because s3:GetObject is for reading.

287
MCQhard

A data scientist is examining a dataset for a binary classification problem. The target variable has a 1:1000 imbalance. Which technique should be used to assess model performance during exploratory data analysis?

A.Area under the Precision-Recall curve
B.F1 score
C.Area under the ROC curve
D.Cohen's kappa
AnswerA

PR AUC is sensitive to class imbalance and focuses on the positive class.

Why this answer

With a 1:1000 class imbalance, the positive class is extremely rare. The Area Under the Precision-Recall curve (AUPRC) focuses on the performance of the positive class and is sensitive to changes in precision and recall, making it a robust metric for imbalanced datasets. Unlike ROC AUC, which can be overly optimistic when negatives dominate, AUPRC provides a realistic assessment of model performance on the minority class.

Exam trap

The trap here is that candidates often default to ROC AUC as the universal metric for classification, not realizing that in extreme imbalance, ROC AUC can be misleadingly high because the false positive rate is diluted by the vast number of true negatives.

How to eliminate wrong answers

Option B (F1 score) is wrong because it is a threshold-dependent metric that evaluates a single point on the precision-recall curve, not the overall performance across all thresholds, and it can be misleading when comparing models without a fixed threshold. Option C (Area under the ROC curve) is wrong because ROC AUC is insensitive to class imbalance; it treats false positive rate (which is dominated by the majority class) equally, often yielding deceptively high scores even when the model fails to identify the minority class. Option D (Cohen's kappa) is wrong because it measures inter-rater agreement adjusted for chance, which is not a standard metric for binary classification model evaluation and does not specifically address the imbalance problem.

288
Multi-Selecteasy

A data scientist wants to identify outliers in a dataset. Which TWO techniques are commonly used for outlier detection during EDA?

Select 2 answers
A.Box plot
B.Heatmap
C.Z-score analysis
D.Bar chart
E.Pearson correlation coefficient
AnswersA, C

Box plots show outliers as points outside the whiskers.

Why this answer

Option A (box plot) identifies outliers as points beyond whiskers. Option C (Z-score) flags points with |Z| > 3. Option B is wrong because bar charts are for categorical data.

Option D is wrong because Pearson correlation measures linear relationship. Option E is wrong because heatmaps show correlations.

289
MCQeasy

A data scientist is performing exploratory data analysis on a dataset stored in Amazon S3 using Amazon SageMaker Studio. The dataset has missing values in several columns. Which approach is the MOST efficient way to handle missing values within SageMaker Studio?

A.Run a Jupyter notebook on a local machine to clean the data and upload back to S3.
B.Use SageMaker Data Wrangler to impute missing values with mean, median, or mode.
C.Use AWS Glue to run a find-and-replace operation.
D.Write a custom Python script using pandas to drop rows with missing values.
AnswerB

Data Wrangler provides a visual interface for imputation.

Why this answer

SageMaker Data Wrangler provides a visual interface to handle missing values. Option B is wrong because writing a custom script is less efficient. Option C is wrong because Glue is external.

Option D is wrong because running a notebook from the local machine is not efficient.

290
MCQeasy

A data scientist is performing exploratory data analysis on a dataset with missing values. The dataset contains a column 'age' with some missing entries. Which technique is most appropriate for imputing missing values in the 'age' column if the data is normally distributed?

A.Drop all rows with missing values.
B.Replace missing values with the mode of the column.
C.Replace missing values with the mean of the column.
D.Replace missing values with the median of the column.
AnswerC

Mean imputation is suitable for normally distributed data.

Why this answer

Option A is correct because mean imputation is appropriate for normally distributed data. Option B is wrong because median imputation is better for skewed data. Option C is wrong because mode imputation is for categorical data.

Option D is wrong because dropping rows reduces sample size.

291
Multi-Selecteasy

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous dataset? (Select TWO.)

Select 2 answers
A.Z-score method
B.IQR (Interquartile Range) method
C.Box plot visualization
D.Pearson correlation coefficient
E.K-means clustering
AnswersA, B

Z-scores beyond a threshold (e.g., 3) indicate outliers.

Why this answer

Options B and D are correct. IQR-based outlier detection identifies points beyond 1.5*IQR from quartiles. Z-score method flags points beyond a threshold (e.g., 3) from mean.

Option A is wrong because clustering is multivariate. Option C is wrong because box plots visualize outliers but are not a detection technique per se; they use IQR. Option E is wrong because correlation is bivariate.

292
MCQhard

A data scientist is exploring a dataset with 200 features. They compute the pairwise correlation matrix and notice that many features have correlations above 0.95. They want to reduce redundancy before modeling. Which of the following techniques is most appropriate for identifying and removing highly correlated features?

A.Compute mutual information between each feature and the target.
B.Apply PCA and keep the first 50 components.
C.Use Lasso regression to select features.
D.Perform hierarchical clustering on the correlation matrix and select one feature per cluster.
AnswerD

This systematically removes redundancy while retaining representative features.

Why this answer

Option D is correct because hierarchical clustering on correlations groups correlated features; then one can select a representative from each cluster. Option A is wrong because PCA creates new features but does not remove original ones. Option B is wrong because Lasso performs feature selection but may not handle multicollinearity well.

Option C is wrong because mutual information does not capture pairwise redundancy directly.

293
MCQhard

A team is building a model to predict house prices. They have a dataset with features like 'SquareFootage', 'Bedrooms', 'YearBuilt', and 'Neighborhood'. They notice that 'SquareFootage' has a few extreme values (e.g., 50,000 sq ft) that are likely data entry errors. They want to handle these outliers without losing all the data. Which of the following approaches is most robust?

A.Cap 'SquareFootage' at the 99th percentile value.
B.Replace extreme values with the mean of 'SquareFootage'.
C.Apply log transformation to 'SquareFootage'.
D.Remove rows where 'SquareFootage' is above 3 standard deviations from the mean.
AnswerA

Capping limits extremes while retaining the records.

Why this answer

Option B is correct because capping at percentiles (e.g., 99th) limits extreme values while keeping the data points. Option A is wrong because removing rows with any outlier may discard useful data. Option C is wrong because log transformation does not fix errors.

Option D is wrong because imputing with mean distorts the distribution.

294
MCQmedium

A data scientist is building a regression model to predict house prices. The dataset includes a feature 'zip_code' with 1,000 unique values. What is the best way to handle this categorical feature in the exploratory data analysis phase?

A.One-hot encode the zip_code feature
B.Apply target encoding using the mean house price per zip code
C.Replace zip_code with the frequency of each zip code in the dataset
D.Use label encoding: assign each zip code a unique integer
AnswerB

Why A is correct

Why this answer

Option A is correct because target encoding (mean encoding) captures the relationship between the category and the target, and is suitable for high-cardinality features. Option B is wrong because one-hot encoding would create too many dummy variables. Option C is wrong because label encoding implies ordinality which is not present.

Option D is wrong because frequency encoding may not capture price variation well.

295
Multi-Selectmedium

A data engineer is exploring a large dataset in Amazon Athena. The dataset is partitioned by date and stored in Parquet format. The engineer wants to check the number of distinct values in a column for a specific date range. Which THREE practices reduce query cost and improve performance?

Select 3 answers
A.Use the COUNT(DISTINCT column) function.
B.Filter the query with a WHERE clause on the partition column.
C.Use ORDER BY to sort the results.
D.Use SELECT * to retrieve all columns.
E.Ensure the table is columnar (Parquet) to reduce I/O.
AnswersA, B, E

Efficiently counts distinct values without fetching all rows.

Why this answer

Options A, C, and D are correct. Using partition filtering limits data scanned. Using COUNT(DISTINCT) is efficient but still scans; however, the question asks for reducing cost, so partition filtering is key.

Option B is wrong because SELECT * scans all columns. Option E is wrong because ORDER BY without LIMIT requires full scan and sort.

296
Multi-Selecthard

A data scientist is performing exploratory data analysis on a time-series dataset of website traffic. The dataset contains hourly page views for the past two years. The scientist wants to analyze seasonality and trends. Which THREE techniques are appropriate for this analysis? (Choose THREE.)

Select 3 answers
A.Moving average smoothing
B.Box plot by month
C.Time series decomposition (additive or multiplicative)
D.Linear regression on time index
E.Autocorrelation (ACF) plot
AnswersA, C, E

Smoothing reveals underlying trend.

Why this answer

Decomposition separates time series into trend, seasonal, and residual components. Autocorrelation plot (ACF) helps identify seasonality. Moving average smooths to reveal trends.

Linear regression is not typical for seasonal decomposition. Box plot by month can show seasonal patterns but is less common for trend.

297
MCQmedium

A data scientist is analyzing a dataset and finds that the target variable has a bimodal distribution. Which preprocessing step is most appropriate before modeling?

A.Standardize the target variable to have mean 0 and variance 1.
B.Remove outliers from the target variable.
C.Consider clustering to separate the two modes and model them separately.
D.Apply a log transformation to the target variable.
AnswerC

Bimodal distribution may indicate two subpopulations.

Why this answer

Option B is correct because clustering can identify natural groups, which can be treated as separate modeling tasks. Option A is wrong because log transformation works for skewed unimodal distributions. Option C is wrong because scaling does not change distribution shape.

Option D is wrong because removing outliers would not address bimodality.

298
MCQmedium

A data scientist ran an AWS Glue ETL job that failed with the error shown. What is the most likely cause?

A.The CSV file has a header mismatch
B.The DataFrame does not have a column named 'age'
C.The schema is evolving incorrectly
D.The data type of 'age' is incompatible
AnswerB

Correct: The error states 'age' is not in the input columns.

Why this answer

Option C is correct because the error indicates the column 'age' is not found in the input data, which has columns [id, name, salary]. Option A is wrong because the error is about a missing column, not a mismatch. Option B is wrong because schema evolution would add column, not cause error.

Option D is wrong because there is no indication of data type issue.

299
MCQeasy

A machine learning engineer is analyzing a text classification dataset with 50,000 documents. Which EDA step is most important to understand the vocabulary size and frequency distribution?

A.Compute TF-IDF matrix
B.Plot frequency of each word in a bar chart
C.Generate bigram collocations
D.Plot histogram of document lengths
AnswerB

Why D is correct

Why this answer

Option D is correct because word frequency distribution (e.g., zipfian) helps decide vocabulary cutoff. Option A is wrong because TF-IDF is a transformation, not EDA. Option B is wrong because document length distribution is about length, not vocabulary.

Option C is wrong because bigram analysis is more advanced; basic frequency is first.

300
MCQmedium

A data scientist is performing EDA on a time series dataset of daily website visits. The scientist wants to identify any seasonality patterns. Which visualization is most appropriate?

A.Correlation matrix of visits with lagged versions of itself.
B.Scatter plot of visits against the day of the month.
C.Histogram of daily visit counts.
D.Line plot with day on x-axis and visits on y-axis, highlighting weekends.
AnswerD

Reveals periodic patterns over time.

Why this answer

A time series line plot with marked intervals (e.g., weekly, monthly) can reveal seasonal patterns. Option A (histogram) shows distribution, not time patterns. Option B (scatter plot of visits vs day) is essentially a line plot but less effective for seasonality.

Option C (correlation matrix) does not show temporal patterns.

← PreviousPage 4 of 6 · 406 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.