CCNA Exploratory Data Analysis Questions

75 of 406 questions · Page 2/6 · Exploratory Data Analysis topic · Answers revealed

76
MCQhard

A machine learning engineer is analyzing a dataset for a regression problem. The target variable has a long-tail distribution with extreme outliers. The engineer wants to reduce the influence of outliers while preserving the relative order of values. Which data transformation should the engineer apply to the target variable?

A.Min-max normalization
B.Box-Cox transformation
C.Rank transformation
D.Log transformation
AnswerC

Rank transformation replaces values with their rank order, making the distribution uniform and robust to outliers.

Why this answer

Option B is correct because the rank transformation maps values to their ranks, eliminating the impact of outliers while preserving order. Option A is wrong because Box-Cox requires positive values and may not reduce outlier influence. Option C is wrong because log transformation can reduce skew but still allows outliers to remain influential.

Option D is wrong because min-max scaling does not reduce outlier influence; it compresses the range.

77
MCQmedium

A data scientist is performing EDA on a time-series dataset and observes a strong upward trend and seasonal patterns. The scientist needs to make the data stationary for modeling. Which transformation should be applied?

A.Apply one-hot encoding
B.Apply PCA
C.Apply min-max scaling
D.Apply differencing to the series
E.Apply logarithmic transformation
AnswerD

Differencing removes trends and seasonality, making the series stationary.

Why this answer

Differencing is a common technique to remove trends and seasonality to make a time series stationary. Option B is wrong because logarithmic transformation stabilizes variance but does not remove trends. Option C is wrong because min-max scaling does not address trends.

Option D is wrong because one-hot encoding is for categorical variables. Option E is wrong because PCA is for dimensionality reduction.

78
MCQhard

A data scientist is exploring a large dataset (10 TB) stored in Amazon S3. The dataset is in CSV format and has many columns. The scientist wants to quickly compute summary statistics (mean, min, max, count) for each column without moving the data. Which approach is most cost-effective and efficient?

A.Import the data into Amazon SageMaker Data Wrangler
B.Launch an Amazon EMR cluster with Spark
C.Use S3 Select to compute statistics
D.Use Amazon Athena with SQL queries
E.Use AWS Glue DataBrew to profile the data
AnswerD

Athena queries data in place with no data movement and pay-per-query pricing.

Why this answer

Using Amazon Athena with SQL queries allows serverless querying of data in S3, and is cost-effective (pay per query). It can compute summary statistics directly on the data without moving it. Option A is wrong because SageMaker Data Wrangler requires importing data into SageMaker, which may incur transfer costs and time.

Option B is wrong because Glue DataBrew also processes data but may be more expensive for large datasets. Option D is wrong because S3 Select works on single objects and is limited. Option E is wrong because launching an EMR cluster adds overhead and cost for a simple task.

79
MCQmedium

A data scientist is analyzing a dataset with missing values in several columns. The dataset contains both numerical and categorical features. Which approach should the data scientist use to handle missing values while minimizing bias and preserving relationships in the data?

A.Use multiple imputation (e.g., MICE) to impute missing values
B.Use forward-fill to propagate the last observed value
C.Delete all rows with missing values
D.Replace missing values with the mean or median of each column
AnswerA

MICE models each variable as a function of others, preserving relationships and reducing bias.

Why this answer

Option B is correct because MICE (Multiple Imputation by Chained Equations) models each variable with missing values as a function of other variables, preserving relationships. Option A is wrong because listwise deletion can introduce bias. Option C is wrong because mean/median imputation reduces variance.

Option D is wrong because forward-fill is for time series.

80
MCQeasy

Refer to the exhibit. A data scientist examines a sample of data and notices that all columns are numeric. The scientist wants to check for multicollinearity. Which statistic should be computed from this sample?

A.Correlation matrix (Pearson)
B.Chi-square test of independence
C.Variance Inflation Factor (VIF)
D.Covariance matrix
AnswerA

A correlation matrix can reveal high pairwise correlations.

Why this answer

Option A is correct because the correlation matrix shows pairwise Pearson correlations, which can indicate high collinearity. Option B is wrong because VIF requires more variables than observations. Option C is wrong because chi-square is for categorical.

Option D is wrong because covariance alone is scale-dependent.

81
MCQhard

A data scientist is performing EDA on a dataset with missing values in 3 of 20 features. The missing rate is 5% for each feature. The scientist wants to preserve as much data as possible while avoiding bias. Which imputation strategy is most appropriate?

A.Remove rows with any missing values.
B.Impute missing values with the mean of each feature.
C.Use K-Nearest Neighbors (KNN) imputation.
D.Impute missing values with the median of each feature.
AnswerD

Median is robust and retains data.

Why this answer

Option A is correct because median imputation is robust to outliers and preserves the dataset size. Option B is wrong because dropping rows with missing values would lose 14% of data. Option C is wrong because mean imputation can be affected by outliers.

Option D is wrong because KNN imputation may introduce bias and is computationally expensive.

82
MCQmedium

A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?

A.The imputation will introduce bias if the missing values are not random.
B.Imputation using median is computationally expensive for large datasets.
C.The imputed values may reduce the variance of the 'age' distribution.
D.The imputed values will increase the variance of the feature, leading to overfitting.
AnswerC

Replacing missing values with a constant reduces the variability of the feature.

Why this answer

Imputing missing values with the median of the observed data artificially concentrates imputed values around the center of the distribution. This reduces the overall variance of the 'age' column because the imputed values do not reflect the natural spread of the data, potentially distorting downstream analyses like regression or clustering that rely on variance structure.

Exam trap

Cisco often tests the subtle distinction between bias (which is a general risk of any imputation under non-random missingness) and variance reduction (which is a specific, guaranteed statistical consequence of constant-value imputation).

How to eliminate wrong answers

Option A is wrong because while imputation can introduce bias if data are not missing at random (MNAR), the question specifically asks about a drawback of using median imputation; the bias concern is not unique to median imputation and is a general risk of any imputation method under MNAR, not the primary technical drawback described. Option B is wrong because computing the median is O(n) with efficient algorithms and is not computationally expensive even for large datasets; mean or median imputation is among the cheapest imputation methods. Option D is wrong because median imputation reduces variance, not increases it; increased variance would be a concern with methods like mean imputation with added noise, not with simple median imputation.

83
Multi-Selecteasy

Which TWO AWS services can be used to visualize data distributions as part of exploratory data analysis? (Select TWO.)

Select 2 answers
A.AWS Glue
B.Amazon QuickSight
C.Amazon Athena
D.Amazon Comprehend
E.Amazon SageMaker Data Wrangler
AnswersB, E

QuickSight provides interactive dashboards and visualizations.

Why this answer

Amazon QuickSight is a cloud-native business intelligence service that can visualize data distributions through histograms, box plots, scatter plots, and other chart types, making it suitable for exploratory data analysis. Amazon SageMaker Data Wrangler provides a visual interface to create data distribution charts (e.g., histograms, bar charts) directly within the data preparation workflow, enabling quick inspection of feature distributions before model building.

Exam trap

Cisco often tests the misconception that AWS Glue or Athena can visualize data distributions because they are used in data preparation or querying, but neither provides native charting or plotting capabilities—they only return raw data or tabular results.

84
Multi-Selectmedium

A data scientist is analyzing a dataset and finds that two features have a Pearson correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose two.)

Select 2 answers
A.Combine the two features into a single feature using PCA or averaging
B.Add interaction terms between the features
C.Increase regularization strength in the model
D.Remove one of the correlated features
E.Apply standard scaling to both features
AnswersA, D

Combining captures information from both while reducing dimensionality.

Why this answer

Options B and C are correct. High correlation can lead to multicollinearity, so removing one feature (B) or combining them (C) are valid approaches. Option A is wrong because increasing regularization is a remedy for multicollinearity but does not directly address the correlation.

Option D is wrong because scaling does not affect correlation. Option E is wrong because adding interaction terms can increase multicollinearity.

85
MCQeasy

A machine learning engineer is performing exploratory data analysis on a dataset containing customer transaction records. The dataset includes a column 'transaction_date' with timestamps. The engineer wants to derive features such as day of the week, hour, and month for modeling. Which AWS service can be used directly to extract these features without writing custom code?

A.AWS Glue ETL with built-in timestamp transforms
B.Amazon Athena with SQL date functions
C.Amazon QuickSight
D.Amazon SageMaker Data Wrangler
AnswerA

AWS Glue provides transforms like 'ExtractTimestamp' to derive date components without custom code.

Why this answer

Option B is correct because AWS Glue provides built-in transforms in its ETL jobs to parse timestamps and extract date/time components. Option A is wrong because Athena is a query engine and can extract date parts using SQL, but that requires writing SQL queries, not a no-code solution. Option C is wrong because SageMaker Data Wrangler is a visual tool that can create features, but it requires a SageMaker Studio environment.

Option D is wrong because QuickSight is a visualization tool, not for feature engineering.

86
MCQhard

A data engineer is performing EDA on a time-series dataset of server metrics (CPU, memory, disk I/O) collected every minute. The dataset contains 2 years of data. The engineer suspects there are seasonal patterns and wants to decompose the time series for one metric. Which AWS service can be used to perform this decomposition natively?

A.Amazon SageMaker Canvas
B.Amazon Athena with SQL window functions
C.Amazon QuickSight
D.AWS Glue DataBrew
AnswerA

Canvas provides time-series analysis and decomposition.

Why this answer

Option A is correct because Amazon SageMaker Canvas supports time-series forecasting with built-in decomposition. Option B is wrong because AWS Glue does not have time-series decomposition. Option C is wrong because Amazon QuickSight does not decompose time series.

Option D is wrong because Amazon Athena does not have decomposition functions.

87
MCQhard

A data scientist is performing EDA on a dataset containing text reviews. To understand the most common words, the data scientist generates a word cloud. Which preprocessing step is most important to ensure the word cloud reflects meaningful content?

A.Stop word removal
B.Part-of-speech tagging
C.Stemming
D.Tokenization
AnswerA

Stop word removal eliminates common, uninformative words.

Why this answer

Option C is correct because removing stop words (common words like 'the', 'and') ensures that the word cloud highlights meaningful words. Stemming (A) may not be necessary for a word cloud. Tokenization (B) is fundamental but not the most critical for meaningfulness.

POS tagging (D) is overkill.

88
Multi-Selectmedium

Which THREE techniques are commonly used in exploratory data analysis to understand the relationships between features and the target variable? (Select THREE.)

Select 3 answers
A.Use box plots to compare feature distributions across target classes.
B.Perform K-means clustering on the features.
C.Compute the correlation matrix between features and target.
D.Generate scatter plots or pair plots to visualize feature interactions.
E.Apply Principal Component Analysis (PCA) to reduce dimensions.
AnswersA, C, D

Box plots by class reveal differences in feature distributions.

Why this answer

Options A, C, and E are correct. A: Correlation matrix quantifies linear relationships. C: Pair plots allow visual inspection of multiple relationships.

E: Box plots by target show distribution differences. B is wrong because PCA is for dimensionality reduction, not EDA of relationships. D is wrong because clustering is unsupervised and not directly for feature-target relationships.

89
MCQmedium

A machine learning engineer is working on a customer churn prediction project. The dataset contains 100,000 records with 15 features, including customer demographics, account information, and usage patterns. The target variable 'churned' is binary with 15% positive examples. During EDA, the engineer notices that the feature 'tenure' (number of months the customer has been with the company) has a multimodal distribution with peaks at 1, 12, 24, and 36 months. Also, the feature 'monthly_charges' has a strong positive correlation with 'total_charges' (correlation coefficient = 0.95). The engineer wants to build a logistic regression model. Which preprocessing steps should the engineer take to address these issues? (Select TWO.)

A.Bin the 'tenure' feature into categorical groups (e.g., 0-6, 7-12, 13-24, 25-36, 36+) to capture the non-linear relationship.
B.Remove one of the correlated features, such as 'total_charges', to reduce multicollinearity.
C.Apply log transformation to the 'tenure' feature to make it unimodal.
D.Create polynomial features up to degree 3 for 'tenure' to capture non-linearity.
E.Standardize all numerical features to have mean 0 and variance 1.
AnswerA, B

Binning can effectively capture the peaks in the distribution and model the non-linear effect of tenure on churn.

Why this answer

Option A is correct because binning the 'tenure' feature into categorical groups (e.g., 0-6, 7-12, 13-24, 25-36, 36+) captures the multimodal distribution and non-linear relationship with churn, which logistic regression (a linear model) cannot model directly. This transforms the feature into a format that allows the model to learn different churn probabilities for each tenure segment without imposing a linear assumption.

Exam trap

Cisco often tests the misconception that standardizing or transforming features alone can fix non-linear relationships or multicollinearity, but these steps do not address the root cause of multimodal distributions or high feature correlation in linear models like logistic regression.

How to eliminate wrong answers

Option C is wrong because applying a log transformation to 'tenure' would not make a multimodal distribution unimodal; log transformations are used to reduce right skewness, not to eliminate multiple peaks, and would distort the natural grouping at contract milestones. Option D is wrong because creating polynomial features up to degree 3 for 'tenure' would introduce multicollinearity and overfitting without addressing the multimodal nature, and logistic regression still assumes a linear relationship in the log-odds space, which polynomial terms do not resolve for discrete peaks. Option E is wrong because standardizing numerical features (mean 0, variance 1) is a general best practice for gradient descent convergence but does not address the multimodal distribution of 'tenure' or the multicollinearity between 'monthly_charges' and 'total_charges'; it is not a targeted preprocessing step for the issues described.

90
MCQeasy

A data scientist is exploring a dataset and wants to understand the distribution of a continuous feature. Which visualization is most appropriate for identifying skewness and potential outliers?

A.Bar chart
B.Scatter plot
C.Box plot
D.Heatmap
AnswerC

Box plots display median, quartiles, and outliers, ideal for assessing skewness and outliers.

Why this answer

Option C is correct because a box plot explicitly shows median, quartiles, and outliers. Option A is wrong because scatter plots show relationships between two variables, not distribution. Option B is wrong because bar charts are for categorical data.

Option D is wrong because heatmaps show correlations, not distribution.

91
MCQmedium

A data scientist is analyzing a dataset with a time series component. They suspect there is a weekly seasonality. Which technique should they use to confirm this?

A.Plot the time series line chart
B.Compute autocorrelation function (ACF)
C.Perform Fourier transform
D.Compute a 7-day moving average
AnswerB

Correct: ACF at lag 7 will be significant if weekly seasonality exists.

Why this answer

Option C is correct because autocorrelation function (ACF) can show peaks at lag 7 indicating weekly seasonality. Option A is wrong because line plot can show patterns but is subjective. Option B is wrong because moving average smooths data and may hide seasonality.

Option D is wrong because spectral analysis is for frequency, but ACF is simpler.

92
Multi-Selecthard

A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)

Select 3 answers
A.Standardize the text data using z-score normalization.
B.Apply stemming to reduce words to their root form.
C.Tokenize the text into individual words.
D.Convert all text to lowercase.
E.Remove common stop words (e.g., 'the', 'and', 'is').
AnswersB, D, E

Stemming groups related words, reducing feature dimensionality.

Why this answer

Option B is correct because stemming reduces words to their root form (e.g., 'running' to 'run'), which consolidates variations of the same word and reduces feature dimensionality. This is a standard preprocessing step before feature extraction in NLP tasks like sentiment analysis, as it helps the model generalize across different word forms.

Exam trap

Cisco often tests the distinction between preprocessing steps that are specific to text (like stemming, lowercasing, stop word removal) versus those meant for numerical data (like normalization), and candidates may mistakenly apply scaling techniques to text or forget that tokenization is a prerequisite but not always listed as a separate 'correct' step in multi-select questions.

93
Multi-Selectmedium

Which TWO techniques are appropriate for detecting outliers in a univariate numeric dataset?

Select 2 answers
A.Cook's distance
B.Mahalanobis distance
C.Z-score method
D.Interquartile range (IQR) method
E.DBSCAN clustering
AnswersC, D

Z-score flags points beyond a threshold (e.g., |z|>3).

Why this answer

Options A and C are correct: Z-score identifies outliers based on standard deviations from the mean; IQR uses the interquartile range. DBSCAN (B) is for multivariate clustering. Mahalanobis distance (D) is multivariate.

Cook's distance (E) is for regression influence.

94
MCQhard

A data engineer is running an Amazon SageMaker Data Wrangler flow on a dataset with 5 million rows. The flow includes several transformations. The engineer wants to validate the data quality by checking for missing values and outliers before training. Which approach is most efficient?

A.Use Data Wrangler's data quality and insights report to generate a report with statistics and visualizations.
B.Export the transformed data to S3 and query with Amazon Athena.
C.Use Amazon EMR with Spark to compute statistics.
D.Import the data into Amazon QuickSight and create dashboards.
AnswerA

Data Wrangler has a built-in report for data quality.

Why this answer

Using Data Wrangler's built-in data quality and insights report is the most efficient way to get statistics and detect issues without custom code. Option B (Athena) requires writing SQL queries. Option C (QuickSight) needs exporting.

Option D (EMR) is overkill.

95
MCQmedium

A data scientist is analyzing a dataset with 100 features and 10,000 samples. The target variable is highly imbalanced (1% positive class). Which exploratory data analysis step is most critical before model training?

A.Apply PCA and visualize the first two principal components
B.Compute pairwise correlation matrix among all features
C.Impute missing values using mean imputation
D.Plot the histogram of the target variable
AnswerD

Why C is correct

Why this answer

Option C is correct because understanding the distribution of the target variable is essential for imbalanced datasets to choose appropriate sampling techniques or evaluation metrics. Option A is wrong because correlation analysis is less critical than target distribution. Option B is wrong because PCA is a dimensionality reduction technique not primarily for EDA.

Option D is wrong because missing value imputation is important but not the most critical for imbalance.

96
MCQeasy

During EDA, a data scientist discovers that a numerical feature 'income' has a skewness of 3.5. Which transformation should the scientist apply to make the distribution more symmetric?

A.Standardization (Z-score)
B.Square transformation
C.Log transformation
D.Min-Max scaling
AnswerC

Log transformation compresses the tail and reduces right skewness.

Why this answer

Option D is correct because a log transformation is commonly used for right-skewed positive data to reduce skewness. Option A is wrong because StandardScaler does not change skewness. Option B is wrong because Min-Max scaling does not change shape.

Option C is wrong because a square transformation would increase skewness.

97
MCQeasy

A company has customer feedback data stored in CSV files in S3. The data includes a 'feedback_text' column. Which AWS service is best suited for performing sentiment analysis as part of exploratory data analysis?

A.Amazon Comprehend
B.Amazon Rekognition
C.Amazon Textract
D.Amazon Lex
AnswerA

Comprehend provides sentiment analysis as a managed service.

Why this answer

Option A is correct because Amazon Comprehend is a natural language processing (NLP) service that can perform sentiment analysis directly. Option B is wrong because Amazon Lex is for conversational interfaces, not text analysis. Option C is wrong because Amazon Rekognition is for image and video analysis.

Option D is wrong because Amazon Textract is for extracting text from documents, not sentiment.

98
MCQeasy

A data scientist is analyzing a dataset with 1,000 features. They suspect many features are redundant and want to reduce dimensionality before training a model. Which technique is most appropriate for identifying the most important features?

A.Apply principal component analysis (PCA) and select the top components
B.Use L1 regularization (Lasso) to shrink coefficients to zero
C.Train a random forest and remove features with low importance
D.Compute the correlation matrix and remove features with high correlation
AnswerA

Why B is correct

Why this answer

Option B is correct because principal component analysis (PCA) is a dimensionality reduction technique that identifies the principal components capturing the most variance. Option A is wrong because correlation matrix only shows pairwise linear relationships, not importance. Option C is wrong because regularization can shrink coefficients but is not a dedicated dimensionality reduction technique.

Option D is wrong because random forests can provide feature importance but are not a dimensionality reduction technique per se.

99
MCQhard

A company has a large dataset of customer transactions stored in Amazon Redshift. A data scientist wants to perform EDA using Python libraries like pandas and matplotlib. The dataset is too large to fit into memory on a single EC2 instance. What is the most efficient approach?

A.Launch an Amazon SageMaker notebook instance with an attached EBS volume large enough to hold the data
B.Use Amazon Athena Federated Query to run SQL queries against Redshift and retrieve aggregated results
C.Use a SQLAlchemy connection to read the entire table into a pandas DataFrame and sample it
D.Export the Redshift table to Amazon S3 in Parquet format, then use pandas to read the Parquet files
AnswerB

Why C is correct

Why this answer

Option C is correct because Amazon Athena allows querying Redshift data directly via federated queries, returning only aggregated results, avoiding the need to move large datasets. Option A is wrong because reading all data to a local DataFrame would exceed memory. Option B is wrong because writing to S3 and then reading with pandas still requires loading all data into memory.

Option D is wrong because SageMaker notebook's local memory is still limited.

100
MCQmedium

A machine learning engineer is performing exploratory data analysis on a large dataset stored in S3 using Amazon Athena. The dataset contains a timestamp column 'event_time' of type string. The engineer wants to analyze daily trends. Which approach is the most cost-effective and efficient?

A.Create a view that casts the column to timestamp and query the view.
B.Use the CAST function in the SELECT statement to convert the string to timestamp.
C.Convert the data to Parquet format with a timestamp column and re-query.
D.Partition the table by date derived from the event_time string and query using partition filtering.
AnswerD

Partition pruning reduces data scanned; can use date_format or substring to derive partition key.

Why this answer

Option D is correct because converting the string to a date type in the query allows Athena to use partition pruning if the table is partitioned by date, reducing scanned data. Option A is wrong because CAST in SELECT still scans all data. Option B is wrong because creating a view does not reduce data scanned.

Option C is wrong because converting to Parquet is beneficial but not the most direct for the given task.

101
Multi-Selectmedium

Which TWO of the following are appropriate techniques for handling missing data during exploratory data analysis? (Select TWO.)

Select 2 answers
A.Ignore missing values and proceed with modeling
B.Replace missing values with -1 to indicate missing
C.Impute missing values using mean or median for numerical features
D.Visualize the missing data pattern using heatmaps or bar charts
E.Delete all rows with any missing values
AnswersC, D

Mean/median imputation is a common EDA technique.

Why this answer

Options A and C are correct. Visualizing missing data patterns (A) helps understand the missing mechanism. Using imputation methods like mean/median (C) is common during EDA.

Option B is wrong because deleting all rows with missing values may discard too much data. Option D is wrong because ignoring missing values can lead to errors. Option E is wrong because replacing with -1 can distort data.

102
MCQeasy

A data analyst is examining the distribution of a continuous variable and notices that its histogram is heavily skewed to the right. Which transformation should the analyst apply to make the distribution more symmetrical?

A.Box-Cox transformation with lambda=2.
B.Logarithmic transformation (log).
C.Standardization (z-score).
D.Square root transformation.
AnswerB

Log transformation reduces right skewness.

Why this answer

Option B is correct because log transformation is commonly used to reduce right skewness by compressing the long tail. Option A is wrong because the square root transformation is less effective for severe skewness. Option C is wrong because Box-Cox requires all values positive and is a family that includes log, but the log is a specific case.

Option D is wrong because standardization does not change the shape of the distribution.

103
Multi-Selecteasy

Which TWO actions are appropriate when handling missing data in a dataset for machine learning? (Select TWO.)

Select 2 answers
A.Use a machine learning model to predict missing values based on other features
B.Drop all rows that contain any missing value
C.Impute missing values with the mean or median of the feature
D.Remove the feature entirely if it contains missing values
E.Fill missing values with zero
AnswersA, C

Why D is correct

Why this answer

Options A and D are correct. Imputing with mean/median is a common technique, and using a model to predict missing values is also valid. Option B is wrong because dropping all rows with missing values can discard too much data.

Option C is wrong because filling with zeros may not be appropriate for all features. Option E is wrong because removing the feature entirely may lose important information.

104
MCQhard

A data scientist queried an Athena table and got only one row back, but the CSV file is 1 MB. What is the most likely reason?

A.The table is partitioned but the partition is not correctly defined
B.The CSV file contains only one row
C.The table is not an external table
D.Athena does not support CSV format
AnswerA

Correct: If date partition is not correctly mapped, the filter may return no data.

Why this answer

Option B is correct because the file is large but the query returned only one row, suggesting the table's partition mapping is wrong; the WHERE clause on date may not match the actual partition. Option A is wrong because if the table were external, it would still read all data. Option C is wrong because 1 MB file likely has many rows.

Option D is wrong because Athena supports CSV.

105
MCQmedium

A data scientist is performing EDA on a dataset with 500 features. The dataset has a mix of numeric and categorical features. The scientist wants to identify which features have a strong nonlinear relationship with the target variable. Which technique is most appropriate?

A.Use ANOVA to compare feature means across target classes.
B.Compute Pearson correlation coefficients.
C.Calculate mutual information between each feature and the target.
D.Perform chi-squared tests for each feature.
AnswerC

Mutual information measures any dependency, including nonlinear.

Why this answer

Mutual information can capture any kind of dependency (including nonlinear) between features and target. Option A (Pearson correlation) only linear. Option B (Chi-squared test) is for categorical features.

Option D (ANOVA) is for comparing means across groups.

106
MCQeasy

Which AWS service can be used to generate a data profile (including histograms, correlations, and statistics) for a dataset stored in Amazon S3 without writing code?

A.Amazon QuickSight
B.AWS Glue DataBrew
C.Amazon Athena
D.Amazon SageMaker Data Wrangler
AnswerD

Data Wrangler provides visual data profiling.

Why this answer

Option D is correct because Amazon SageMaker Data Wrangler provides a visual interface to create data profiles. Option A (QuickSight) is for visualization, not profiling; Option B (Glue DataBrew) also profiles but Data Wrangler is more integrated with SageMaker; Option C (Athena) is for querying.

107
MCQhard

Refer to the exhibit. A data scientist queries the table with 'SELECT COUNT(*) FROM mytable' in Athena and gets a result of 1000 rows. However, the scientist knows there are 1500 data files in the S3 location. What is the most likely reason for the discrepancy?

A.Some files may use a different delimiter (e.g., tab) and are not parsed correctly, resulting in zero rows from those files.
B.The table schema does not match the data, causing some files to be skipped.
C.Some files may be empty or contain only headers, so they contribute 0 rows.
D.Athena skips files larger than a certain size to prevent scanning too much data.
AnswerA

If delimiter is not comma, rows are not parsed, reducing count.

Why this answer

The table is not partitioned (empty PartitionKeys), and the SerDe expects comma-delimited files. If some files use a different delimiter (e.g., tab), those rows may not be parsed correctly, leading to fewer rows counted. Option A is wrong because the table schema matches the data.

Option B is wrong because Athena can handle many small files, but count should still include all rows. Option D is wrong because Athena does not skip files based on size unless explicitly filtered.

108
Multi-Selecteasy

A data scientist is analyzing a dataset with a mix of numerical and categorical features. The target variable is binary. The data scientist wants to visualize the distribution of a numerical feature across the two target classes. Which TWO visualization techniques are appropriate? (Choose 2.)

Select 2 answers
A.Heatmap of the correlation matrix
B.Stacked bar chart of the feature binned
C.Overlapping histograms with transparency
D.Side-by-side boxplots
E.Scatter plot with color-coded classes
AnswersC, D

Histograms show distribution shapes; transparency allows comparison.

Why this answer

Option A is correct because side-by-side boxplots show distribution and outliers across categories. Option C is correct because overlapping histograms with transparency show distribution shapes. Option B is wrong because scatter plots require two numerical variables.

Option D is wrong because heatmaps are for correlation or contingency tables. Option E is wrong because bar charts are for categorical data, not numerical distributions.

109
MCQmedium

A data scientist is working with a dataset that has missing values in 30% of rows for a categorical feature 'city'. Which EDA step should be performed before deciding on imputation?

A.Check if missingness is related to other features or random
B.Impute missing values with the mode of the column
C.Drop all rows with missing values
D.Encode the city feature using label encoding
AnswerA

Why B is correct

Why this answer

Option B is correct because analyzing the missing pattern (e.g., MCAR, MAR, MNAR) guides imputation strategy. Option A is wrong because mode imputation is a method, not a diagnostic step. Option C is wrong because label encoding is for modeling, not imputation decision.

Option D is wrong because dropping rows may lose data; need to understand missingness first.

110
Multi-Selecteasy

Which TWO are appropriate visualizations for exploring the distribution of a single numeric variable? (Select TWO.)

Select 2 answers
A.Heatmap
B.Histogram
C.Bar chart
D.Scatter plot
E.Box plot
AnswersB, E

Histogram displays frequency distribution of a single numeric variable.

Why this answer

Options B and D are correct. Option A is wrong because scatter plot is for two variables. Option C is wrong because heatmap shows correlation.

Option E is wrong because bar chart is for categorical data.

111
MCQmedium

A data scientist is analyzing a dataset with a binary target variable. They compute the correlation matrix and find that all features have correlations between -0.1 and 0.1 with the target. They suspect that the relationship might be non-linear. Which of the following techniques should they use to detect non-linear relationships?

A.ANOVA test
B.Spearman's rank correlation
C.Pearson correlation coefficient
D.Mutual information
AnswerD

Measures any kind of dependency, linear or non-linear.

Why this answer

Option C is correct because mutual information can capture any dependency. Option A is wrong because Pearson correlation only captures linear. Option B is wrong because Spearman captures monotonic but not all non-linear.

Option D is wrong because ANOVA is for categorical vs continuous.

112
MCQeasy

During exploratory data analysis, a data scientist notices that a feature has a highly skewed distribution. Which transformation is most likely to make the distribution approximately normal?

A.Log transformation
B.Min-max scaling
C.One-hot encoding
D.Standardization (z-score)
AnswerA

Why C is correct

Why this answer

Option C is correct because log transformation is commonly used to reduce right skewness. Option A is wrong because standardization does not change distribution shape. Option B is wrong because min-max scaling does not change shape.

Option D is wrong because one-hot encoding is for categorical variables, not continuous.

113
MCQmedium

A data scientist is analyzing a dataset with 10 million rows and 50 columns. The target variable is highly imbalanced (99% negative, 1% positive). Which approach is most appropriate for exploratory data analysis before modeling?

A.Remove all negative examples and analyze only the positive ones.
B.Take a random sample of 100,000 rows from the entire dataset.
C.Take a stratified sample that preserves the 99:1 ratio.
D.Up-sample the minority class to balance the dataset before analysis.
AnswerC

Stratified sampling ensures representation of both classes.

Why this answer

Option B is correct because stratified sampling preserves the class proportion in the sample, which is critical for imbalanced data. Option A (random sample) may miss positives; Option C (up-sample minority) changes distribution; Option D (remove negatives) loses information.

114
MCQeasy

A data scientist needs to detect outliers in a dataset with multiple features that follow different distributions. Which method is most robust for multivariate outlier detection?

A.Z-score threshold
B.Interquartile range (IQR)
C.DBSCAN clustering
D.Isolation Forest
AnswerD

Correct: Isolation Forest works well for multivariate data without distributional assumptions.

Why this answer

Option C is correct because Isolation Forest is an ensemble method that isolates anomalies effectively in high-dimensional spaces without assuming distribution. Option A is wrong because Z-score assumes normal distribution. Option B is wrong because IQR is univariate.

Option D is wrong because DBSCAN is for clustering, not specifically for outlier detection.

115
Multi-Selecteasy

Which TWO actions should a data scientist take when exploring a dataset that contains missing values and outliers? (Select TWO.)

Select 2 answers
A.Calculate the percentage of missing values per column.
B.Normalize all features using Min-Max scaling.
C.Remove all rows with outliers.
D.Impute missing values with the mean immediately.
E.Visualize the distribution of each feature using histograms.
AnswersA, E

Missing value counts inform imputation strategy.

Why this answer

Options B and E are correct. B: Visualizing the distribution helps identify shape and outliers. E: Reporting missing value counts is a standard EDA step.

A is wrong because imputation should be done after analysis. C is wrong because removing outliers without analysis is premature. D is wrong because normalization is not a first step.

116
MCQmedium

A data scientist is analyzing a dataset with a target variable that is heavily imbalanced (e.g., 99% negative class, 1% positive class). Which exploratory data analysis technique is most appropriate to understand the relationship between features and the target before modeling?

A.Randomly sample 10% of the data and plot feature distributions by class.
B.Apply PCA to reduce dimensionality, then visualize the first two components.
C.Use stratified sampling to create a balanced subset, then compute correlation matrices and box plots.
D.Focus only on the majority class features to avoid bias.
AnswerC

Stratified sampling preserves class proportions, enabling meaningful EDA.

Why this answer

Option C is correct because stratified sampling preserves the class distribution in the sample, allowing you to create a balanced subset for exploratory analysis. Computing correlation matrices and box plots on this balanced subset reveals feature-target relationships without being overwhelmed by the majority class, which is critical for imbalanced datasets like 99% negative vs. 1% positive.

Exam trap

The trap here is that candidates may think random sampling (Option A) is sufficient for EDA, but they overlook that severe class imbalance (99:1) makes random samples uninformative for the minority class, whereas stratified sampling explicitly addresses this by ensuring both classes are represented in the analysis subset.

How to eliminate wrong answers

Option A is wrong because random sampling of 10% of the data will likely preserve the original class imbalance (99:1), so feature distributions by class will still be dominated by the negative class, obscuring patterns for the rare positive class. Option B is wrong because PCA is an unsupervised dimensionality reduction technique that does not use the target variable; the first two components may capture variance unrelated to the target, and the resulting visualization may not highlight class-specific separations. Option D is wrong because focusing only on the majority class features ignores the minority class entirely, which is the very class of interest in imbalanced problems; this approach would miss important discriminative features and introduce bias.

117
MCQeasy

A data scientist runs the above AWS CLI command on a file in S3. What can be concluded from the output?

A.The ETag can be used for integrity checking.
B.The file is 1 GB in size.
C.The object has S3 versioning enabled.
D.The file has not been preprocessed.
AnswerA

ETag is an MD5 hash of the object, used to detect changes.

Why this answer

Option B is correct because the ETag can be used to verify file integrity. Option A is wrong because ContentLength shows size in bytes, ContentLength 1048576 is 1 MB. Option C is wrong because ETag is not a version ID; S3 versioning uses VersionId.

Option D is wrong because preprocessed metadata indicates the file has been processed.

118
MCQeasy

A data scientist is using Amazon SageMaker Data Wrangler for exploratory data analysis. The dataset contains a column with missing values that are encoded as 'NA' strings. The data scientist wants to treat these as missing values during the import. Which step should the data scientist take?

A.Configure a custom missing value symbol 'NA' in the import settings of Data Wrangler.
B.Use the 'Impute' transform to fill 'NA' with the mean of the column.
C.Use the 'Replace missing' transform to replace 'NA' with null after import.
D.Use the 'Drop missing' transform to remove rows containing 'NA'.
AnswerA

Data Wrangler supports custom missing value symbols during data import.

Why this answer

Option C is correct because Data Wrangler allows specifying custom missing value symbols during import. Option A is wrong because dropping rows is premature. Option B is wrong because replacing after import is less efficient.

Option D is wrong because imputation is needed after treating missing values.

119
MCQhard

The exhibit shows an IAM policy for a SageMaker notebook. A data scientist wants to use the notebook to run an Athena query and then load the results into a pandas DataFrame. Which action is NOT possible with this policy?

A.Read the Athena query results from the output S3 location
B.Start an Athena query execution
C.Read a specific object from the my-training-data bucket
D.List objects in the my-training-data bucket
AnswerA

The policy only allows read on my-training-data, not the Athena output bucket.

Why this answer

Option C is correct because the policy allows s3:GetObject only on the 'my-training-data' bucket, but Athena writes results to a different S3 location (specified in the query configuration). Without permission to read from that output bucket, the notebook cannot load the results. Option A is wrong because Athena actions are allowed.

Option B is wrong because listing the training data bucket is allowed. Option D is wrong because reading individual objects from the training data bucket is allowed.

120
MCQeasy

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset and notices that the target variable is highly imbalanced. Which technique should the data scientist apply to balance the dataset before training?

A.Synthetic Minority Oversampling Technique (SMOTE)
B.One-hot encoding of the target variable
C.Random undersampling of the majority class
D.Min-Max scaling of all features
AnswerA

SMOTE creates synthetic minority samples to balance the dataset.

Why this answer

Option D is correct because SMOTE generates synthetic samples for the minority class. Option A is wrong because it discards majority class data. Option B is wrong because scaling does not balance classes.

Option C is wrong because encoding is for categorical features.

121
MCQeasy

A data analyst is using Amazon QuickSight to explore a dataset with 10 million rows. The analyst wants to create a histogram of a numerical column. However, the query is taking too long. Which action should the analyst take to improve performance without losing accuracy?

A.Change the data source to Amazon Athena directly with a limit clause.
B.Reduce the number of bins in the histogram.
C.Use a sample of the data (e.g., 1 million rows) for the histogram.
D.Import the dataset into SPICE (Super-fast, Parallel, In-memory Calculation Engine).
AnswerD

SPICE accelerates queries by loading data into memory.

Why this answer

Option A is correct because using SPICE in-memory engine speeds up queries by caching data. Option B is wrong because sampling loses accuracy. Option C is wrong because reducing bin count is a trade-off.

Option D is wrong because Athena is query engine, but QuickSight already uses it; SPICE is better.

122
MCQeasy

After loading a dataset into a pandas DataFrame, a data scientist runs df.info() and sees that a column 'income' has object dtype. What does this indicate, and what EDA step should be taken?

A.The column has missing values; impute them.
B.The column contains strings; convert to numeric using pd.to_numeric() and investigate non-convertible values.
C.Normalize the column to a 0-1 range.
D.The column is already numeric; proceed.
AnswerB

Conversion to numeric is necessary for analysis; non-convertible values may indicate errors.

Why this answer

Option C is correct because object dtype typically indicates string or mixed types; converting to numeric allows mathematical operations. Option A is wrong because object dtype does not necessarily indicate missing values. Option B is wrong because object dtype is not automatically numeric.

Option D is wrong because normalization is done after conversion.

123
MCQhard

A data analyst is examining a dataset with a target variable that has three classes: A, B, C. They plot the distribution of a feature 'X' for each class and notice that for classes A and B, the distributions are bimodal, while for class C it is unimodal. They want to assess whether feature 'X' is useful for separating the classes. Which of the following metrics should they compute to quantify the separability?

A.ANOVA F-statistic between feature X and the target.
B.Variance ratio (between-group variance / within-group variance).
C.Chi-square test of independence.
D.Mutual information between X and the target.
AnswerA

ANOVA tests if the mean of X differs across classes.

Why this answer

Option A is correct because ANOVA F-statistic tests if means across groups are significantly different. Option B is wrong because chi-square is for categorical features. Option C is wrong because mutual information is for feature selection but doesn't directly test separability.

Option D is wrong because variance ratio is not standard.

124
MCQeasy

A data scientist is exploring a dataset with 10 features and observes that the correlation between feature A and feature B is 0.98. Which action should be taken to address multicollinearity before training a linear regression model?

A.Use Principal Component Analysis (PCA) to combine them.
B.Apply Min-Max scaling to both features.
C.Remove one of the two features from the dataset.
D.Add polynomial features to both.
AnswerC

Dropping one highly correlated feature directly addresses multicollinearity.

Why this answer

Option C is correct because dropping one of the highly correlated features reduces redundancy and mitigates multicollinearity. Option A is wrong because scaling does not address collinearity. Option B is wrong because PCA creates orthogonal components but reduces interpretability; dropping a feature is more straightforward.

Option D is wrong because adding polynomial features increases correlation.

125
Multi-Selecthard

Which TWO statements about handling missing data during exploratory data analysis are correct? (Select TWO.)

Select 2 answers
A.Missing values can be ignored during EDA and handled during model training.
B.Visualizing the pattern of missingness can help determine if data is missing at random.
C.Understanding the missing data mechanism (MCAR, MAR, MNAR) is important for choosing an imputation strategy.
D.Listwise deletion (removing rows with missing values) is always safe and unbiased.
E.Imputing missing values with the mean preserves the original variance.
AnswersB, C

Missingness patterns inform assumptions about missing data mechanisms.

Why this answer

Options B and D are correct. B: Visualizing missing patterns (e.g., with a missingno matrix) is a good EDA practice. D: Understanding the mechanism (MCAR, MAR, MNAR) is critical for choosing imputation method.

A is wrong because listwise deletion can introduce bias if data not MCAR. C is wrong because mean imputation reduces variance. E is wrong because missing values should be handled before model training, not after.

126
MCQeasy

A data scientist is exploring a dataset containing customer transactions. The dataset has a column 'transaction_amount' with values ranging from $0.01 to $10,000. Which EDA step is most appropriate to detect skewed distribution?

A.Create a count plot of transaction amounts
B.Create a correlation heatmap of all numeric columns
C.Create a histogram of transaction amounts
D.Create a scatter plot of transaction amount vs. customer age
AnswerC

Why C is correct

Why this answer

Option C is correct because a histogram or density plot reveals skewness visually. Option A is wrong because count plot is for categorical data. Option B is wrong because scatter plot shows relationship between two variables.

Option D is wrong because heatmap shows correlation, not skewness.

127
MCQeasy

A data scientist wants to identify outliers in a dataset with 1,000 samples and 5 numerical features. Which technique is most appropriate for univariate outlier detection?

A.Principal component analysis (PCA)
B.Interquartile range (IQR) method
C.Mahalanobis distance
D.Z-score with a threshold of 3
AnswerB

IQR is robust and suitable for univariate outlier detection.

Why this answer

Option C is correct because the IQR method identifies outliers as points outside 1.5*IQR from the quartiles, which is robust to non-normal distributions. Option A is incorrect because Z-score assumes normality and is sensitive to extreme outliers. Option B is incorrect because Mahalanobis distance is for multivariate outliers.

Option D is incorrect because PCA is for dimensionality reduction, not outlier detection.

128
MCQmedium

A data analyst is using Amazon Athena to query a partitioned dataset in S3. They notice that queries are scanning more data than expected. Which step should they take during exploratory data analysis to optimize query performance?

A.Convert the data to Parquet format.
B.Use S3 Select to filter data before querying.
C.Increase the number of workers in Athena.
D.Check the partition metadata to ensure queries are pruning partitions.
AnswerD

Verifying partition structure ensures efficient partition pruning.

Why this answer

Option B is correct because checking the partition metadata using SHOW PARTITIONS or querying the information_schema helps verify that the partition structure is correct and that queries are using partition pruning. Option A is incorrect because compressing data reduces storage but does not directly affect partition pruning. Option C is incorrect because converting to Parquet improves columnar scanning but does not address partition misuse.

Option D is incorrect because increasing workers does not fix incorrect partition usage.

129
Multi-Selecthard

A data scientist is performing EDA on a dataset stored in Amazon S3 using Amazon Athena. The dataset is partitioned by date, and each partition contains CSV files. The data scientist notices that some queries return zero rows for partitions that should have data. Which THREE steps should the data scientist take to troubleshoot? (Choose 3.)

Select 3 answers
A.Verify that the CSV files exist in the S3 bucket for the specific partition.
B.Run MSCK REPAIR TABLE to add new partitions to the Glue Data Catalog.
C.Convert the CSV files to Parquet format.
D.Check the data types of the columns used in the query's WHERE clause.
E.Re-run the query with a LIMIT clause to force partition discovery.
AnswersA, B, D

Files may have been moved or deleted.

Why this answer

Option A is correct because manually checking files confirms data presence. Option B is correct because MSCK REPAIR TABLE adds partitions not yet registered. Option C is correct because incorrect data types can cause filters to exclude rows.

Option D is wrong because the query does not automatically update partitions. Option E is wrong because converting to Parquet is not a troubleshooting step.

130
MCQeasy

In exploratory data analysis, a data scientist notices that the distribution of a continuous variable is bimodal. The scientist suspects that the two modes correspond to two different groups in the data. Which visualization is MOST appropriate to confirm this suspicion?

A.Box plot
B.Bar chart
C.Histogram with overlaid densities by group
D.Scatter plot
AnswerC

Overlaying densities by group allows visual comparison of the two modes.

Why this answer

Option A is correct because a histogram with hue (color) for each group can show the separate distributions. Option B is wrong because a box plot shows summary statistics but not the shape. Option C is wrong because a scatter plot is for two continuous variables.

Option D is wrong because a bar chart is for categorical data.

131
Multi-Selectmedium

Which THREE of the following are appropriate data visualization techniques for exploring the relationship between two numerical variables?

Select 3 answers
A.Scatter plot
B.Hexbin plot
C.Box plot
D.Bar chart
E.Pair plot
AnswersA, B, E

Scatter plots directly show the relationship between two numerical variables.

Why this answer

Scatter plot, hexbin plot, and pair plot are designed for bivariate numerical relationships. Bar chart is for categorical. Box plot is for numerical vs categorical.

132
MCQeasy

During exploratory data analysis, a data scientist notices that the target variable is highly imbalanced. Which technique should be used to address this issue before training a classification model?

A.Apply PCA to reduce dimensionality
B.Remove outliers from the majority class
C.Use cross-validation to evaluate the model
D.Apply feature scaling to all features
E.Use SMOTE to generate synthetic samples for the minority class
AnswerE

SMOTE is a standard technique for imbalanced classification.

Why this answer

SMOTE (Synthetic Minority Over-sampling Technique) is a popular method for handling imbalanced datasets by generating synthetic samples for the minority class. Option A is wrong because removing outliers does not address class imbalance. Option B is wrong because feature scaling does not affect imbalance.

Option D is wrong because PCA is for dimensionality reduction, not imbalance. Option E is wrong because cross-validation is a model evaluation technique, not an imbalance solution.

133
MCQhard

A data engineer is preparing a dataset for training a binary classification model. The target variable is highly imbalanced (95% negative, 5% positive). The engineer needs to split the data into training and test sets while maintaining the class distribution in both sets. Which method should the engineer use?

A.Use k-fold cross-validation and then split the data
B.Oversample the minority class first, then do a random split
C.Perform a simple random 80/20 split
D.Use stratified random sampling to split the data
AnswerD

Stratified split preserves class proportions in each subset.

Why this answer

Option D is correct because stratified random sampling ensures the proportion of classes is preserved in both training and test sets. Option A is wrong because simple random sampling may result in uneven distribution. Option B is wrong because oversampling should be done after splitting to avoid data leakage.

Option C is wrong because k-fold cross-validation is not a split method.

134
MCQmedium

Refer to the exhibit. A data scientist is unable to read a CSV file from the S3 bucket 'my-bucket' using SageMaker. The IAM policy attached to the SageMaker execution role is shown. What is the most likely cause of the failure?

A.The policy does not allow the s3:GetObject action
B.The policy does not grant read access to the bucket
C.The bucket uses server-side encryption with AWS KMS (SSE-KMS) and the policy lacks kms:Decrypt permission
D.The policy does not include s3:ListBucket action
AnswerC

KMS-encrypted objects require kms:Decrypt permission.

Why this answer

The policy grants GetObject and ListBucket, but ListBucket is not sufficient for reading objects; GetObject is needed and is present. However, the error might be due to missing s3:GetObject on the specific object path. But the policy looks correct.

Actually, the issue could be that the bucket is in a different region or encryption mismatch. However, the most common cause is that the SageMaker notebook instance's IAM role does not have the policy attached, or the policy is missing permissions like KMS if encrypted. But given the options, the policy appears correct.

Wait, the question states 'unable to read'. A plausible cause is that the bucket uses SSE-KMS and the policy does not include kms:Decrypt. Option B is correct because if the bucket uses KMS encryption, the role needs KMS permissions.

Option A is wrong because GetObject is present. Option C is wrong because ListBucket is present. Option D is wrong because the policy allows read access.

135
Multi-Selectmedium

A data scientist is performing EDA on a dataset with 500,000 rows and 20 columns. The dataset contains missing values in some columns. Which TWO approaches are appropriate for handling missing data during EDA? (Choose 2)

Select 2 answers
A.Use forward fill to propagate the last observed value
B.Remove all rows with any missing value (listwise deletion)
C.Create an indicator column to flag whether the value was missing, then impute with a placeholder
D.Impute missing values with the mean of each column
E.Impute missing values with the median for numerical columns and mode for categorical columns
AnswersC, E

This retains the information about missingness and is a common practice.

Why this answer

Options C and D are correct because imputation with median/mode is robust, and flagging missingness with an indicator variable preserves information. Option A is wrong because listwise deletion can introduce bias and reduce sample size. Option B is wrong because mean imputation is sensitive to outliers.

Option E is wrong because forward fill is for time series, not general EDA.

136
MCQeasy

A data analyst is performing EDA on a dataset containing timestamps of user logins. They want to understand daily login patterns. The timestamp column is in Unix epoch format (integer). Which of the following is the most appropriate transformation to extract day-of-week patterns?

A.Convert the timestamps to datetime objects and extract the day-of-week.
B.Convert the timestamps to string and split into date and time.
C.Apply min-max scaling to the timestamp values.
D.Bin the timestamps into 1-hour intervals.
AnswerA

This enables grouping by day of the week to analyze patterns.

Why this answer

Option B is correct because converting to datetime allows extracting day-of-week. Option A is wrong because binning into hours loses day information. Option C is wrong because converting to string does not facilitate analysis.

Option D is wrong because scaling does not help.

137
MCQeasy

A data analyst needs to visualize the distribution of a numerical feature in a dataset. Which AWS service can be used to create a histogram directly from data stored in S3 without writing code?

A.Amazon Athena
B.Amazon SageMaker Studio
C.Amazon QuickSight
D.AWS Glue
AnswerC

QuickSight provides no-code visualizations like histograms.

Why this answer

Option B is correct because Amazon QuickSight is a BI service that can connect to S3 data and create histograms without coding. Option A is wrong because Amazon SageMaker Studio requires code or notebook. Option C is wrong because Amazon Athena outputs query results, not visualizations directly.

Option D is wrong because AWS Glue is for ETL, not visualization.

138
MCQhard

A data science team at a financial services company is building a fraud detection model using a dataset of credit card transactions. The dataset contains 10 million rows and 20 features, including transaction amount, merchant category, time since last transaction, and customer ID. The target variable 'is_fraud' is highly imbalanced: only 0.1% of transactions are fraudulent. The team is performing exploratory data analysis (EDA) on a sample of 100,000 rows. They compute the correlation matrix and find that 'transaction amount' has a correlation of 0.02 with 'is_fraud'. They also plot the distribution of 'transaction amount' and see that it is heavily right-skewed with a long tail. The team wants to understand the relationship between 'transaction amount' and fraud more deeply before feature engineering. They have access to AWS SageMaker and can run processing jobs. Which course of action is most appropriate?

A.Conclude that 'transaction amount' is not predictive because the correlation is near zero
B.Train a random forest model on the sample and use feature importance to assess the predictive power of 'transaction amount'
C.Create bins for 'transaction amount' (e.g., 0-10, 10-50, 50-100, 100+) and compute the fraud rate per bin to detect any non-linear patterns
D.Apply a log transformation to 'transaction amount' to reduce skewness and re-run the correlation analysis
AnswerC

Binning and examining fraud rates per bin can reveal non-linear relationships.

Why this answer

Option B is correct because binning the transaction amount and computing fraud rates per bin can reveal non-linear relationships that correlation might miss. Option A is wrong because log transformation does not reveal relationship with target. Option C is wrong because correlation is already computed and can mask non-linearity.

Option D is wrong because feature importance from a tree model is more appropriate after feature engineering, not during EDA.

139
MCQhard

A data scientist is performing EDA on a dataset of 1 million images stored in Amazon S3. Each image is 100x100 pixels in RGB format. The data scientist wants to compute the mean pixel value per channel across the entire dataset. Which approach is most efficient?

A.Use Amazon SageMaker Processing with a custom Python script that iterates over S3 objects and aggregates pixel values.
B.Use Amazon Athena with a SQL query on the image metadata stored in a CSV file.
C.Use AWS Glue ETL to read images and compute the mean.
D.Use a SageMaker notebook instance with a large instance type to load all images into memory and compute the mean.
AnswerA

SageMaker Processing can distribute the workload across multiple instances for efficient computation.

Why this answer

Option C is correct because using SageMaker Processing with a custom script can distribute the computation across multiple instances, making it efficient for large datasets. Option A is wrong because loading all images into memory on a single notebook instance is not feasible. Option B is wrong because Athena is designed for structured data, not images.

Option D is wrong because AWS Glue is for ETL on tabular data, not image processing.

140
MCQhard

A data scientist is analyzing a dataset with 1 million records and 20 features. The target variable is continuous. The scientist wants to identify non-linear relationships between features and the target. Which technique is MOST suitable for this purpose during exploratory data analysis?

A.Visualize the correlation matrix heatmap of all features.
B.Apply Principal Component Analysis (PCA) and examine the loadings.
C.Calculate mutual information scores between each feature and the target.
D.Compute Pearson correlation coefficients between each feature and the target.
AnswerC

Mutual information captures non-linear dependencies.

Why this answer

Option D is correct because mutual information captures any kind of dependency, including non-linear. Option A is wrong because Pearson correlation only measures linear relationships. Option B is wrong because PCA is for dimensionality reduction, not feature-target relationship.

Option C is wrong because correlation matrix is pairwise among features, not with target.

141
MCQeasy

A data scientist is analyzing a dataset with a target variable that is binary (0/1). Which visualization is most appropriate to explore the relationship between a continuous feature and the target?

A.Scatter plot of the feature vs. the target.
B.Bar chart of the feature.
C.Box plot of the feature grouped by target.
D.Histogram of the feature.
AnswerC

Box plots compare distributions across categories.

Why this answer

Option D is correct because a box plot shows distribution of continuous feature across binary classes. Option A is wrong because scatter plot is for two continuous variables. Option B is wrong because histogram shows distribution of one variable.

Option C is wrong because bar chart is for categorical features.

142
MCQmedium

A data scientist runs a logistic regression and obtains a model with 95% accuracy on the training set. However, the model performs poorly on the test set. Which exploratory data analysis step should have been performed to identify this issue?

A.Generating a correlation matrix of features
B.Log transformation of skewed features
C.Checking for class imbalance in the target variable
D.Creating a heatmap of missing values
AnswerC

Class imbalance can lead to high training accuracy but poor generalization.

Why this answer

Checking for class imbalance is critical because it can cause a model to predict the majority class and still achieve high accuracy, but fail on the minority class in unseen data. Option A is wrong because log transformation is for skewness, not class imbalance. Option C is wrong because a correlation matrix helps with multicollinearity.

Option D is wrong because missing value heatmaps show missing data patterns.

143
MCQeasy

A data scientist is analyzing a dataset with missing values in several columns. The dataset is stored in an S3 bucket. What is the most efficient method to identify the percentage of missing values per column using AWS services?

A.Use Amazon SageMaker Notebook with pandas to load the dataset and compute missing percentages.
B.Use Amazon QuickSight to connect to S3 and calculate missing value percentages via calculated fields.
C.Use Amazon Athena to query the data with SQL using COUNT(*) and CASE statements to compute missing percentage per column.
D.Use AWS Glue Crawler to infer schema and view missing values statistics in the AWS Glue Data Catalog.
AnswerC

Athena is serverless and can query S3 data directly, making it efficient for this task.

Why this answer

Option C is correct because Amazon Athena allows running SQL queries directly on data in S3, and the COUNT and CASE statements can compute missing value percentages efficiently without moving data. Option A is wrong because AWS Glue Crawler only catalogs metadata, not performing data analysis. Option B is wrong because SageMaker Notebook requires manual coding and is less efficient for quick checks.

Option D is wrong because QuickSight is a visualization tool, not for direct SQL-based analysis.

144
MCQhard

During exploratory data analysis, a data scientist observes a strong correlation (r=0.95) between two numeric features. The model to be trained is a linear regression. What is the most appropriate action?

A.Apply standardization to both features.
B.Remove one of the correlated features.
C.Use L2 regularization (Ridge regression) without removing features.
D.Create an interaction term between the two features.
AnswerB

Removing reduces multicollinearity in linear regression.

Why this answer

Option C is correct because high correlation indicates multicollinearity, which can be addressed by removing one feature. Option A is wrong because scaling does not help. Option B is wrong because interaction terms increase multicollinearity.

Option D is wrong because regularization helps but is not the first step; removal is simpler.

145
MCQeasy

A data analyst wants to understand the distribution of a continuous variable. Which visualization is most appropriate for this purpose?

A.Box plot
B.Bar chart
C.Histogram
D.Scatter plot
AnswerC

Histogram displays the distribution of a single continuous variable.

Why this answer

Option B is correct because a histogram shows the frequency distribution of a continuous variable. Option A is wrong because a scatter plot shows relationship between two variables. Option C is wrong because a box plot shows summary statistics but not full distribution.

Option D is wrong because a bar chart is for categorical data.

146
MCQeasy

A data scientist is analyzing a dataset with missing values. Which technique is most appropriate for imputing missing values in a numerical feature that follows a normal distribution?

A.Mean imputation
B.Standard deviation imputation
C.Mode imputation
D.Median imputation
AnswerA

Mean imputation preserves the mean of the normal distribution.

Why this answer

Mean imputation is suitable for normally distributed data as it preserves the mean. Median is robust to outliers, not normality. Mode is for categorical data.

Standard deviation is not an imputation method. KNN imputation is non-parametric.

147
MCQhard

A team is building a regression model to predict house prices. The dataset includes a column 'zip_code' with 100 unique values. The data scientist one-hot encodes this column, resulting in 100 new binary columns. The model shows poor performance on a validation set. What is the most likely cause?

A.One-hot encoding introduced multicollinearity among the binary columns.
B.One-hot encoding reduced the number of features, causing underfitting.
C.The one-hot encoding introduced high variance, but the validation set has low variance.
D.The model suffers from the curse of dimensionality due to the large number of features.
AnswerD

With 100 additional sparse features, the model may overfit and not generalize well.

Why this answer

One-hot encoding 'zip_code' with 100 unique values creates 100 binary features. With only 100 features, the dataset is not high-dimensional enough to cause the curse of dimensionality, which typically requires thousands of features. The poor performance is more likely due to other issues like overfitting or data leakage, not the curse of dimensionality.

Option D is incorrect because the curse of dimensionality is not the most likely cause in this scenario.

Exam trap

AWS often tests the misconception that one-hot encoding always causes multicollinearity or the curse of dimensionality, when in reality the primary risk is overfitting due to sparse representation of high-cardinality categories.

How to eliminate wrong answers

Option A is wrong because one-hot encoding does not introduce multicollinearity; in fact, it creates orthogonal binary columns that are linearly independent when the intercept is dropped. Option B is wrong because one-hot encoding increases the number of features, not reduces them, so it cannot cause underfitting due to feature reduction. Option C is wrong because one-hot encoding can increase variance (overfitting) but the validation set having low variance is not a direct consequence; the issue is that the model may overfit the training data, not that the validation set has low variance.

148
MCQhard

A data engineer is performing EDA on a dataset with 1 million rows and 200 columns. The dataset is stored in S3 as CSV files. The engineer notices that some columns have a high proportion of zeros. What is the best approach to determine if these zeros represent missing data or actual zero values?

A.Check correlation of zero columns with other features; if low, assume zeros are missing.
B.Calculate the percentage of zeros and compare with other columns; if unusually high, treat as missing.
C.Use AWS Glue Data Catalog to view column statistics and infer missing values.
D.Consult the data source documentation or domain experts to understand the meaning of zero values.
AnswerD

Domain knowledge is crucial for accurate interpretation of data.

Why this answer

Option D is correct because domain knowledge and documentation are the most reliable ways to understand the meaning of zeros. Option A is wrong because statistical methods cannot distinguish missing vs actual zero without context. Option B is wrong because metadata may not have this detail.

Option C is wrong because comparing to other columns might be misleading.

149
Multi-Selecthard

Which TWO techniques can be used to detect multicollinearity among numerical features during exploratory data analysis? (Choose two.)

Select 2 answers
A.Apply Principal Component Analysis (PCA) and examine loadings.
B.Compute a correlation matrix and look for pairs with absolute correlation > 0.8.
C.Perform a t-test between each pair of features.
D.Calculate Variance Inflation Factor (VIF) for each feature.
E.Use a chi-square test of independence.
AnswersB, D

High correlation indicates multicollinearity.

Why this answer

Options A and B are correct. A: Correlation matrix shows pairwise correlations; high values indicate collinearity. B: Variance Inflation Factor (VIF) quantifies how much a feature is explained by others.

C: PCA reduces dimensionality but does not detect collinearity directly; D: t-tests compare means; E: Chi-square tests independence for categorical variables.

150
Multi-Selecthard

Which TWO statements about data leakage in machine learning are correct? (Select TWO.)

Select 2 answers
A.Using the target variable to filter features before splitting leads to data leakage
B.Applying SMOTE after splitting the dataset prevents data leakage
C.Applying standardization on the entire dataset before splitting into training and test sets can cause data leakage
D.Using cross-validation eliminates all possible data leakage
E.For time series data, using a random train-test split is recommended to avoid data leakage
AnswersA, C

Why C is correct

Why this answer

Options A and C are correct. Scaling before splitting is a classic source of data leakage. Using target information to filter features is also leakage.

Option B is wrong because proper cross-validation prevents leakage if done correctly. Option D is wrong because time-based split is a valid way to prevent leakage in time series. Option E is wrong because SMOTE should be applied after splitting to avoid leakage.

← PreviousPage 2 of 6 · 406 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.