Knowledge + Practice

CCNA Exploratory Data Analysis Questions

31 of 406 questions · Page 6/6 · Exploratory Data Analysis topic · Answers revealed

Practice these questions Exam hub All questions

376

MCQeasy

A data scientist is exploring a dataset and wants to check for missing values. Which method is most appropriate to identify the percentage of missing values per column?

A.Use Amazon S3 Select to query missing values

B.Use Amazon Athena to run a SELECT COUNT(*) query

C.Use Amazon QuickSight to create a missing value dashboard

D.Use AWS Glue Crawler to detect missing values

E.Use pandas .isnull().sum() in a SageMaker notebook

AnswerE

This is a direct and efficient way to count missing values per column.

Why this answer

Using pandas .isnull().sum() in a SageMaker notebook is a standard approach to count missing values per column. Option A is wrong because S3 Select is for filtering S3 objects, not for data analysis. Option B is wrong because QuickSight is for visualization but not for programmatic missing value analysis.

Option D is wrong because Athena requires SQL and is less direct for EDA. Option E is wrong because Glue Crawler discovers schema, not missing values.

Practice this question →

377

MCQmedium

A machine learning engineer is examining a dataset containing text reviews. They want to convert the text into numerical features for a model. During EDA, they notice that the word 'the' appears in almost every review, while words like 'excellent' appear rarely. Which of the following techniques should they use to reduce the impact of very common words?

A.Apply TF-IDF transformation.

B.Remove stopwords from the text.

C.Use word2vec embeddings.

D.Use a bag-of-words representation.

AnswerA

TF-IDF reduces the weight of terms that appear frequently across documents.

Why this answer

Option C is correct because TF-IDF downweights common words. Option A is wrong because bag-of-words does not weight. Option B is wrong because removing stopwords helps but does not adjust for frequency beyond that.

Option D is wrong because word2vec focuses on context, not frequency weighting.

Practice this question →

378

MCQeasy

A data scientist is exploring a dataset with 100 features. The goal is to build a binary classification model. The dataset is highly imbalanced with 95% negative class and 5% positive class. The data scientist wants to understand the relationship between features and the target. Which technique is most appropriate for initial exploratory analysis?

A.Remove the minority class samples and analyze the majority class only.

B.Use stratified sampling to create a balanced subset for visualization and correlation analysis.

C.Use random sampling to select 10% of the data for EDA.

D.Apply SMOTE to the dataset before performing EDA.

AnswerB

Stratified sampling preserves the proportion of each class and ensures the minority class is included in the analysis.

Why this answer

Option A is correct because stratified sampling ensures that the minority class is adequately represented in the sample. Option B is wrong because random sampling may miss the rare class entirely. Option C is wrong because SMOTE is a data augmentation technique for training, not for EDA.

Option D is wrong because removing the minority class would prevent analyzing the target.

Practice this question →

379

MCQhard

A data scientist is performing EDA on a dataset with 100 features. They want to identify which features are most predictive of the target using a model-agnostic method. Which technique should they use?

A.Pearson correlation matrix

B.L1 regularization

C.SHAP values

D.Permutation feature importance

AnswerD

Permutation importance works with any model and measures drop in performance when a feature is shuffled.

Why this answer

Option A is correct because permutation importance is model-agnostic and measures feature importance by shuffling. Option B is wrong because SHAP values are model-specific. Option C is wrong because L1 regularization is model-specific.

Option D is wrong because correlation is bivariate.

Practice this question →

380

MCQmedium

A data scientist is performing EDA and observes that a feature 'purchase_amount' has many zeros and a long tail of positive values. What type of model would be appropriate for this target variable?

A.Zero-inflated negative binomial regression.

B.Linear regression after log transformation.

C.Logistic regression on binary indicator of purchase.

D.Poisson regression.

AnswerA

Handles excess zeros and overdispersion.

Why this answer

Option A is correct because zero-inflated models handle excess zeros. Option B is wrong because linear regression assumes normal distribution. Option C is wrong because Poisson regression is for count data without excess zeros.

Option D is wrong because logistic regression is for binary outcomes.

Practice this question →

381

Multi-Selecteasy

A data scientist is exploring a dataset with a binary target variable. Which TWO metrics are appropriate for evaluating the balance of the target classes? (Choose two.)

Select 2 answers

A.Count plot of the target variable

B.Histogram of a feature

C.Scatter plot of two features colored by target

D.value_counts() on the target column

E.Correlation matrix of all features

AnswersA, D

Count plot shows frequency of each class.

Why this answer

Options A and D are correct. Count plot and value_counts show class frequencies. Option B is wrong because correlation matrix shows relationships between features.

Option C is wrong because scatter plot shows relationship between two numeric variables. Option E is wrong because histogram shows distribution of a continuous variable.

Practice this question →

382

MCQhard

A data scientist is analyzing a dataset with 500 features and 100,000 observations. The target variable is binary. The dataset contains highly correlated features and some categorical variables with high cardinality. Which combination of techniques should the data scientist use to reduce dimensionality while preserving interpretability for EDA?

A.Apply Principal Component Analysis (PCA) to all features and then train a model on the top 50 components.

B.Use mutual information to select top features and apply label encoding to categorical variables.

C.Use chi-squared test to select top features and one-hot encode categorical variables.

D.Apply correlation-based feature selection to remove highly correlated pairs, then use target encoding for high-cardinality categorical variables.

AnswerD

Correlation filter reduces redundancy; target encoding converts categoricals to numeric without increasing dimensionality.

Why this answer

Option A is correct because correlation-based feature selection removes redundant features, and target encoding handles high-cardinality categoricals without expanding dimensions. Option B is wrong because PCA reduces interpretability and does not handle categoricals. Option C is wrong because chi-squared test is for categorical targets, but dataset has binary target; also one-hot encoding explodes dimensions.

Option D is wrong because mutual information is used for feature selection but does not address high cardinality directly.

Practice this question →

383

MCQeasy

During exploratory data analysis, a data scientist notices that a categorical feature 'city' has over 1,000 unique values. The dataset has 10,000 rows. Which technique should the scientist consider to reduce the cardinality of this feature?

A.Apply label encoding to assign numeric labels.

B.Group low-frequency categories into a single 'other' category.

C.Apply one-hot encoding to all categories.

D.Apply frequency encoding to replace each category with its frequency.

AnswerB

Grouping rare categories reduces cardinality effectively.

Why this answer

Grouping rare categories into an 'other' bucket is a common technique to reduce cardinality. Option A (one-hot encoding) would create too many columns. Option B (label encoding) doesn't reduce uniqueness.

Option D (frequency encoding) replaces categories with frequency but still has 1000 values.

Practice this question →

384

MCQhard

A data scientist is working on a customer churn prediction project for a telecom company. The dataset contains 50,000 records with 25 features, including 'tenure' (number of months customer stayed), 'monthly_charges', 'total_charges', 'contract_type' (month-to-month, one year, two year), 'payment_method', and a target 'churn' (Yes/No). The data is stored in an S3 bucket as a single CSV file. The scientist uses Amazon SageMaker Data Wrangler to perform EDA. After importing the data, the scientist notices that the 'total_charges' column has many missing values (about 20% of rows). The scientist suspects that missing values occur only for customers with tenure = 0 (new customers). After verifying that suspicion, the scientist wants to handle the missing values appropriately. Which course of action should the scientist take?

A.Use a regression model to predict total_charges based on other features.

B.Impute missing total_charges with the mean of non-missing values.

C.Drop all rows with missing total_charges to avoid bias.

D.Impute missing total_charges with 0, since missing values correspond to customers with tenure=0.

AnswerD

Given the pattern, total_charges should be 0 for new customers; imputing with 0 preserves data integrity.

Why this answer

Option D is correct because if total_charges is missing only for tenure=0, it means those customers have not been billed yet, so total_charges should be 0. Imputing with 0 is appropriate. Option A is wrong because dropping rows with missing total_charges would remove all new customers, biasing the dataset.

Option B is wrong because imputing with mean would assign incorrect values to new customers. Option C is wrong because using a model to predict missing values is overkill and may introduce error when the true value is known to be 0.

Practice this question →

385

MCQhard

A machine learning engineer is evaluating a dataset for building a fraud detection model. The dataset has 1 million transactions, but only 500 are fraudulent. The engineer wants to understand the distribution of fraudulent vs. non-fraudulent transactions over time. Which EDA visualization is most suitable?

A.Bar chart of transaction count per day with colors for fraud status

B.Scatter plot of transactions over time colored by fraud status

C.Box plot of transaction amount per month grouped by fraud status

D.Line plot of daily fraud rate and non-fraud rate

AnswerD

Why D is correct

Why this answer

Option D is correct because a time series line plot with two lines (fraud vs. non-fraud) shows temporal patterns. Option A is wrong because bar chart of counts per day is less effective for two categories. Option B is wrong because scatter plot with 1 million points is overwhelming.

Option C is wrong because box plot shows distribution per time period but not temporal trend.

Practice this question →

386

MCQhard

A data scientist is analyzing a dataset with many categorical features. The target variable is binary. Which statistical test should be used to assess the association between each categorical feature and the target?

A.Pearson correlation coefficient

B.Chi-squared test of independence

C.ANOVA

D.Kolmogorov-Smirnov test

AnswerB

Chi-squared tests association between categorical variables.

Why this answer

Option C is correct because Chi-squared test is used for independence between categorical variables. Option A is wrong because ANOVA is for continuous target. Option B is wrong because Pearson correlation is for continuous variables.

Option D is wrong because Kolmogorov-Smirnov test is for distribution comparison.

Practice this question →

387

MCQhard

Refer to the exhibit. A data scientist is running an Amazon EMR Spark job for exploratory data analysis on a large dataset. The job fails with the error shown. What is the most appropriate action to resolve this?

A.Reduce the number of worker nodes.

B.Convert the input data to Parquet format.

C.Increase the executor memory in Spark configuration.

D.Increase the driver memory.

AnswerC

More memory per executor prevents heap overflow.

Why this answer

Option B is correct because increasing the executor memory in Spark configuration can handle larger data. Option A (fewer nodes) reduces resources; Option C (Parquet) may help but not directly address memory; Option D (increase driver memory) may not help if executors are the issue.

Practice this question →

388

MCQmedium

A data scientist is exploring log files stored in S3. They ran the above AWS CLI command. What does the output indicate about the data, and what EDA step should be taken next?

A.All log files are about 150KB-200KB in size.

B.There are 3 objects in the bucket under the prefix.

C.There are 3 log files larger than 100KB in the specified prefix.

D.The prefix 'logs/2023/' contains exactly 3 objects.

AnswerC

The command filters by size >100000 bytes and returns keys and sizes.

Why this answer

Option B is correct because the command filters objects larger than 100000 bytes, and the output shows three large files. Option A is wrong because the command does not count all objects; it filters by size. Option C is wrong because the output shows three files, not all files.

Option D is wrong because the command explicitly filters by size, so the output is not all objects.

Practice this question →

389

Multi-Selecthard

Which THREE techniques are commonly used to detect multicollinearity in a dataset during exploratory data analysis?

Select 3 answers

A.Heatmap of missing values

B.Eigenvalue analysis from PCA

C.Correlation matrix

D.Variance Inflation Factor (VIF)

E.Scatter matrix of all features

AnswersB, C, D

Near-zero eigenvalues indicate linear dependencies.

Why this answer

Options A, B, and D are correct. A: Correlation matrix shows pairwise correlations; high values indicate collinearity. B: Variance Inflation Factor (VIF) quantifies how much a feature is explained by others.

D: Eigenvalues from PCA can indicate multicollinearity if some are near zero. Option C is incorrect because a scatter matrix shows pairwise relationships but not a multicollinearity measure. Option E is incorrect because a heatmap of missing values is unrelated to multicollinearity.

Practice this question →

390

Multi-Selecteasy

A data scientist is working with a dataset that contains geolocation coordinates (latitude and longitude) and timestamps. The scientist wants to visualize the data to check for spatial and temporal patterns. Which TWO AWS services can be used for this visualization?

Select 2 answers

A.Amazon Comprehend

B.Amazon SageMaker Data Wrangler

C.Amazon Rekognition

D.Amazon QuickSight

E.AWS Glue

AnswersB, D

Can create scatter plots with coordinates.

Why this answer

Options B and D are correct. Amazon QuickSight supports geospatial charts and time-series. Amazon SageMaker Data Wrangler can also visualize geospatial data as scatter plots.

Option A is wrong because Amazon Comprehend is for NLP. Option C is wrong because AWS Glue is for ETL. Option E is wrong because Amazon Rekognition is for image/video analysis.

Practice this question →

391

MCQmedium

A machine learning engineer is exploring a dataset with 50 features. Some features are highly correlated. Which technique should the engineer use to reduce dimensionality while preserving variance?

A.Principal Component Analysis (PCA)

B.Factor Analysis

C.t-Distributed Stochastic Neighbor Embedding (t-SNE)

D.Linear Discriminant Analysis (LDA)

AnswerA

PCA reduces dimensionality by finding components that maximize variance.

Why this answer

PCA (Principal Component Analysis) is the standard technique for dimensionality reduction by projecting data onto principal components that capture maximum variance. LDA is supervised and aims to separate classes. t-SNE is for visualization. Autoencoders can reduce dimensionality but are more complex.

Factor analysis assumes latent factors.

Practice this question →

392

Multi-Selecthard

Which THREE are valid reasons to perform feature scaling during exploratory data analysis?

Select 3 answers

A.To improve performance of distance-based algorithms like KNN.

B.To change the shape of the feature distribution.

C.To increase the number of features.

D.To ensure features have zero mean and unit variance.

E.To reduce the effect of outliers by clipping values.

AnswersA, D, E

Distance algorithms are sensitive to scale.

Practice this question →

393

MCQeasy

A data scientist is analyzing a dataset and notices that the distribution of a continuous feature is heavily right-skewed. Which transformation is most likely to make the distribution more symmetric?

A.Log transformation (natural log)

B.Min-Max scaling

C.One-hot encoding

D.Square transformation

AnswerA

Log transformation compresses high values, reducing right skew.

Why this answer

Option A is correct because log transformation is commonly used to reduce right skewness. Option B is wrong because square transformation increases skewness. Option C is wrong because Min-Max scaling does not change shape.

Option D is wrong because one-hot encoding is for categorical features.

Practice this question →

394

MCQeasy

A data scientist wants to understand the relationship between a categorical feature with 3 levels and a continuous target variable. Which visualization is most appropriate?

A.Correlation matrix

B.Line chart

C.Box plot grouped by category

D.Scatter plot

AnswerC

Box plots compare distributions across categories.

Why this answer

A box plot grouped by category (Option C) is the most appropriate visualization because it directly compares the distribution of a continuous target variable across the three levels of a categorical feature. It displays median, quartiles, and potential outliers for each group, making it ideal for understanding central tendency, spread, and skewness in a side-by-side comparison.

Exam trap

The trap here is that candidates often confuse the purpose of a scatter plot (for two continuous variables) with the need to compare a continuous variable across categories, leading them to choose Option D instead of recognizing that a grouped box plot is the standard tool for this task.

How to eliminate wrong answers

Option A is wrong because a correlation matrix is used to quantify linear relationships between continuous variables, not between a categorical feature and a continuous target. Option B is wrong because a line chart is designed to show trends over a continuous or time-ordered axis, not to compare distributions across discrete categories. Option D is wrong because a scatter plot visualizes the relationship between two continuous variables; it cannot effectively display a categorical feature with only three levels without overplotting or requiring jittering, and it does not summarize distributional properties like median or quartiles.

Practice this question →

395

MCQmedium

A data scientist is troubleshooting access to an S3 bucket. The above IAM policy is attached to their role. What is the likely result when they try to list objects in the 'confidential' folder?

A.Access is allowed because the Allow statement grants s3:ListBucket.

B.Access is denied unconditionally.

C.Access is allowed only if the request uses HTTPS.

D.Access is denied if the request does not originate from the specified VPC endpoint.

AnswerD

The condition requires the request to come from vpce-12345678 to allow access.

Why this answer

Option C is correct because the Deny statement explicitly denies s3:* actions on the confidential folder unless the request comes from a specific VPC endpoint. If the request is from outside that endpoint or from a different VPC, it will be denied. Option A is wrong because the Deny overrides the Allow.

Option B is wrong because the Deny is conditional. Option D is wrong because the condition restricts access.

Practice this question →

396

MCQhard

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

A.Apply PCA to all features to decorrelate them.

B.Standardize all features using StandardScaler.

C.For each highly correlated pair, remove one feature based on domain knowledge or higher correlation with target.

D.Randomly drop half of the correlated features.

AnswerC

This reduces redundancy while retaining predictive power.

Why this answer

Option C is correct because when features are highly correlated (e.g., > 0.95), they introduce multicollinearity, which can destabilize coefficient estimates in linear models and reduce interpretability. Removing one feature from each correlated pair based on domain knowledge or its correlation with the target variable preserves predictive power while reducing redundancy. This approach is more targeted than PCA, which transforms features into uncorrelated components but sacrifices interpretability and may not align with the binary target.

Exam trap

Cisco often tests the misconception that PCA is the default solution for multicollinearity, but the trap here is that PCA transforms features into uninterpretable components, whereas removing correlated features directly preserves the original feature space and domain relevance.

How to eliminate wrong answers

Option A is wrong because PCA decorrelates features by projecting them onto orthogonal components, but it does not remove features—it creates new synthetic features that are linear combinations of the originals, losing interpretability and potentially discarding target-specific information. Option B is wrong because standardizing features (e.g., using StandardScaler) only scales them to zero mean and unit variance, which does not address multicollinearity; it is a preprocessing step for algorithms sensitive to feature scales, not a remedy for correlated features. Option D is wrong because randomly dropping half of the correlated features ignores the relationship between features and the target variable, which can discard informative predictors and degrade model performance; a principled selection based on domain knowledge or target correlation is required.

Practice this question →

397

MCQeasy

A data scientist is performing EDA on a dataset with both numerical and categorical features. Which technique is best for detecting multicollinearity among numerical features?

A.Chi-square test of independence

B.Box plots for each numerical feature

C.Correlation matrix with heatmap

D.Pair plot

AnswerC

Correlation matrix shows pairwise linear correlations, indicating multicollinearity.

Why this answer

Option B is correct because a correlation matrix quantifies linear relationships between numerical features. Option A is wrong because box plots show distribution, not relationships. Option C is wrong because chi-square test is for categorical associations.

Option D is wrong because pair plots visualize scatter plots but not a quantitative measure of multicollinearity.

Practice this question →

398

Multi-Selecthard

A data scientist is analyzing a dataset with a continuous target variable and suspects that the relationship between a predictor and the target is non-linear. Which THREE techniques can the scientist use to explore and model this non-linearity?

Select 3 answers

A.Apply logistic regression to binarize the target.

B.Compute the Pearson correlation coefficient between the predictor and target.

C.Add polynomial features (e.g., x^2, x^3) and check if model performance improves.

D.Fit a decision tree regressor and examine feature importance.

E.Create a scatter plot and overlay a LOESS (local regression) smooth curve.

AnswersC, D, E

Polynomial features capture non-linearity in linear models.

Why this answer

Options A, B, and C are correct. Scatter plots with a LOESS curve visually reveal non-linearity. Polynomial features allow linear models to capture non-linear relationships.

Decision trees can model non-linear interactions without explicit feature engineering. Option D is wrong because correlation measures only linear relationships. Option E is wrong because logistic regression is for binary outcomes.

Practice this question →

399

MCQeasy

A machine learning team is analyzing a dataset with numerical features. They compute the pairwise correlation matrix and find that two features, 'X1' and 'X2', have a correlation coefficient of 0.98. The team plans to train a linear regression model. Which of the following actions should the team take to avoid multicollinearity issues?

A.Perform PCA on the dataset to reduce dimensionality.

B.Add an interaction term between X1 and X2 to the model.

C.Standardize both features using Z-score normalization.

D.Remove one of the two highly correlated features.

AnswerD

This directly addresses multicollinearity by eliminating redundancy.

Why this answer

Option C is correct because removing one of the highly correlated features reduces multicollinearity. Option A is wrong because PCA is not necessary for just two correlated features. Option B is wrong because standard scaling does not address correlation.

Option D is wrong because adding interaction terms increases multicollinearity.

Practice this question →

400

MCQmedium

A company is building a classification model and discovers that the target variable is imbalanced: 95% of samples belong to class A and 5% to class B. The data scientist needs to understand the distribution of numeric features for each class. Which approach is most appropriate?

A.Run a t-test for each feature to determine statistical significance between classes.

B.Generate box plots for each feature using Amazon QuickSight.

C.Use Amazon SageMaker Data Wrangler to create histograms for each feature, grouped by class label.

D.Compute the correlation matrix between features and the target.

AnswerC

Histograms grouped by class provide a clear view of feature distributions across classes.

Why this answer

Using Amazon SageMaker Data Wrangler to generate histograms segmented by class is a straightforward way to visualize feature distributions for each class. Option B (t-tests) may be used later but doesn't provide distribution visualization. Option C (box plots) is a good alternative but not as comprehensive as histograms for distribution shape.

Option D (correlation matrix) does not show class-wise distribution.

Practice this question →

401

MCQhard

A data scientist is analyzing a dataset with missing values. The missing data mechanism is missing at random (MAR). Which imputation method is most appropriate to preserve relationships between variables?

A.Remove all rows with any missing values.

B.Use k-nearest neighbors imputation.

C.Use multiple imputation by chained equations (MICE).

D.Replace missing values with the mean of the column.

AnswerC

MICE models each variable with missing values conditional on others, suitable for MAR.

Why this answer

Option D is correct because multiple imputation by chained equations (MICE) handles MAR well by modeling each variable with missing values conditional on others. Option A is wrong because mean imputation underestimates variance. Option B is wrong because dropping rows with missing data reduces sample size and can introduce bias.

Option C is wrong because KNN imputation assumes data are MCAR and may not be optimal for MAR.

Practice this question →

402

MCQhard

A data scientist is analyzing a dataset with 1 million rows and 50 features. The scientist wants to detect outliers in a numerical feature 'transaction_amount' which has a long right tail. The scientist suspects that outliers are due to data entry errors and should be removed. Which outlier detection method is MOST robust for this scenario?

A.Interquartile range (IQR) with multiplier 1.5

B.Mahalanobis distance

C.Z-score with threshold 3

D.DBSCAN clustering

AnswerA

IQR method is non-parametric and robust to skewness.

Why this answer

Option C is correct because the IQR method is robust to skewed distributions and does not assume normality. Option A is wrong because Z-score assumes normality. Option B is wrong because Mahalanobis distance assumes multivariate normality.

Option D is wrong because DBSCAN is computationally expensive on 1 million rows and may not be practical for univariate outlier detection.

Practice this question →

403

MCQhard

Refer to the exhibit. A data scientist runs the AWS CLI command shown to explore the contents of an S3 bucket. The command returns an empty array. However, the data scientist knows there are objects larger than 1000 bytes in the bucket. What is the most likely reason for the empty result?

A.The query syntax is incorrect; backticks should not be used

B.The command should use list-objects instead of list-objects-v2

C.The --query parameter is not supported by list-objects-v2

D.The AWS CLI is not configured with the correct region for the bucket

AnswerD

If the bucket is in a different region, the command returns no results.

Why this answer

The command uses backticks incorrectly; in CLI, the correct syntax is --query "Contents[?Size > `1000`]" but the backticks are not valid for numeric comparison in JMESPath. The proper syntax is Size > `1000` with backticks? Actually, JMESPath uses backticks for literal values. The command appears correct.

However, the issue might be that the objects are under a different prefix or the bucket is in a different region. But the most likely reason is that the command is missing the --region parameter if the bucket is not in the default region. Option C is correct.

Option A is wrong because the syntax is correct. Option B is wrong because the query syntax is valid. Option D is wrong because the command lists objects.

Practice this question →

404

MCQhard

A data scientist is performing EDA on a large dataset (10 TB) stored in S3. They need to compute summary statistics for each column. Which approach is most cost-effective and efficient?

A.Use an AWS Glue ETL job with PySpark to compute statistics

B.Use Amazon Athena with SQL queries

C.Download the dataset to an Amazon SageMaker Studio notebook and use pandas

D.Launch an Amazon EMR cluster and use Spark SQL

AnswerB

Athena is serverless, cost-effective, and efficient for ad-hoc queries.

Why this answer

Option D is correct because Amazon Athena uses a serverless query engine that scales automatically and charges per query based on data scanned, making it cost-effective for large datasets. Option A is wrong because downloading to SageMaker Studio may incur high data transfer costs and require significant local storage. Option B is wrong because AWS Glue Spark jobs have overhead and are more suited for complex ETL.

Option C is wrong because Amazon EMR requires provisioning clusters and is more expensive for simple statistics.

Practice this question →

405

MCQhard

A machine learning team is analyzing feature importance in a dataset with many categorical features. They plan to use a tree-based model. Which encoding method should they use to handle high-cardinality categorical features without creating too many dummy variables?

A.One-hot encoding

B.Label encoding

C.Target encoding

D.Frequency encoding

AnswerC

Target encoding replaces categories with the target mean, preserving information without increasing dimensionality.

Why this answer

Option C is correct because target encoding replaces categories with the mean of the target, which is efficient and works well with tree models. Option A is wrong because one-hot encoding creates many columns for high cardinality. Option B is wrong because label encoding imposes ordinality.

Option D is wrong because frequency encoding may not capture predictive information.

Practice this question →

406

MCQmedium

A data analyst is performing exploratory data analysis on a dataset with 100 features. The analyst wants to identify which features contribute most to the variance in the data. Which technique should the analyst use?

A.K-means clustering

B.Principal Component Analysis (PCA)

C.t-Distributed Stochastic Neighbor Embedding (t-SNE)

D.Linear Discriminant Analysis (LDA)

AnswerB

PCA decomposes the data into components that capture the maximum variance.

Why this answer

Option A is correct because PCA is a dimensionality reduction technique that identifies the directions (principal components) that maximize variance. Option B is wrong because t-SNE is for visualization and does not provide variance contributions. Option C is wrong because LDA is supervised and requires labels.

Option D is wrong because K-means is clustering, not variance analysis.

Practice this question →

← PreviousPage 6 of 6 · 406 questions total

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.

Start 20-question session