Knowledge + Practice

CCNA Exploratory Data Analysis Questions

75 of 406 questions · Page 5/6 · Exploratory Data Analysis · Answers revealed

Practice these questions Domain overview All questions

301

MCQeasy

A company has a dataset with 1 million rows and 500 features. They want to reduce dimensionality for visualization. Which technique is most suitable for preserving global structure?

A.Autoencoder

B.t-Distributed Stochastic Neighbor Embedding (t-SNE)

C.Linear Discriminant Analysis (LDA)

D.Principal Component Analysis (PCA)

AnswerD

PCA preserves global variance.

Why this answer

Option A is correct because PCA is a linear technique that preserves global variance. Option B is wrong because t-SNE focuses on local structure. Option C is wrong because LDA requires labels.

Option D is wrong because Autoencoders are more complex and not primarily for visualization.

Practice this question →

302

Multi-Selectmedium

A data scientist is performing EDA on a dataset with a binary target variable. Which THREE techniques can help assess the relationship between a continuous feature and the target?

Select 3 answers

A.Scatter plot against another continuous feature

B.KDE plot grouped by target

C.Histogram colored by target

D.Bar chart of feature values

E.Box plot grouped by target

AnswersB, C, E

KDE plots show smoothed density per class.

Why this answer

Box plots (comparing distributions for each class), histograms (overlay or side-by-side), and KDE plots (probability density) are all effective for visualizing the relationship between a continuous feature and a binary target. Option D (scatter plot) requires two continuous variables. Option E (bar chart) is for categorical features.

Practice this question →

303

MCQmedium

A data scientist is exploring a dataset with a large number of features. The scientist suspects that some features are redundant because they are highly correlated with each other. Which technique should the scientist use during EDA to identify and remove such redundant features?

A.Chi-square test

B.Principal Component Analysis (PCA)

C.Correlation matrix heatmap

D.Variance Inflation Factor (VIF)

AnswerD

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.

Why this answer

Option B is correct because VIF quantifies multicollinearity by measuring how much the variance of a coefficient is inflated due to correlation with other features. Option A is wrong because PCA creates new features, not identification of redundant ones. Option C is wrong because correlation matrix shows pairwise correlations but VIF is more comprehensive.

Option D is wrong because Chi-square is for categorical features.

Practice this question →

304

MCQhard

Refer to the exhibit. A data scientist is setting up an IAM policy for EDA on a data lake. The scientist needs to run exploratory SQL queries using Amazon Athena and save results to a new S3 bucket. What is a critical missing permission in this policy?

A.s3:ListBucket on the output bucket

B.s3:PutObject on the output S3 bucket

C.glue:GetDatabase

D.athena:StopQueryExecution

AnswerB

Athena needs to write query results to an S3 bucket.

Why this answer

Option B is correct because Athena writes query results to an S3 bucket, which requires s3:PutObject permission on the output bucket. Option A is wrong because the policy includes necessary Athena permissions. Option C is wrong because glue:GetTable is already included.

Option D is wrong because s3:ListBucket is included.

Practice this question →

305

MCQhard

A data scientist is analyzing a dataset with a large number of categorical features. The target variable is binary. Which technique should the scientist use to assess the relationship between each categorical feature and the target?

A.ANOVA

B.Point-biserial correlation

C.Cramér's V

D.Chi-square test of independence

AnswerD

Chi-square tests association between two categorical variables.

Why this answer

The chi-square test of independence is appropriate for testing association between categorical features and a binary target. ANOVA is for continuous target. Mutual information measures dependency but is not a hypothesis test.

Point-biserial correlation is for continuous and binary. Cramér's V is a measure of association after chi-square.

Practice this question →

306

MCQhard

A machine learning team is building a model to predict customer churn. The dataset has 20 features and 50,000 rows. After initial EDA, they notice that the target variable 'churn' is highly imbalanced (5% churn, 95% non-churn). Which EDA step should the team prioritize to address this imbalance before model training?

A.Remove outliers in the majority class to balance the dataset.

B.Analyze the distribution of each feature separately for churn and non-churn groups.

C.Perform stratified cross-validation to ensure balanced folds.

D.Apply Principal Component Analysis (PCA) to reduce noise.

AnswerB

This helps identify which features differentiate the classes and informs whether resampling or cost-sensitive methods are needed.

Why this answer

Option D is correct because understanding the distribution of features across churn and non-churn classes helps identify which features drive churn. Option A is wrong because PCA reduces dimensionality but does not address imbalance. Option B is wrong because cross-validation is a modeling step, not EDA.

Option C is wrong because removing outliers may worsen imbalance.

Practice this question →

307

MCQeasy

During EDA, a data scientist finds that a feature has a skewness value of 2.5. What does this indicate about the data distribution?

A.The distribution is right-skewed

B.The distribution is symmetric

C.The distribution is left-skewed

D.The distribution has no outliers

AnswerA

Positive skewness indicates a long right tail.

Why this answer

A skewness > 1 indicates a highly right-skewed (positive skew) distribution, meaning the tail extends to the right. Option A is wrong because left skew is negative. Option B is wrong because symmetric distributions have skewness near 0.

Option D is wrong because skewness describes shape, not the presence of outliers specifically.

Practice this question →

308

MCQhard

A data scientist is working on a predictive maintenance project for a manufacturing company. Sensor data is collected every second from 100 machines and stored in an Amazon S3 bucket as Parquet files, partitioned by machine_id and date. The dataset is massive (10 TB) and contains over 2000 features per machine. The data scientist needs to perform exploratory data analysis to identify which features are most predictive of machine failure. They have access to Amazon SageMaker Studio with a SageMaker Data Wrangler flow. The initial data exploration is taking too long due to the volume of data. The data scientist wants to speed up the analysis without losing accuracy in feature selection. Which course of action is most appropriate?

A.Switch to using Amazon EMR with Spark to perform distributed feature selection on the full dataset

B.Reduce the data to a single partition by concatenating all files and use only one machine's data

C.Use SageMaker Data Wrangler to create a stratified sample by machine_id and date, then analyze the sample

D.Use Amazon Athena to query a random sample of rows from the dataset

AnswerC

Correct: Stratified sampling preserves distribution of key variables and reduces data size.

Why this answer

Option B is correct because using SageMaker Data Wrangler's sampling capabilities allows faster exploration while preserving statistical properties for feature selection. Option A is wrong because reducing to a single partition loses time series context. Option C is wrong because moving to a smaller instance may cause memory issues.

Option D is wrong because random sample of rows may break time series ordering.

Practice this question →

309

MCQmedium

A data engineer runs a SQL query on Amazon Athena to explore a dataset stored in S3 as CSV. The query returns zero rows for a column that should have numeric values. Which step should the engineer take to diagnose the issue?

A.Verify that the S3 bucket has encryption enabled.

B.Run an AWS Glue crawler to update the table schema.

C.Add a partition to the table for the date column.

D.Check the table schema in AWS Glue Data Catalog to ensure the column data type is correct.

AnswerD

Incorrect data type can cause Athena to return null values.

Why this answer

Option B is correct because checking the schema and data type conversion can reveal issues like unquoted commas or wrong format. Option A is wrong because the issue is likely with data types, not encryption. Option C is wrong because adding a partition won't fix data type issues.

Option D is wrong because crawling does not change data types if schema is inferred incorrectly.

Practice this question →

310

MCQmedium

During EDA, a data scientist finds that a feature has a skewed distribution. They want to apply a log transformation to make it more Gaussian-like. Which Amazon SageMaker feature is most appropriate for this transformation?

A.Amazon SageMaker Data Wrangler

B.Amazon SageMaker Clarify

C.Amazon SageMaker JumpStart

D.Amazon SageMaker Ground Truth

AnswerA

Data Wrangler provides a visual interface for transformations like log scaling.

Why this answer

Option C is correct because SageMaker Data Wrangler provides a visual interface to apply transformations like log scaling without writing code. Option A is wrong because SageMaker Ground Truth is for labeling, not transformation. Option B is wrong because SageMaker JumpStart is for pre-built models.

Option D is wrong because SageMaker Clarify is for bias detection and explainability.

Practice this question →

311

MCQhard

A data scientist is analyzing clickstream data from a website. The data is stored in Amazon S3 as JSON files, each containing nested arrays. The scientist needs to flatten the nested structures and compute user session durations. Which approach is most efficient for this EDA task?

A.Use Amazon EMR with Apache Spark to process the data.

B.Use Amazon Athena with JSON SerDe to query the data and compute session duration with SQL.

C.Use AWS Glue DataBrew to flatten the JSON and create new columns for session duration.

D.Use Amazon QuickSight to visualize the raw data without flattening.

AnswerC

DataBrew is built for data preparation and can handle nested JSON visually.

Why this answer

AWS Glue DataBrew provides a visual interface to flatten nested JSON and compute derived metrics like session duration without writing code. Option B (Athena with JSON SerDe) can query but requires SQL that handles arrays. Option C (EMR with Spark) is more complex.

Option D (QuickSight) is visualization only.

Practice this question →

312

MCQhard

A company stores customer transaction data in Amazon S3. A data scientist needs to perform exploratory data analysis using Amazon SageMaker. The dataset is 500 GB in CSV format. Which approach is most cost-effective and time-efficient for initial data profiling?

A.Use Amazon S3 Select to sample rows directly from S3

B.Load the entire dataset into a SageMaker notebook instance and use pandas

C.Convert the data to Parquet format and then use Athena to query

D.Use AWS Glue ETL to transform the data and then analyze in Athena

AnswerA

S3 Select allows efficient querying of a subset without full data movement.

Why this answer

Option D is correct because Amazon S3 Select can query a subset of rows from S3 without loading the entire dataset, enabling quick profiling. Option A is wrong because loading full data is expensive and slow. Option B is wrong because Glue ETL processes full dataset.

Option C is wrong because converting to Parquet adds overhead for initial profiling.

Practice this question →

313

MCQmedium

A data scientist is performing EDA on a dataset containing customer transaction records. The dataset includes columns: 'transaction_id', 'customer_id', 'transaction_amount', 'transaction_date', and 'product_category'. The data scientist wants to check for duplicate transactions and identify any suspicious patterns, such as multiple transactions from the same customer on the same day with the same amount. The dataset has 5 million rows. The data scientist is using a SageMaker Studio notebook with a ml.t3.medium instance. The data is stored in S3. What is the most efficient way to perform this analysis?

A.Use a SageMaker Spark processing job with PySpark to aggregate and detect duplicates.

B.Use Amazon Athena to run SQL queries to find duplicates.

C.Load the entire dataset into a pandas DataFrame and use groupby operations.

D.Use AWS Glue DataBrew to create a profile and manually inspect.

AnswerA

Spark can handle large data efficiently.

Why this answer

Using Spark on SageMaker allows distributed processing of large data. Option A is wrong because pandas may run out of memory. Option B is wrong because Athena requires SQL queries and external setup.

Option D is wrong because DataBrew is for profiling but not custom duplicate analysis.

Practice this question →

314

MCQeasy

A data scientist is performing exploratory data analysis on a dataset with missing values. They want to understand the distribution of each feature and identify outliers. Which AWS service can be used to create visualizations such as histograms and box plots without writing any code?

A.Amazon EMR

B.AWS Glue

C.Amazon QuickSight

D.Amazon SageMaker Studio

E.Amazon Athena

AnswerC

QuickSight provides code-free visualizations like histograms and box plots.

Why this answer

Amazon QuickSight is a serverless, machine learning-powered business intelligence service that allows users to create interactive dashboards and visualizations without writing code. Option A is wrong because SageMaker Studio requires coding for custom visualizations. Option B is wrong because AWS Glue is used for ETL, not visualization.

Option D is wrong because Amazon Athena is a query service. Option E is wrong because Amazon EMR is a big data platform not primarily for visualization.

Practice this question →

315

MCQhard

A machine learning engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 using AWS Glue. The dataset contains a mix of numeric and categorical features. The engineer wants to efficiently compute summary statistics (e.g., mean, median, standard deviation) for the numeric columns. Which AWS service or feature should the engineer use to achieve this with minimal setup?

A.Launch an Amazon EMR cluster and use Spark.

B.Use AWS Glue DataBrew to profile the dataset.

C.Use Amazon Athena to run SQL queries on the data.

D.Use Amazon SageMaker Data Wrangler.

AnswerB

DataBrew provides an easy interface for profiling and statistics.

Why this answer

Option B is correct because AWS Glue DataBrew provides a visual interface to profile data and compute summary statistics without writing code. Option A is wrong because Amazon Athena requires SQL queries and more manual effort. Option C is wrong because Amazon EMR requires cluster setup and management.

Option D is wrong because Amazon SageMaker Data Wrangler is a good tool but requires more configuration than DataBrew for simple summary statistics.

Practice this question →

316

MCQmedium

A data analyst is working with a time series dataset that shows increasing variance over time. To stabilize the variance before modeling, which transformation is most appropriate?

A.First-order differencing

B.Box-Cox transformation

C.Log transformation

D.Min-max scaling

AnswerC

Log transformation compresses high values and stabilizes increasing variance.

Why this answer

Option A is correct because the log transformation is commonly used to stabilize variance when variance increases with the mean. Box-Cox (B) is more general but requires positive data. Differencing (C) is for trend/seasonality, not variance.

Min-max scaling (D) does not stabilize variance.

Practice this question →

317

Multi-Selectmedium

A data scientist is exploring a dataset containing customer transaction records. The target variable is 'churn' (1 = churned, 0 = not churned). Which TWO actions should the scientist take to understand the data distribution and prepare for modeling?

Select 2 answers

A.Apply Principal Component Analysis (PCA) to reduce dimensionality.

B.Train a gradient boosting model to identify important features.

C.Plot the frequency of the target variable to check for class imbalance.

D.Check for missing values in each column and decide on an imputation strategy.

E.Convert categorical variables into one-hot encoded vectors.

AnswersC, D

Essential to detect imbalance.

Why this answer

Visualizing class imbalance and identifying missing values are fundamental EDA steps. Option B (PCA) is for dimensionality reduction, not initial EDA. Option D (one-hot encoding) is for categorical variables, but not an EDA action.

Option E (gradient boosting) is modeling, not EDA.

Practice this question →

318

Multi-Selecthard

Which THREE of the following are best practices for feature engineering during EDA? (Select THREE.)

Select 3 answers

A.Remove all outliers from the dataset

B.Standardize all features to have zero mean and unit variance

C.Apply log transformation to highly skewed features

D.Create interaction features between numeric variables

E.Encode categorical variables using one-hot encoding

AnswersC, D, E

Log transformation reduces skewness.

Why this answer

Option C is correct because applying a log transformation to highly skewed features helps normalize their distribution, reducing the impact of extreme values and making the data more suitable for many machine learning algorithms that assume normally distributed features. This is a common technique during exploratory data analysis (EDA) to stabilize variance and improve model performance, especially for linear models and neural networks.

Exam trap

Cisco often tests the misconception that all preprocessing steps, like outlier removal and standardization, should be performed during EDA, when in fact EDA is for understanding data distributions and relationships, while transformations and scaling are part of data preprocessing that may follow EDA based on insights gained.

Practice this question →

319

MCQeasy

A data scientist is performing EDA on a dataset with 1,000 features. The goal is to select the most important features for a regression model. Which technique can be used to rank feature importance quickly?

A.Calculate the correlation coefficient of each feature with the target

B.Use t-SNE to visualize feature relationships

C.Run k-means clustering and use cluster centroids

D.Apply Principal Component Analysis (PCA) and examine component loadings

AnswerA

Quick and provides a ranking.

Why this answer

Correlation analysis with the target variable is a quick way to rank features. Option B is wrong because PCA is unsupervised and does not rank features by importance to target. Option C is wrong because t-SNE is for visualization.

Option D is wrong because k-means is clustering.

Practice this question →

320

MCQmedium

A data scientist is working with a dataset that contains a feature with many outliers. Which transformation should the scientist apply to reduce the impact of outliers?

A.Min-max scaling

B.Log transformation

C.Standardization (z-score)

D.Binning

AnswerB

Log transformation reduces skewness and dampens outlier effects.

Why this answer

Log transformation compresses the range of values and reduces the impact of outliers. Standardization (z-score) does not reduce outlier impact. Min-max scaling is sensitive to outliers.

Square root transformation is less effective than log for large outliers. Binning loses information.

Practice this question →

321

MCQeasy

A data analyst wants to check for duplicate rows in a dataset stored in S3. Which AWS service can be used to run a SQL query to count duplicates without moving the data?

A.Amazon Athena

B.Amazon Redshift Spectrum

C.Amazon SageMaker Studio

D.AWS Glue

AnswerA

Athena can run SQL queries on S3 data to count duplicates.

Why this answer

Option B is correct because Amazon Athena allows running SQL queries directly on data in S3, including counting duplicates. Option A is wrong because AWS Glue is an ETL service, not a query engine. Option C is wrong because Amazon Redshift Spectrum can query data in S3 but requires a Redshift cluster.

Option D is wrong because Amazon SageMaker Studio is an IDE, not a query service.

Practice this question →

322

Multi-Selecteasy

During EDA, a data scientist generates a pairplot of the dataset and observes that two features have a Pearson correlation coefficient of 0.95. Which TWO conclusions can the scientist draw from this observation? (Choose 2)

Select 2 answers

A.The two features may be multicollinear

B.The two features have a strong linear relationship

C.The two features move in opposite directions

D.The two features are statistically independent

E.One feature causes the other

AnswersA, B

High correlation between features can cause multicollinearity in regression models.

Why this answer

Options B and C are correct because a high correlation indicates a strong linear relationship and suggests multicollinearity. Option A is wrong because correlation does not imply causation. Option D is wrong because a high positive correlation means they move together, not opposite.

Option E is wrong because correlation measures linear relationship, not independence.

Practice this question →

323

MCQeasy

During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?

A.Apply log transformation.

B.Standardize the features using Z-scores.

C.Apply a sine transformation.

D.Apply Box-Cox transformation with lambda=0.

AnswerA

Log transformation stabilizes variance when variance increases with mean.

Why this answer

A funnel-shaped pattern in a scatter matrix indicates heteroscedasticity, where variance increases with the mean. The log transformation is appropriate because it compresses the scale of the data, making the variance more constant across the range of values, which stabilizes variance for right-skewed or multiplicative data.

Exam trap

Cisco often tests the distinction between transformations that stabilize variance (log, Box-Cox) versus those that only standardize (Z-scores) or are domain-specific (sine), and candidates may incorrectly choose Box-Cox with lambda=0 thinking it is a separate technique, missing that the log transformation is the canonical answer for funnel-shaped heteroscedasticity.

How to eliminate wrong answers

Option B is wrong because standardizing using Z-scores centers and scales the data to unit variance but does not address the relationship between variance and mean; it assumes homoscedasticity and can amplify heteroscedasticity. Option C is wrong because a sine transformation is periodic and used for cyclical or angular data, not for stabilizing variance in funnel-shaped patterns. Option D is wrong because Box-Cox with lambda=0 is equivalent to the log transformation only when the data is positive, but the Box-Cox transformation is a family of power transformations; specifying lambda=0 directly is redundant and the question asks for the appropriate transformation, not a specific parameterization.

Practice this question →

324

MCQmedium

A data engineer is exploring a dataset with 1 million rows and 50 features. They notice that some features have missing values. The 'Age' column has 5% missingness, and 'Income' has 20% missingness. The target variable is 'LoanDefault' (binary). The engineer wants to impute missing values. Which of the following strategies is most appropriate?

A.Impute missing 'Age' with median and 'Income' with median.

B.Impute missing 'Age' with mode and 'Income' with mode.

C.Use a k-NN model to predict missing values.

D.Drop all rows with missing values.

AnswerA

Median is robust to outliers and suitable for skewed distributions.

Why this answer

Option B is correct because median imputation is robust to outliers and simple for numerical features. Option A is wrong because dropping rows with any missing values would remove 25% of data. Option C is wrong because mode is for categorical.

Option D is wrong because model-based imputation is complex for initial EDA.

Practice this question →

325

MCQmedium

An ML engineer runs the AWS CLI command above to list files in a training data bucket. The engineer notices that the three CSV files have different sizes but the same number of columns. What is the MOST likely cause of the size variation?

A.The files are compressed with different algorithms.

B.Some files have duplicate headers.

C.The files contain a different number of rows.

D.The files have different column data types.

AnswerC

Row count directly affects file size.

Why this answer

Option D is correct because the number of rows can vary between files, leading to different file sizes. Option A is wrong because different column types would cause inconsistent schemas, but the engineer says same number of columns. Option B is wrong because compression would be applied uniformly.

Option C is wrong because S3 does not add headers multiple times.

Practice this question →

326

MCQeasy

A machine learning engineer is analyzing feature distributions in a dataset and notices that one feature has a long tail. Which transformation is most appropriate to reduce skewness and make the distribution more normal?

A.Apply one-hot encoding

B.Apply a log transformation

C.Apply min-max normalization

D.Apply standardization (Z-score)

AnswerB

Log transformation compresses the long tail and reduces skewness.

Why this answer

Option A is correct because log transformation is commonly used to reduce right skewness. Option B is wrong because min-max scaling does not change distribution shape. Option C is wrong because one-hot encoding is for categorical variables.

Option D is wrong because standardization does not reduce skewness.

Practice this question →

327

MCQhard

A data scientist is working with a dataset containing geospatial coordinates (latitude and longitude) of customer locations. The scientist wants to engineer features such as distance to the nearest store, and cluster customers into regions. Which AWS service is best suited for performing geospatial analysis and clustering during exploratory data analysis?

A.Amazon SageMaker with custom Python scripts using scikit-learn and Geopy

B.Amazon Athena with PostGIS extensions

C.AWS Glue with geospatial transforms

D.Amazon Location Service

AnswerA

SageMaker allows custom code for distance calculations and clustering using libraries like scikit-learn.

Why this answer

Option B is correct because Amazon SageMaker provides built-in algorithms like K-Means for clustering, and the scientist can use custom code with libraries like Geopy to compute distances. Option A is wrong because Amazon Athena with PostGIS is for querying geospatial data, not for clustering. Option C is wrong because Amazon Location Service is for maps and location tracking, not for analytical clustering.

Option D is wrong because AWS Glue is for ETL, not for analysis and clustering.

Practice this question →

328

MCQeasy

A data scientist is visualizing the distribution of a numerical feature that is heavily right-skewed. Which visualization technique is most appropriate?

A.Histogram with linear scale

B.Scatter plot

C.Box plot with log scale

D.Q-Q plot

AnswerC

Box plot with log scale handles skewness and shows outliers.

Why this answer

A box plot with log scale is effective for skewed data as it shows outliers and distribution shape after transformation. Histogram with log scale also works. KDE is similar to histogram.

Q-Q plot checks normality. Scatter plot is for two variables.

Practice this question →

329

MCQeasy

A data scientist runs a SQL query on an Amazon Athena table and notices that the query scans a large amount of data. Which approach would reduce the amount of data scanned without changing the SQL logic?

A.Partition the table on a column that is frequently used in WHERE clauses.

B.Convert the data from CSV to JSON format.

C.Store the data in Parquet format without partitioning.

D.Use GZIP compression on the data files.

AnswerA

Partitioning prunes data and reduces scanned bytes.

Why this answer

Partitioning the table on a frequently filtered column limits the data scanned to relevant partitions. Option A is wrong because compressing reduces storage but not scan size unless combined with columnar format. Option C is wrong because converting to JSON does not reduce scan.

Option D is wrong because Parquet is columnar and can reduce scan, but without partitioning, Athena still scans entire columns.

Practice this question →

330

MCQhard

A machine learning engineer is analyzing a dataset with high cardinality categorical features. They want to reduce the number of categories by grouping rare categories into an 'Other' category. Which Amazon SageMaker processing job capability is best suited for this task?

A.Amazon SageMaker Processing

B.Amazon SageMaker Data Wrangler

C.AWS Glue Studio

D.Amazon SageMaker Autopilot

AnswerA

Processing jobs allow custom scripts for flexible data transformation.

Why this answer

Option D is correct because Amazon SageMaker Processing jobs can run custom scripts that can handle high cardinality categorical features using libraries like pandas. Option A is wrong because SageMaker Data Wrangler is a visual tool, but it may not be as flexible for custom grouping logic. Option B is wrong because SageMaker Autopilot is for automated ML, not for custom data processing.

Option C is wrong because AWS Glue Studio is a visual ETL tool, but it may not be as tightly integrated with SageMaker.

Practice this question →

331

MCQeasy

Refer to the exhibit. A data scientist lists files in an S3 bucket. The dataset is split into train, test, and validation sets. What is the most likely issue with this data split?

A.The files are not partitioned by date.

B.The training file is missing a header row.

C.The training set is smaller than the test set, which is unusual.

D.The test file should be in JSON format.

AnswerC

Typically training set is largest.

Why this answer

Option C is correct because the training set (1024 bytes) is smaller than the test set (2048 bytes), which is unusual. Typically training set should be larger. Option A (missing header) cannot be inferred; Option B (CSV format) is fine; Option D (partitioning) is not evident.

Practice this question →

332

MCQmedium

A data scientist is analyzing a dataset containing customer reviews. The data scientist wants to understand the most common words used in positive and negative reviews. Which AWS service is most suitable for this task?

A.Amazon Rekognition

B.Amazon Comprehend

C.Amazon Polly

D.Amazon Transcribe

AnswerB

Comprehend provides sentiment analysis and key phrase extraction.

Why this answer

Option B is correct because Amazon Comprehend can perform sentiment analysis and extract key phrases. Option A is wrong because Amazon Rekognition is for image/video analysis. Option C is wrong because Amazon Polly is a text-to-speech service.

Option D is wrong because Amazon Transcribe is for speech-to-text.

Practice this question →

333

MCQmedium

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (only 1% positive class). The goal is to build a binary classifier. During exploratory data analysis, which metric is MOST appropriate to evaluate the performance of different sampling strategies before model training?

A.Root Mean Squared Error (RMSE)

B.Area Under the Receiver Operating Characteristic Curve (AUC ROC)

C.F1 score

D.Accuracy

AnswerB

AUC ROC is threshold-independent and robust to class imbalance.

Why this answer

Option D is correct because the AUC ROC curve is independent of class distribution and provides a robust measure of separability between classes. Option A is wrong because accuracy is misleading for imbalanced data. Option B is wrong because RMSE is for regression.

Option C is wrong because F1 score depends on a threshold and can be affected by sampling.

Practice this question →

334

MCQmedium

Refer to the exhibit. A data scientist plans to read this CSV file into memory for exploratory data analysis using pandas. The instance has 8 GB of RAM. What is the MOST likely issue the scientist will encounter?

A.The file contains too many rows for pandas to handle

B.The file is too large to load into memory on this instance

C.The file is not in CSV format despite the ContentType

D.The file is not accessible because of insufficient permissions

AnswerB

1 GB CSV file may require >8 GB RAM when loading into pandas.

Why this answer

Option B is correct because the file size is approximately 1 GB (1073741824 bytes = 1 GB), and pandas typically requires 3-5x the file size in memory for CSV parsing, which would exceed the 8 GB RAM. Option A is wrong because the file is accessible (HTTP 200). Option C is wrong because the content type is text/csv, which is correct.

Option D is wrong because the metadata indicates 10 million rows, but the file size is the main issue.

Practice this question →

335

MCQeasy

A team has a dataset with 500 features and wants to reduce dimensionality. During EDA, they compute the variance of each feature. Which finding would most likely lead to feature removal?

A.Some features have high correlation with each other

B.Some features have negative covariance with the target

C.Some features have very high variance

D.Some features have near-zero variance

AnswerD

Why B is correct

Why this answer

Option B is correct because near-zero variance features provide little information and can be removed. Option A is wrong because high variance is often useful. Option C is wrong because high correlation between two features might warrant removal of one, but variance is not the direct indicator.

Option D is wrong because negative covariance is still informative.

Practice this question →

336

MCQmedium

A data scientist runs the above AWS CLI command. What does the command do?

A.It lists objects larger than 1,000,000 bytes under the data/ prefix.

B.It counts the number of objects larger than 1 MB.

C.It lists objects created after January 2023.

D.It lists objects larger than 1 MB in size.

AnswerA

The --query filters Size > '1000000', which is 1,000,000 bytes.

Why this answer

Option B is correct. The command lists objects in the bucket under prefix 'data/' that are greater than 1,000,000 bytes (Size > '1000000') and returns their keys. Option A is wrong because size is in bytes, not 1 MB (1 MB = 1,048,576 bytes).

Option C is wrong because it returns keys, not counts. Option D is wrong because it filters by size, not date.

Practice this question →

337

MCQeasy

A data scientist is reviewing a dataset and notices that the distribution of a numerical feature is heavily right-skewed with a long tail. Which visualization is most appropriate to assess the distribution?

A.Box plot

B.Line chart

C.Scatter plot

D.Histogram with a logarithmic scale on the x-axis

AnswerD

Log scale helps visualize skewed distributions.

Why this answer

Option B is correct because a histogram with a log scale can handle skewed data. Option A is wrong because a box plot shows quartiles but not the full distribution shape. Option C is wrong because a scatter plot is for two variables.

Option D is wrong because a line chart is for time series.

Practice this question →

338

Multi-Selectmedium

A data scientist is analyzing a dataset with a binary target variable. The dataset has 50,000 rows and 200 features. The data scientist wants to identify which features are most predictive. Which TWO methods are appropriate for feature selection during EDA?

Select 2 answers

A.Mutual information

B.Chi-squared test

C.LSTM neural network

D.K-means clustering

E.Principal Component Analysis (PCA)

AnswersA, B

Can measure dependency between features and target.

Why this answer

Chi-squared test is for categorical features and mutual information can handle both. Option C is wrong because PCA is unsupervised. Option D is wrong because k-means is clustering.

Option E is wrong because LSTM is a model, not for EDA.

Practice this question →

339

MCQhard

An ML engineer is performing EDA on a dataset of customer transactions. The dataset has 1 million rows and 20 columns, including a 'transaction_amount' column. The engineer notices that 5% of the transaction amounts are negative, which are data entry errors. The rest are positive. Which approach is most appropriate for handling these negative values during EDA?

A.Impute the negative values with the median of positive transaction amounts.

B.Remove rows with negative transaction amounts from the dataset.

C.Take the absolute value of the negative transaction amounts.

D.Cap the negative values at zero.

AnswerB

Removing erroneous data points cleans the dataset without introducing bias.

Why this answer

Option D is correct because the negative values are errors and likely distort the distribution; removing them is straightforward and valid. Option A is wrong because taking absolute values would incorrectly treat errors as legitimate high values. Option B is wrong because negative values are not missing, so imputation is not appropriate.

Option C is wrong because capping may still retain erroneous values.

Practice this question →

340

MCQhard

A data scientist is trying to upload a CSV file to an S3 bucket using the AWS CLI without specifying server-side encryption. The upload fails with an AccessDenied error. Based on the bucket policy exhibit, what is the most likely cause?

A.The upload request did not specify the required server-side encryption.

B.The bucket does not exist.

C.The data scientist does not have any permissions to the bucket.

D.The data scientist used the wrong AWS region.

AnswerA

The condition requires s3:x-amz-server-side-encryption to be AES256.

Why this answer

Option B is correct because the policy requires that all PutObject requests include server-side encryption with AES256. Option A is wrong because the policy allows GetObject and PutObject, but with a condition. Option C is wrong because the condition is on the encryption, not on the bucket.

Option D is wrong because the error is AccessDenied, not NoSuchBucket.

Practice this question →

341

MCQhard

A data scientist is performing exploratory data analysis on a high-dimensional dataset with 500 features. The scientist wants to visualize the data in 2D to check for clusters. Which dimensionality reduction technique should the scientist use that preserves global structure and is computationally efficient for large datasets?

A.t-SNE

B.Linear Discriminant Analysis (LDA)

C.PCA

D.UMAP

AnswerC

PCA is linear, fast, and preserves global variance.

Why this answer

Option C is correct because PCA is linear, fast, and preserves global variance structure. Option A is wrong because t-SNE is non-linear, slower, and focuses on local structure. Option B is wrong because UMAP can be slow and is also non-linear.

Option D is wrong because LDA is supervised and requires labels.

Practice this question →

342

MCQmedium

During EDA, a data scientist finds that a feature 'age' has 30% missing values. The dataset has 100,000 rows. Which imputation strategy is most robust if the data is not missing at random (MNAR) and the missingness is related to the age value itself?

A.Impute with the mean age

B.Impute with a random sample from the observed ages

C.Impute with the median age

D.Create a separate category indicating missingness and impute with a placeholder

AnswerD

Creating a missing category captures the information that the value is missing, which is informative under MNAR.

Why this answer

Option C is correct because missingness related to the value itself means that the missing data are systematically different; creating a 'missing' category allows the model to learn the pattern. Option A is wrong because mean imputation reduces variance and ignores the systematic difference. Option B is wrong because median imputation has similar issues.

Option D is wrong because random imputation introduces noise without capturing the missingness pattern.

Practice this question →

343

Multi-Selecthard

A data scientist is analyzing a large dataset of images stored in Amazon S3. The dataset is used to train a computer vision model. Which THREE EDA steps are appropriate for this image dataset?

Select 3 answers

A.Compute the distribution of image dimensions (height and width).

B.Check for corrupted or unreadable image files.

C.Decompose the time series of image timestamps to detect seasonality.

D.Visualize a sample of images from each class to verify labels.

E.Perform tokenization and stop word removal on image filenames.

AnswersA, B, D

Important for resizing and batching.

Why this answer

Checking label distribution, identifying corrupt images, and analyzing image dimensions are standard EDA for images. Option B (time series decomposition) is irrelevant. Option E (text preprocessing) is for text data.

Practice this question →

344

MCQeasy

A machine learning team is reviewing a dataset for a regression problem. They notice that the target variable has a right-skewed distribution. Which transformation should they consider applying to the target variable to improve model performance?

A.Apply StandardScaler to the target variable.

B.Apply MinMaxScaler to the target variable.

C.Apply log transformation to the target variable.

D.Apply one-hot encoding to the target variable.

AnswerC

Log transformation reduces right skewness.

Why this answer

Log transformation is commonly applied to right-skewed data to make it more normally distributed, which can improve model performance. Option A (StandardScaler) is for scaling, not skewness. Option B (MinMaxScaler) also doesn't address skewness.

Option D (One-hot encoding) is for categorical variables.

Practice this question →

345

Multi-Selecthard

During EDA of a dataset for a regression problem, a data scientist notices that the target variable has a right-skewed distribution. Which THREE transformations are appropriate to address this skewness? (Choose THREE.)

Select 3 answers

A.Log transformation

B.StandardScaler (z-score normalization)

C.Box-Cox transformation

D.Yeo-Johnson transformation

E.Min-Max scaling

AnswersA, C, D

Log transformation compresses large values, reducing right skew.

Why this answer

Options A, C, and E are correct. Log transformation, Box-Cox, and Yeo-Johnson are common transformations for right-skewed data. Option B is wrong because StandardScaler only standardizes, does not reduce skewness.

Option D is wrong because MinMax scaling does not affect skewness.

Practice this question →

346

MCQmedium

A company stores sensor data in Amazon S3. A data scientist wants to explore the data using SQL without moving it. Which AWS service should they use?

A.Amazon EMR

B.Amazon Redshift

C.Amazon QuickSight

D.Amazon Athena

AnswerD

Athena queries data directly in S3 using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL without any data movement or infrastructure management. Athena uses Presto under the hood and charges only for the data scanned per query, making it ideal for ad-hoc exploratory analysis on sensor data stored in S3.

Exam trap

The trap here is that candidates often confuse Amazon Athena with Amazon EMR or Redshift, thinking they need a full cluster or data warehouse for SQL queries, but Athena is specifically designed for serverless, direct S3 querying with no data movement.

How to eliminate wrong answers

Option A is wrong because Amazon EMR is a managed big data platform that requires provisioning and managing clusters (e.g., Hadoop, Spark), which involves moving or processing data in a separate compute layer, not querying it directly in S3 with SQL without setup. Option B is wrong because Amazon Redshift is a data warehouse that requires loading data from S3 into its own storage before querying, violating the 'without moving it' requirement. Option C is wrong because Amazon QuickSight is a business intelligence (BI) visualization tool, not a SQL query engine; it can connect to Athena but cannot directly run SQL queries on S3 data on its own.

Practice this question →

347

Multi-Selecthard

A data scientist is analyzing a dataset with many missing values. The scientist wants to decide on an imputation strategy. Which THREE considerations are important for choosing the imputation method?

Select 3 answers

A.The mechanism of missingness (MCAR, MAR, MNAR).

B.The class imbalance of the target variable.

C.The percentage of missing values in each feature.

D.The distribution of the feature (e.g., skewed, normal).

E.The feature importance according to a random forest model.

AnswersA, C, D

Determines whether imputation is valid.

Why this answer

Missing data mechanism (MCAR/MAR/MNAR), proportion of missing values, and feature distribution (skewness, outliers) all affect imputation choice. Option A (feature importance) is not directly relevant. Option D (class imbalance) is for classification targets.

Practice this question →

348

MCQhard

A data engineer is performing exploratory data analysis on a large dataset stored in Amazon S3 (10 TB in CSV format). The dataset has 2000 columns and 50 million rows. The engineer needs to compute summary statistics (mean, median, standard deviation) for each numeric column and identify missing values. Which approach is MOST cost-effective and time-efficient?

A.Use Amazon Redshift Spectrum to query the data directly from S3.

B.Load the data into Amazon SageMaker Data Wrangler and compute statistics interactively.

C.Convert the data to Apache Parquet format, then use Amazon Athena to run SQL queries for statistics.

D.Use AWS Glue ETL to compute statistics and write results to S3.

AnswerC

Parquet reduces data scanned, and Athena is cost-effective for ad-hoc queries.

Why this answer

Using Amazon Athena with columnar formats like Parquet after converting from CSV reduces query costs and improves performance. Option A (SageMaker Data Wrangler) may struggle with 2000 columns. Option B (AWS Glue ETL) is more expensive and slower for simple statistics.

Option D (Redshift Spectrum) requires setting up a Redshift cluster, which is overkill.

Practice this question →

349

Multi-Selectmedium

Which TWO are appropriate techniques for detecting outliers in a dataset during exploratory data analysis?

Select 2 answers

A.Z-score method (assuming normal distribution)

B.One-hot encoding

C.Principal component analysis (PCA)

D.t-SNE

E.Interquartile range (IQR) method

AnswersA, E

Z-score identifies outliers based on standard deviations.

Practice this question →

350

MCQmedium

In exploratory data analysis, a data scientist notices that the distribution of a feature 'income' is heavily right-skewed. Which transformation is most appropriate to reduce skewness?

A.Standardization (z-score).

B.Square transformation.

C.Min-max scaling.

D.Log transformation.

AnswerD

Log transformation reduces right skew.

Why this answer

Log transformation is the most appropriate technique to reduce right skewness in a feature like 'income' because it compresses the long tail of high values while expanding the lower end, making the distribution more symmetric. This is particularly effective for income data, which often follows a log-normal distribution, and is a standard preprocessing step in machine learning to improve model performance.

Exam trap

The trap here is that candidates confuse scaling techniques (which change range or variance) with transformations that alter distribution shape, leading them to pick standardization or min-max scaling as a fix for skewness.

How to eliminate wrong answers

Option A is wrong because standardization (z-score) centers and scales the data to have mean 0 and standard deviation 1, but it does not change the shape of the distribution, so skewness remains. Option B is wrong because a square transformation amplifies larger values even more, which would worsen right skewness rather than reduce it. Option C is wrong because min-max scaling linearly rescales the data to a fixed range (e.g., [0,1]), which preserves the original distribution shape and does not address skewness.

Practice this question →

351

MCQhard

A data scientist is analyzing a dataset for a binary classification problem. The dataset has 10,000 samples and 200 features. After splitting into training (80%) and test (20%), the data scientist trains a decision tree classifier and achieves 100% accuracy on the training set but only 55% on the test set. Which step should the data scientist take first to address this issue?

A.Use cross-validation to evaluate model performance

B.Collect more training data

C.Add more features to the model

D.Prune the decision tree to reduce complexity

AnswerD

Why D is correct

Why this answer

Option D is correct because the large discrepancy between training and test accuracy indicates overfitting, and pruning the decision tree (e.g., limiting max_depth) reduces overfitting. Option A is wrong because more features may worsen overfitting. Option B is wrong because more data may help but is not the first step; also data is limited.

Option C is wrong because cross-validation is a technique to evaluate model performance but does not directly fix overfitting; pruning does.

Practice this question →

352

Multi-Selecthard

A data scientist is performing EDA on a dataset with 1 million rows and 50 features. The dataset includes a column 'user_id' with unique identifiers, a column 'event_date' with timestamps, and other columns. Which TWO actions should the data scientist take to understand data quality issues?

Select 2 answers

A.Analyze missing value patterns across columns

B.Check for duplicate rows based on 'user_id' and 'event_date'

C.Drop the 'user_id' column to reduce dimensionality

D.Use PCA to reduce dimensions and visualize

E.Train a random forest model to identify feature importance

AnswersA, B

Missing value analysis is key for data quality.

Why this answer

Checking for duplicate rows and analyzing missing values are fundamental steps in EDA. Option B is wrong because dropping user_id before analysis may lose information. Option C is wrong because training a model is not part of EDA.

Option D is wrong because PCA is not for data quality.

Practice this question →

353

MCQeasy

A data scientist needs to understand the distribution of a continuous variable in a large dataset stored in Amazon S3. Which AWS service is most appropriate for quickly generating summary statistics and visualizations?

A.AWS Glue

B.Amazon Athena

C.Amazon QuickSight

D.Amazon SageMaker Studio

AnswerC

Correct: QuickSight can directly connect to S3 and create interactive dashboards with summary statistics.

Why this answer

Option A is correct because Amazon QuickSight integrates with S3 and can generate histograms and summary statistics without needing to move data. Option B is wrong because SageMaker is for model building, not quick ad-hoc analysis. Option C is wrong because Athena is for querying, not visualization.

Option D is wrong because Glue is for ETL, not analysis.

Practice this question →

354

Multi-Selectmedium

Which THREE actions are valid steps in exploratory data analysis when working with a new dataset? (Choose three.)

Select 3 answers

A.Check the data types of each column.

B.Generate descriptive statistics (mean, std, min, max).

C.Fit a linear regression model to identify important features.

D.Split the dataset into training and test sets.

E.Create histograms for numerical features.

AnswersA, B, E

Understanding data types is essential.

Why this answer

Options A, C, and E are correct. A: Checking data types is fundamental. C: Descriptive statistics summarize distributions.

E: Visualizing with histograms reveals patterns. B: Splitting into train/test is modeling, not EDA. D: Building a linear model is modeling, not EDA.

Practice this question →

355

Multi-Selecthard

A data scientist is analyzing a dataset with high multicollinearity. Which TWO techniques can help identify and address multicollinearity?

Select 2 answers

A.Plot a correlation matrix

B.Apply Lasso regression

C.Use Recursive Feature Elimination (RFE)

D.Use Principal Component Analysis (PCA)

E.Compute Variance Inflation Factor (VIF)

AnswersD, E

Correct: PCA creates uncorrelated components.

Why this answer

Correct options: A and D. VIF (A) quantifies multicollinearity; PCA (D) creates orthogonal components. Option B is wrong because correlation matrix only shows pairwise correlations.

Option C is wrong because Lasso does feature selection but does not identify multicollinearity. Option E is wrong because RFE is for feature selection, not multicollinearity detection.

Practice this question →

356

Multi-Selectmedium

A data scientist is exploring a dataset with 50 features. Which TWO EDA techniques are most effective for detecting multicollinearity?

Select 2 answers

A.Box plots of each feature

B.Variance Inflation Factor (VIF) analysis

C.Scatter plots of each feature pair

D.Histograms of each feature

E.Correlation matrix visualized as heatmap

AnswersC, E

Why A is correct

Why this answer

Option A is correct because scatter plots of feature pairs can reveal linear relationships. Option C is correct because correlation matrix with heatmap quantifies pairwise correlations. Option B is wrong because histogram shows distribution, not relationship.

Option D is wrong because VIF is a statistical test, but the question asks for EDA techniques (visual). Option E is wrong because box plot shows univariate distribution.

Practice this question →

357

MCQmedium

A data scientist is exploring a dataset with 10 million rows and 500 features. The target variable is binary. The dataset is stored in an Amazon S3 bucket. The data scientist wants to quickly identify which features have the highest correlation with the target variable. Which approach is MOST efficient?

A.Use Amazon SageMaker Data Wrangler to import the dataset from S3 and generate a correlation matrix.

B.Use Amazon QuickSight to create scatter plots for each feature vs. target.

C.Use Amazon Athena with SQL queries to compute correlation coefficients.

D.Use AWS Glue ETL to compute pairwise correlations and output to Amazon Redshift.

AnswerA

Data Wrangler provides interactive data exploration and correlation analysis.

Why this answer

Option B is correct because Amazon SageMaker Data Wrangler can connect to S3, perform correlation analysis, and generate a correlation matrix without writing code. Option A is wrong because AWS Glue ETL is for ETL pipelines, not interactive exploration. Option C is wrong because Amazon Athena is for SQL queries, not correlation analysis.

Option D is wrong because Amazon QuickSight is for visualization, not statistical correlation.

Practice this question →

358

MCQmedium

A data scientist is working with a dataset that contains a 'Price' column. After plotting a histogram, they observe that the distribution is right-skewed with many extreme high values. They plan to use a linear model that assumes normally distributed errors. Which of the following transformations should they apply to the 'Price' column to make it more normally distributed?

A.Apply log transformation (log(Price)).

B.Apply square transformation (Price^2).

C.Apply min-max scaling to the 'Price' column.

D.Bin the 'Price' values into equal-width intervals.

AnswerA

Log transformation compresses the tail and makes the distribution more symmetric.

Why this answer

Option D is correct because log transformation is commonly used for right-skewed data to reduce skewness. Option A is wrong because min-max scaling does not change distribution shape. Option B is wrong because square transformation increases skewness.

Option C is wrong because binning loses information.

Practice this question →

359

MCQhard

A data scientist is analyzing a dataset with a target variable that is highly imbalanced (99% negative class, 1% positive class). The dataset has 10 million rows. The goal is to train a binary classifier. Which technique should be applied during exploratory data analysis to best address the imbalance?

A.Assign higher class weights to the minority class

B.Random undersampling of the majority class

C.Synthetic Minority Oversampling Technique (SMOTE)

D.Collect more data for the minority class

AnswerB

Feasible for large datasets and can balance classes.

Why this answer

For large datasets, undersampling the majority class is feasible and can be effective. Option A is wrong because SMOTE generates synthetic samples but may be computationally expensive for 10M rows. Option B is wrong because class weights are set during training, not EDA.

Option D is wrong because collecting more data is not guaranteed to fix imbalance.

Practice this question →

360

MCQmedium

A data scientist is working with a dataset that includes a 'timestamp' column. They want to create features that capture seasonality. Which feature engineering approach is most appropriate?

A.Bin timestamps into fixed intervals.

B.Convert timestamp to Unix epoch seconds.

C.Extract hour of day and apply sine/cosine transformation.

D.One-hot encode the timestamp column.

AnswerC

Sine/cosine encoding preserves cyclic nature.

Why this answer

Option D is correct because sine and cosine transformations capture cyclic patterns like time of day or day of year. Option A is wrong because one-hot encoding creates many sparse features. Option B is wrong because converting to Unix timestamp loses cyclic nature.

Option C is wrong because binning loses granularity.

Practice this question →

361

Multi-Selectmedium

A data scientist is exploring a dataset with many features and suspects that some features are highly correlated. Which TWO methods can the scientist use to detect and handle multicollinearity before building a linear regression model?

Select 2 answers

A.Apply Principal Component Analysis (PCA) and use all components.

B.Standardize all features to have zero mean and unit variance.

C.Compute Variance Inflation Factor (VIF) for each feature and remove features with VIF > 10.

D.Use stepwise feature selection.

E.Use Ridge regression (L2 regularization) to shrink coefficients.

AnswersC, E

VIF detects multicollinearity; removing high VIF features reduces it.

Why this answer

Options A and C are correct. VIF is a standard measure for detecting multicollinearity; removing features with high VIF reduces multicollinearity. Ridge regression (L2 regularization) can handle multicollinearity by penalizing large coefficients.

Option B is wrong because PCA reduces dimensionality but makes models less interpretable and does not directly handle multicollinearity in the original features. Option D is wrong because stepwise selection does not specifically address multicollinearity. Option E is wrong because standardization does not affect collinearity.

Practice this question →

362

MCQmedium

A machine learning engineer trains a binary classifier on an imbalanced dataset where the positive class represents 1% of the data. After training, the model achieves 99% accuracy but only 10% recall on the positive class. Which metric should the engineer focus on to evaluate the model's performance on the minority class?

A.F1 score

B.Accuracy

C.AUC-ROC

D.Precision

AnswerA

F1 score considers both precision and recall, giving a better measure for imbalanced data.

Why this answer

Option B is correct because the F1 score balances precision and recall, which is suitable for imbalanced datasets. Option A is wrong because accuracy can be misleading with imbalance. Option C is wrong because precision alone ignores recall.

Option D is wrong because AUC-ROC may still be high even with poor recall.

Practice this question →

363

MCQhard

A data scientist is trying to list objects in an S3 bucket named 'my-bucket' using the AWS CLI command: `aws s3 ls s3://my-bucket/`. The command fails with an access denied error. The IAM policy attached to the scientist's role is shown in the exhibit. What is the most likely cause of the failure?

A.The condition on the ListBucket action requires all objects to have the tag 'data-type'='training', which may not be satisfied.

B.The IAM policy does not include the s3:ListBucket action.

C.The policy does not grant access to the bucket because it uses 'my-bucket' instead of the full ARN.

D.The condition should use 'StringLike' instead of 'StringEquals'.

AnswerA

The condition on ListBucket is problematic and may cause denial.

Why this answer

Option B is correct because the ListBucket action is conditioned on the s3:ExistingObjectTag/data-type condition, which requires that all objects in the bucket have the tag 'data-type' set to 'training'. However, the ListBucket action applies to the bucket, not individual objects, and the condition might not be evaluated correctly, but more importantly, the condition on ListBucket is unusual; typically conditions on ListBucket should be avoided or use different keys. However, the most direct issue is that the condition must be satisfied for the request, and if the bucket has objects without that tag, the ListBucket action is denied.

Option A is wrong because the policy includes s3:ListBucket on the bucket ARN. Option C is wrong because the condition uses StringEquals, not a request parameter. Option D is wrong because the bucket exists and the user has permission if the condition is met.

Practice this question →

364

Multi-Selecthard

Which TWO of the following are valid reasons to use a sample of the data during exploratory data analysis instead of the full dataset? (Select TWO.)

Select 2 answers

A.Remove bias from the original dataset

B.Ensure rare events are captured in the analysis

C.Improve model accuracy by reducing noise

D.Reduce memory usage and computation time

E.Enable interactive data visualization with large datasets

AnswersD, E

Sampling allows faster iteration with smaller data.

Why this answer

Options A and D are correct. Sampling reduces memory usage and speeds up interactive analysis. Option B is wrong because sampling can miss rare events.

Option C is wrong because model accuracy typically decreases with less data. Option E is wrong because sampling does not remove bias from the original data.

Practice this question →

365

MCQhard

A team is analyzing a dataset with many categorical features. They notice that one feature has 1,000 unique values but a long tail where most values appear only once. Which encoding method is most appropriate to avoid overfitting?

A.Target encoding

B.Label encoding

C.One-hot encoding

D.Count encoding

AnswerD

Count encoding replaces categories with their frequency, reducing dimensionality and handling rare values.

Why this answer

Option C is correct because count encoding uses frequency counts, which can capture information for rare categories without creating high dimensionality. One-hot encoding (A) would create 1,000 columns. Target encoding (B) can cause overfitting.

Label encoding (D) implies ordinality.

Practice this question →

366

MCQmedium

An organization stores streaming data in Amazon Kinesis Data Streams. A data analyst wants to perform real-time exploratory data analysis on the incoming data to detect anomalies. Which AWS service should the analyst use to run SQL queries on the streaming data?

A.Amazon Kinesis Data Analytics

B.Amazon SageMaker

C.AWS Glue

D.Amazon Athena

AnswerA

Kinesis Data Analytics supports SQL queries on streaming data for real-time analysis.

Why this answer

Option B is correct because Amazon Kinesis Data Analytics enables running SQL queries on streaming data in real-time. Option A is wrong because Athena is for batch queries on S3. Option C is wrong because Glue is for ETL, not real-time SQL.

Option D is wrong because SageMaker is for ML model training, not streaming SQL.

Practice this question →

367

MCQmedium

A data scientist is analyzing a dataset with missing values. The missing data is not random and is correlated with other features. Which imputation method is most appropriate to minimize bias?

A.Last observation carried forward

B.Multiple imputation using MICE

C.Listwise deletion

D.Mean imputation

AnswerB

Correct: MICE models missing values using other features, suitable for non-random missingness.

Why this answer

Option B is correct because Multiple Imputation by Chained Equations (MICE) accounts for relationships between features and preserves variability. Option A is wrong because mean imputation can bias estimates when data is not missing completely at random. Option C is wrong because dropping rows reduces sample size and may introduce bias.

Option D is wrong because last observation carried forward is for time series.

Practice this question →

368

MCQmedium

A data scientist is analyzing a dataset with missing values in a numeric column. The missing rate is 30% and the data is not missing completely at random. Which imputation method should the data scientist avoid to minimize bias?

A.Mean imputation

B.Model-based imputation using linear regression

C.k-Nearest Neighbors imputation

D.Multiple imputation using chained equations

AnswerA

Mean imputation can introduce bias and reduce variance, especially when data is not missing completely at random.

Why this answer

Option C is correct because mean imputation can introduce bias when data is not missing completely at random, as it reduces variance and distorts relationships. Option A (multiple imputation) and B (model-based imputation) are appropriate for non-random missing data. Option D (k-NN imputation) can also be used but may be less biased than mean imputation.

Practice this question →

369

MCQeasy

During exploratory data analysis, a data scientist notices that the Pearson correlation coefficient between two continuous variables is 0.85. What does this indicate?

A.A causal relationship between the two variables

B.A strong positive linear relationship

C.A weak negative linear relationship

D.No relationship between the variables

AnswerB

Values close to 1 indicate a strong positive linear relationship.

Why this answer

Option B is correct because a correlation of 0.85 indicates a strong positive linear relationship. Option A is wrong because 0.85 is far from 1. Option C is wrong because correlation does not imply causation.

Option D is wrong because 0.85 is not weak.

Practice this question →

370

Multi-Selectmedium

A data scientist is exploring a dataset with 50 features and a binary target. The data scientist computes the correlation matrix and finds that two features, X1 and X2, have a correlation coefficient of 0.95. Which TWO actions should the data scientist consider? (Choose 2.)

Select 2 answers

A.Apply a log transformation to X1 and X2.

B.Remove one of the highly correlated features from the dataset.

C.Apply Principal Component Analysis (PCA) to the feature set.

D.Create an interaction term between X1 and X2.

E.Impute missing values for X1 and X2.

AnswersB, C

Removing one feature reduces multicollinearity.

Why this answer

Option A is correct because high correlation indicates multicollinearity; removing one feature reduces redundancy. Option C is correct because PCA can create uncorrelated components. Option B is wrong because adding interaction terms would increase multicollinearity.

Option D is wrong because correlation does not imply missing values. Option E is wrong because log transformation does not address correlation between features.

Practice this question →

371

MCQmedium

A data scientist is working with a dataset containing 10,000 observations and 100 features. The scientist wants to detect outliers in the dataset. Which method is most appropriate for outlier detection in a high-dimensional space?

A.Use Z-score to identify points beyond 3 standard deviations

B.Use Isolation Forest

C.Use Mahalanobis distance

D.Use interquartile range (IQR) for each feature

AnswerB

Isolation Forest is designed for high-dimensional data and does not assume distribution.

Why this answer

Option D is correct because Isolation Forest is effective for high-dimensional data and uses tree-based isolation. Option A is wrong because Z-score assumes normality and is univariate. Option B is wrong because IQR is univariate and not suitable for high dimensions.

Option C is wrong because Mahalanobis distance assumes multivariate normality and is sensitive to dimensionality.

Practice this question →

372

MCQhard

A data scientist uses Amazon SageMaker Data Wrangler to explore a dataset. The target column is 'price' (continuous). Which EDA analysis would best help decide between linear regression and tree-based models?

A.Compute variance inflation factor (VIF) for features

B.Check linear relationships between features and target

C.Detect outliers using Z-score

D.Identify class imbalance in the target

AnswerB

Why A is correct

Why this answer

Option A is correct because checking linearity (e.g., scatter plots of features vs. target) is fundamental for linear model assumptions. Option B is wrong because multicollinearity affects linear regression but not tree models. Option C is wrong because class imbalance is for classification.

Option D is wrong because outlier detection is important but not the primary factor for model selection.

Practice this question →

373

MCQmedium

A company uses Amazon SageMaker Data Wrangler to perform exploratory data analysis. They want to detect outliers in a numerical column using the Interquartile Range (IQR) method. Which transformation should they apply in Data Wrangler?

A.Impute

B.Normalize

C.Handle outliers

D.Binning

AnswerC

This transform supports IQR method.

Why this answer

Option A is correct because Data Wrangler has a built-in 'Handle outliers' transform that allows IQR-based detection. Option B (Normalize) scales data; Option C (Binning) groups values; Option D (Impute) fills missing values.

Practice this question →

374

Multi-Selectmedium

A data scientist is exploring a dataset with skewed numerical features. Which THREE transformations can help make the features more normally distributed?

Select 3 answers

A.Min-max scaling

B.Standardization (Z-score)

C.Yeo-Johnson transformation

D.Box-Cox transformation

E.Log transformation

AnswersC, D, E

Correct: Yeo-Johnson works for both positive and negative values.

Why this answer

Correct options: A, B, D. Log transform (A), Box-Cox (B), and Yeo-Johnson (D) are common for normalizing skewed data. Option C is wrong because standardization (Z-score) does not change distribution shape.

Option E is wrong because min-max scaling does not fix skewness.

Practice this question →

375

Multi-Selecteasy

During EDA, a data scientist notices that a numeric feature 'age' has values ranging from 0 to 150, but expects adult ages between 18-100. Which TWO steps should the scientist take to investigate?

Select 2 answers

A.Remove all rows with age > 100

B.Compute summary statistics (min, max, percentiles)

C.Apply log transformation to normalize the distribution

D.Impute age values outside 18-100 with the mean

E.Create a box plot to visualize outliers

AnswersB, E

Why D is correct

Why this answer

Option B is correct because box plots show outliers. Option D is correct because summary statistics (min, max, percentiles) reveal extreme values. Option A is wrong because removing outliers before understanding context is premature.

Option C is wrong because log transformation changes scale but does not help identify outliers. Option E is wrong because mean imputation would distort distribution.

Practice this question →

← PreviousPage 5 of 6 · 406 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.

Start 20-question session