CCNA Exploratory Data Analysis Questions

75 of 406 questions · Page 1/6 · Exploratory Data Analysis · Answers revealed

1
MCQeasy

A machine learning engineer is exploring a dataset with 500 features and 10,000 samples. To reduce dimensionality for visualization, which technique is most suitable if the goal is to preserve global data structure?

A.t-Distributed Stochastic Neighbor Embedding (t-SNE)
B.Locally Linear Embedding (LLE)
C.Principal Component Analysis (PCA)
D.Uniform Manifold Approximation and Projection (UMAP)
AnswerC

PCA preserves global variance (covariance structure).

Why this answer

PCA is the most suitable technique for preserving the global data structure when reducing dimensionality because it is a linear method that maximizes variance along orthogonal principal components, capturing the overall covariance structure of the 500 features. Unlike nonlinear methods, PCA ensures that the global relationships (e.g., distances between clusters) are retained, making it ideal for visualization of high-dimensional data where the goal is to see broad patterns.

Exam trap

Cisco often tests the misconception that nonlinear methods like t-SNE or UMAP are always better for visualization, but the trap here is that they sacrifice global structure for local detail, making PCA the correct choice when the question explicitly states 'preserve global data structure.'

How to eliminate wrong answers

Option A is wrong because t-SNE is a nonlinear technique that focuses on preserving local neighborhoods and pairwise similarities, often distorting global structure (e.g., cluster sizes and distances) to create visually separable clusters. Option B is wrong because LLE is a nonlinear manifold learning method that preserves local linear relationships between neighbors, but it does not guarantee preservation of global structure and can fail with high-dimensional data (500 features) due to the curse of dimensionality. Option D is wrong because UMAP, while faster than t-SNE, is also a nonlinear technique designed to preserve local and some global structure but prioritizes topological connectivity over global variance, making it less suitable than PCA when the explicit goal is to maintain the overall data covariance and global distances.

2
MCQmedium

A company has a dataset with a large number of missing values in several columns. The data scientist wants to impute missing values without introducing bias. Which approach should be used?

A.Remove rows with any missing values
B.Use iterative imputation (MICE) to model missing values
C.Replace missing values with the mode of each column
D.Replace missing values with the mean of each column
AnswerB

MICE uses relationships among variables to impute, reducing bias.

Why this answer

Option C is correct because MICE (Multiple Imputation by Chained Equations) is a sophisticated method that models each variable with missing values as a function of other variables, reducing bias. Option A is wrong because mean imputation can reduce variance and bias relationships. Option B is wrong because dropping rows loses data.

Option D is wrong because mode imputation for categorical data may introduce bias if missingness is not random.

3
MCQhard

A machine learning team is working with a dataset containing high-dimensional sparse features, such as text data represented as bag-of-words. The team wants to reduce dimensionality while preserving the structure of the sparse matrix. Which technique is most appropriate for this scenario?

A.t-distributed Stochastic Neighbor Embedding (t-SNE).
B.Truncated Singular Value Decomposition (SVD).
C.Linear Discriminant Analysis (LDA).
D.Principal Component Analysis (PCA) using the covariance matrix.
AnswerB

Truncated SVD works efficiently on sparse matrices.

Why this answer

Option D is correct because Truncated SVD (e.g., using sklearn's TruncatedSVD or PCA on sparse data via SVD) is designed for sparse matrices and preserves variance. Option A is wrong because PCA with covariance matrix requires dense matrix and is computationally expensive for sparse data. Option B is wrong because t-SNE is for visualization, not for general dimensionality reduction preserving structure.

Option C is wrong because LDA is a supervised method and requires labels.

4
MCQhard

During exploratory data analysis, a data scientist discovers that a feature has a variance of 0.01, while other features have variances around 1.0. Which action should be taken?

A.Scale the feature to have unit variance.
B.Apply a log transformation to the feature.
C.Impute missing values in the feature.
D.Consider removing the feature or applying variance threshold.
AnswerD

Near-zero variance features are often uninformative.

Why this answer

Option B is correct because features with near-zero variance contribute little information and may cause numerical instability. Option A is wrong because low variance does not automatically mean missing values. Option C is wrong because scaling to unit variance amplifies noise.

Option D is wrong because transformation does not increase variance meaningfully.

5
MCQmedium

Refer to the exhibit. A data scientist is using an IAM role with this policy to run a SageMaker processing job that reads data from S3. The job fails with an access error. What is the most likely cause?

A.The policy does not allow sagemaker:CreateProcessingJob
B.The policy does not allow s3:PutObject
C.The policy does not allow s3:ListBucket
D.The policy does not allow s3:GetObject
AnswerC

ListBucket is required to list objects in the bucket.

Why this answer

The processing job needs both s3:GetObject and s3:ListBucket to read objects. The policy lacks s3:ListBucket. Option A is wrong because sagemaker:CreateProcessingJob is allowed.

Option B is wrong because the policy allows s3:GetObject. Option D is wrong because s3:PutObject is not needed for reading.

6
MCQeasy

A machine learning engineer is working on a regression problem to predict house prices. The dataset contains 500,000 rows and 20 features, including 'sqft_living', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'yr_built', 'zipcode', and 'lat'. After performing exploratory data analysis, the engineer notices that the 'sqft_living' feature has a right-skewed distribution with a long tail. The 'zipcode' feature is categorical with 70 unique values. The 'lat' feature is continuous. The engineer wants to prepare the data for a linear regression model. Which action should the engineer take to improve model performance?

A.Remove the 'sqft_living' feature because it violates the normality assumption.
B.Apply a log transformation to the 'sqft_living' feature.
C.One-hot encode the 'zipcode' feature to capture location effects.
D.Apply standard scaling (z-score) to the 'sqft_living' feature.
AnswerB

Log transformation reduces right skewness, making the distribution more symmetric.

Why this answer

Linear regression assumes that features are approximately normally distributed, and a right-skewed distribution like 'sqft_living' can violate this assumption, leading to poor model performance. Applying a log transformation compresses the long tail, making the distribution more symmetric and helping the model learn a linear relationship between the feature and the target. This is a standard preprocessing step for skewed features in regression tasks.

Exam trap

Cisco often tests the misconception that standard scaling (z-score) can fix skewness, when in reality it only normalizes the mean and variance without altering the shape of the distribution.

How to eliminate wrong answers

Option A is wrong because removing the 'sqft_living' feature outright discards valuable information; linear regression does not require strict normality of features, only that residuals are normally distributed, and skewness can be addressed via transformation. Option C is wrong because one-hot encoding 'zipcode' with 70 unique values would create 69 dummy variables, which is acceptable but not the most impactful action for improving model performance given the stated issue of skewness in 'sqft_living'. Option D is wrong because standard scaling (z-score) only centers and scales the data, which does not address right skewness; it would preserve the long tail and fail to make the distribution more normal.

7
MCQmedium

Refer to the exhibit. A data scientist is using AWS Glue ETL jobs to process data from a source database. The job logs show repeated timeout errors. Which EDA step should the scientist perform to diagnose the issue?

A.Test network connectivity from the Glue job to the source database using telnet.
B.Check the source database table sizes and row counts over time.
C.Switch the Glue ETL job type from Spark to Python shell to reduce overhead.
D.Increase the Glue job timeout to 600 seconds and rerun.
AnswerB

Identifies if data volume growth causes timeouts.

Why this answer

The timeout suggests the job is taking longer than the max 300 seconds. Analyzing source data volume trends helps determine if data size has increased, causing longer processing time. Option A is wrong because changing job type does not address root cause.

Option B is wrong because increasing timeout without understanding data growth is a temporary fix. Option D is wrong because the error is about timeout, not connectivity.

8
MCQmedium

A data scientist is analyzing a dataset of customer reviews for a retail company. The dataset contains text reviews, star ratings (1-5), and customer metadata. The scientist wants to perform sentiment analysis to classify reviews as positive or negative. During EDA, the scientist uses Amazon SageMaker Data Wrangler to visualize the distribution of star ratings and notices that 90% of reviews are 4 or 5 stars, while only 2% are 1 star. The scientist is concerned about class imbalance. Which approach should the scientist take to address the imbalance before modeling?

A.Downsample the majority class to create a balanced dataset.
B.Use random oversampling of the minority class to balance the dataset.
C.Use accuracy as the evaluation metric since it is simple.
D.Use the F1-score as the evaluation metric to account for imbalance.
AnswerD

F1-score balances precision and recall, appropriate for imbalanced classes.

Why this answer

Option B is correct because using F1-score as the evaluation metric accounts for class imbalance better than accuracy. Option A is wrong because random oversampling can lead to overfitting and is not always best. Option C is wrong because downsampling the majority class loses valuable data.

Option D is wrong because accuracy is misleading for imbalanced datasets.

9
MCQhard

A data scientist is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a dataset. The dataset contains a feature 'age' with values ranging from 0 to 120. The data scientist wants to detect outliers. Which built-in transform in Data Wrangler is most appropriate for this task?

A.Handle Outliers
B.Scale and Normalize
C.Handle Missing
D.Encode Categorical
AnswerA

This transform includes outlier detection methods such as IQR and z-score.

Why this answer

Option C is correct because the 'Handle Outliers' transform provides methods like IQR and z-score to detect and handle outliers. Option A is wrong because 'Handle Missing' deals with missing values, not outliers. Option B is wrong because 'Scale and Normalize' transforms data but does not detect outliers.

Option D is wrong because 'Encode Categorical' is for categorical features.

10
MCQhard

A data scientist is performing exploratory data analysis on text data. They want to identify the most common terms and their frequencies. Which approach should they use?

A.Perform sentiment analysis on the text.
B.Apply Latent Dirichlet Allocation (LDA) to extract topics.
C.Create a bag-of-words matrix and compute term frequencies.
D.Use word2vec to generate word embeddings.
AnswerC

Bag-of-words directly provides term counts.

Why this answer

Option A is correct because a term frequency-inverse document frequency (TF-IDF) vectorizer can provide weighted frequencies, but a simple count vectorizer is also common. However, the question asks for common terms and frequencies, so a bag-of-words approach is appropriate. Option B is wrong because word2vec produces embeddings, not frequencies.

Option C is wrong because LDA is a topic model. Option D is wrong because sentiment analysis is not about frequency.

11
MCQmedium

A data scientist is working with a dataset containing customer transactions. The dataset has a column named 'transaction_date' with timestamp values. The scientist wants to create new features such as day of week, hour, and whether the transaction occurred on a weekend. Which AWS service provides built-in feature engineering capabilities for datetime columns?

A.Amazon SageMaker Data Wrangler
B.Amazon Athena
C.AWS Glue ETL
D.Amazon EMR
AnswerA

Data Wrangler has built-in datetime feature extraction.

Why this answer

Amazon SageMaker Data Wrangler includes built-in transformations for datetime features like extracting day, month, hour, etc. Option B (AWS Glue ETL) requires custom code. Option C (Amazon Athena) can extract parts but not as a feature engineering step.

Option D (Amazon EMR) requires more manual effort.

12
MCQmedium

Refer to the exhibit. A data scientist is unable to run an Amazon Athena query on data in `my-bucket`. The IAM policy shown is attached to the user. What is the most likely reason for the failure?

A.The ListBucket action is not granted.
B.Athena needs s3:PutObject permission to write results.
C.The data is encrypted with SSE-C.
D.The bucket does not exist.
AnswerB

Athena writes output to S3.

Why this answer

Option D is correct because Athena also requires `s3:PutObject` to write query results to an output location. Option A (bucket exists) is fine; Option B (ListBucket is allowed) is present; Option C (no encryption) is not an issue.

13
MCQmedium

A data scientist is exploring a dataset with many missing values. They want to understand the pattern of missingness before deciding on imputation. Which approach is most appropriate?

A.Compute the correlation matrix of the features with missing values.
B.Drop all rows with any missing values.
C.Impute all missing values with the mean of each column.
D.Visualize the missingness using a heatmap or bar chart.
AnswerD

Visualization helps identify patterns like monotonic or random missingness.

Why this answer

Option A is correct because a heatmap of missing values (using libraries like missingno) visually shows patterns. Option B (drop rows) is premature; Option C (mean imputation) assumes MCAR; Option D (correlation matrix) does not show missingness patterns.

14
MCQeasy

A data scientist is exploring a dataset and finds that the correlation between two features is 0.95. What should the data scientist do to address multicollinearity before training a linear regression model?

A.Remove one of the two features
B.Apply L2 regularization
C.Standardize the features
D.Apply Principal Component Analysis
AnswerA

Removing one feature eliminates the high correlation.

Why this answer

Option A is correct because removing one of the highly correlated features reduces multicollinearity. Regularization (B) like Ridge can help but does not remove multicollinearity. PCA (C) changes interpretability.

Scaling (D) does not affect correlation.

15
MCQmedium

A data scientist is analyzing server logs stored in Amazon CloudWatch Logs. The above snippet shows three log entries. They want to count the number of 500 errors per minute using CloudWatch Logs Insights. Which query should they use?

A.fields @timestamp, status | filter status = 500 | stats count() by bin(1m)
B.fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(1m)
C.fields @timestamp, @message | filter @message like /500/ | sort @timestamp desc
D.fields @timestamp, @message | filter @message like /500/ | stats count() by bin(1m)
AnswerD

Correctly filters for 500 status code and counts per minute.

Why this answer

Option B is correct because the query filters status 500, parses the timestamp, and counts per minute. Option A is wrong because it filters error messages, not status codes. Option C is wrong because it filters by fields that may not exist.

Option D is wrong because it uses sort without stats.

16
MCQmedium

A data scientist is exploring a dataset with 100 features. After generating pair plots, the scientist notices that many features have skewed distributions. Which transformation should the scientist apply to make the distributions more Gaussian-like for modeling?

A.Log transformation
B.Yeo-Johnson transformation
C.Standard scaling (z-score normalization)
D.Box-Cox transformation
AnswerB

Works for any real values.

Why this answer

Option C is correct because Yeo-Johnson can handle both positive and negative values. Option A is wrong because log transform only works for positive values. Option B is wrong because Box-Cox also requires positive values.

Option D is wrong because standard scaling does not fix skewness.

17
MCQeasy

During EDA, a data scientist notices that a numeric feature 'age' has outliers beyond 3 standard deviations. What is the most appropriate first step?

A.Use the feature as-is in the model
B.Apply a log transformation to the feature
C.Remove all rows with outlier values
D.Investigate the source of the outliers
AnswerD

Understanding outliers guides proper handling.

Why this answer

Option C is correct because investigating the outliers helps determine if they are data errors or valid. Option A is wrong because removing without investigation may lose valuable data. Option B is wrong because winsorizing is a transformation, not first step.

Option D is wrong because modeling with noise is not prudent.

18
MCQeasy

A data analyst is investigating a dataset where the target variable is binary (0/1). The analyst wants to check for multicollinearity among the numerical features. Which statistical measure should the analyst use?

A.Variance Inflation Factor (VIF).
B.Mutual information between features and target.
C.Chi-square test of independence.
D.Pearson correlation coefficient between each pair of features.
AnswerA

VIF measures how much a feature is explained by other features.

Why this answer

Option B is correct because Variance Inflation Factor (VIF) quantifies how much a feature is correlated with other features. Option A is wrong because Pearson correlation only measures pairwise linear relationships, not multicollinearity among multiple features. Option C is wrong because chi-square is for categorical variables.

Option D is wrong because mutual information measures dependence but does not specifically detect multicollinearity.

19
MCQmedium

A data scientist is performing EDA on a dataset with many features. They suspect some features are redundant due to high pairwise correlations. Which technique can help identify groups of correlated features?

A.Use t-SNE to visualize feature relationships
B.Apply PCA and examine the loadings
C.Compute mutual information between each feature and the target
D.Use chi-square test for each pair
E.Create a correlation matrix and visualize with a heatmap
AnswerE

A correlation matrix heatmap clearly shows correlated feature groups.

Why this answer

A correlation matrix with a heatmap visualizes pairwise correlations and helps identify groups of correlated features. Option B is wrong because PCA reduces dimensionality but does not show feature groups directly. Option C is wrong because mutual information measures dependency but not specifically linear correlation.

Option D is wrong because chi-square test is for categorical associations. Option E is wrong because t-SNE is for visualization of high-dimensional data, not for correlation analysis.

20
Drag & Dropmedium

Drag and drop the steps to use Amazon SageMaker Feature Store for feature engineering in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Feature Store involves defining group, ingesting, querying, training, and maintaining.

21
Multi-Selecteasy

A data analyst is performing EDA on a tabular dataset with 500 features. The goal is to reduce dimensionality before modeling. Which TWO techniques are appropriate for this task?

Select 2 answers
A.t-distributed Stochastic Neighbor Embedding (t-SNE).
B.Principal Component Analysis (PCA).
C.k-fold cross-validation to assess model performance.
D.Chi-square test for independence between features.
E.One-hot encoding of categorical variables.
AnswersA, B

t-SNE reduces dimensions for visualization.

Why this answer

PCA and t-SNE are both dimensionality reduction techniques. PCA is linear, t-SNE is nonlinear. Option A (Chi-square test) is for feature selection with categorical targets, not dimensionality reduction.

Option C (cross-validation) is for model evaluation. Option E (one-hot encoding) expands features.

22
MCQhard

A data engineer is performing exploratory data analysis on a dataset stored in Amazon S3 using AWS Glue DataBrew. The dataset contains a column 'age' with missing values. DataBrew's profile shows that the column has 5% missing values, a mean of 45, and a standard deviation of 15. Which imputation strategy should the engineer recommend to minimize bias if the missing data is Missing at Random (MAR)?

A.Replace missing values with the mean (45)
B.Remove rows with missing 'age' values
C.Replace missing values with the median
D.Use multiple imputation to generate several plausible values and combine results
AnswerD

Multiple imputation preserves the natural variability and provides valid statistical inferences under MAR.

Why this answer

Option C is correct because multiple imputation provides unbiased estimates under MAR by accounting for uncertainty. Option A is wrong because mean imputation reduces variance and can bias relationships. Option B is wrong because median imputation is robust but still single imputation.

Option D is wrong because dropping rows reduces sample size and may introduce bias if missingness is related to other variables.

23
Multi-Selecteasy

Which TWO are common steps in exploratory data analysis?

Select 2 answers
A.Training a machine learning model.
B.Checking for missing values.
C.Visualizing distributions of features.
D.Deploying the model to production.
E.Tuning hyperparameters.
AnswersB, C

Missing value analysis is a key step.

24
MCQmedium

A data scientist is analyzing a time series dataset of daily website traffic. The scientist notices a strong weekly seasonality. To better understand the underlying patterns, which decomposition method should the scientist use to separate the trend, seasonal, and residual components?

A.Additive decomposition using moving averages.
B.Use STL (Seasonal and Trend decomposition using Loess).
C.Fit an ARIMA model and examine residuals.
D.Apply an ETS (Error, Trend, Seasonal) model.
AnswerB

STL is robust and flexible for any seasonality.

Why this answer

Option C is correct because STL decomposition is robust to outliers and can handle any seasonality period, making it suitable for daily data with weekly seasonality. Option A is wrong because classical decomposition assumes additive seasonality and is less robust. Option B is wrong because ARIMA is a forecasting model, not a decomposition method.

Option D is wrong because ETS is an exponential smoothing model, not primarily for decomposition.

25
Multi-Selecteasy

Which TWO approaches are appropriate for handling missing categorical data during exploratory data analysis? (Choose two.)

Select 2 answers
A.Use one-hot encoding to represent missingness as a binary feature.
B.Impute with the mode (most frequent) of the column.
C.Treat missing values as a separate 'Unknown' category.
D.Drop all rows with missing values in that column.
E.Impute missing values with the mean of the column.
AnswersB, C

Mode is a simple imputation for categorical data.

Why this answer

Options B and D are correct. B: Creating an 'Unknown' category preserves information about missingness. D: Using mode imputation is a simple baseline.

A: Dropping rows may lose data; C: Mean is for numerical data; E: One-hot encoding requires values first.

26
MCQhard

A data scientist is exploring a dataset with 1,000 features and only 200 samples. The goal is to build a binary classifier. Which technique should be used first during exploratory data analysis to reduce dimensionality and avoid overfitting?

A.Compute pairwise correlations and remove highly correlated features.
B.Apply L1 regularization (Lasso) to select features.
C.Use t-SNE to visualize clusters and reduce dimensions.
D.Use principal component analysis (PCA) to reduce dimensions.
AnswerD

PCA reduces dimensionality while preserving variance.

Why this answer

Option C is correct because PCA is an unsupervised dimensionality reduction technique suitable for high-dimensional data with few samples. Option A is wrong because feature selection based on correlation may miss interactions. Option B is wrong because L1 regularization is model-dependent and not part of EDA.

Option D is wrong because t-SNE is for visualization, not feature reduction for modeling.

27
MCQmedium

A data scientist is exploring a dataset stored as a single 2 GB object in S3. The scientist wants to read only a subset of the file (e.g., the first 1000 lines) to perform initial data inspection. Which approach should the scientist take to minimize data transfer and cost?

A.Use the AWS CLI to download the entire file and then use head to get the first lines.
B.Use S3 Select with a SQL query to retrieve the first 1000 rows.
C.Use the S3 Range header to read the first 1 MB of the file and parse lines.
D.Use Amazon Athena to query the file with LIMIT 1000.
AnswerB

S3 Select efficiently retrieves only the required subset.

Why this answer

Option C is correct because S3 Select allows retrieving a subset of data using SQL-like queries, reducing data transfer. Option A is wrong because downloading the entire 2 GB file is inefficient and costly. Option B is wrong because the S3 range read (with Range header) can retrieve bytes, but the user wants lines, not bytes; it's possible but less convenient and still transfers the entire file if lines are not aligned.

Option D is wrong because Athena scans the entire file; it's not efficient for a single large file.

28
MCQmedium

A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has 1 million rows with 50 features, including numerical and categorical variables. The goal is to identify patterns and potential data quality issues before building a model. Which approach should the data scientist take to efficiently explore the data?

A.Use AWS Glue DataBrew to profile the dataset, view data quality reports, and visualize distributions.
B.Use Amazon Athena to run SQL queries and generate summary statistics.
C.Use Amazon SageMaker Data Wrangler to import the data and create a flow for feature engineering.
D.Use Amazon SageMaker Ground Truth to label the data and then analyze the labels.
AnswerA

DataBrew provides an interactive interface for data profiling, cleaning, and visualization, making it suitable for EDA.

Why this answer

AWS Glue DataBrew is purpose-built for visual data preparation and profiling without writing code. It can directly profile the 1-million-row dataset, automatically generate data quality reports (e.g., missing values, outliers, data types), and provide distribution visualizations for both numerical and categorical features, making it the most efficient choice for exploratory data analysis.

Exam trap

Cisco often tests the distinction between tools for exploratory data analysis versus tools for data transformation or labeling, leading candidates to confuse SageMaker Data Wrangler (feature engineering) or Athena (SQL querying) with a dedicated profiling tool like DataBrew.

How to eliminate wrong answers

Option B is wrong because Amazon Athena is a serverless query engine for analyzing data in S3 using SQL, but it does not provide built-in profiling, data quality reports, or visualizations; it requires manual SQL queries to generate summary statistics, which is less efficient for exploratory analysis. Option C is wrong because Amazon SageMaker Data Wrangler is designed for importing, transforming, and creating feature engineering flows, not for initial data profiling and quality assessment; its primary purpose is preparing data for model training, not exploratory analysis. Option D is wrong because Amazon SageMaker Ground Truth is a data labeling service for creating labeled datasets, not for exploratory data analysis or profiling; using it to analyze labels would be an incorrect and inefficient use of the service.

29
MCQmedium

A data scientist is exploring a dataset stored in an Amazon S3 bucket. The dataset contains both numerical and categorical features. The scientist wants to compute summary statistics (mean, median, standard deviation) for all numerical features and count the distinct values for categorical features. Which AWS service is most appropriate for this task with minimal coding?

A.Amazon Athena
B.AWS Glue ETL jobs
C.AWS Glue DataBrew
D.Amazon SageMaker Data Wrangler
E.Amazon EMR
AnswerC

DataBrew offers visual data profiling with summary statistics and distinct counts.

Why this answer

AWS Glue DataBrew provides a visual interface for data preparation and profiling, including summary statistics and distinct value counts, without writing code. Option A is wrong because Amazon SageMaker Data Wrangler requires integration with SageMaker and may require more setup. Option B is wrong because AWS Glue ETL jobs require coding in Python or Scala.

Option D is wrong because Amazon Athena requires SQL queries. Option E is wrong because Amazon EMR requires cluster management and coding.

30
MCQeasy

A machine learning engineer notices that the target variable in a regression dataset has a long-tailed distribution. Which visualization technique is most appropriate to assess the distribution before applying a log transformation?

A.Bar chart
B.Histogram with density curve
C.Box plot
D.Scatter plot
AnswerB

Histogram and density curve show the distribution shape, including long tails.

Why this answer

Option C is correct because a histogram or density plot clearly shows the shape and spread of the distribution, including long tails. Option A is incorrect because box plots show quartiles and outliers but not the full distribution shape. Option B is incorrect because scatter plots show relationships between two variables, not univariate distribution.

Option D is incorrect because bar charts are for categorical data.

31
MCQhard

A data scientist is analyzing a dataset with a timestamp column and several numeric measurements. The goal is to detect seasonality and trends. Which AWS service can be used directly from SageMaker Studio to perform this analysis without writing code?

A.Amazon Forecast
B.Amazon SageMaker Data Wrangler
C.Amazon QuickSight ML Insights
D.AWS Glue DataBrew
AnswerB

Includes built-in time series analysis.

Why this answer

SageMaker Data Wrangler includes built-in time series analysis capabilities such as seasonality detection. Option A is wrong because QuickSight ML Insights requires data to be in QuickSight. Option B is wrong because Glue DataBrew is for data preparation but not specifically for time series analysis.

Option D is wrong because Forecast is for forecasting, not exploratory analysis.

32
MCQhard

A data scientist is analyzing a dataset stored in Amazon S3 (100 GB, CSV format) using Amazon SageMaker Studio. The dataset contains 500 columns and 10 million rows. The data scientist wants to understand the distribution of each column, detect missing values, and identify outliers. However, the SageMaker Studio notebook instance runs out of memory when loading the entire dataset into a pandas DataFrame. The data scientist needs to complete the EDA efficiently without modifying the source data. What should the data scientist do?

A.Write a script that loads only a random 10% sample of rows to reduce memory usage.
B.Use AWS Glue ETL to transform the data into Parquet format and then load into pandas.
C.Launch a larger notebook instance with more memory (e.g., ml.r5.24xlarge) and reload the data.
D.Use Amazon SageMaker Data Wrangler to create a data flow that samples and profiles the data.
AnswerD

Data Wrangler can handle large datasets efficiently.

Why this answer

Using SageMaker Data Wrangler allows processing data in a distributed manner without loading everything into memory. Option A is wrong because increasing instance size may still not be enough and is costly. Option B is wrong because Glue ETL is more complex and not integrated with Studio.

Option D is wrong because sampling may miss important data.

33
MCQeasy

During EDA, a data scientist finds that a categorical feature 'city' has 500 unique values but only 10 cities account for 90% of the data. What is a recommended way to handle the rare categories?

A.Group rare categories into a single 'Other' category.
B.Apply label encoding to all categories.
C.One-hot encode all 500 categories.
D.Drop all rows with rare categories.
AnswerA

Reduces cardinality and retains data.

Why this answer

Option D is correct because grouping rare categories into 'Other' reduces cardinality. Option A is wrong because one-hot encoding all 500 creates many features. Option B is wrong because dropping all may lose information.

Option C is wrong because label encoding rare categories may not help.

34
MCQhard

A data scientist is performing exploratory data analysis on a dataset with mixed data types (numerical, categorical, text). The goal is to identify clusters of similar records. Which technique is most appropriate?

A.DBSCAN
B.Hierarchical clustering
C.K-means clustering
D.K-prototypes clustering
AnswerD

K-prototypes is designed for mixed numerical and categorical data.

Why this answer

K-prototypes extends k-means to handle mixed data by combining Euclidean distance for numerical and Hamming distance for categorical. K-means only works with numerical data. DBSCAN works on numerical data.

Hierarchical clustering typically uses numerical distance. Gower distance can be used but is less common in clustering algorithms.

35
Multi-Selecteasy

A data analyst is using AWS Glue to catalog datasets for exploratory analysis. The analyst wants to understand the schema and data types. Which TWO tools can the analyst use to view the schema of a table in the AWS Glue Data Catalog? (Choose TWO.)

Select 2 answers
A.Amazon Athena
B.Amazon Redshift query editor
C.Amazon QuickSight
D.AWS Glue console
E.Amazon S3 console
AnswersA, D

Athena can query the Glue Data Catalog using SHOW CREATE TABLE or INFORMATION_SCHEMA.

Why this answer

Option B is correct because Athena can query the INFORMATION_SCHEMA to view table schemas. Option D is correct because the Glue console displays table schemas directly. Option A is wrong because QuickSight is a visualization tool, not for schema viewing.

Option C is wrong because S3 does not store schemas; it stores objects. Option E is wrong because Redshift returns data, not schema from Glue Catalog directly.

36
MCQmedium

A data scientist is working on a project to predict customer churn for a telecom company. The dataset includes 50,000 records with 20 features, including customer demographics, account information, and service usage. The data scientist uses Amazon SageMaker Studio and loads the data into a pandas DataFrame. During EDA, the data scientist notices that the target variable 'churn' has only 10% positive cases. Additionally, several features have missing values: 'income' has 5% missing, 'age' has 2% missing, and 'total_charges' has 1% missing. The data scientist also observes that 'income' is highly skewed with a long right tail, and 'age' is moderately skewed. The data scientist wants to handle missing values and prepare the data for modeling. Which course of action is most appropriate?

A.Impute 'income' with median, 'age' with median, 'total_charges' with median, and use SMOTE to handle class imbalance after splitting the data.
B.Remove all rows with any missing values, and use random oversampling to handle class imbalance.
C.Impute 'income' with mode, 'age' with mode, 'total_charges' with mode, and use SMOTE after splitting.
D.Impute all missing values with the mean of each column, and use stratified sampling to handle class imbalance.
AnswerA

Median is robust to skewness. SMOTE is appropriate for imbalance.

Why this answer

Option A is correct because median imputation is robust for skewed data, and for the target imbalance, SMOTE can be applied after splitting. Option B is wrong because mean imputation is sensitive to skewness. Option C is wrong because removing rows with missing values would discard 8% of data, which is significant.

Option D is wrong because mode is for categorical data, not continuous.

37
MCQhard

A data scientist is building a model to predict housing prices using a dataset with 100,000 records and 50 features. The features include 'sqft_living', 'sqft_lot', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', etc. The data scientist uses Amazon SageMaker Data Wrangler for EDA. Upon reviewing the data, the data scientist finds that 'sqft_living' has a correlation of 0.7 with 'sqft_above' (square footage above ground) and 0.6 with 'sqft_basement'. Also, 'grade' (overall grade of the house) is highly correlated with 'condition' (0.8). The target variable 'price' is right-skewed. The data scientist plans to use a linear regression model. Which set of actions should the data scientist take to improve model performance?

A.Apply standard scaling to all numeric features and use the data as is, since linear regression is robust to multicollinearity.
B.Remove all features that have correlation >0.5 with any other feature to eliminate multicollinearity, and apply standard scaling to all numeric features.
C.Apply principal component analysis (PCA) to all features to reduce dimensionality, and then fit linear regression on the principal components.
D.Apply log transformation to the target variable 'price' to reduce skewness, and remove either 'sqft_above' or 'sqft_living' and either 'grade' or 'condition' to handle multicollinearity.
AnswerD

Log transform addresses skewness; removing one of each pair reduces multicollinearity.

Why this answer

Option B is correct because log-transforming the target addresses skewness, and removing or combining highly correlated features reduces multicollinearity. Option A is wrong because removing all correlated features may discard predictive power. Option C is wrong because standard scaling does not fix skewness or multicollinearity.

Option D is wrong because PCA on all features may lose interpretability and does not target the specific issues.

38
MCQhard

A data scientist is setting up an IAM policy for a SageMaker notebook instance that needs to read and write data in the 'training/' folder of an S3 bucket, and also list objects in the bucket. Does the policy satisfy the requirements?

A.Yes, the policy correctly grants the required permissions.
B.No, the policy must also include s3:DeleteObject for data cleaning.
C.No, the policy misses s3:GetObject for the bucket itself.
D.No, the condition on ListBucket is invalid.
AnswerA

The policy grants read/write on objects under training/ and list with prefix condition.

Why this answer

Option A is correct. The policy allows GetObject and PutObject on the training/ folder, and ListBucket with a condition restricting prefix to training/*, which allows listing only that prefix. Option B is wrong because the policy works.

Option C is wrong because the condition is valid. Option D is wrong because the policy is sufficient.

39
MCQmedium

During exploratory data analysis, a data scientist notices that the distribution of a continuous feature is heavily right-skewed. Which transformation should be applied to make the distribution more symmetric for linear regression?

A.Standardization (z-score)
B.One-hot encoding
C.Min-max scaling
D.Log transformation
AnswerD

Log transformation reduces right skewness.

Why this answer

Log transformation is commonly used for right-skewed data. Option A is wrong because min-max scaling does not change distribution shape. Option B is wrong because standardization does not fix skewness.

Option D is wrong because one-hot encoding is for categorical features.

40
MCQmedium

A data scientist is analyzing a dataset with a skewed target variable for a regression problem. During EDA, the scientist wants to transform the target variable to approximate a normal distribution. Which transformation should the scientist apply first?

A.Quantile transformation
B.Min-Max scaling
C.Log transformation
D.Box-Cox transformation
AnswerD

Box-Cox automatically finds the best power transformation to achieve normality.

Why this answer

Option B is correct because Box-Cox is designed to make data more normally distributed and handles positive values. Option A is wrong because log is a special case of Box-Cox but less general. Option C is wrong because scaling does not change distribution shape.

Option D is wrong because quantile transformation can be used but may overfit; Box-Cox is parametric and often preferred.

41
MCQmedium

A data engineer is using Amazon SageMaker Data Wrangler to perform exploratory data analysis on a large dataset stored in S3. The analysis reveals high cardinality in a categorical feature with over 1 million unique values. What is the best approach to handle this before training a model?

A.Apply one-hot encoding.
B.Use label encoding to convert categories to integers.
C.Drop the high-cardinality feature.
D.Use target encoding based on the mean of the target variable per category.
AnswerD

Target encoding reduces cardinality and captures target relationship.

Why this answer

Option B is correct because target encoding is effective for high cardinality. Option A is wrong because one-hot encoding would create too many columns. Option C is wrong because label encoding may introduce ordinal relationships.

Option D is wrong because dropping the feature may lose important information.

42
MCQmedium

A data scientist runs the AWS CLI command shown in the exhibit to list objects larger than 100 KB in an S3 bucket. The data scientist wants to understand the size distribution of these files. What is the most significant limitation of this approach for EDA?

A.The command only returns objects larger than 100 KB, not equal to.
B.The command may return incomplete results if there are more than 1000 objects.
C.The command uses the wrong query syntax and will fail.
D.The command does not return the file names, only sizes.
AnswerB

S3 list-objects returns up to 1000 objects per call; pagination is required for more.

Why this answer

Option D is correct because the CLI command may not return all objects if there are more than 1000, as S3 list-objects paginates by default. Option A is wrong because the command returns size and key, which are sufficient. Option B is wrong because the command does include objects exactly 100 KB.

Option C is wrong because the query syntax is correct for filtering.

43
MCQmedium

A data scientist is using Amazon SageMaker Data Wrangler to explore a dataset. They notice that a feature has a very high correlation (0.95) with the target variable. What should they do to avoid overfitting?

A.Use L2 regularization in the model
B.Apply PCA to reduce dimensionality
C.Standardize the feature using StandardScaler
D.Remove the feature from the dataset
AnswerD

Correct: High correlation with target can indicate data leakage; removing is safest.

Why this answer

Option A is correct because a feature with correlation 0.95 may be directly derived from the target or leaking information; removing it prevents overfitting. Option B is wrong because PCA is for dimensionality reduction but may still include the leak. Option C is wrong because regularization helps but may not fully address leakage.

Option D is wrong because scaling does not address correlation.

44
MCQhard

A data scientist is using Amazon SageMaker Studio notebooks for EDA. They want to share a reproducible report that includes code, visualizations, and narrative text with their team. Which approach should they use?

A.Save the notebook as an .ipynb file and share it via Amazon S3.
B.Use Amazon SageMaker Clarify to generate an EDA report.
C.Export the results to Amazon QuickSight and create a dashboard.
D.Use Amazon SageMaker Autopilot to generate a report.
AnswerA

Notebooks combine code, output, and narrative.

Why this answer

Option C is correct because a Jupyter notebook (in Studio) contains code and markdown and can be shared. Option A (QuickSight dashboard) is for interactive dashboards; Option B (SageMaker Autopilot) is for automated ML; Option D (SageMaker Clarify) is for bias detection.

45
MCQeasy

A data scientist loads a large dataset from Amazon S3 into a pandas DataFrame using a SageMaker notebook. The dataset contains a mix of numeric and categorical features. The data scientist wants to quickly check for missing values. Which pandas function is most appropriate?

A.df.info()
B.df.describe()
C.df.shape
D.df.isnull().sum()
AnswerD

This returns the sum of null values per column.

Why this answer

Option C is correct because df.isnull().sum() returns the count of missing values per column. Option A is wrong because df.info() provides column data types and non-null counts, but not missing value counts directly. Option B is wrong because df.describe() only summarizes numeric columns.

Option D is wrong because df.shape returns the dimensions, not missing values.

46
Multi-Selecteasy

A data scientist is exploring a dataset with categorical variables. Which TWO EDA techniques are appropriate for understanding the relationship between a categorical feature and a continuous target? (Choose TWO.)

Select 2 answers
A.Correlation matrix
B.Violin plots
C.Scatter plot with categorical variable on x-axis
D.Bar chart of category counts
E.Side-by-side box plots
AnswersB, E

Violin plots show density and distribution across categories.

Why this answer

Options A and C are correct. Box plots show distribution of continuous variable across categories. Violin plots combine box plot and density.

Option B is wrong because scatter plot is for two continuous variables. Option D is wrong because bar chart of counts shows frequency, not relationship with target. Option E is wrong because correlation matrix is for numerical features.

47
MCQhard

A data scientist is building a fraud detection model using a dataset of 500,000 credit card transactions. The dataset contains 20 features, including transaction amount, merchant category, time since last transaction, and customer age. The target variable 'is_fraud' has 0.1% positive examples. Initial EDA reveals that the transaction amount distribution is highly skewed with a long tail. Also, there are missing values in the 'customer_age' field (5% missing). The data scientist needs to prepare the data for training a binary classifier. Which combination of preprocessing steps should the data scientist apply to address these issues and improve model performance? (Select TWO.)

A.Use SMOTE to generate synthetic samples of the minority class.
B.Apply standard scaling to all numerical features.
C.Apply log transformation to the transaction amount to reduce skewness.
D.Impute missing values in customer_age with the mean of the non-missing values.
E.Drop the transaction amount feature because of its skewness.
AnswerC, D

Log transformation is effective for reducing right skewness and can make the distribution more Gaussian-like, which benefits many models.

Why this answer

Option C is correct because applying a log transformation to the highly skewed transaction amount reduces skewness and compresses the dynamic range, which helps many machine learning algorithms (especially those sensitive to feature scales like logistic regression or SVM) converge faster and perform better. This is a standard technique for handling right-skewed distributions without losing data.

Exam trap

The trap here is that candidates often confuse handling skewness with scaling—they may choose standard scaling (Option B) thinking it addresses skewness, but standard scaling only centers and scales the data, not corrects the shape of the distribution.

How to eliminate wrong answers

Option A is wrong because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class, but with only 0.1% fraud cases (500 out of 500,000), SMOTE would create an extremely large synthetic dataset that risks overfitting and does not address the skewed transaction amount or missing values. Option B is wrong because standard scaling (z-score normalization) is not appropriate for highly skewed features like transaction amount; scaling after log transformation would be valid, but applying standard scaling directly to a skewed distribution does not reduce skewness and can still leave the feature non-Gaussian, harming model performance. Option E is wrong because dropping the transaction amount feature due to skewness discards valuable predictive information; skewness can be corrected via transformation (e.g., log) rather than deletion, which would reduce model accuracy.

48
Multi-Selectmedium

Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?

Select 2 answers
A.Apply a Random Forest classifier to predict outliers.
B.Use Z-score and flag values with absolute Z-score > 3.
C.Remove any value that is more than one standard deviation from the mean.
D.Use DBSCAN clustering with default parameters.
E.Use the interquartile range (IQR) and flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
AnswersB, E

Z-score >3 is a common outlier threshold.

Why this answer

The Z-score method (Option B) is a standard statistical technique for detecting outliers in a univariate continuous feature. It measures how many standard deviations a data point is from the mean, and flagging values with an absolute Z-score greater than 3 is a common threshold because, under a normal distribution, approximately 99.7% of data falls within three standard deviations, making points beyond this likely outliers.

Exam trap

Cisco often tests the misconception that removing values more than one standard deviation from the mean is a valid outlier detection technique, when in fact it removes a large portion of normal data and is not a standard practice.

49
Multi-Selecteasy

A data scientist is performing exploratory data analysis on a dataset with missing values. Which TWO approaches are appropriate for handling missing data in a way that retains as much data as possible?

Select 2 answers
A.Use a model that handles missing values natively, such as XGBoost
B.Replace missing values with a constant like -999
C.Impute missing values with the median
D.Drop columns with high missing percentage
E.Drop rows with any missing value
AnswersA, C

Correct: Some algorithms can handle missing values without imputation.

Why this answer

Correct options: A and C. Option A (imputation) retains all rows and fills missing values. Option C (use algorithms that handle missing values) avoids deletion.

Option B is wrong because dropping rows reduces data. Option D is wrong because dropping columns loses features. Option E is wrong because replacing with a constant may bias.

50
Multi-Selecthard

Which TWO of the following are appropriate methods for handling missing data in a dataset?

Select 2 answers
A.Dropping features with more than 50% missing values
B.Mean imputation for all features
C.Multiple imputation
D.Using algorithms that handle missing values internally (e.g., XGBoost)
E.Listwise deletion (removing rows with missing values)
AnswersC, D

Multiple imputation accounts for uncertainty by creating multiple datasets.

Why this answer

Multiple imputation and using algorithms that handle missing values (e.g., XGBoost) are valid. Listwise deletion reduces sample size. Mean imputation may bias distributions.

Dropping features with many missing values may lose information.

51
Multi-Selecthard

A machine learning engineer is analyzing a dataset with a large number of features (p >> n). The engineer suspects that many features are irrelevant. Which THREE methods are suitable for feature selection during exploratory data analysis? (Choose THREE.)

Select 3 answers
A.Fit a Lasso regression model and select features with non-zero coefficients
B.Remove features with variance below a threshold (e.g., <0.01)
C.Remove features with high pairwise correlation (e.g., >0.95)
D.Calculate mutual information between each feature and the target, and keep top k features
E.Apply Principal Component Analysis (PCA) and select top components
AnswersB, C, D

Low-variance features provide little information and can be removed.

Why this answer

Option A is correct because correlation-based feature selection removes highly correlated features. Option C is correct because mutual information measures relevance to the target. Option E is correct because Variance Threshold removes low-variance features.

Option B is wrong because PCA is a dimensionality reduction technique, not a feature selection method. Option D is wrong because Lasso regression is a modeling technique, not typically used in EDA.

52
MCQhard

A machine learning team is building a fraud detection model. The dataset is highly imbalanced (99.9% legitimate, 0.1% fraudulent). Which EDA technique is most important to apply before modeling?

A.Normalize all numerical features to have zero mean and unit variance.
B.Remove outliers from the dataset using the IQR method.
C.Create a stratified train-test split to preserve the class distribution.
D.Perform correlation analysis to remove highly correlated features.
AnswerC

Ensures the rare class appears in both training and test sets.

Why this answer

Stratified sampling ensures the rare class is represented in train/test splits, preserving the imbalanced ratio for evaluation. Option A is wrong because normalization does not address imbalance. Option C is wrong because correlation analysis is not specific to imbalance.

Option D is wrong because removing outliers could eliminate fraud cases.

53
MCQeasy

A data scientist needs to analyze a dataset stored in Amazon S3 as CSV files. The dataset contains 100 columns, and the data scientist wants to quickly understand the distribution of each column, including missing values, data types, and basic statistics. Which AWS service is best suited for this task?

A.AWS Glue DataBrew
B.Amazon SageMaker Data Wrangler
C.Amazon QuickSight
D.Amazon Athena
AnswerA

Why A is correct

Why this answer

Option A is correct because AWS Glue DataBrew provides visual data profiling and preparation without writing code. Option B is wrong because Amazon Athena is an interactive query service, not a profiling tool. Option C is wrong because Amazon QuickSight is for visualization, not data profiling.

Option D is wrong because SageMaker Data Wrangler is for feature engineering within SageMaker, but DataBrew is simpler for initial exploration.

54
Multi-Selecthard

Which TWO statements about handling categorical variables in exploratory data analysis are correct? (Select TWO.)

Select 2 answers
A.When a categorical feature has high cardinality, consider grouping rare categories.
B.Target encoding always avoids data leakage.
C.One-hot encoding creates binary columns for each category.
D.Label encoding is suitable for nominal categorical variables.
E.Categorical variables should always be dropped if they have many unique values.
AnswersA, C

Grouping reduces dimensionality and overfitting.

Why this answer

Option A is correct because high-cardinality categorical features can lead to overfitting and sparse representations. Grouping rare categories into a single 'Other' bucket reduces dimensionality and noise, improving model generalization without losing significant predictive signal.

Exam trap

Cisco often tests the misconception that label encoding is safe for nominal data, when in fact it imposes an ordinal relationship that can distort model performance.

55
MCQmedium

A data engineer is building a data pipeline that aggregates customer transaction data. The engineer notices that some transactions have duplicate entries due to a system error. Which approach should the engineer use to identify and remove duplicates based on a unique transaction ID?

A.Sort the data by transaction ID and then check consecutive rows for equality
B.Use fuzzy matching to find similar transaction IDs
C.Group by all columns and aggregate with sum
D.Use the drop_duplicates method on the transaction ID column
AnswerD

drop_duplicates removes exact duplicate rows based on specified columns.

Why this answer

Option B is correct because dropping duplicates based on the transaction ID is straightforward and efficient. Option A is wrong because groupby with aggregation may lose information. Option C is wrong because fuzzy matching is for approximate matches, not exact duplicates.

Option D is wrong because sorting then checking consecutive equals is more complex than needed.

56
Multi-Selectmedium

A data scientist is performing exploratory data analysis on a dataset with 100 features. They want to identify which features are most correlated with the target variable. Which THREE methods are appropriate for this task?

Select 3 answers
A.Pearson correlation coefficient
B.Variance threshold
C.One-hot encoding
D.Feature importance from a random forest
E.Mutual information
AnswersA, D, E

Measures linear correlation between each feature and the target.

Why this answer

Pearson correlation captures linear relationships. Mutual information captures non-linear dependencies. Feature importance from a tree-based model provides a ranking.

Spearman correlation is for monotonic relationships, but the question asks for three methods among the options; the correct set includes Pearson, mutual information, and feature importance. Note: Spearman is also valid but not listed as a correct option here because we need exactly three; the options given make A, B, and D the correct choices.

57
MCQeasy

A data scientist is exploring a dataset with many features and wants to detect multicollinearity. Which technique should the scientist use?

A.Calculate the Variance Inflation Factor (VIF) for each feature.
B.Compute the Pearson correlation matrix between features.
C.Perform ANOVA on each feature against the target.
D.Create pairwise scatter plots of all features.
AnswerA

VIF measures how much the variance of a regression coefficient is inflated due to collinearity.

Why this answer

Variance Inflation Factor (VIF) is a standard metric for detecting multicollinearity. Option A (pairwise scatter plots) can hint but not quantify. Option B (Pearson correlation matrix) shows pairwise linear correlation but not multicollinearity among multiple variables.

Option D (ANOVA) is for comparing means.

58
Multi-Selectmedium

A data scientist is performing EDA on a dataset with mixed data types (numerical, categorical, text). The dataset is stored in S3. Which TWO AWS services can be used to directly perform statistical summaries and visualizations without writing custom code?

Select 2 answers
A.Amazon SageMaker Studio
B.AWS Glue DataBrew
C.Amazon Athena
D.Amazon SageMaker Data Wrangler
E.Amazon QuickSight
AnswersD, E

Data Wrangler offers visual data analysis and built-in visualizations.

Why this answer

Options A and D are correct. SageMaker Data Wrangler provides a visual interface for data preparation and analysis with built-in transforms and visualizations. QuickSight is a BI service that can connect to S3 data and create dashboards with statistical summaries.

Option B is wrong because Athena is primarily SQL query engine, not visualization. Option C is wrong because Glue DataBrew is for data preparation but requires some configuration. Option E is wrong because SageMaker Studio is an IDE, not a direct analysis service.

59
Multi-Selecthard

A data engineer is analyzing a large dataset stored in Amazon S3 using AWS Glue and Amazon Athena. They notice that queries against a table with many small files are slow. Which TWO actions can improve query performance?

Select 2 answers
A.Use Athena's automatic compression
B.Increase the number of Glue DPUs
C.Convert files to Apache Parquet format
D.Decrease the number of partitions
E.Use a larger number of partitions
AnswersC, E

Columnar storage reduces I/O and improves compression.

Why this answer

Compacting small files into larger ones reduces overhead. Partitioning the data limits the amount of data scanned. Using Parquet or ORC improves performance, but the question asks for two actions; converting to columnar format is also valid but not listed as a correct option here.

The correct pair is A and B.

60
MCQmedium

A company has a dataset with a timestamp column and multiple numerical metrics. They want to identify seasonality and trends. Which AWS service is best suited for this analysis?

A.Amazon SageMaker Canvas
B.Amazon CloudWatch
C.Amazon QuickSight
D.Amazon Athena
AnswerC

QuickSight offers time series analysis and forecasting capabilities.

Why this answer

Amazon QuickSight provides built-in time series visualization and forecasting. SageMaker Canvas is for ML models without code. Athena is for querying.

CloudWatch is for monitoring AWS resources. Kinesis Data Analytics is for real-time analytics.

61
MCQmedium

A data scientist is working with a dataset containing customer transaction records stored in Amazon S3 as CSV files. The dataset has 500 columns and 2 million rows. The scientist wants to perform EDA to understand data types, missing values, and summary statistics for each column. They need to do this quickly and without writing custom code. The scientist has access to AWS Glue DataBrew and Amazon SageMaker Data Wrangler. Which approach should the scientist take?

A.Use Amazon SageMaker Data Wrangler to import the data and generate a report
B.Use Amazon Athena to run SELECT statements on each column
C.Use AWS Glue DataBrew to create a profile job that outputs data quality reports
D.Use AWS Glue ETL jobs with PySpark to compute statistics
AnswerC

DataBrew's profile job automatically computes statistics and detects missing values.

Why this answer

AWS Glue DataBrew provides a visual interface for data profiling and can handle large datasets without writing code. It automatically detects data types, missing values, and summary statistics. Option B is wrong because SageMaker Data Wrangler requires more manual setup and coding.

Option C is wrong because Athena requires SQL queries and is not a profiling tool. Option D is wrong because Glue ETL jobs require writing code.

62
MCQmedium

A DevOps engineer runs the CloudWatch Logs Insights query shown above on the log group for an ML training job. The result shows a spike in ERROR messages at a specific hour. What should the engineer do next to identify the root cause?

A.Modify the query to display the actual @message for the hour with the spike.
B.Remove the filter on ERROR to see all messages.
C.Change the bin to 5m to see more detailed spikes.
D.Increase the limit to 50 to see more hours.
AnswerA

Directly see error details.

Why this answer

Option C is correct because examining the actual error messages during that hour can reveal the cause. Option A is wrong because bin(5m) may be too granular. Option B is wrong because the query already uses 1h bins.

Option D is wrong because the spike is already identified.

63
Multi-Selecthard

A data engineer is performing exploratory data analysis on a dataset with 1 million rows and 50 features. The engineer wants to identify missing values and outliers. Which THREE approaches should the engineer use? (Choose three.)

Select 3 answers
A.Create a correlation heatmap of all features
B.Use a DataFrame.info() method to see non-null counts
C.Plot box plots for all features simultaneously
D.Use a missingno matrix to visualize missing data patterns
E.Use a DataFrame.describe() to view summary statistics
AnswersB, D, E

info() shows non-null counts and data types.

Why this answer

Options A, C, and D are correct. A gives overall count, C provides visual summary, D gives statistical summary. Option B is wrong because box plots are for continuous variables, not for all features.

Option E is wrong because correlation heatmap does not show missing values or outliers.

64
MCQmedium

A data scientist uses SageMaker Studio to run EDA on a dataset with 500 features. The goal is to reduce dimensionality before modeling. Which EDA technique should the data scientist use to understand the variance explained by each feature?

A.Histogram of the target variable
B.Scree plot of principal components
C.Heatmap of feature correlations
D.Box plot of each feature
AnswerB

Scree plot displays variance explained by each component.

Why this answer

A Scree plot from PCA shows the eigenvalues or variance explained by each principal component, helping decide how many components to retain. Option A is wrong because a heatmap of correlations shows pairwise relationships, not variance. Option C is wrong because a histogram shows distribution.

Option D is wrong because a box plot shows summary statistics.

65
Multi-Selecthard

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Select 3 answers
A.Check the proportion of missing values for each feature.
B.Compute pairwise correlation coefficients between numerical features.
C.Encode all categorical features using label encoding for simplicity.
D.Include all categorical features with high cardinality as-is in the model.
E.Visualize the distribution of numerical features using histograms and box plots.
AnswersA, B, E

Missing value analysis is a key EDA step.

Why this answer

Option A is correct because checking the proportion of missing values for each feature is a fundamental step in exploratory data analysis (EDA). It helps identify data quality issues, such as systematic missingness, which can bias downstream modeling and inform decisions about imputation strategies or feature exclusion.

Exam trap

The trap here is that candidates may assume label encoding is harmless for categorical features, but it imposes an artificial order that can distort model behavior, especially in tree-based models that rely on split points.

66
Multi-Selectmedium

A data scientist is analyzing a dataset with 100 features and 10,000 observations. The target variable is binary (0/1). Initial exploratory data analysis reveals that many features have missing values, high correlation with each other, and non-normal distributions. The data scientist wants to identify the most important features for predicting the target while reducing dimensionality. Which TWO actions should the data scientist take? (Choose two.)

Select 2 answers
A.Use chi-squared test to rank features by p-value.
B.Apply Principal Component Analysis (PCA) to reduce dimensionality.
C.Perform a t-test for each feature to compare means between classes.
D.Calculate Pearson correlation coefficients between features and target.
E.Compute mutual information between each feature and the target.
AnswersB, E

PCA reduces dimensionality by creating uncorrelated components, handling multicollinearity.

Why this answer

B is correct because Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated features into a set of linearly uncorrelated principal components, effectively handling high correlation and reducing the feature space. It does not require normality assumptions and can work with missing values after imputation, making it suitable for this dataset.

Exam trap

Cisco often tests the misconception that correlation-based methods (like Pearson or chi-squared) are sufficient for feature selection in high-dimensional, non-normal data, when in fact they fail due to assumptions about linearity and distribution.

67
MCQeasy

During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?

A.Remove the feature entirely to avoid bias.
B.Create a binary indicator for missingness and impute the continuous values with the median.
C.Impute missing values with -1 since it is out of range.
D.Drop all rows with missing values in that feature.
AnswerB

This captures both the pattern of missingness and the distribution.

Why this answer

Option B is correct because it preserves the predictive signal from the feature while accounting for the pattern of missingness. Creating a binary indicator allows the model to learn whether missingness itself is informative, and median imputation is robust to outliers for a continuous feature. This approach avoids the bias of dropping the feature entirely and is more principled than arbitrary out-of-range imputation.

Exam trap

The trap here is that candidates often choose to drop the feature or rows without considering that missingness can be a meaningful signal, and that a binary indicator combined with robust imputation is a standard technique for high-missingness continuous features.

How to eliminate wrong answers

Option A is wrong because removing a feature with 70% missing values discards potentially important domain-driven signal, and the missingness itself may be informative. Option C is wrong because imputing with -1 (an arbitrary out-of-range value) can distort the feature's distribution and introduce a false signal that the model may misinterpret as a valid numeric relationship. Option D is wrong because dropping all rows with missing values in that feature would discard 70% of the dataset, leading to severe sample size reduction and potential selection bias.

68
MCQeasy

A data analyst is exploring a dataset and wants to identify outliers in a numerical feature. Which visualization technique is most effective for detecting outliers?

A.Line chart
B.Histogram
C.Scatter plot
D.Box plot
AnswerD

Box plots display outliers as individual points outside the whiskers.

Why this answer

Option A is correct because a box plot explicitly shows quartiles and potential outliers as points beyond the whiskers. Option B is wrong because a histogram shows distribution but outliers may be in low-frequency bins. Option C is wrong because a scatter plot shows relationship between two variables, not univariate outliers.

Option D is wrong because a line chart is for time series.

69
Multi-Selectmedium

A machine learning engineer is analyzing a dataset with 500 features and suspects multicollinearity. Which TWO techniques can help identify and address multicollinearity during exploratory data analysis? (Choose TWO.)

Select 2 answers
A.Apply t-SNE for visualization
B.Apply Principal Component Analysis (PCA)
C.Calculate Variance Inflation Factor (VIF) for each feature
D.Generate a correlation matrix heatmap
E.Use Lasso regression to select features
AnswersC, D

VIF > 5-10 indicates multicollinearity.

Why this answer

Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity. Correlation matrix heatmap shows pairwise correlations. PCA reduces dimensionality but does not directly identify multicollinearity.

Lasso regression addresses it via regularization but is a modeling step. t-SNE is for visualization of high-dimensional data.

70
MCQmedium

A company runs a real-time fraud detection system using Amazon SageMaker. The model is deployed as a SageMaker endpoint and receives predictions within milliseconds. Recently, the model's accuracy has degraded due to data drift. The data scientists want to monitor the model's performance continuously. What is the most effective way to detect data drift?

A.Store all incoming requests in Amazon S3 and use Amazon Athena to run periodic SQL queries for drift detection
B.Set up Amazon CloudWatch anomaly detection on the endpoint's invocation count and latency metrics
C.Enable Amazon SageMaker Model Monitor to capture inference data and compare it against a baseline dataset
D.Use Amazon CloudWatch Logs Insights to analyze inference logs and set custom alarms
AnswerC

Why C is correct

Why this answer

Option C is correct because SageMaker Model Monitor can automatically detect data drift by comparing incoming data against a baseline. Option A is wrong because CloudWatch Logs Insights can query logs but not automatically detect drift. Option B is wrong because storing predictions in S3 and using Athena is batch-oriented and not automated.

Option D is wrong because CloudWatch anomaly detection is generic and not specialized for ML model drift.

71
Multi-Selecteasy

Which TWO actions are appropriate when dealing with outliers in a dataset during exploratory data analysis? (Select TWO.)

Select 2 answers
A.Replace the mean with the median for numerical features.
B.Apply log transformation to reduce the impact of extreme values.
C.Remove all outliers without further investigation.
D.Use visualization techniques like box plots to identify outliers.
E.Assume outliers are errors and delete them.
AnswersB, D

Log transformation can compress skewed distributions and reduce outlier influence.

Why this answer

Option B is correct because applying a log transformation compresses the range of the data, reducing the influence of extreme values without removing them. This is a common technique in exploratory data analysis for right-skewed distributions, as it can make the data more normally distributed and improve the performance of models that assume normality.

Exam trap

Cisco often tests the distinction between data transformation techniques (like log transformation) and data removal or replacement strategies, trapping candidates who think that simply changing a summary statistic (mean to median) or deleting outliers without investigation is a proper handling method.

72
MCQmedium

A data engineer ingests streaming data into Amazon Kinesis Data Streams. The data science team needs to analyze the data using Amazon SageMaker notebooks. What is the most efficient way to provide access to the stream data for ad-hoc exploration?

A.Create an AWS Lambda function to transform and write data to DynamoDB, then query DynamoDB from the notebook.
B.Configure a Kinesis Firehose delivery stream to deliver data to an S3 bucket, then query the data from the notebook using Athena.
C.Install the Kinesis Agent on the SageMaker notebook instance and configure it to write data to a local file.
D.Use the Kinesis connector for Spark to read data directly from the stream into a Spark DataFrame in the notebook.
AnswerD

Direct, real-time access for ad-hoc exploration.

Why this answer

Using the Kinesis connector for Spark in a SageMaker notebook allows reading from the stream directly. Option A is wrong because S3 ingestion adds latency and additional steps. Option B is wrong because Kinesis Agent is for data producers, not consumers.

Option D is wrong because Lambda transformation is not needed for exploration.

73
MCQmedium

A data scientist is analyzing a dataset with 50 features and 10,000 samples. After generating a correlation matrix, they notice several pairs of features have correlation coefficients above 0.95. What should the data scientist do to prepare the data for linear regression?

A.Apply PCA to reduce dimensionality to 10 components.
B.Remove one feature from each highly correlated pair.
C.Drop all features with correlation above 0.95.
D.Standardize all features using StandardScaler.
AnswerB

Reduces multicollinearity while retaining most information.

Why this answer

Option B is correct because high correlation between features indicates multicollinearity, which can destabilize linear regression coefficients. Removing one feature from each highly correlated pair reduces redundancy. Option A is wrong because dropping all correlated features may discard useful information.

Option C is wrong because standardizing does not address multicollinearity. Option D is wrong because PCA creates new features that lose interpretability.

74
MCQeasy

The exhibit shows a data quality report for a column named 'age'. Which potential data issue should be investigated further?

A.The mean and median are significantly different
B.The minimum age of 0 and maximum age of 120 may be outliers
C.The missing value rate of 2.3% is too high
D.The number of unique values (85) is too high
AnswerB

Age 0 and 120 are likely data errors.

Why this answer

Option D is correct because an age of 0 and 120 are likely data entry errors and should be investigated. Option A is wrong because 2.3% missing is relatively low and may be acceptable. Option B is wrong because the mean and median are close.

Option C is wrong because 85 unique values for age is reasonable.

75
MCQhard

During exploratory data analysis, a data scientist notices that the correlation matrix of features shows many pairs with absolute correlation > 0.95. The dataset includes both numerical and categorical variables. Which technique is most appropriate to reduce multicollinearity while preserving the most information?

A.Apply Principal Component Analysis (PCA) to the features.
B.Use only one-hot encoded categorical features.
C.Apply L1 regularization during model training.
D.Remove one feature from each highly correlated pair.
AnswerA

PCA reduces dimensionality and decorrelates features.

Why this answer

Option D is correct because PCA is a dimensionality reduction technique that handles multicollinearity by creating orthogonal components. Option A (remove one from each pair) is ad-hoc; Option B (regularization) is for modeling, not EDA; Option C (use only one-hot encoded features) loses information.

Page 1 of 6 · 406 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Exploratory Data Analysis questions.