CCNA Pde Analysis Ml Questions

15 of 90 questions · Page 2/2 · Pde Analysis Ml topic · Answers revealed

76
MCQhard

A data engineer needs to split time-series data for training a forecasting model. The data is sorted by timestamp. The engineer wants to avoid leakage where future data influences training. Which data splitting approach should they use?

A.Use k-fold cross-validation with random assignment
B.Use stratified splitting on the target variable
C.Perform a random 80/20 split on the entire dataset
D.Use a time-series aware split: first 80% of data by timestamp for training, last 20% for testing
AnswerD

This preserves temporal order and avoids leakage.

Why this answer

For time-series, the only safe split is to use an earlier contiguous block for training and a later block for testing, preserving temporal order. Random splits would cause leakage. K-fold cross-validation on time-series requires special techniques like forward chaining, not standard k-fold.

Stratified split is for classification.

77
MCQmedium

A company uses Looker to define business logic in LookML. They need to create a new measure that calculates the average order value, defined as total revenue divided by number of orders. Which LookML syntax should they use?

A.measure: avg_order_value { type: sum; sql: ${revenue} / ${order_count} ;; }
B.measure: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }
C.dimension: avg_order_value { type: number; sql: ${revenue} / ${order_count} ;; }
D.dimension: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }
AnswerB

Correct syntax for a measure that computes an average of a ratio.

Why this answer

Measures in LookML are defined with type and sql expression. The correct syntax for a calculated measure is: measure: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }

78
MCQhard

You have a BigQuery table 'events' with a TIMESTAMP column 'event_time'. You need to compute, for each event, the difference in seconds from the previous event of the same user. Which window function should you use?

A.FIRST_VALUE(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
B.LEAD(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
C.LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
D.ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time)
AnswerC

LAG accesses the previous event, then you can use TIMESTAMP_DIFF to compute difference.

Why this answer

LAG() allows accessing the previous row in a partition. Combined with TIMESTAMP_DIFF, you can compute the difference. LEAD() accesses next row.

ROW_NUMBER() and FIRST_VALUE() are not suitable.

79
MCQhard

A company wants to use BigQuery's PIVOT operator to transform their sales data. They have a table with columns: 'year', 'quarter', 'revenue'. They want to create a report where each row is a year and each column is a quarter (Q1, Q2, Q3, Q4) showing revenue. Which SQL statement is correct?

A.SELECT * FROM sales PIVOT(SUM(revenue) FOR quarter IN ('Q1','Q2','Q3','Q4'))
B.SELECT * FROM sales PIVOT(revenue FOR quarter IN (Q1, Q2, Q3, Q4))
C.PIVOT sales ON quarter USING SUM(revenue)
D.SELECT * FROM (SELECT year, quarter, revenue FROM sales) PIVOT(SUM(revenue) FOR quarter IN (Q1, Q2, Q3, Q4))
AnswerA

Correct syntax with subquery alias and aggregate function.

Why this answer

PIVOT in BigQuery requires specifying an aggregate function, the pivot column, and the list of pivot column values. The syntax is: SELECT * FROM (SELECT year, quarter, revenue FROM sales) PIVOT(SUM(revenue) FOR quarter IN ('Q1','Q2','Q3','Q4')).

80
MCQhard

You are building a machine learning pipeline for credit risk assessment. The dataset has a severe class imbalance (1% default rate). You want to use AutoML Tables on Vertex AI. Which strategy should you incorporate to handle imbalance?

A.Downsample the majority class to a 50-50 ratio
B.Apply SMOTE in a Dataflow pipeline before training
C.Upsample the minority class using BigQuery SQL
D.Use the `class_weight` parameter in the AutoML Tables model
AnswerD

AutoML Tables supports adjusting class weights to handle imbalance.

Why this answer

AutoML Tables automatically applies class imbalance handling (e.g., class weighting) by default. You can adjust the weight strategy. SMOTE is not directly supported in AutoML Tables; you would need custom training.

Downsampling and upsampling are manual steps not needed.

81
MCQeasy

To enable data lineage tracking in BigQuery, which feature should be activated?

A.BigQuery Audit Logs
B.Data Catalog
C.Dataplex Lineage
D.BigQuery Lineage API
AnswerD

The Lineage API provides data lineage for BigQuery assets.

Why this answer

BigQuery Lineage API allows tracking data lineage for tables and views. Dataplex Lineage also provides lineage but requires Dataplex. BigQuery Audit Logs capture metadata changes but not lineage specifically.

Data Catalog is for metadata management but not lineage tracking.

82
MCQmedium

You train a BigQuery ML linear regression model to predict house prices. The model has high bias during evaluation. Which action BEST reduces bias?

A.Decrease the learning rate in the training options
B.Add more features like number of bedrooms and square footage
C.Remove features that have low correlation with the label
D.Increase L2 regularization
AnswerB

Adding relevant features helps capture patterns, reducing bias.

Why this answer

High bias indicates underfitting. Adding more features (e.g., polynomial features) increases model complexity and reduces bias. Increasing regularization (option B) increases bias.

Removing features (C) increases bias. Reducing learning rate (D) does not help underfitting; it may slow convergence.

83
Multi-Selecthard

You are designing a data pipeline for ML training with Vertex AI. You need to split time-series data into train/validation/test sets without leaking future data. Which THREE practices should you follow?

Select 3 answers
A.Use a sliding window validation approach for hyperparameter tuning.
B.Ensure that all data points for a given time period are in the same split.
C.Use Looker to generate the splits automatically.
D.Randomly assign rows to each split to ensure statistical distribution.
E.Use a date column to define the split boundaries.
AnswersA, B, E

Correct: sliding window respects time order.

Why this answer

Time-series data requires temporal splitting: use a date column, avoid random splitting to prevent data leakage, and use sliding window for sequential data. Random splitting (E) is wrong. Looker (D) is irrelevant.

84
MCQhard

A retail company uses BigQuery to store sales data and wants to forecast weekly demand for the next 8 weeks using historical data from the past 2 years. They need to account for seasonality and holidays. Which BigQuery ML model type and configuration is most appropriate?

A.ARIMA_PLUS with holiday_region parameter
B.Boosted tree classifier
C.Linear regression with engineered time features
D.Time-series DECOMPOSE model
AnswerA

ARIMA_PLUS is designed for time-series forecasting with automatic seasonality detection and holiday support.

Why this answer

BigQuery ML's ARIMA_PLUS model is designed for time-series forecasting, automatically detecting seasonality and handling holiday effects via the holiday_region parameter. Linear regression would require manual feature engineering for time components. Time-series DECOMPOSE is not a model type.

Boosted trees are not natively time-series aware without feature engineering.

85
MCQmedium

A data team uses Looker Studio to create a report that combines data from two different BigQuery tables: one with sales transactions and another with customer demographics. They need to join these tables in the report without writing SQL. Which feature should they use?

A.Data blending
B.Creating a report with multiple charts
C.Custom query in BigQuery connector
D.Calculated fields
AnswerA

Data blending allows combining data from different sources via a common key without SQL.

Why this answer

Looker Studio's data blending feature allows combining data from multiple sources (including BigQuery tables) using a common key, without writing SQL. It provides a graphical interface to define joins. Creating a custom query requires SQL.

Looker Studio reports support multiple charts, but blending is the specific feature for joining data. Calculated fields transform data within a single source.

86
Multi-Selectmedium

A data engineer is planning a time-series forecasting model using BigQuery ML ARIMA+ on a dataset with daily sales data spanning 3 years. Which TWO actions are required to prepare the data for ARIMA+? (Choose 2.)

Select 2 answers
A.Create a partition on the time column to improve performance.
B.Remove any rows with NULL values in the time column.
C.Sort the data by the time column in ascending order.
D.Ensure the time column is of type DATE or TIMESTAMP.
E.Encode the target variable using one-hot encoding.
AnswersC, D

ARIMA+ expects the data to be ordered by time.

Why this answer

ARIMA+ requires a time column and a numeric target column. The time column must be in a date/timestamp format. Additionally, the data should be sorted by time.

Missing values should be handled (e.g., filled with 0 or interpolated) but that's not a requirement of the function itself.

87
Multi-Selecteasy

A data engineer is preparing a dataset for ML training in Vertex AI. The dataset includes a timestamp column, a categorical column with high cardinality (1000 distinct values), and a numerical column with outliers. Which two preprocessing steps should they apply? (Choose TWO)

Select 2 answers
A.Drop the timestamp column
B.Label encode the categorical column
C.Winsorize the numerical column to cap outliers
D.Normalize the numerical column using Z-score
E.One-hot encode the categorical column
AnswersB, C

Label encoding maps categories to integers, reducing dimensionality.

Why this answer

One-hot encoding for high cardinality may be too sparse; label encoding (ordinal encoder) is more common. Winsorizing clips outliers.

88
MCQeasy

Which Google Cloud service would you use to create a unified data catalog that automatically captures lineage from BigQuery, Cloud Storage, and other sources?

A.Cloud Composer
B.Dataflow
C.Data Catalog
D.Dataplex
AnswerD

Dataplex includes a unified catalog, lineage, and governance.

Why this answer

Dataplex provides a unified data catalog (Universal Catalog) with automated lineage, discovery, and governance across GCP. Data Catalog is the older standalone service; Dataplex is the recommended unified solution. Cloud Composer and Dataflow are orchestration/processing tools.

89
Multi-Selectmedium

A data team wants to use Approximate Aggregation Functions in BigQuery to get faster query results. Which two functions can they use? (Choose 2)

Select 2 answers
A.APPROX_SUM
B.APPROX_AVG
C.APPROX_QUANTILES
D.APPROX_COUNT_DISTINCT
E.APPROX_MEDIAN
AnswersC, D

Returns approximate quantiles.

Why this answer

BigQuery provides APPROX_COUNT_DISTINCT for approximate distinct counts and APPROX_QUANTILES for approximate quantiles. Other approximate functions include APPROX_TOP_COUNT and APPROX_TOP_SUM.

90
Multi-Selectmedium

A data scientist needs to perform feature engineering for a machine learning model using Vertex AI. They want to preprocess data using a pipeline that includes scaling, one-hot encoding, and handling missing values. Which TWO services can they use to define and execute this preprocessing pipeline? (Choose 2.)

Select 2 answers
A.Cloud Dataproc
B.Vertex AI Pipelines
C.BigQuery SQL with ML.TRANSFORM
D.Cloud Dataflow
E.Cloud Functions
AnswersB, C

Allows you to build and run end-to-end ML pipelines, including preprocessing.

Why this answer

Vertex AI Pipelines is the recommended service for building and running ML pipelines, including preprocessing steps. Alternatively, you can use BigQuery SQL for feature engineering directly on the data, then export the processed data for training. Cloud Dataflow is an option for batch/streaming data processing but is not specific to ML pipelines.

Cloud Functions and Dataproc are less suitable for this purpose.

← PreviousPage 2 of 2 · 90 questions total

Ready to test yourself?

Try a timed practice session using only Pde Analysis Ml questions.