Free PDE Preparing and Using Data for Analysis Practice Questions (2026)

Q: What does the Preparing and Using Data for Analysis domain cover on the PDE exam?

The Preparing and Using Data for Analysis domain covers the key concepts and skills tested in this area of the PDE exam blueprint published by Google Cloud.

Q: How many Preparing and Using Data for Analysis questions are on the PDE exam?

The Preparing and Using Data for Analysis domain is one of the weighted domains on the PDE exam. The Courseiva question bank has 90 practice questions for this domain.

Q: How can I practice Preparing and Using Data for Analysis questions for PDE?

Click any of the 90 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Preparing and Using Data for Analysis domain.

Practice Preparing and Using Data for Analysis questions

10Q 20Q 30Q 50Q

All PDE Preparing and Using Data for Analysis questions (90)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data engineer wants to train a linear regression model in BigQuery ML to predict sales. The training data includes a categorical feature with 1000+ unique values. Which method is most appropriate to handle this feature in the CREATE MODEL statement?

You need to create a Looker model that defines a 'sales' view based on a BigQuery table, with a measure for total revenue. Which LookML object defines the table and dimensions?

A company uses Looker Studio to build dashboards from BigQuery data. They notice that queries take several seconds to return. They want to improve performance without changing the schema or adding materialized views. Which option should they use?

A data scientist is training a binary classification model on an imbalanced dataset (95% negative, 5% positive) using AutoML Tables. Which strategy should they use to handle the class imbalance?

You need to split a time-series dataset into training and evaluation sets for a forecasting model. The data is ordered by timestamp. Which splitting technique should you use?

Which BigQuery SQL function returns the rank of a row within a window, with gaps in the ranking for ties?

A company uses Dataplex to manage data quality across multiple BigQuery datasets. They want to define a data quality rule that checks if a column 'email' contains a valid email format. Which Dataplex feature should they use?

A data engineer needs to query data across BigQuery (in Google Cloud) and Snowflake (in AWS) without moving the data. Which service should they use?

You want to train a custom TensorFlow model on Vertex AI using a managed Jupyter notebook environment. Which service should you use?

A company uses Looker to define business logic in LookML. They need to create a new measure that calculates the average order value, defined as total revenue divided by number of orders. Which LookML syntax should they use?

A data scientist wants to import a pre-trained TensorFlow model into BigQuery ML for batch predictions. The model is stored in a Cloud Storage bucket. Which statement is correct?

You need to track the lineage of data in BigQuery, showing how tables are derived from other tables via queries. Which service provides this capability?

A data engineer needs to build a feature engineering pipeline using Vertex AI Pipelines. The pipeline should preprocess data, train a model, and deploy it. Which two components are required to define the pipeline? (Choose 2)

A company uses AutoML Tables to train a classification model. They want to improve model performance by engineering new features from existing timestamp columns. Which three techniques can they apply within AutoML Tables? (Choose 3)

A data team wants to use Approximate Aggregation Functions in BigQuery to get faster query results. Which two functions can they use? (Choose 2)

A data engineer needs to create a BigQuery ML model for predicting customer churn using a dataset with 10 million rows and 50 features. The dataset is highly imbalanced (5% churn). Which approach should the engineer use to handle class imbalance during model training?

A financial analytics team uses Looker to explore BigQuery data. They need to allow business users to filter by a custom date range that is not tied to an existing dimension. The date range must be user-input at query time. What is the best approach in Looker?

A data scientist wants to train a custom TensorFlow model on Vertex AI using a managed Jupyter notebook. Which Vertex AI service should they use to set up a notebook environment with pre-installed deep learning frameworks?

A retail company uses BigQuery to store sales data and wants to forecast weekly demand for the next 8 weeks using historical data from the past 2 years. They need to account for seasonality and holidays. Which BigQuery ML model type and configuration is most appropriate?

A data engineer needs to query data from BigQuery and another cloud provider's storage (AWS S3) using a single SQL query. The data must not be moved or copied to GCP. Which Google Cloud service should they use?

What is the primary purpose of Vertex AI Feature Store?

A company uses Looker Studio to create dashboards from BigQuery data. They notice that dashboard queries take several seconds to load. They want to improve performance without changing the underlying data or creating materialized views. Which option should they use?

A data engineer is building a production ML pipeline on Vertex AI. The pipeline must preprocess features (e.g., scaling, encoding) and then train a model. The preprocessing logic must be reusable for serving predictions. Which Vertex AI component should they use?

To enable data lineage tracking in BigQuery, which feature should be activated?

A company needs to predict whether a product image contains a specific defect. They have 10,000 labeled images and want to build a model quickly without writing custom code or training from scratch. Which GCP service should they use?

A data engineer needs to split time-series data for training a forecasting model. The data is sorted by timestamp. The engineer wants to avoid leakage where future data influences training. Which data splitting approach should they use?

A data team uses Looker Studio to create a report that combines data from two different BigQuery tables: one with sales transactions and another with customer demographics. They need to join these tables in the report without writing SQL. Which feature should they use?

A data engineer needs to implement data quality rules and governance policies across multiple data lakes in GCP. They want to automatically discover and catalog data assets, and enforce row-level security. Which two services should they use? (Select TWO)

A data analyst wants to compute the rank of sales per region and also the difference in sales between consecutive months for each region. Which BigQuery analytic functions should they use? (Select TWO)

A company wants to use BigQuery ML to build a recommendation system for movies. The data includes user IDs, movie IDs, and ratings. Which BigQuery ML model types are suitable for this? (Select TWO)

You train a BigQuery ML linear regression model to predict house prices. The model has high bias during evaluation. Which action BEST reduces bias?

You are building a real-time fraud detection system using BigQuery streaming and a BQML logistic regression model. The model must be retrained every hour with new labeled data. What is the MOST cost-effective approach to serve predictions with low latency?

You need to analyze customer churn and want to understand the rank of each customer's churn probability within their subscription plan. Which BigQuery window function computes the relative ranking from 1 (highest probability) to N?

You have a BigQuery table with sales data and want to pivot product categories into columns. Which SQL clause should you use?

Your Looker dashboard uses a BigQuery connection. You notice that some queries take over a minute. Which service can you enable to cache results in memory for sub-second Looker queries?

You are building a machine learning pipeline for credit risk assessment. The dataset has a severe class imbalance (1% default rate). You want to use AutoML Tables on Vertex AI. Which strategy should you incorporate to handle imbalance?

Which Google Cloud service would you use to create a unified data catalog that automatically captures lineage from BigQuery, Cloud Storage, and other sources?

You need to create a time-series forecast for inventory demand using BigQuery ML. The data includes daily sales for 5 years. Which model type should you use?

You are using Vertex AI Feature Store to serve features for online predictions. Your model requires features from multiple sources with low latency (<10ms). Which type of serving should you use?

You want to quickly estimate the number of distinct visitors to your website from a large BigQuery table. Which function provides an approximate count with low latency?

You are migrating a large on-premises data warehouse to BigQuery. The data includes sensitive PII columns that must be masked for certain users. Which BigQuery feature can automatically redact PII in query results based on user roles?

You need to build a Looker model that joins multiple tables from BigQuery. Which LookML object defines the relationship between tables?

A data scientist wants to use Vertex AI Workbench for exploratory data analysis. Which TWO statements are true about Vertex AI Workbench?

You are designing a data pipeline for ML training with Vertex AI. You need to split time-series data into train/validation/test sets without leaking future data. Which THREE practices should you follow?

You want to query data across Google Cloud and AWS using a single SQL interface without moving data. Which TWO services can you use?

You are building a forecasting model to predict daily sales for the next 90 days using historical sales data with clear seasonality and trend. You want to use BigQuery ML with minimal manual tuning. Which model type should you choose?

You have a BigQuery table 'orders' with columns order_id, customer_id, order_amount, and order_date. You need to rank customers by total spend per month, assigning the rank 1 to the highest spender. Which SQL function should you use in a window clause?

Your team uses Looker to develop a model on top of BigQuery. The data is partitioned by ingestion time, and analysts frequently query the last 7 days. However, Looker queries are scanning the entire table, causing high costs. Which two changes should you implement? (Pick two) Wait, this is multiple_choice. Pick one best approach.

You are building a multi-cloud analytics solution to join data from Google Cloud and AWS S3. You need to query the S3 data using BigQuery without moving it. Which Google Cloud service should you use?

You need to preprocess tabular data for training a classification model using Vertex AI. The dataset has missing values in numerical columns and categorical columns with high cardinality. Which Vertex AI service provides automated feature engineering and preprocessing as part of the pipeline?

A retailer wants to use machine learning to predict customer churn based on transaction history and demographic data. The dataset has 500 features, many of which are correlated. The data is highly imbalanced: only 2% churn. They need to deploy a model that provides feature importance and is interpretable. Which model type should they use in BigQuery ML?

Your team uses Looker Studio to build dashboards on top of BigQuery. The dashboards are slow when filtering on a high-cardinality dimension (e.g., user ID). You want to improve performance without changing the underlying BigQuery table design. Which action should you take?

You are using BigQuery ML to train a matrix factorization model for a recommendation system. The training data consists of user-item interactions. You notice that the model is overfitting. Which of the following hyperparameter changes would most likely reduce overfitting?

You need to track data lineage from a BigQuery table through a series of transformations and into a Vertex AI model training pipeline. Which Google Cloud service provides automated data lineage tracking?

You are building a binary classification model using AutoML Tables on Vertex AI. The dataset has a severe class imbalance (1% positive class). Which strategy should you use to handle the imbalance?

You have a BigQuery table 'events' with a TIMESTAMP column 'event_time'. You need to compute, for each event, the difference in seconds from the previous event of the same user. Which window function should you use?

You are using Looker to model data from BigQuery. You have a dimension that should be filtered by a user attribute (e.g., user's region). Which LookML concept allows you to apply dynamic row-level security based on user attributes?

You need to select two BigQuery features that improve query performance by reducing the amount of data read. Which two options accomplish this? (Choose TWO)

You are building a time-series forecasting model with BigQuery ML. Which three steps should you perform to properly split the data and evaluate the model? (Choose THREE)

You need to implement data quality rules on a Dataplex lake to ensure that critical columns are not null and meet certain constraints. Which two Dataplex features can you use? (Choose TWO)

A data engineer needs to train a linear regression model in BigQuery ML using a table with 10 million rows. The model will predict sales based on features like advertising spend, seasonality, and store location. Which SQL statement should they use to create and train the model?

A data analyst wants to rank products by sales within each category. They need to assign a unique rank to each product, with no gaps in the ranking numbers (i.e., ties should have different ranks). Which window function should they use?

A company uses BigQuery and wants to reduce query costs by using BI Engine for Looker Studio dashboards. The data is stored in a BigQuery dataset with 5 TB of frequently accessed tables. The dashboards run dozens of concurrent queries. What is the recommended approach to enable BI Engine acceleration?

A data engineer needs to build a LookML model in Looker to define business logic and relationships for a new dataset. They want to create an 'explore' that joins an 'orders' view with a 'customers' view. Where should they define this join?

A company uses Vertex AI Workbench notebooks for data exploration and model development. They want to ensure that the notebook environment can access BigQuery data using the same permissions as the user's Google Cloud account. What is the recommended setup?

Which BigQuery SQL function can be used to get an approximate count of distinct values in a large column faster than COUNT(DISTINCT) with lower accuracy?

A data scientist is using AutoML Tables to build a classification model for predicting customer churn. The dataset is highly imbalanced (only 1% churn). Which strategy should they use to handle the class imbalance within AutoML Tables?

An organization wants to integrate BigQuery Omni to query data stored in AWS S3. They have set up the necessary connections. What is the primary benefit of using BigQuery Omni over simply copying the data to BigQuery?

A company uses Dataplex to manage data lakes on Google Cloud. They want to enforce data quality rules on a BigQuery table, such as ensuring that a 'email' column is not null and matches a regex pattern. Which Dataplex feature should they use?

A machine learning engineer needs to deploy a custom TensorFlow model for online predictions with low latency. The model is already trained and saved in SavedModel format. Which Vertex AI service should they use?

Which BigQuery function can be used to retrieve the value of a column from the previous row within a partition, ordered by a timestamp?

A company wants to use BigQuery's PIVOT operator to transform their sales data. They have a table with columns: 'year', 'quarter', 'revenue'. They want to create a report where each row is a year and each column is a quarter (Q1, Q2, Q3, Q4) showing revenue. Which SQL statement is correct?

A data engineer is planning a time-series forecasting model using BigQuery ML ARIMA+ on a dataset with daily sales data spanning 3 years. Which TWO actions are required to prepare the data for ARIMA+? (Choose 2.)

A company wants to track data lineage for their BigQuery tables to understand how data flows from source to derived tables. Which TWO Google Cloud services can be used to capture and visualize data lineage? (Choose 2.)

A data scientist needs to perform feature engineering for a machine learning model using Vertex AI. They want to preprocess data using a pipeline that includes scaling, one-hot encoding, and handling missing values. Which TWO services can they use to define and execute this preprocessing pipeline? (Choose 2.)

A company wants to train a machine learning model to predict customer churn using BigQuery ML. The dataset has a severe class imbalance (only 2% churn). Which approach should the data engineer take to handle this imbalance within BigQuery ML?

A data engineer needs to design a data pipeline that ingests streaming data from Cloud Pub/Sub, performs real-time aggregations, and loads the results into BigQuery for dashboarding. Which Google Cloud service should they use for the streaming aggregation step?

A company uses BigQuery Omni to query data stored in AWS S3. They need to join this data with data in BigQuery (GCP). The dataset in AWS is large (10 TB) and frequently updated. Which approach minimizes data movement and cost?

A data engineer is building a Looker dashboard that requires a calculated field to compute the running total of sales per day per store. Which Looker Studio function should they use?

A company wants to use AutoML Tables to build a classification model on a dataset with 100 features and 500,000 rows. They need to deploy the model for online predictions with low latency (<100 ms). Which deployment option should they choose?

A data engineer needs to implement data lineage tracking for a BigQuery data warehouse. They want to automatically capture column-level lineage from ETL jobs run by Dataform and from manual SQL queries executed in the BigQuery console. Which approach meets these requirements?

A company is using Looker to explore their BigQuery data. They have defined a LookML model with an 'explore' that joins two views: 'orders' and 'customers'. The join is a left join. They want to ensure that only customers with orders are shown when exploring. Which LookML parameter should they modify?

A data engineer is building a feature store for ML models using Vertex AI Feature Store. The features are computed daily from BigQuery and need to be available for both online predictions (low latency) and offline training. Which two actions must the engineer take? (Choose TWO)

An e-commerce company uses BigQuery to analyze customer behavior. They need to compute the number of distinct customers per day, approximate quantiles of purchase amounts, and assign a row number per customer partition by date. Which BigQuery SQL functions should they use? (Choose THREE)

A company uses Dataplex to manage data quality across multiple BigQuery datasets. They need to define data quality rules that check for null values in critical columns and enforce uniqueness constraints. Which two Dataplex features should they use? (Choose TWO)

A data engineer is preparing a dataset for ML training in Vertex AI. The dataset includes a timestamp column, a categorical column with high cardinality (1000 distinct values), and a numerical column with outliers. Which two preprocessing steps should they apply? (Choose TWO)

A company wants to use BigQuery ML to train a time-series forecasting model on historical sales data. The data is recorded daily for 3 years. They need to evaluate model accuracy using time-series aware cross-validation. Which two options should they configure in the CREATE MODEL statement? (Choose TWO)

A data engineer needs to build a real-time dashboard in Looker Studio that displays live sales data from BigQuery. The dashboard must refresh every minute. The underlying BigQuery table is updated continuously via streaming inserts. Which two approaches can reduce query cost and latency? (Choose TWO)

A company stores sensitive customer data in BigQuery. They need to implement column-level security to restrict access to personally identifiable information (PII) columns. Which two BigQuery features can they use together? (Choose TWO)

A data engineer is using Vertex AI Workbench to develop a custom ML model. They want to store and version datasets, track experiments, and register models. Which three Vertex AI services should they use? (Choose THREE)

Practice all 90 Preparing and Using Data for Analysis questions

Other PDE exam domains

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

Frequently asked questions

What does the Preparing and Using Data for Analysis domain cover on the PDE exam?

The Preparing and Using Data for Analysis domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.

How many Preparing and Using Data for Analysis questions are in the PDE question bank?

The Courseiva PDE question bank contains 90 questions in the Preparing and Using Data for Analysis domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Preparing and Using Data for Analysis for PDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Preparing and Using Data for Analysis questions for PDE?

Yes — the session launcher on this page draws questions exclusively from the Preparing and Using Data for Analysis domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included