Practice PDE Preparing and Using Data for Analysis questions with full explanations on every answer.
Start practicing
Preparing and Using Data for Analysis — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data engineer wants to train a linear regression model in BigQuery ML to predict sales. The training data includes a categorical feature with 1000+ unique values. Which method is most appropriate to handle this feature in the CREATE MODEL statement?
2You need to create a Looker model that defines a 'sales' view based on a BigQuery table, with a measure for total revenue. Which LookML object defines the table and dimensions?
3A company uses Looker Studio to build dashboards from BigQuery data. They notice that queries take several seconds to return. They want to improve performance without changing the schema or adding materialized views. Which option should they use?
4A data scientist is training a binary classification model on an imbalanced dataset (95% negative, 5% positive) using AutoML Tables. Which strategy should they use to handle the class imbalance?
5You need to split a time-series dataset into training and evaluation sets for a forecasting model. The data is ordered by timestamp. Which splitting technique should you use?
6Which BigQuery SQL function returns the rank of a row within a window, with gaps in the ranking for ties?
7A company uses Dataplex to manage data quality across multiple BigQuery datasets. They want to define a data quality rule that checks if a column 'email' contains a valid email format. Which Dataplex feature should they use?
8A data engineer needs to query data across BigQuery (in Google Cloud) and Snowflake (in AWS) without moving the data. Which service should they use?
9You want to train a custom TensorFlow model on Vertex AI using a managed Jupyter notebook environment. Which service should you use?
10A company uses Looker to define business logic in LookML. They need to create a new measure that calculates the average order value, defined as total revenue divided by number of orders. Which LookML syntax should they use?
11A data scientist wants to import a pre-trained TensorFlow model into BigQuery ML for batch predictions. The model is stored in a Cloud Storage bucket. Which statement is correct?
12You need to track the lineage of data in BigQuery, showing how tables are derived from other tables via queries. Which service provides this capability?
13A data engineer needs to build a feature engineering pipeline using Vertex AI Pipelines. The pipeline should preprocess data, train a model, and deploy it. Which two components are required to define the pipeline? (Choose 2)
14A company uses AutoML Tables to train a classification model. They want to improve model performance by engineering new features from existing timestamp columns. Which three techniques can they apply within AutoML Tables? (Choose 3)
15A data team wants to use Approximate Aggregation Functions in BigQuery to get faster query results. Which two functions can they use? (Choose 2)
16A data engineer needs to create a BigQuery ML model for predicting customer churn using a dataset with 10 million rows and 50 features. The dataset is highly imbalanced (5% churn). Which approach should the engineer use to handle class imbalance during model training?
17A financial analytics team uses Looker to explore BigQuery data. They need to allow business users to filter by a custom date range that is not tied to an existing dimension. The date range must be user-input at query time. What is the best approach in Looker?
18A data scientist wants to train a custom TensorFlow model on Vertex AI using a managed Jupyter notebook. Which Vertex AI service should they use to set up a notebook environment with pre-installed deep learning frameworks?
19A retail company uses BigQuery to store sales data and wants to forecast weekly demand for the next 8 weeks using historical data from the past 2 years. They need to account for seasonality and holidays. Which BigQuery ML model type and configuration is most appropriate?
20A data engineer needs to query data from BigQuery and another cloud provider's storage (AWS S3) using a single SQL query. The data must not be moved or copied to GCP. Which Google Cloud service should they use?
21What is the primary purpose of Vertex AI Feature Store?
22A company uses Looker Studio to create dashboards from BigQuery data. They notice that dashboard queries take several seconds to load. They want to improve performance without changing the underlying data or creating materialized views. Which option should they use?
23A data engineer is building a production ML pipeline on Vertex AI. The pipeline must preprocess features (e.g., scaling, encoding) and then train a model. The preprocessing logic must be reusable for serving predictions. Which Vertex AI component should they use?
24To enable data lineage tracking in BigQuery, which feature should be activated?
25A company needs to predict whether a product image contains a specific defect. They have 10,000 labeled images and want to build a model quickly without writing custom code or training from scratch. Which GCP service should they use?
26A data engineer needs to split time-series data for training a forecasting model. The data is sorted by timestamp. The engineer wants to avoid leakage where future data influences training. Which data splitting approach should they use?
27A data team uses Looker Studio to create a report that combines data from two different BigQuery tables: one with sales transactions and another with customer demographics. They need to join these tables in the report without writing SQL. Which feature should they use?
28A data engineer needs to implement data quality rules and governance policies across multiple data lakes in GCP. They want to automatically discover and catalog data assets, and enforce row-level security. Which two services should they use? (Select TWO)
29A data analyst wants to compute the rank of sales per region and also the difference in sales between consecutive months for each region. Which BigQuery analytic functions should they use? (Select TWO)
30A company wants to use BigQuery ML to build a recommendation system for movies. The data includes user IDs, movie IDs, and ratings. Which BigQuery ML model types are suitable for this? (Select TWO)
31You train a BigQuery ML linear regression model to predict house prices. The model has high bias during evaluation. Which action BEST reduces bias?
32You are building a real-time fraud detection system using BigQuery streaming and a BQML logistic regression model. The model must be retrained every hour with new labeled data. What is the MOST cost-effective approach to serve predictions with low latency?
33You need to analyze customer churn and want to understand the rank of each customer's churn probability within their subscription plan. Which BigQuery window function computes the relative ranking from 1 (highest probability) to N?
34You have a BigQuery table with sales data and want to pivot product categories into columns. Which SQL clause should you use?
35Your Looker dashboard uses a BigQuery connection. You notice that some queries take over a minute. Which service can you enable to cache results in memory for sub-second Looker queries?
36You are building a machine learning pipeline for credit risk assessment. The dataset has a severe class imbalance (1% default rate). You want to use AutoML Tables on Vertex AI. Which strategy should you incorporate to handle imbalance?
37Which Google Cloud service would you use to create a unified data catalog that automatically captures lineage from BigQuery, Cloud Storage, and other sources?
38You need to create a time-series forecast for inventory demand using BigQuery ML. The data includes daily sales for 5 years. Which model type should you use?
39You are using Vertex AI Feature Store to serve features for online predictions. Your model requires features from multiple sources with low latency (<10ms). Which type of serving should you use?
40You want to quickly estimate the number of distinct visitors to your website from a large BigQuery table. Which function provides an approximate count with low latency?
41You are migrating a large on-premises data warehouse to BigQuery. The data includes sensitive PII columns that must be masked for certain users. Which BigQuery feature can automatically redact PII in query results based on user roles?
42You need to build a Looker model that joins multiple tables from BigQuery. Which LookML object defines the relationship between tables?
43A data scientist wants to use Vertex AI Workbench for exploratory data analysis. Which TWO statements are true about Vertex AI Workbench?
44You are designing a data pipeline for ML training with Vertex AI. You need to split time-series data into train/validation/test sets without leaking future data. Which THREE practices should you follow?
45You want to query data across Google Cloud and AWS using a single SQL interface without moving data. Which TWO services can you use?
46You are building a forecasting model to predict daily sales for the next 90 days using historical sales data with clear seasonality and trend. You want to use BigQuery ML with minimal manual tuning. Which model type should you choose?
47You have a BigQuery table 'orders' with columns order_id, customer_id, order_amount, and order_date. You need to rank customers by total spend per month, assigning the rank 1 to the highest spender. Which SQL function should you use in a window clause?
48Your team uses Looker to develop a model on top of BigQuery. The data is partitioned by ingestion time, and analysts frequently query the last 7 days. However, Looker queries are scanning the entire table, causing high costs. Which two changes should you implement? (Pick two) Wait, this is multiple_choice. Pick one best approach.
49You are building a multi-cloud analytics solution to join data from Google Cloud and AWS S3. You need to query the S3 data using BigQuery without moving it. Which Google Cloud service should you use?
50You need to preprocess tabular data for training a classification model using Vertex AI. The dataset has missing values in numerical columns and categorical columns with high cardinality. Which Vertex AI service provides automated feature engineering and preprocessing as part of the pipeline?
51A retailer wants to use machine learning to predict customer churn based on transaction history and demographic data. The dataset has 500 features, many of which are correlated. The data is highly imbalanced: only 2% churn. They need to deploy a model that provides feature importance and is interpretable. Which model type should they use in BigQuery ML?
52Your team uses Looker Studio to build dashboards on top of BigQuery. The dashboards are slow when filtering on a high-cardinality dimension (e.g., user ID). You want to improve performance without changing the underlying BigQuery table design. Which action should you take?
53You are using BigQuery ML to train a matrix factorization model for a recommendation system. The training data consists of user-item interactions. You notice that the model is overfitting. Which of the following hyperparameter changes would most likely reduce overfitting?
54You need to track data lineage from a BigQuery table through a series of transformations and into a Vertex AI model training pipeline. Which Google Cloud service provides automated data lineage tracking?
55You are building a binary classification model using AutoML Tables on Vertex AI. The dataset has a severe class imbalance (1% positive class). Which strategy should you use to handle the imbalance?
56You have a BigQuery table 'events' with a TIMESTAMP column 'event_time'. You need to compute, for each event, the difference in seconds from the previous event of the same user. Which window function should you use?
57You are using Looker to model data from BigQuery. You have a dimension that should be filtered by a user attribute (e.g., user's region). Which LookML concept allows you to apply dynamic row-level security based on user attributes?
58You need to select two BigQuery features that improve query performance by reducing the amount of data read. Which two options accomplish this? (Choose TWO)
59You are building a time-series forecasting model with BigQuery ML. Which three steps should you perform to properly split the data and evaluate the model? (Choose THREE)
60You need to implement data quality rules on a Dataplex lake to ensure that critical columns are not null and meet certain constraints. Which two Dataplex features can you use? (Choose TWO)
61A data engineer needs to train a linear regression model in BigQuery ML using a table with 10 million rows. The model will predict sales based on features like advertising spend, seasonality, and store location. Which SQL statement should they use to create and train the model?
62A data analyst wants to rank products by sales within each category. They need to assign a unique rank to each product, with no gaps in the ranking numbers (i.e., ties should have different ranks). Which window function should they use?
63A company uses BigQuery and wants to reduce query costs by using BI Engine for Looker Studio dashboards. The data is stored in a BigQuery dataset with 5 TB of frequently accessed tables. The dashboards run dozens of concurrent queries. What is the recommended approach to enable BI Engine acceleration?
64A data engineer needs to build a LookML model in Looker to define business logic and relationships for a new dataset. They want to create an 'explore' that joins an 'orders' view with a 'customers' view. Where should they define this join?
65A company uses Vertex AI Workbench notebooks for data exploration and model development. They want to ensure that the notebook environment can access BigQuery data using the same permissions as the user's Google Cloud account. What is the recommended setup?
66Which BigQuery SQL function can be used to get an approximate count of distinct values in a large column faster than COUNT(DISTINCT) with lower accuracy?
67A data scientist is using AutoML Tables to build a classification model for predicting customer churn. The dataset is highly imbalanced (only 1% churn). Which strategy should they use to handle the class imbalance within AutoML Tables?
68An organization wants to integrate BigQuery Omni to query data stored in AWS S3. They have set up the necessary connections. What is the primary benefit of using BigQuery Omni over simply copying the data to BigQuery?
69A company uses Dataplex to manage data lakes on Google Cloud. They want to enforce data quality rules on a BigQuery table, such as ensuring that a 'email' column is not null and matches a regex pattern. Which Dataplex feature should they use?
70A machine learning engineer needs to deploy a custom TensorFlow model for online predictions with low latency. The model is already trained and saved in SavedModel format. Which Vertex AI service should they use?
71Which BigQuery function can be used to retrieve the value of a column from the previous row within a partition, ordered by a timestamp?
72A company wants to use BigQuery's PIVOT operator to transform their sales data. They have a table with columns: 'year', 'quarter', 'revenue'. They want to create a report where each row is a year and each column is a quarter (Q1, Q2, Q3, Q4) showing revenue. Which SQL statement is correct?
73A data engineer is planning a time-series forecasting model using BigQuery ML ARIMA+ on a dataset with daily sales data spanning 3 years. Which TWO actions are required to prepare the data for ARIMA+? (Choose 2.)
74A company wants to track data lineage for their BigQuery tables to understand how data flows from source to derived tables. Which TWO Google Cloud services can be used to capture and visualize data lineage? (Choose 2.)
75A data scientist needs to perform feature engineering for a machine learning model using Vertex AI. They want to preprocess data using a pipeline that includes scaling, one-hot encoding, and handling missing values. Which TWO services can they use to define and execute this preprocessing pipeline? (Choose 2.)
76A company wants to train a machine learning model to predict customer churn using BigQuery ML. The dataset has a severe class imbalance (only 2% churn). Which approach should the data engineer take to handle this imbalance within BigQuery ML?
77A data engineer needs to design a data pipeline that ingests streaming data from Cloud Pub/Sub, performs real-time aggregations, and loads the results into BigQuery for dashboarding. Which Google Cloud service should they use for the streaming aggregation step?
78A company uses BigQuery Omni to query data stored in AWS S3. They need to join this data with data in BigQuery (GCP). The dataset in AWS is large (10 TB) and frequently updated. Which approach minimizes data movement and cost?
79A data engineer is building a Looker dashboard that requires a calculated field to compute the running total of sales per day per store. Which Looker Studio function should they use?
80A company wants to use AutoML Tables to build a classification model on a dataset with 100 features and 500,000 rows. They need to deploy the model for online predictions with low latency (<100 ms). Which deployment option should they choose?
81A data engineer needs to implement data lineage tracking for a BigQuery data warehouse. They want to automatically capture column-level lineage from ETL jobs run by Dataform and from manual SQL queries executed in the BigQuery console. Which approach meets these requirements?
82A company is using Looker to explore their BigQuery data. They have defined a LookML model with an 'explore' that joins two views: 'orders' and 'customers'. The join is a left join. They want to ensure that only customers with orders are shown when exploring. Which LookML parameter should they modify?
83A data engineer is building a feature store for ML models using Vertex AI Feature Store. The features are computed daily from BigQuery and need to be available for both online predictions (low latency) and offline training. Which two actions must the engineer take? (Choose TWO)
84An e-commerce company uses BigQuery to analyze customer behavior. They need to compute the number of distinct customers per day, approximate quantiles of purchase amounts, and assign a row number per customer partition by date. Which BigQuery SQL functions should they use? (Choose THREE)
85A company uses Dataplex to manage data quality across multiple BigQuery datasets. They need to define data quality rules that check for null values in critical columns and enforce uniqueness constraints. Which two Dataplex features should they use? (Choose TWO)
86A data engineer is preparing a dataset for ML training in Vertex AI. The dataset includes a timestamp column, a categorical column with high cardinality (1000 distinct values), and a numerical column with outliers. Which two preprocessing steps should they apply? (Choose TWO)
87A company wants to use BigQuery ML to train a time-series forecasting model on historical sales data. The data is recorded daily for 3 years. They need to evaluate model accuracy using time-series aware cross-validation. Which two options should they configure in the CREATE MODEL statement? (Choose TWO)
88A data engineer needs to build a real-time dashboard in Looker Studio that displays live sales data from BigQuery. The dashboard must refresh every minute. The underlying BigQuery table is updated continuously via streaming inserts. Which two approaches can reduce query cost and latency? (Choose TWO)
89A company stores sensitive customer data in BigQuery. They need to implement column-level security to restrict access to personally identifiable information (PII) columns. Which two BigQuery features can they use together? (Choose TWO)
90A data engineer is using Vertex AI Workbench to develop a custom ML model. They want to store and version datasets, track experiments, and register models. Which three Vertex AI services should they use? (Choose THREE)
The Preparing and Using Data for Analysis domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.
The Courseiva PDE question bank contains 90 questions in the Preparing and Using Data for Analysis domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Preparing and Using Data for Analysis domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included