This chapter covers machine learning (ML) on Google Cloud, focusing on the services and tools that enable building, training, deploying, and managing ML models. For the GCDL exam, this topic appears in Domain 3 (Data Analytics AI), Objective 3.2, and typically accounts for 10-15% of exam questions. Understanding the ML workflow, key Google Cloud services like Vertex AI, AI Platform, and pre-trained APIs, and the differences between AI and ML is essential. This chapter provides the depth you need to answer scenario-based questions about selecting the right ML service, understanding the ML pipeline, and managing models in production.
Jump to a section
Imagine a car factory assembly line. Each car moves through stations: welding, painting, engine installation, and final inspection. At the final inspection station, a quality inspector checks each car for defects. The inspector uses a checklist: paint finish, engine noise, brake response. If a car passes, it moves to shipping. If it fails, it is sent back to a specific station for rework. This is exactly how a machine learning pipeline works on Google Cloud. Data flows through stages: ingestion, preprocessing, training, evaluation, and prediction. At the evaluation stage, a model is 'inspected' using metrics like accuracy or precision. If performance is below a threshold (e.g., accuracy < 90%), the model is sent back for retraining with adjusted hyperparameters. The inspector’s checklist corresponds to evaluation metrics; the rework station corresponds to hyperparameter tuning. Just as the factory can have multiple inspectors for different checks, Google Cloud AI Platform can run multiple evaluation jobs. The factory also uses sensors to collect data on each car's build time and defect rate, which feeds back to improve the assembly process—this mirrors Vertex AI's model monitoring and continuous training. Without the inspector, defective cars would ship to customers; without evaluation, poor models would serve predictions, leading to business errors. The analogy is mechanistic: each step has a specific input, transformation, and output, with feedback loops for improvement.
What is Machine Learning on Google Cloud?
Machine learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and improve from experience without being explicitly programmed. On Google Cloud, ML services provide infrastructure and tools to build, train, deploy, and manage ML models at scale. The exam tests your understanding of the ML workflow and the specific Google Cloud services that support each stage.
The ML Workflow (Pipeline)
The ML pipeline consists of several stages: 1. Data Ingestion and Preparation: Collect raw data from sources like Cloud Storage, BigQuery, or streaming data via Pub/Sub. Data must be cleaned, transformed, and labeled. On Google Cloud, you can use Cloud Data Fusion, Dataflow, or Dataprep for data preparation. 2. Model Training: Use algorithms to learn patterns from data. Training can be done using: - Vertex AI: Unified ML platform for training with custom code, AutoML, or pre-built algorithms. - AI Platform Training: Legacy service for distributed training. - Notebooks (Vertex AI Workbench): Jupyter-based environment for experimentation. 3. Model Evaluation: Assess model performance using metrics like accuracy, precision, recall, F1-score, and AUC-ROC. Vertex AI provides evaluation pipelines and model comparison. 4. Model Deployment: Deploy the trained model to a serving endpoint for predictions. Vertex AI offers: - Endpoints: Managed prediction service with autoscaling. - Batch Predictions: For large-scale offline predictions. 5. Model Monitoring: Monitor deployed models for data drift, concept drift, and performance degradation. Vertex AI Model Monitoring can trigger retraining alerts. 6. MLOps (CI/CD for ML): Automate the pipeline using Vertex AI Pipelines, Cloud Build, and Cloud Composer.
Key Services and Their Roles
- Vertex AI: The central ML platform that brings together AutoML, custom training, prediction, and MLOps. It unifies the experience across the ML lifecycle. As of 2024, Vertex AI is the recommended service for most ML workloads. - AutoML: A no-code/low-code solution for training high-quality models on structured, image, text, video, and tabular data. It automatically searches for the best model architecture and hyperparameters. - Pre-trained APIs: Google Cloud offers ready-to-use ML models via APIs: - Vision API: Image classification, object detection, OCR. - Natural Language API: Sentiment analysis, entity extraction, syntax analysis. - Translation API: Language translation. - Speech-to-Text and Text-to-Speech: Audio transcription and synthesis. - Document AI: Document processing and understanding. - Video Intelligence API: Video content analysis. - Dialogflow: Conversational AI for chatbots. - AI Platform: Legacy service for training and prediction (still supported but Vertex AI is preferred). - Cloud TPU: Tensor Processing Units for accelerated training of large models. - BigQuery ML: Enables creating and executing ML models using SQL queries on data in BigQuery. Supports model types like linear regression, logistic regression, k-means clustering, matrix factorization, and deep neural networks (via TensorFlow). - Cloud Dataflow and Dataproc: Data processing engines that can be part of the ML pipeline. - Cloud Composer: Managed Apache Airflow for orchestrating ML workflows.
AutoML Details
AutoML allows you to build custom models without writing code. You provide labeled data, and it automatically trains and tunes multiple models to find the best one. AutoML supports: - AutoML Tables: For structured/tabular data (regression, classification). - AutoML Vision: For image classification and object detection. - AutoML Natural Language: For text classification, entity extraction, and sentiment analysis. - AutoML Translation: For custom translation models. - AutoML Video Intelligence: For video classification and object tracking.
Pre-trained APIs vs. Custom Models
Pre-trained APIs: Best for common use cases where Google's generic models suffice. They are quick to integrate, require no training data, and are cost-effective for standard tasks.
Custom Models (Vertex AI): Necessary when you have domain-specific data, need higher accuracy, or require control over the model architecture. Custom training requires more effort and data.
BigQuery ML
BigQuery ML enables ML directly in BigQuery using SQL. Supported models include:
Linear regression (for forecasting)
Logistic regression (for classification)
K-means (for clustering)
Matrix factorization (for recommendation)
Time series models (ARIMA)
Deep neural networks (using TensorFlow)
Boosted tree models (XGBoost)
Imported TensorFlow models
You can create a model with a CREATE MODEL statement and then predict using ML.PREDICT.
Vertex AI Workbench
Vertex AI Workbench provides Jupyter-based notebooks for data science and ML development. It integrates with BigQuery, Cloud Storage, and other services. You can use it for exploratory data analysis, model training, and experimentation.
MLOps Components
Vertex AI Pipelines: Executes ML pipelines defined as code (using Kubeflow Pipelines or TensorFlow Extended).
Model Registry: Central repository for storing, versioning, and deploying models.
Feature Store: Managed repository for ML features that can be shared across models.
Model Monitoring: Detects data drift and concept drift in deployed models.
Continuous Training: Automates retraining with new data.
Training and Prediction Options
- Training: - Custom training: You provide training code in a container (Docker) and specify machine types (e.g., n1-standard-8) or use accelerators (GPUs, TPUs). - AutoML: No-code training. - Hyperparameter tuning: Vertex AI can automatically search for best hyperparameters. - Prediction: - Online prediction: Low-latency requests to a deployed endpoint. Supports autoscaling based on traffic. - Batch prediction: Process large sets of data asynchronously. Output is written to Cloud Storage.
Integration with Other Services
Cloud Storage: Store training data, model artifacts, and prediction results.
BigQuery: Store and query large datasets; use BigQuery ML for SQL-based ML.
Pub/Sub: Ingest streaming data for real-time predictions.
Cloud Functions: Trigger ML pipelines on events.
Cloud Logging and Monitoring: Track model performance and system health.
Exam-Relevant Numbers and Defaults
Vertex AI supports up to 10 nodes for AutoML training by default (can be increased).
AutoML training time limits: up to 72 hours for some tasks.
Vertex AI Endpoints have a default timeout of 60 seconds for online prediction requests.
Batch prediction outputs are stored in Cloud Storage in JSON or CSV format.
BigQuery ML models are stored in BigQuery datasets.
Vertex AI Workbench instances can be stopped after inactivity (default 180 minutes).
Common Pitfalls
Confusing AutoML with pre-trained APIs: AutoML trains a custom model on your data; pre-trained APIs use Google's generic models.
Thinking BigQuery ML only supports linear models: It supports various model types including deep neural networks.
Assuming Vertex AI replaces all legacy services: AI Platform still exists but Vertex AI is the recommended path.
Overlooking data preprocessing: Garbage in, garbage out — data quality is critical for ML success.
Define Business Problem and Data
The first step is to clearly define the business problem that ML will solve (e.g., predicting customer churn, classifying images). Determine whether the problem requires regression, classification, clustering, or another ML task. Identify and collect the necessary data from sources like BigQuery, Cloud Storage, or external databases. Ensure data quality, label data if needed, and consider data privacy and compliance. This step sets the foundation for the entire ML project.
Prepare and Preprocess Data
Raw data often contains missing values, outliers, or inconsistent formats. Use Google Cloud services like Cloud Data Fusion, Dataprep (by Trifacta), or Dataflow to clean, transform, and normalize data. Feature engineering may involve creating new features from existing ones. Split data into training, validation, and test sets. Store processed data in Cloud Storage or BigQuery. This step is crucial because model performance heavily depends on data quality.
Choose Model Type and Train
Select a model type based on the problem: regression for continuous values, classification for categories, etc. Decide between using a pre-trained API, AutoML, or custom training. For custom training, write training code (e.g., TensorFlow, PyTorch) and package it in a container. Submit a training job to Vertex AI, specifying machine type, accelerators (GPUs/TPUs), and hyperparameters. Vertex AI supports distributed training for large datasets. Training time varies from minutes to days.
Evaluate Model Performance
After training, evaluate the model on the held-out test set using appropriate metrics: accuracy, precision, recall, F1-score for classification; RMSE, MAE for regression; AUC-ROC for binary classifiers. Vertex AI provides evaluation pipelines that compute these metrics. If performance is unsatisfactory, adjust hyperparameters, try different algorithms, or improve data preprocessing. This step may involve multiple iterations.
Deploy Model to Endpoint
Once the model meets performance thresholds, deploy it to a Vertex AI Endpoint for online predictions, or set up a batch prediction job. For online endpoints, configure machine type, autoscaling (min/max nodes), and traffic splitting for A/B testing. The endpoint provides a REST API for prediction requests. Batch predictions are submitted as jobs and results are written to Cloud Storage. Deploying multiple versions allows for canary deployments.
Monitor and Retrain Model
After deployment, monitor the model for data drift (changes in input data distribution) and concept drift (changes in relationship between input and output). Vertex AI Model Monitoring can send alerts when drift exceeds thresholds. Set up continuous training pipelines to retrain the model periodically or when drift is detected. Use Vertex AI Pipelines to automate the retraining process. Model versioning in Model Registry helps track changes.
Enterprise Scenario 1: Retail Customer Churn Prediction
A large e-commerce company wants to predict which customers are likely to churn in the next 30 days. They have historical purchase data, customer support interactions, and website clickstream data stored in BigQuery. They use AutoML Tables to train a binary classification model without writing code. The data is exported from BigQuery to Cloud Storage, and AutoML automatically splits it, trains multiple models, and selects the best one based on AUC-ROC. The model is deployed to a Vertex AI Endpoint. The company integrates the endpoint into their CRM system to trigger retention offers. Challenges include handling imbalanced classes (churn is rare) and ensuring the model updates daily with new data. They use Vertex AI Pipelines to automate retraining every night. Misconfiguration could lead to outdated models causing poor predictions, so they monitor data drift with Vertex AI Model Monitoring.
Enterprise Scenario 2: Medical Image Classification
A healthcare provider needs to classify X-ray images as normal or abnormal. They have a dataset of 100,000 labeled images stored in Cloud Storage. They use AutoML Vision to train a custom image classification model. The model achieves 95% accuracy on the test set. They deploy the model to a Vertex AI Endpoint with a GPU for low-latency predictions. The endpoint is integrated into the hospital's PACS system. For compliance, they need to log all predictions and retrain the model when new labeled data arrives. They use Vertex AI Feature Store to store image embeddings for reuse. A common pitfall is overfitting to the training set due to limited data; they use AutoML's built-in regularization. If the model is deployed without monitoring, data drift (e.g., new X-ray machines producing different images) could degrade performance without detection.
Enterprise Scenario 3: Real-Time Fraud Detection
A financial services company processes credit card transactions in real time and needs to detect fraudulent transactions with sub-100ms latency. They use a custom TensorFlow model trained on historical transaction data. The model is deployed to a Vertex AI Endpoint with autoscaling from 1 to 10 nodes. They use Cloud Pub/Sub to stream transactions to a Cloud Run service that calls the endpoint. The endpoint returns a fraud probability score. They monitor model performance using Cloud Monitoring and set up alerts if prediction latency exceeds 200ms. They also use Vertex AI Model Monitoring to detect concept drift (changing fraud patterns). A common misconfiguration is setting autoscaling min nodes too low during traffic spikes, causing latency spikes. They also use A/B testing with traffic splitting to test new model versions.
What GCDL Tests on Machine Learning
Objective 3.2: 'Identify Google Cloud solutions for data analytics and AI/ML' — you must know which service to use for different ML scenarios. The exam is scenario-based: given a business requirement, choose the right Google Cloud service. Key areas: - When to use AutoML vs. pre-trained APIs vs. custom training: AutoML for custom models without coding; pre-trained APIs for standard tasks; custom training for unique needs or higher accuracy. - Vertex AI as the unified platform: Know that Vertex AI combines AutoML, custom training, prediction, and MLOps. - BigQuery ML for SQL-based ML: When the data is in BigQuery and the user wants to build models using SQL. - Pre-trained APIs: Know the specific APIs: Vision, Natural Language, Translation, Speech-to-Text, Text-to-Speech, Document AI, Video Intelligence, Dialogflow. - ML workflow stages: Data prep, training, evaluation, deployment, monitoring. - MLOps concepts: Pipelines, model registry, feature store, continuous training.
Common Wrong Answers and Traps
Choosing AI Platform instead of Vertex AI: AI Platform is legacy; Vertex AI is the current recommended service. The exam may include both, but Vertex AI is the correct answer for new projects.
Selecting a pre-trained API when custom model is needed: If the scenario mentions using proprietary data or domain-specific requirements, AutoML or custom training is correct, not a pre-trained API.
Thinking BigQuery ML only supports linear models: BigQuery ML supports multiple model types including deep neural networks. The exam may test this by offering 'linear regression only' as a distractor.
Confusing AutoML with pre-trained APIs: AutoML trains a model on your data; pre-trained APIs use Google's data. The exam may describe a scenario where a company has labeled data and wants a custom model — AutoML is correct.
Overlooking MLOps: The exam may ask about model monitoring or retraining. Know that Vertex AI Model Monitoring detects drift and can trigger retraining pipelines.
Specific Numbers and Terms
Vertex AI Endpoint default timeout: 60 seconds.
AutoML training time limit: up to 72 hours for some tasks.
Batch prediction output format: JSON or CSV in Cloud Storage.
BigQuery ML model types: linear regression, logistic regression, k-means, matrix factorization, time series, deep neural networks, boosted trees.
Pre-trained APIs: Vision, Natural Language, Translation, Speech-to-Text, Text-to-Speech, Document AI, Video Intelligence, Dialogflow.
Edge Cases
When data is not labeled: Use pre-trained APIs or unsupervised learning (e.g., clustering with BigQuery ML k-means).
Real-time vs. batch: Online prediction for low-latency; batch for large volumes.
Multi-cloud or hybrid: Vertex AI can be used with Anthos for deployment on-premises or other clouds.
How to Eliminate Wrong Answers
If the scenario mentions 'no coding' or 'minimal ML expertise', look for AutoML or pre-trained APIs.
If the scenario mentions 'SQL' and 'BigQuery', the answer is BigQuery ML.
If the scenario mentions 'custom algorithm' or 'specific architecture', choose custom training on Vertex AI.
If the scenario mentions 'monitoring for drift', Vertex AI Model Monitoring is key.
Vertex AI is the unified ML platform for building, training, deploying, and managing ML models on Google Cloud.
AutoML trains custom models without code; pre-trained APIs provide ready-to-use models for common tasks.
BigQuery ML enables creating ML models using SQL on data in BigQuery, supporting multiple model types including deep neural networks.
The ML pipeline includes data ingestion, preparation, training, evaluation, deployment, and monitoring.
Vertex AI Model Monitoring detects data drift and concept drift in deployed models.
MLOps practices like pipelines, model registry, and continuous training are supported by Vertex AI.
For real-time predictions, deploy models to Vertex AI Endpoints; for batch predictions, use batch prediction jobs.
Pre-trained APIs include Vision, Natural Language, Translation, Speech-to-Text, Text-to-Speech, Document AI, Video Intelligence, and Dialogflow.
These come up on the exam all the time. Here's how to tell them apart.
AutoML
No coding required; upload data and get a model.
Automatically searches for best model architecture and hyperparameters.
Limited to supported data types (image, text, tabular, video).
Less control over model internals.
Suitable for users with limited ML expertise.
Custom Training
Requires writing training code (TensorFlow, PyTorch, etc.).
Full control over architecture, hyperparameters, and training process.
Can use any ML framework or custom algorithms.
More flexible for complex or novel models.
Requires ML expertise and more effort.
Mistake
AutoML and pre-trained APIs are the same thing.
Correct
AutoML trains a custom model on your data, while pre-trained APIs use Google's generic models trained on public datasets. AutoML requires your labeled data; pre-trained APIs work out-of-the-box without your data.
Mistake
BigQuery ML only supports linear regression.
Correct
BigQuery ML supports multiple model types including linear regression, logistic regression, k-means clustering, matrix factorization, time series, deep neural networks (via TensorFlow), and boosted trees.
Mistake
Vertex AI is just a renamed AI Platform.
Correct
Vertex AI is a unified ML platform that integrates AutoML, custom training, prediction, and MLOps capabilities that were previously separate. AI Platform is a legacy service; Vertex AI is the recommended platform going forward.
Mistake
You must use GPUs for all ML training on Google Cloud.
Correct
GPUs are optional and beneficial for certain workloads like deep learning, but many models (e.g., simple regression, small datasets) can be trained efficiently on CPUs. Vertex AI allows you to choose machine types with or without accelerators.
Mistake
Pre-trained APIs can be fine-tuned with your own data.
Correct
Most pre-trained APIs (Vision, Natural Language, etc.) do not support fine-tuning. For custom models, you must use AutoML or custom training. However, some APIs like Document AI allow custom models through AutoML.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Vertex AI is the successor to AI Platform. It unifies AutoML, custom training, prediction, and MLOps into a single platform. AI Platform is a legacy service that is still available but Vertex AI is recommended for new projects. The exam expects you to choose Vertex AI for most ML scenarios.
Use pre-trained APIs when you have a common use case (e.g., image classification, sentiment analysis) and Google's generic model meets your accuracy needs. Use AutoML when you have labeled data specific to your domain and need a custom model with higher accuracy, but you lack ML expertise to write code.
BigQuery ML models are primarily designed for batch prediction using ML.PREDICT. For real-time predictions, you should export the model to Vertex AI and deploy it to an endpoint. However, BigQuery ML can still be used for near-real-time with streaming data and short queries.
Vertex AI Model Monitoring continuously checks deployed models for data drift (changes in input distribution) and concept drift (changes in the relationship between input and output). It is important because models can degrade over time due to changes in real-world data. Monitoring triggers alerts and can initiate retraining pipelines.
BigQuery ML supports linear regression, logistic regression, k-means clustering, matrix factorization, time series (ARIMA), deep neural networks (via TensorFlow), boosted trees (XGBoost), and imported TensorFlow models. It also supports AutoML models via BigQuery ML (using Vertex AI).
Deploy a trained model to a Vertex AI Endpoint. You need to upload the model artifact to Vertex AI Model Registry, create an endpoint, and deploy the model to the endpoint. You can configure machine type, autoscaling, and traffic splitting. The endpoint provides a REST API for sending prediction requests.
Online prediction provides real-time, low-latency responses to individual requests (e.g., via REST API). Batch prediction processes a large set of inputs asynchronously and writes results to Cloud Storage. Online prediction is suitable for interactive applications; batch prediction is for offline processing of large datasets.
You've just covered Machine Learning on Google Cloud — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?