GCDLChapter 47 of 101Objective 3.1

Data Types and Analytics Workloads

This chapter covers the types of data (structured, semi-structured, unstructured) and analytics workloads (batch, stream, interactive, machine learning) on Google Cloud. Understanding these concepts is fundamental for the GCDL exam, as approximately 10% of questions touch on data types and analytics workloads. You will learn how to match the right Google Cloud services to specific data and workload requirements, a skill tested in scenario-based questions.

25 min read
Intermediate
Updated May 31, 2026

Library Data Workflow Analogy

Imagine a large public library that receives shipments of thousands of new books every day. Each book is an individual data point. The library's workflow for processing these books mirrors how data analytics workloads handle raw data. First, the loading dock receives the books (data ingestion). Then, workers sort them by genre, language, and condition (data cleansing and transformation). Next, the books are cataloged with metadata like title, author, and ISBN, and placed on shelves in the correct section (data storage). When a patron wants to find all books on a specific topic, a librarian uses the catalog to locate them (data querying). However, if the librarian needs to understand trends—like which genres are most popular among children—they would need to analyze checkout records (structured data) and maybe even read summaries (unstructured data). This deeper analysis is like an analytics workload: it requires not just finding individual books but aggregating, filtering, and deriving insights from the entire collection. The library might use a separate research desk with specialized tools (like BigQuery) to handle these complex queries without disrupting daily operations. Just as the library must decide whether to process books immediately or batch them, analytics systems choose between stream and batch processing based on latency needs.

How It Actually Works

What Are Data Types and Analytics Workloads?

Data types classify the format and structure of data. Analytics workloads describe the processing patterns used to derive insights from data. On Google Cloud, choosing the right service for a given data type and workload is critical for cost, performance, and scalability.

Structured Data

Structured data adheres to a predefined schema with rows and columns. Each column has a data type (e.g., integer, string, date). Examples include relational database tables, CSV files, and spreadsheet exports. Structured data is highly organized, making it easy to query using SQL. In Google Cloud, structured data is typically stored in: - Cloud SQL: For OLTP workloads (e.g., transaction processing). - Cloud Spanner: For globally distributed, strongly consistent OLTP. - BigQuery: For OLAP and data warehousing (columnar storage). - Datastore/Firestore: For NoSQL document/entity storage (semi-structured but often used with structured properties).

Structured data is the most efficient for exact-match lookups and aggregations. However, it requires schema flexibility to be low; schema changes can be costly.

Semi-Structured Data

Semi-structured data has some organizational properties but does not require a rigid schema. It often uses tags or markers to separate data elements. Common formats include JSON, Avro, Parquet, and XML. Semi-structured data is self-describing: each record can have different fields. Google Cloud services that handle semi-structured data: - Cloud Storage: Stores files in any format. - BigQuery: Natively supports JSON and Avro with schema autodetection. - Dataproc: Processes semi-structured data using Hadoop/Spark. - Pub/Sub: Messages are often JSON or Avro.

Semi-structured data offers flexibility but may require more processing to parse and query efficiently. BigQuery’s JSON functions (e.g., JSON_EXTRACT) allow querying nested fields.

Unstructured Data

Unstructured data has no predefined structure. Examples include text documents, images, audio, video, and binary files. It is the most abundant type of data (over 80% of enterprise data). Processing unstructured data often requires machine learning or specialized tools. Google Cloud services: - Cloud Storage: Primary store for unstructured data. - Cloud Vision API: Extracts text from images. - Cloud Speech-to-Text: Converts audio to text. - Document AI: Parses documents (e.g., invoices). - Vertex AI: For custom ML models.

Unstructured data is challenging to search and analyze directly; transformations (e.g., OCR, transcription) are usually needed.

Analytics Workloads

Analytics workloads are categorized by processing style, latency, and data volume.

#### Batch Processing Batch processing handles large volumes of data at scheduled intervals. It is ideal for non-time-sensitive tasks like daily reports, ETL jobs, and historical analysis. Google Cloud services: - Cloud Dataproc: Managed Hadoop/Spark for batch jobs. - Cloud Dataflow: Supports both batch and stream (unified model). - BigQuery: Can be used for batch queries. - Cloud Composer: Orchestrates batch workflows.

Batch jobs typically run for minutes to hours. They are cost-effective because resources can be scaled down after completion.

#### Stream Processing Stream processing ingests and processes data in real-time as it arrives. Latency is measured in seconds or milliseconds. Use cases include fraud detection, IoT telemetry, and live dashboards. Google Cloud services: - Cloud Dataflow: Exactly-once semantics, auto-scaling. - Pub/Sub: Ingestion and delivery of streams. - BigQuery: Streaming inserts for near-real-time analytics. - Dataproc: Spark Streaming.

Stream processing must handle out-of-order data, late arrivals, and exactly-once or at-least-once guarantees.

#### Interactive Analytics Interactive analytics allows users to query data with sub-second to few-second latency. It supports iterative exploration and ad-hoc analysis. Google Cloud services: - BigQuery: On-demand petabyte-scale SQL analytics. - Looker: Business intelligence and visualization. - Dataplex: Data mesh and lakehouse.

Interactive analytics relies on columnar storage, caching, and distributed query engines.

#### Machine Learning Workloads ML workloads involve training and deploying models. Training is compute-intensive; inference can be real-time or batch. Google Cloud services: - Vertex AI: End-to-end ML platform. - BigQuery ML: Create models using SQL. - Cloud TPUs: Specialized hardware for training. - AI Platform: For model deployment.

ML workloads blur the line between analytics and AI.

How They Interact

A typical analytics pipeline ingests data (stream or batch), stores it (structured/unstructured), processes it (transformation, enrichment), and serves it (queries, dashboards, ML). For example, IoT sensor data (unstructured streams) is ingested via Pub/Sub, processed with Dataflow (stream), stored in BigQuery (structured), and analyzed with Looker (interactive).

Key Values and Defaults

BigQuery: 60,000 concurrent queries per project; up to 1 TB per query on-demand; 10 GB per streaming insert row.

Pub/Sub: 10 MB message size; 1-10 second delivery latency; 7-day message retention default.

Dataflow: Min 1 worker, max 1000; auto-scaling interval 30 seconds.

Dataproc: 1-min idle timeout default; preemptible VM cost savings up to 60%.

Configuration and Verification

To check data types in BigQuery:

SELECT column_name, data_type FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS` WHERE table_name = 'mytable';

To verify stream processing latency in Dataflow:

gcloud dataflow jobs list --region=us-central1 --status=active
gcloud dataflow metrics list --job-id=JOB_ID --metric=SystemLatency

Related Technologies

Data types and workloads are tightly coupled with storage, compute, and networking. For example, Cloud Storage classes (Standard, Nearline, Coldline, Archive) affect cost and retrieval latency for unstructured data. VPC Service Controls can restrict data access. Cloud NAT allows outbound connections from private instances.

Exam Tips

The GCDL exam tests your ability to recommend the right service for a given scenario. Focus on:

Distinguishing batch vs. stream: look for words like "real-time," "immediate," "continuous" vs. "daily," "scheduled," "overnight."

Structured vs. unstructured: if the data has a schema, it's structured; if it's images/video, it's unstructured.

Semi-structured: JSON, Avro, Parquet are typical.

Interactive vs. batch: interactive means user-driven, sub-second queries; batch means scheduled, large-scale.

Walk-Through

1

Ingest Data into Pub/Sub

A producer application publishes messages to a Pub/Sub topic. Each message is up to 10 MB. Pub/Sub stores messages for at least 7 days (configurable up to 31). Messages are acknowledged by subscribers; unacknowledged messages are redelivered. This step decouples data production from consumption.

2

Stream Process with Dataflow

A Dataflow pipeline subscribes to the Pub/Sub subscription. It reads messages in micro-batches (default 1 second window) or exactly-once streaming. Dataflow applies transformations (e.g., parsing JSON, aggregating counts) and writes results to BigQuery. Auto-scaling adjusts workers based on backlog.

3

Store in BigQuery Tables

Dataflow writes streaming inserts into BigQuery. BigQuery buffers writes for up to 90 minutes before making them available for querying (streaming buffer). For lower latency, use Storage Write API. Data is stored in columnar format (Capacitor) optimized for analytical queries.

4

Query with Interactive SQL

A data analyst runs a SQL query from Looker or the BigQuery console. BigQuery uses a distributed query engine (Dremel) to scan only relevant columns. Results typically return in seconds for terabytes of data. Queries are charged per byte processed (on-demand) or by slot (flat-rate).

5

Visualize with Looker Dashboard

Looker connects to BigQuery and runs queries to populate dashboards. It uses a semantic model (LookML) to define business logic. Dashboards refresh periodically or on-demand. This step provides real-time insights to decision-makers.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Sales Analytics

A large retailer ingests point-of-sale transactions from thousands of stores. Each transaction is a JSON message with store ID, product SKU, quantity, price, and timestamp. The data is semi-structured. The retailer uses Pub/Sub to ingest these messages in real-time. A Dataflow pipeline cleans the data (removing malformed records), enriches it with product catalog data from Cloud SQL, and writes to BigQuery. Every night, a batch job runs to compute daily sales summaries. The interactive dashboard in Looker allows regional managers to see sales by hour. When misconfigured, Dataflow workers may not scale fast enough during Black Friday, causing backlog and delayed insights. Properly configuring max workers and using streaming engine avoids this.

Enterprise Scenario 2: Healthcare Claims Processing

A health insurance company receives claims as PDF documents (unstructured). They use Document AI to extract structured fields (member ID, date, amount). The extracted data (structured) is stored in BigQuery. A batch Dataflow job runs nightly to validate claims against member eligibility (stored in Cloud Spanner). Fraud detection models in Vertex AI score claims in real-time using a stream processing pipeline. The company also archives raw PDFs in Cloud Storage with lifecycle policies to move to Nearline after 90 days. Common issues include Document AI misclassification due to poor OCR, leading to incorrect data. Tuning Document AI processors and validating output with a secondary check (e.g., regex) mitigates this.

Enterprise Scenario 3: IoT Sensor Monitoring

A manufacturing plant has thousands of sensors emitting temperature, pressure, and vibration readings every second. This is a high-velocity stream of structured data. They use Pub/Sub for ingestion, Dataflow for stream processing (filtering out-of-range values, computing rolling averages), and BigQuery for storage. A real-time dashboard in Looker alerts operators when thresholds are exceeded. For historical analysis, they run batch queries on BigQuery. Scaling issues arise when sensor data spikes; using Pub/Sub flow control and Dataflow autoscaling helps. Misconfigured windows in Dataflow can cause incorrect aggregates; using event-time processing with allowed lateness (default 0) ensures accuracy.

How GCDL Actually Tests This

GCDL Objective Codes

This chapter covers objectives under Domain 3: Data Analytics and AI, specifically objective 3.1 (Identify the different types of data and analytics workloads). The exam tests your ability to:

Differentiate between structured, semi-structured, and unstructured data.

Identify batch vs. stream processing scenarios.

Recommend appropriate Google Cloud services for each workload.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing Cloud Storage for all unstructured data: While Cloud Storage is a great fit, candidates often forget that processing unstructured data (e.g., images) requires additional services like Vision API or Vertex AI. The exam may ask for the *complete* solution.

2.

Selecting Cloud SQL for analytics workloads: Cloud SQL is for OLTP, not OLAP. Candidates see "SQL" and think analytics. The correct answer is BigQuery for large-scale analytics.

3.

Confusing Pub/Sub and Dataflow: Pub/Sub is the ingestion/messaging layer; Dataflow is the processing layer. A common question: "Which service processes streaming data?" Answer: Dataflow (or Dataproc for Spark Streaming).

4.

Assuming all real-time processing needs stream: Some scenarios can be handled with near-real-time batch (e.g., micro-batches every few seconds). The exam uses words like "immediate" vs. "within minutes."

Specific Numbers and Terms

BigQuery streaming buffer: up to 90 minutes before data is available for queries.

Pub/Sub message retention: default 7 days, max 31 days.

Dataflow exactly-once processing: uses snapshots and checkpointing.

Structured data: schema-on-write; semi-structured: schema-on-read.

Edge Cases

Late-arriving data: In stream processing, Dataflow handles late data via allowed lateness (default 0). The exam may ask about handling out-of-order events.

Schema changes: BigQuery supports schema autodetection for CSV/JSON but can fail if types mismatch. Use --autodetect flag carefully.

Cost optimization: For batch workloads, using preemptible VMs on Dataproc reduces cost. For interactive queries, use BigQuery flat-rate slots for predictable pricing.

How to Eliminate Wrong Answers

If the scenario mentions "real-time" or "continuous," eliminate batch-only services (e.g., Cloud Dataproc without streaming).

If data is described as "images" or "videos," eliminate structured-only services (e.g., Cloud SQL).

If latency is sub-second, eliminate batch processing.

Use the process of elimination: identify data type first, then workload pattern, then match to service.

Key Takeaways

Structured data has a fixed schema; semi-structured (JSON, Avro) is self-describing; unstructured (images, video) has no schema.

Batch processing is scheduled and handles large volumes; stream processing is continuous with low latency.

BigQuery is the primary service for interactive analytics on structured/semi-structured data.

Pub/Sub decouples data producers and consumers; Dataflow provides unified batch and stream processing.

Unstructured data requires ML APIs or custom models for analysis.

The GCDL exam tests matching data type and workload to the correct Google Cloud service.

Common traps: confusing Cloud SQL (OLTP) with BigQuery (OLAP), and Pub/Sub (ingestion) with Dataflow (processing).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Batch Processing

Processes data in scheduled intervals (e.g., hourly, daily).

High latency (minutes to hours).

Cost-effective with preemptible VMs.

Best for historical analysis and ETL.

Google Cloud services: Dataproc, BigQuery (batch), Dataflow (batch mode).

Stream Processing

Processes data as it arrives (real-time).

Low latency (seconds to milliseconds).

Higher cost due to always-on resources.

Best for real-time dashboards, alerts, and fraud detection.

Google Cloud services: Dataflow (streaming), Pub/Sub, BigQuery (streaming inserts).

Watch Out for These

Mistake

BigQuery only supports structured data.

Correct

BigQuery supports semi-structured data like JSON and Avro with schema autodetection, and unstructured data via external tables (e.g., querying text files in Cloud Storage). However, it is optimized for structured and semi-structured.

Mistake

Stream processing always uses Pub/Sub.

Correct

Pub/Sub is a common ingestion service, but stream processing can also use Kafka (via Dataproc) or Cloud Storage (event-triggered Cloud Functions). Dataflow can read from multiple sources.

Mistake

Batch processing cannot handle large data.

Correct

Batch processing is specifically designed for large volumes (petabytes). Services like Dataproc and BigQuery handle massive datasets efficiently.

Mistake

Unstructured data cannot be analyzed.

Correct

Unstructured data can be analyzed using ML APIs (e.g., Vision, Natural Language) or custom models. It can also be transformed into structured data for analysis.

Mistake

Cloud SQL is suitable for data warehousing.

Correct

Cloud SQL is for OLTP (row-based storage, low latency writes). Data warehousing requires columnar storage and high-throughput queries, which BigQuery provides.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between structured and semi-structured data?

Structured data has a rigid schema (e.g., relational tables with rows and columns). Semi-structured data has tags or markers but no fixed schema; each record can have different fields. Example: a CSV file is structured; a JSON file is semi-structured because fields can vary per object.

When should I use batch processing vs. stream processing?

Use batch processing when data is not time-sensitive and can be processed in large chunks at scheduled intervals (e.g., nightly reports). Use stream processing when you need immediate insights, such as fraud detection or live dashboards. The key factor is latency tolerance.

Can BigQuery handle streaming data?

Yes, BigQuery supports streaming inserts via the tabledata.insertAll API or the Storage Write API. However, data is initially buffered for up to 90 minutes before being available for queries. For true real-time, combine Pub/Sub and Dataflow.

What is the best service for processing unstructured data?

Cloud Storage stores unstructured data. To analyze it, use specialized APIs: Cloud Vision for images, Cloud Speech-to-Text for audio, Document AI for documents, or Vertex AI for custom models. Dataproc can also process unstructured data using Spark.

What does 'schema-on-read' mean?

Schema-on-read means the schema is applied when data is read, not when it is written. This is common with semi-structured data like JSON in BigQuery. It offers flexibility because the schema can change without rewriting data.

How does Dataflow handle late-arriving data?

Dataflow uses event-time processing with watermarks and allowed lateness. By default, allowed lateness is 0 seconds. You can configure it to handle late data, which will be included in windowed aggregations if within the allowed lateness period.

What is the difference between Cloud SQL and BigQuery?

Cloud SQL is a relational database for OLTP (transactional workloads) with row-based storage. BigQuery is a data warehouse for OLAP (analytics) with columnar storage. Use Cloud SQL for operational apps, BigQuery for large-scale queries and reporting.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Types and Analytics Workloads — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Done with this chapter?