GCDLChapter 95 of 101Objective 3.1

Google Cloud Data Catalog

This chapter covers Google Cloud Data Catalog, a fully managed metadata management service that enables data discovery, governance, and lineage tracking across your cloud data landscape. For the GCDL exam, Data Catalog appears in the Data Analytics AI domain (Objective 3.1) and typically accounts for 5-8% of exam questions. Understanding Data Catalog's architecture, key features, and integration points is essential for any cloud professional tasked with managing data assets at scale.

25 min read
Intermediate
Updated May 31, 2026

Library Card Catalog for Cloud Data

Imagine a massive university library with millions of books stored across multiple buildings, floors, and special collections. Without a central card catalog, a student would have to physically wander through every shelf to find a book on quantum computing. The card catalog is a metadata repository: it doesn't hold the books themselves, but it records each book's title, author, subject, location (building, floor, shelf number), and a unique call number. When a student searches for 'quantum computing' in the catalog, they instantly see a list of all relevant books with their exact locations. The catalog also tracks which books are checked out, when they are due back, and which rare books require special permission. Now, if the library acquires a new book, a librarian must create a new catalog entry with all metadata before the book is placed on a shelf. Similarly, if a book is moved to a different shelf, the catalog entry must be updated. Without this discipline, the catalog becomes inaccurate, and students waste time searching for books that aren't where the catalog says they are. Google Cloud Data Catalog works exactly like this: it's a fully managed metadata management service that helps organizations discover, understand, and manage their data assets across Google Cloud and beyond. It automatically crawls and catalogs metadata from BigQuery, Cloud Storage, Pub/Sub, and other sources, and it allows users to enrich that metadata with business context, tags, and classifications. Just as a library catalog enables efficient book discovery, Data Catalog enables efficient data discovery and governance.

How It Actually Works

What is Google Cloud Data Catalog?

Google Cloud Data Catalog is a fully managed, scalable metadata management service that helps organizations discover, manage, and govern their data assets. It provides a unified view of metadata across Google Cloud services (BigQuery, Cloud Storage, Pub/Sub, Dataflow, Dataproc, etc.) and external systems via integrations. Data Catalog is not a data storage system; it stores only metadata — information about your data — not the data itself. The service automatically crawls and catalogs technical metadata (schemas, data types, partitioning, clustering) and allows users to add business metadata (descriptions, tags, classifications, ownership).

Why Data Catalog Exists

In large organizations, data is scattered across hundreds of datasets, tables, files, and streams. Without a central metadata repository, data analysts and scientists spend up to 80% of their time just finding and understanding data. Data Catalog solves this by providing: - Data Discovery: Search and browse all data assets across the organization. - Data Understanding: View schema, lineage, and business context for each asset. - Data Governance: Apply policies, classifications, and access controls based on metadata. - Data Lineage: Track how data flows from source to destination (in preview during the exam timeframe).

How Data Catalog Works Internally

Data Catalog operates as a metadata service that stores and indexes metadata in a scalable, searchable store. The key components are:

1. Metadata Sources: Data Catalog connects to various Google Cloud services and external systems via built-in connectors. For BigQuery, it automatically crawls all datasets, tables, views, and routines. For Cloud Storage, it crawls buckets and objects. For Pub/Sub, it crawls topics and subscriptions. Each crawl extracts technical metadata such as schema, partitioning, clustering, and timestamps.

2. Metadata Store: All crawled metadata is stored in a centralized, highly available metadata store. This store is partitioned and indexed for fast search. The metadata includes both technical metadata (automatically extracted) and business metadata (manually added via tags and descriptions).

3. Search and Discovery: Data Catalog provides a search interface (console, API, and SDK) that uses full-text search and faceted filtering. Users can search by asset name, description, tags, columns, and data types. The search index is built from the metadata store and updated in near real-time as metadata changes.

4. Tag Templates and Tags: Tags are the primary mechanism for adding business metadata. A tag template defines a set of fields (e.g., PII, sensitivity, data steward). Users create tags based on these templates and attach them to data assets (datasets, tables, columns, files). Tags are inheritable: a tag on a dataset can be inherited by its tables unless overridden.

5. Policy Tags (with Data Catalog): Data Catalog integrates with BigQuery column-level security via policy tags. You can create policy tags in Data Catalog, attach them to columns, and then use BigQuery's IAM-based access control to restrict access to sensitive columns. Policy tags are essentially Data Catalog tags with special semantics for access control.

Key Features and Defaults

Automatic Crawling: For BigQuery, Data Catalog automatically crawls all datasets and tables within a project. No manual configuration required. For Cloud Storage, you must enable the Data Catalog API and grant appropriate permissions. The crawl interval is typically within minutes of asset creation.

Search Indexing: Metadata is indexed for search within seconds of being written to the metadata store.

Tag Inheritance: Tags attached to a dataset are inherited by all tables in that dataset. You can override inheritance at the table or column level.

Tag Templates: You can create up to 1000 tag templates per project (soft limit, can be increased). Each template can have up to 100 fields.

Policy Tags: Policy tags are a type of tag that controls access to BigQuery columns. They are created in Data Catalog and used in BigQuery's column-level security.

Lineage: Data lineage (preview) shows how data moves through pipelines. It requires the Data Lineage API to be enabled and uses events from services like Dataflow, Dataproc, and BigQuery.

Configuration and Verification Commands

To use Data Catalog, you must enable the Data Catalog API in your project:

gcloud services enable datacatalog.googleapis.com

To list all tag templates in a project:

gcloud data-catalog tag-templates list --project=PROJECT_ID

To create a tag template:

gcloud data-catalog tag-templates create TEMPLATE_ID \
    --location=LOCATION \
    --display-name="Display Name" \
    --field=id=field_id,type=string,display-name="Field Display",required=TRUE \
    --field=id=field_id2,type=enum(Value1|Value2),display-name="Field Display 2"

To attach a tag to a BigQuery table:

gcloud data-catalog tags create \
    --entry=ENTRY_NAME \
    --tag-template=TEMPLATE_ID \
    --tag-file=tag.json

To search for assets:

gcloud data-catalog entries search --query="sensitive"

Integration with Related Technologies

BigQuery: Data Catalog automatically catalogs all BigQuery datasets, tables, views, and routines. It also integrates with BigQuery's column-level security via policy tags.

Cloud Storage: Data Catalog can crawl Cloud Storage buckets and objects. You must grant the Storage Object Viewer role to the Data Catalog service account.

Pub/Sub: Data Catalog crawls Pub/Sub topics and schemas.

Dataflow and Dataproc: Data Catalog can capture lineage information from Dataflow and Dataproc pipelines (in preview).

IAM: Access to Data Catalog resources (entries, tag templates, tags) is controlled by IAM roles like datacatalog.admin, datacatalog.editor, datacatalog.viewer, datacatalog.tagTemplateCreator, and datacatalog.tagTemplateViewer.

Cloud DLP: Data Catalog can be used with Cloud Data Loss Prevention (DLP) to automatically classify sensitive data. DLP inspection results can be written as tags to Data Catalog entries.

Metadata Model

Data Catalog uses an entry-based metadata model: - Entry Group: A container for entries. For example, BigQuery datasets are entry groups. - Entry: A single data asset, such as a BigQuery table, a Cloud Storage bucket, or a Pub/Sub topic. Each entry has a unique name (fully qualified resource name) and contains technical metadata. - Tag: A piece of business metadata attached to an entry or a column. Tags are instances of tag templates. - Tag Template: A schema that defines the fields of a tag.

Performance and Scale

Data Catalog is designed to handle millions of entries. There are quotas and limits:

Maximum entry size: 1 MB

Maximum number of tags per entry: 1000

Maximum number of tag templates per project: 1000 (default)

Maximum number of entries per entry group: 10,000 (default)

API requests: 6000 requests per minute per project (default)

Security and Access Control

Data Catalog uses IAM for access control. Key roles: - roles/datacatalog.admin: Full control over all Data Catalog resources. - roles/datacatalog.editor: Can edit entries and tags but cannot manage IAM. - roles/datacatalog.viewer: Read-only access to entries and tags. - roles/datacatalog.tagTemplateCreator: Can create and manage tag templates. - roles/datacatalog.tagTemplateViewer: Read-only access to tag templates.

Permissions are evaluated at the resource level. For example, to attach a tag to a BigQuery table, you need datacatalog.tags.create on the entry and datacatalog.tagTemplates.use on the tag template.

Common Use Cases

1.

Data Discovery: Analysts search for 'customer' and find all tables with that term in their name or description, across all projects.

2.

Data Governance: A data steward tags all tables containing PII with a 'PII' tag. This tag can then be used to enforce access policies via BigQuery column-level security.

3.

Data Lineage: A data engineer traces the lineage of a report table back to its source tables and transformation pipelines to understand data quality issues.

4.

Automated Classification: Cloud DLP scans Cloud Storage files and BigQuery tables, and its findings are automatically written as tags to Data Catalog, marking sensitive data.

Exam Relevance

For the GCDL exam, focus on:

The purpose of Data Catalog: centralized metadata management for data discovery and governance.

How tags and tag templates work for adding business context.

Integration with BigQuery column-level security via policy tags.

Automatic crawling of BigQuery and Cloud Storage.

The difference between technical metadata (automatically extracted) and business metadata (user-defined).

The role of Data Catalog in a broader data governance strategy.

Common exam scenarios include: choosing the right service for metadata management, understanding how to tag sensitive data, and identifying the benefits of Data Catalog for data discovery.

Walk-Through

1

Enable Data Catalog API

Before using Data Catalog, you must enable the Data Catalog API in your Google Cloud project. This is done via the Cloud Console, gcloud command, or through Terraform. The command is `gcloud services enable datacatalog.googleapis.com`. Once enabled, the service can start crawling metadata from supported sources. Without this step, attempts to search or tag assets will fail with a permission error. The API must be enabled in each project where you want to use Data Catalog. Enabling the API does not automatically start crawling; you must also grant appropriate permissions to the Data Catalog service account.

2

Configure IAM Permissions

After enabling the API, you must grant IAM roles to users and service accounts. For example, to allow a user to search and view metadata, assign `roles/datacatalog.viewer`. To allow creating tags, assign `roles/datacatalog.editor` or more granular roles like `roles/datacatalog.tagTemplateCreator`. For BigQuery crawling, the Data Catalog service account needs `roles/bigquery.metadataViewer` on the project or dataset level. For Cloud Storage crawling, the service account needs `roles/storage.objectViewer`. Without proper IAM, users cannot search, tag, or even see assets. The Data Catalog service account is automatically created when you enable the API; its email is in the format `service-<project-number>@gcp-sa-datacatalog.iam.gserviceaccount.com`.

3

Automatic Crawling Begins

Once the API is enabled and permissions are set, Data Catalog automatically crawls metadata from supported sources. For BigQuery, all existing datasets and tables are scanned within minutes. New assets are crawled shortly after creation (typically within 5 minutes). For Cloud Storage, you must explicitly enable crawling by granting the Storage Object Viewer role to the Data Catalog service account. Crawling extracts technical metadata such as schema, data types, partitioning, clustering, and creation timestamps. This metadata is stored in the Data Catalog metadata store and indexed for search. No manual intervention is required for BigQuery; it happens automatically.

4

Create Tag Templates

To add business context, you first create tag templates. A tag template defines a reusable schema with fields. For example, a 'Sensitivity' template might have fields: 'classification' (enum: Public, Internal, Confidential, Restricted) and 'data_steward' (string). You can create templates via the Console, gcloud, or API. The gcloud command is `gcloud data-catalog tag-templates create`. Templates can have up to 100 fields, each with a type (string, double, bool, timestamp, enum, etc.). Fields can be required or optional. Once created, templates can be used to create tags on any entry. Templates are regional resources; they exist in a specific location (e.g., us-central1).

5

Attach Tags to Entries

With tag templates created, you can attach tags to entries (datasets, tables, columns, files). Tags are instances of a template. For example, you attach a 'Sensitivity' tag to a BigQuery table and set the classification to 'Confidential'. Tags can be attached at the entry level (e.g., table) or at the column level (e.g., a specific column). Tags are inheritable: if you tag a dataset, all tables in that dataset inherit that tag unless overridden. You can attach multiple tags to the same entry (up to 1000). Tags are created via the Console, gcloud (`gcloud data-catalog tags create`), or API. Once attached, the tags are indexed and become searchable.

6

Search and Discover Assets

After metadata and tags are in place, users can search for assets using the Data Catalog search interface (console or API). The search supports full-text queries, faceted filtering by project, location, type, and tags. For example, searching for 'customer' returns all entries with 'customer' in their name, description, or tags. The search index is updated in near real-time. Users can click on an entry to view its full metadata, including schema, tags, and lineage (if enabled). This step is the primary value of Data Catalog: enabling data discovery across the organization. Without tags, search is limited to technical metadata; with tags, users can find assets by business context.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Services Data Governance

A large bank uses Data Catalog to manage thousands of BigQuery tables across multiple projects. They have strict regulatory requirements for data privacy (GDPR, CCPA). They create tag templates for 'PII Classification' (fields: PII Type, Sensitivity, Data Steward) and 'Retention Policy' (fields: Retention Period, Legal Hold). Using Cloud DLP, they automatically scan tables for PII and write findings as Data Catalog tags. This allows them to enforce column-level security in BigQuery using policy tags. For example, columns tagged as 'Confidential' are restricted to users with the appropriate IAM role. The bank also uses Data Catalog's search to help analysts quickly find approved datasets, reducing data discovery time from hours to minutes. A common pitfall is not granting the Data Catalog service account the necessary BigQuery metadata viewer role, causing crawling to fail silently. Another is over-tagging: creating too many tags on a single table can exceed the 1000-tag limit, causing errors.

Enterprise Scenario 2: Retail Data Lake

A retail company ingests data from multiple sources into Cloud Storage and BigQuery. They use Data Catalog to catalog all files in Cloud Storage buckets (parquet, CSV, JSON) and BigQuery tables. They create a 'Data Source' tag template to track the origin of each dataset (e.g., 'POS System', 'Web Analytics'). Data engineers use Data Catalog to find the correct source tables for building data pipelines, avoiding duplicate or stale data. They also use Data Catalog's lineage (preview) to trace data from raw files to transformed tables, helping with debugging and impact analysis. One challenge is that Cloud Storage crawling requires manual enabling and appropriate permissions; if not configured, files remain invisible in Data Catalog. Additionally, if buckets have millions of objects, crawling can take time and may hit API rate limits. The company mitigates this by using bucket-level filters to exclude temporary or staging directories.

Enterprise Scenario 3: Healthcare Data Compliance

A healthcare organization uses Data Catalog to manage PHI (Protected Health Information) across BigQuery and Cloud Storage. They create a 'PHI' tag template with fields like 'PHI Category' (enum: Direct Identifier, Quasi-Identifier) and 'De-identification Status'. They use Cloud DLP to automatically classify data and write tags. They then use BigQuery column-level security with policy tags to restrict access to PHI columns to only authorized researchers. Data Catalog's search allows auditors to quickly find all assets containing PHI and verify that appropriate controls are in place. A common misconfiguration is not setting up tag inheritance correctly: if a dataset is tagged as 'PHI', but a table in that dataset is accidentally tagged as 'Non-PHI', the table-level tag overrides the dataset tag, potentially exposing data. The organization trains data stewards to always use inheritance unless there is a specific reason to override.

How GCDL Actually Tests This

What GCDL Tests on Data Catalog (Objective 3.1)

The GCDL exam focuses on the following aspects of Data Catalog: - Purpose: Understand that Data Catalog is a metadata management service for data discovery and governance. - Key Features: Automatic crawling of BigQuery and Cloud Storage, tag templates and tags, integration with BigQuery column-level security via policy tags. - Benefits: Reduces time spent finding data, enables business context enrichment, supports data governance. - Integration: Understand how Data Catalog works with BigQuery, Cloud Storage, Cloud DLP, and IAM. - Use Cases: Data discovery, data governance, automated classification, lineage tracking (preview).

Common Wrong Answers and Why Candidates Choose Them

1. Wrong Answer: 'Data Catalog stores a copy of the actual data.' Why chosen: Candidates confuse metadata storage with data storage. Correction: Data Catalog stores only metadata, not the data itself.

2. Wrong Answer: 'Data Catalog automatically encrypts data at rest.' Why chosen: Candidates assume any data service includes encryption. Correction: Encryption is handled by the underlying storage (e.g., BigQuery, Cloud Storage), not by Data Catalog.

3. Wrong Answer: 'Data Catalog can directly control access to BigQuery tables.' Why chosen: Candidates think tags are access control mechanisms. Correction: Data Catalog tags (including policy tags) are metadata; access control is enforced by BigQuery using IAM and policy tags. Data Catalog itself does not enforce access.

4. Wrong Answer: 'Data Catalog requires manual entry of all metadata.' Why chosen: Candidates underestimate automatic crawling. Correction: Technical metadata is automatically crawled; business metadata (tags) is manually added.

Specific Numbers, Values, and Terms on the Exam

Tag template limit: 1000 per project (default)

Tags per entry: 1000 max

Entry group size: 10,000 entries per group (default)

API rate limit: 6000 requests per minute per project

Roles: datacatalog.admin, datacatalog.editor, datacatalog.viewer, datacatalog.tagTemplateCreator, datacatalog.tagTemplateViewer

Policy tags: used for BigQuery column-level security

Cloud DLP integration: DLP findings can be written as Data Catalog tags

Lineage: preview feature during exam timeframe

Edge Cases and Exceptions

Data Catalog does not support all Google Cloud services out of the box. For example, Cloud SQL and Spanner require custom integrations or third-party connectors.

Tag inheritance: a tag on a dataset is inherited by tables, but not by views or routines unless explicitly attached.

Deleting a tag template does not automatically delete existing tags; you must delete tags first.

Data Catalog is regional; metadata is stored in the region where the entry group is created. Search is global across regions.

If a BigQuery table is deleted, its Data Catalog entry is automatically removed within a few hours.

How to Eliminate Wrong Answers

If the question mentions 'metadata management', 'data discovery', or 'data governance', Data Catalog is likely the answer.

If the question mentions 'automatic crawling of BigQuery and Cloud Storage', think Data Catalog.

If the question mentions 'column-level security in BigQuery', think policy tags, which are created in Data Catalog.

Eliminate options that claim Data Catalog stores data, encrypts data, or controls access directly.

Remember that Data Catalog is not a data processing or storage service; it is purely metadata.

Key Takeaways

Data Catalog is a fully managed metadata management service for data discovery and governance.

It automatically crawls metadata from BigQuery, Cloud Storage, Pub/Sub, and more.

Tag templates define the schema for business metadata; tags are instances attached to entries.

Policy tags in Data Catalog enable BigQuery column-level security.

Data Catalog integrates with Cloud DLP to automatically classify sensitive data and write tags.

IAM roles like datacatalog.admin and datacatalog.viewer control access to Data Catalog resources.

Data Catalog does not store actual data; it only stores metadata.

Tag inheritance: tags on a dataset are inherited by its tables unless overridden.

Data Catalog is regional; metadata is stored in the entry group's region, but search is global.

Lineage tracking is in preview and requires the Data Lineage API.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Google Cloud Data Catalog

Manages metadata for data discovery and governance.

Automatically crawls technical metadata from BigQuery, Cloud Storage, etc.

Allows manual tagging with business context.

Integrates with BigQuery column-level security via policy tags.

Provides search and browse for data assets.

Cloud DLP (Data Loss Prevention)

Scans data content for sensitive information (PII, credentials).

Classifies and redacts sensitive data.

Can be used to inspect data in BigQuery, Cloud Storage, and other sources.

Can write classification results as Data Catalog tags.

Focuses on data security and compliance, not metadata management.

Google Cloud Data Catalog

Centralized metadata management across services.

Supports tagging and search for data discovery.

Integrates with BigQuery column-level security.

Does not manage data lakes or provide data processing.

Lightweight; no need to organize data into zones.

Dataplex (Data Lake Management)

Manages data lakes with zones, lakes, and assets.

Provides data quality, lifecycle management, and governance.

Includes a built-in metadata catalog (Data Catalog is part of Dataplex).

Offers data processing and integration with Dataflow and Dataproc.

More comprehensive for data lake governance.

Watch Out for These

Mistake

Data Catalog stores a copy of the actual data from BigQuery tables.

Correct

Data Catalog stores only metadata (schema, description, tags), never the data rows. The actual data remains in BigQuery. This is a fundamental distinction: metadata vs. data.

Mistake

Data Catalog automatically encrypts data at rest for BigQuery tables.

Correct

Encryption is handled by the underlying storage service (BigQuery, Cloud Storage). Data Catalog does not perform any encryption; it only manages metadata about the data.

Mistake

Data Catalog tags directly control access to BigQuery tables and columns.

Correct

Tags are metadata; they do not enforce access. However, policy tags (a special type of tag) can be used by BigQuery to enforce column-level security. Access control is still managed via IAM in BigQuery, not in Data Catalog.

Mistake

Data Catalog requires manual entry of all metadata for every asset.

Correct

Technical metadata (schema, partitioning, etc.) is automatically crawled for BigQuery and Cloud Storage. Only business metadata (tags, descriptions) requires manual input. This automatic crawling is a key benefit.

Mistake

Data Catalog can crawl any Google Cloud service without configuration.

Correct

Data Catalog has built-in connectors for BigQuery, Cloud Storage, Pub/Sub, and a few others. For services like Cloud SQL or Spanner, you need custom integrations. Also, Cloud Storage crawling requires explicit permission grants.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a tag template and a tag in Data Catalog?

A tag template is a reusable schema that defines the fields for a tag. For example, a 'Sensitivity' template might have fields like 'classification' (enum) and 'data_steward' (string). A tag is an instance of that template attached to a specific data asset (e.g., a table). You create templates once, then create many tags from them. Templates are regional resources, while tags are attached to entries.

How does Data Catalog integrate with BigQuery column-level security?

Data Catalog allows you to create policy tags, which are a special type of tag. You attach these policy tags to columns in BigQuery tables. Then, in BigQuery, you use IAM roles like `roles/bigquery.dataViewer` with conditions that reference the policy tag. This restricts access to columns with that tag. The policy tag itself is just metadata; the enforcement happens in BigQuery.

Can Data Catalog automatically crawl Cloud Storage buckets?

Yes, but you must explicitly enable it by granting the Storage Object Viewer role to the Data Catalog service account. Once granted, Data Catalog will crawl bucket metadata and object metadata (name, size, type, timestamps). It does not crawl the contents of objects. You can also configure bucket-level filters to exclude certain directories.

What happens to Data Catalog entries when a BigQuery table is deleted?

When a BigQuery table is deleted, the corresponding Data Catalog entry is automatically removed within a few hours. This is part of the lifecycle management. Tags attached to that entry are also deleted. However, if you delete a tag template, existing tags are not automatically deleted; you must delete them manually.

Is Data Catalog available in all Google Cloud regions?

Data Catalog is available in most Google Cloud regions, but it is a regional service. When you create an entry group (e.g., for a BigQuery dataset), you specify a location. Metadata is stored in that region. However, search queries are global and can retrieve entries from any region. Some regions may have limitations; check the official documentation for the latest list.

How does Data Catalog help with data governance?

Data Catalog supports data governance by providing a central place to define and apply business metadata (tags) that can represent data classifications, ownership, retention policies, and more. It integrates with Cloud DLP for automated classification and with BigQuery for column-level security. This enables organizations to enforce policies consistently across all data assets.

What are the default quotas for Data Catalog?

Key default quotas include: up to 1000 tag templates per project, up to 1000 tags per entry, up to 10,000 entries per entry group, and 6000 API requests per minute per project. These are soft limits and can be increased by requesting a quota increase from Google Cloud Support.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Google Cloud Data Catalog — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Done with this chapter?