ACEChapter 65 of 101Objective 5.2

Cloud Data Loss Prevention (DLP) API

This chapter covers the Google Cloud Data Loss Prevention (DLP) API, a critical service for inspecting, classifying, and de-identifying sensitive data. For the ACE exam, understanding DLP API's capabilities, configuration, and integration with other GCP services is essential, as questions on data protection and compliance appear in roughly 5-8% of the exam. You will learn how to create inspection jobs, configure infoType detectors, apply de-identification transforms, and manage templates.

25 min read
Intermediate
Updated May 31, 2026

The Privacy-Aware Mailroom Sorter

Imagine a corporate mailroom that processes all incoming and outgoing packages. The sorter, a highly trained employee, has a set of rules: scan every package for specific patterns like 'SSN', 'credit card number', or 'medical record'. When she finds a label containing 'SSN-123-45-6789', she doesn't just remove it; she applies a transformation: she replaces the digits with 'XXX-XX-XXXX' while keeping the package intact. She also logs what she found and where. If a package contains a photo with a face, she blurs the face using a stencil. The sorter can also be instructed to only redact certain types of data, or to simply count how many packages contain sensitive info without altering them. She works autonomously, but her actions are reviewed by an auditor. In Google Cloud DLP API, the 'sorter' is the DLP inspection engine, the 'packages' are data streams (text, images, structured data), and the 'rules' are infoType detectors and de-identification templates. The sorter's log is the DLP API's findings, and the auditor is the Cloud Audit Logs. Just as the sorter never reads the content of a package for personal curiosity, the DLP API operates on data without storing or reusing it beyond the inspection job.

How It Actually Works

What is Cloud DLP API?

The Cloud Data Loss Prevention (DLP) API is a fully managed service that helps you discover, classify, and protect sensitive data across Google Cloud, on-premises, and other clouds. It uses over 150 built-in detectors (infoTypes) to identify data like credit card numbers, social security numbers, passport numbers, and custom patterns you define. The API can inspect text, images (via OCR), and structured data (like BigQuery tables or CSV files). Once sensitive data is found, you can de-identify it using techniques like masking, tokenization, encryption, or date shifting.

How DLP API Works Internally

When you submit a request to the DLP API, the following steps occur:

1.

Content Reception: The API receives your data—either inline text, a Cloud Storage file, a BigQuery table, or a Datastore kind. If it's an image, the API first performs OCR to extract text.

2.

Inspection Configuration: You specify which infoTypes to look for (e.g., US_SOCIAL_SECURITY_NUMBER, CREDIT_CARD_NUMBER). You can also define custom infoTypes using dictionaries or regex patterns. The configuration includes likelihood thresholds (e.g., LIKELY or VERY_LIKELY) to reduce false positives.

3.

Inspection Engine: The engine scans the content character by character, matching against the configured infoTypes. For each match, it records the location (byte range), the detected infoType, and a confidence score (likelihood). The engine handles overlapping matches and prioritizes higher-likelihood findings.

4.

De-identification: If you specify a de-identification template, the engine applies transforms to the matching content. Transforms can be simple (e.g., replace with a token) or complex (e.g., date shifting within a range, cryptographic pseudonymization with a key).

5.

Output: The API returns the de-identified content and a list of findings (if you requested them). For jobs, results are stored in BigQuery or Cloud Storage.

Key Components and Defaults

InfoType: A type of sensitive data. Built-in examples: EMAIL_ADDRESS, PHONE_NUMBER. Custom infoTypes can be defined with CustomInfoType using a dictionary or regex. Default likelihood threshold is UNLIKELY (i.e., even low-confidence matches are reported).

Inspection Template: A reusable configuration that includes the list of infoTypes, likelihood threshold, and rules. You create templates via the console or API, then reference them in jobs.

- De-identification Template: Contains transforms to apply to findings. Common transforms: - PrimitiveTransformation: ReplaceWithInfoType (replaces with the infoType name), CharacterMaskConfig (masks characters, e.g., XXXX-XXXX-XXXX-1234), CryptoReplaceFfxFpeConfig (format-preserving encryption). - DateShiftConfig: Shifts dates by a random number of days within a range.

Job Triggers: You can schedule DLP jobs to run periodically (e.g., every 24 hours) using Cloud Scheduler or DLP job triggers. The default schedule is daily at midnight UTC.

Inspection Job: A long-running operation that scans data in Cloud Storage, BigQuery, or Datastore. Results are written to a BigQuery table you specify. Job status can be viewed in the console or via API.

Content Item: The data to inspect. Can be ByteContentItem, Table, Row, or StorageConfig (for jobs).

Configuration and Verification Commands

To use DLP API, you must enable the DLP API in your project and grant appropriate IAM roles (e.g., roles/dlp.user, roles/dlp.inspector, roles/dlp.deidentify).

Example: Inspect text using gcloud

gcloud dlp text inspect \
    --content="My phone number is 555-123-4567" \
    --info-types=PHONE_NUMBER \
    --min-likelihood=UNLIKELY

Example: Create an inspection template

gcloud dlp templates create \
    --inspect-template-id=custom-template \
    --display-name="Custom Template" \
    --info-types=US_SOCIAL_SECURITY_NUMBER,CREDIT_CARD_NUMBER \
    --min-likelihood=LIKELY

Example: Create a de-identification template

gcloud dlp templates create \
    --deidentify-template-id=my-deid-template \
    --display-name="DeID Template" \
    --replace-with-info-type

Example: Run an inspection job on a Cloud Storage bucket

gcloud dlp jobs create \
    --project=my-project \
    --inspect-job=gs://my-bucket/ \
    --inspect-template=projects/my-project/inspectTemplates/custom-template \
    --output-bigquery-table=my-project:my_dataset.dlp_results

Verification: Use gcloud dlp jobs list and gcloud dlp jobs describe JOB_ID to check status and view findings.

Interaction with Related Technologies

Cloud Storage: DLP can scan objects in buckets. You specify the bucket and optional file filters (e.g., *.txt).

BigQuery: DLP can scan table columns. You can inspect all rows or a sample. Results are written to a designated BigQuery table.

Cloud Datastore: DLP can scan entities of a kind.

Cloud Logging: DLP operations are logged in Cloud Audit Logs.

Cloud KMS: DLP can use customer-managed encryption keys (CMEK) for cryptographic transforms (e.g., CryptoReplaceFfxFpeConfig).

Cloud Dataflow: For large-scale streaming or batch de-identification, you can use DLP templates with Dataflow pipelines.

Performance and Limitations

Inspection Quotas: The DLP API has quotas on requests per day (e.g., 10,000 requests per day for the free tier, higher for paid). For large datasets, use job-based scanning rather than inline requests.

Content Size Limits: Inline text requests are limited to 1 MB. For larger content, use Cloud Storage or BigQuery jobs.

Image Scanning: Only supports images up to 1 MB. For larger images, split them.

InfoType Limits: You can specify up to 200 infoTypes per request (built-in + custom).

Pricing

DLP pricing is based on the amount of data processed (per byte) and the number of infoType inspections. De-identification operations are charged separately. Free tier includes 5,000 units per month (1 unit = 1 MB scanned).

Walk-Through

1

Enable DLP API and IAM

Before using DLP, you must enable the API in your project via the console or `gcloud services enable dlp.googleapis.com`. Then, grant IAM roles to users or service accounts: `roles/dlp.user` for full access, `roles/dlp.inspector` for read-only inspection, `roles/dlp.deidentify` for de-identification. If you use Cloud KMS keys, also grant `roles/cloudkms.cryptoKeyEncrypterDecrypter`. Without proper IAM, API calls will fail with 403 Forbidden.

2

Define InfoTypes and Templates

Create inspection templates that specify which infoTypes to detect and the minimum likelihood threshold (e.g., `LIKELY`). You can also create de-identification templates with transforms. Templates are stored in the project and can be reused across multiple jobs. Use `gcloud dlp templates create` or the console. For custom infoTypes, define a dictionary or regex pattern. Templates reduce configuration duplication and enforce consistent policies.

3

Choose Data Source and Method

Decide whether to inspect inline content (text, image, table) or run a job on a data store (Cloud Storage, BigQuery, Datastore). For inline, use the `content.inspect` or `content.deidentify` methods. For jobs, use `projects.jobs.create` with a `StorageConfig`. Jobs are asynchronous and can handle large datasets. Job triggers allow scheduling. The choice affects latency: inline is synchronous, jobs are asynchronous.

4

Submit Inspection Request

For inline requests, send a JSON payload with the content and configuration. For jobs, specify the source (e.g., `cloud_storage_options` with `file_set` and `bytes_limit_per_file`). The API returns a job ID. You can poll the job status using `projects.jobs.get`. For inline, the response includes findings and optionally de-identified content. The inspection engine processes data in chunks, respecting byte limits.

5

Review Findings and Remediate

After the job completes, review the findings stored in the output BigQuery table. Findings include infoType, location, and likelihood. Use this information to assess risk and apply remediation, such as de-identification or access controls. You can rerun jobs with different templates or thresholds. For continuous compliance, set up job triggers to scan new data automatically. Cloud Audit Logs record all DLP API calls for auditing.

What This Looks Like on the Job

Enterprise Scenario 1: Healthcare Compliance (HIPAA)

A healthcare organization stores patient records in Cloud Storage and BigQuery. They use DLP API to scan all new files and database tables for protected health information (PHI) like patient names, SSNs, and medical record numbers. They create an inspection template with infoTypes US_SOCIAL_SECURITY_NUMBER, PERSON_NAME, and MEDICAL_RECORD_NUMBER. A job trigger runs daily to scan new objects. Findings are written to a BigQuery table for compliance reporting. When PHI is found, a Cloud Function triggers a de-identification job that replaces SSNs with tokens using format-preserving encryption (FPE) with a Cloud KMS key. Common pitfalls: forgetting to set the correct likelihood threshold (default UNLIKELY causes many false positives), or not granting the DLP service account access to the KMS key, causing de-identification to fail.

Enterprise Scenario 2: Financial Data Protection (PCI DSS)

A fintech company processes credit card transactions. They use DLP API to inspect logs and application data for credit card numbers (infoType CREDIT_CARD_NUMBER). They set the minimum likelihood to VERY_LIKELY to avoid masking valid numbers that look like credit cards. They use a de-identification template with CharacterMaskConfig to mask all but the last four digits (e.g., XXXX-XXXX-XXXX-1234). The de-identified logs are stored in a separate bucket for analytics. Performance consideration: scanning millions of log lines per day requires using job-based scanning with BigQuery output. A misconfiguration: using ReplaceWithInfoType instead of masking, which would replace the credit card number with the literal string 'CREDIT_CARD_NUMBER', breaking downstream processing.

Enterprise Scenario 3: Multi-Cloud Data Discovery

A multinational corporation uses a hybrid cloud environment with data in AWS S3 and on-premises. They use DLP API's ability to scan data from any source by exporting data to Cloud Storage first. They set up a weekly job to scan exports. Custom infoTypes detect employee IDs and passport numbers in multiple formats. The DLP API's findings are exported to Security Command Center for centralized risk management. Scale: scanning 10 TB of data per week requires careful quota management and using multiple jobs. A common failure: exceeding the daily request quota (default 10,000) when using inline scanning for many small files.

How ACE Actually Tests This

The ACE exam tests DLP API under Objective 5.2 'Manage data loss prevention and classification'. Expect 2-4 questions that assess your ability to configure inspection jobs, apply de-identification, and interpret findings. Key areas:

1.

InfoType detection: Know built-in infoTypes (e.g., US_SOCIAL_SECURITY_NUMBER, CREDIT_CARD_NUMBER, EMAIL_ADDRESS). The exam may ask which infoType detects a specific pattern. Memorize the naming convention: US_ for US-specific, PERSON_NAME, PHONE_NUMBER, etc.

2.

Likelihood thresholds: The default is UNLIKELY. The exam loves to test that VERY_LIKELY reduces false positives. A common wrong answer is setting POSSIBLE as default.

3.

De-identification transforms: You must know the difference between ReplaceWithInfoType, CharacterMaskConfig, and CryptoReplaceFfxFpeConfig. The exam may present a scenario where you need format-preserving encryption to maintain data format for downstream systems. The wrong answer is often ReplaceWithInfoType when format preservation is needed.

4.

Job vs. inline: Inline inspection is for small content (<1 MB). Jobs are for large datasets. A common trap: using inline for a 10 MB file.

5.

Templates: Templates are reusable configurations. The exam tests that you can create a template and reference it in a job. A wrong answer might suggest configuring everything inline without templates.

6.

Integration with Cloud Storage and BigQuery: Know that DLP can scan Cloud Storage buckets and BigQuery tables. The output of a job is written to BigQuery. A common wrong answer is that results are written to Cloud Storage.

7.

IAM roles: roles/dlp.user grants full access; roles/dlp.inspector is read-only. The exam may ask which role is needed to run a de-identification job. The correct answer is roles/dlp.user or roles/dlp.deidentify. A wrong answer is roles/dlp.inspector.

8.

Custom infoTypes: You can define them using a dictionary or regex. The exam may ask how to detect proprietary employee IDs.

9.

Image scanning: DLP can scan images via OCR. The exam may include a scenario where text in an image contains sensitive data.

10.

Pricing and quotas: The free tier includes 5,000 units per month. Quotas are per project. A wrong answer might assume unlimited usage.

Edge cases: DLP cannot scan encrypted files (e.g., encrypted with your own keys before upload). It can scan files encrypted with CMEK if you grant access. Also, DLP does not scan nested fields in BigQuery RECORD types by default; you need to specify the field path.

Key Takeaways

DLP API inspects and de-identifies sensitive data using over 150 built-in infoTypes.

Inspection templates and de-identification templates enable reusable configurations.

Minimum likelihood threshold default is UNLIKELY; increase to LIKELY or VERY_LIKELY to reduce false positives.

Inline inspection supports up to 1 MB of content; use job-based scanning for larger datasets.

DLP can scan Cloud Storage, BigQuery, and Datastore; results are written to BigQuery for jobs.

De-identification transforms include ReplaceWithInfoType, CharacterMaskConfig, and CryptoReplaceFfxFpeConfig.

IAM roles: roles/dlp.user (full), roles/dlp.inspector (read-only), roles/dlp.deidentify (de-identification only).

Custom infoTypes can be defined using dictionaries or regular expressions.

DLP integrates with Cloud KMS for cryptographic transforms using customer-managed keys.

DLP operations are logged in Cloud Audit Logs for compliance.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Inline Inspection

Synchronous: returns results immediately

Limited to 1 MB of content per request

Best for small text or images

Quota applies per request (e.g., 10,000/day)

Results returned in API response

Job-Based Inspection

Asynchronous: long-running operation

Handles large datasets (GBs to TBs)

Required for Cloud Storage, BigQuery, Datastore

Quota based on bytes scanned, not number of requests

Results written to BigQuery table

Watch Out for These

Mistake

DLP API automatically de-identifies all data in a project without any configuration.

Correct

DLP is not automatic; you must explicitly create inspection jobs or use the API to inspect and de-identify data. It does not scan all data by default. You choose which sources to scan.

Mistake

Setting minimum likelihood to 'POSSIBLE' is the default and best for catching all sensitive data.

Correct

The default is 'UNLIKELY'. 'POSSIBLE' would include many false positives. 'VERY_LIKELY' is recommended for production to reduce noise.

Mistake

DLP can inspect data in any Google Cloud service, including Compute Engine disks.

Correct

DLP can inspect Cloud Storage, BigQuery, and Datastore. It does not directly inspect Compute Engine persistent disks or Cloud SQL databases. You must export data to a supported source first.

Mistake

The DLP API can de-identify data in place (e.g., modify the original Cloud Storage object).

Correct

DLP returns de-identified content in the API response or writes results to a new BigQuery table or Cloud Storage location. It never modifies the original source data. You must copy the de-identified output to the original location if needed.

Mistake

You can use DLP to scan data in other clouds (AWS, Azure) directly without moving data to GCP.

Correct

DLP can only scan data that is accessible via Cloud Storage, BigQuery, or Datastore. To scan data in other clouds, you must first export it to Cloud Storage or use a connector.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between DLP API and Data Loss Prevention in Security Command Center?

DLP API is the core service for scanning and de-identifying data. Security Command Center (SCC) uses DLP findings as part of its asset inventory and threat detection. SCC can surface DLP findings as part of its dashboard, but the actual scanning is performed by DLP API. For the exam, know that DLP API is the tool you configure directly, while SCC consumes its output.

How do I scan a BigQuery table with DLP?

Create an inspection job with `big_query_options` specifying the table. You can inspect all rows or a sample (e.g., `rows_limit=10000`). The job writes findings to a BigQuery table you specify. Use the console or `gcloud dlp jobs create` with `--inspect-job` pointing to the table.

Can DLP scan images for text?

Yes, DLP can inspect images for text using OCR. The image must be in JPEG, PNG, BMP, or TIFF format, up to 1 MB. The API extracts text and then inspects it for sensitive infoTypes. Use `ByteContentItem` with `type=IMAGE`.

What is format-preserving encryption (FPE) in DLP?

FPE via `CryptoReplaceFfxFpeConfig` encrypts sensitive data while preserving the original format (e.g., a 16-digit credit card number remains 16 digits). It uses a cryptographic key (from Cloud KMS) and a surrogate infoType. This is useful for maintaining data structure in databases.

How do I schedule recurring DLP scans?

Use job triggers. Create a trigger with a schedule (e.g., every 24 hours) and associate it with an inspection job configuration. The trigger automatically creates a new job at each interval. You can also use Cloud Scheduler to call the DLP API on a cron schedule.

What happens if DLP finds sensitive data? Can it automatically redact it?

DLP can automatically de-identify data if you provide a de-identification template in the same request. For jobs, you can configure de-identification as part of the job. However, DLP does not modify the original source; it writes de-identified output to a new location. For automatic redaction in real-time, you would use DLP with a Cloud Function or a proxy.

What are the costs of using DLP API?

DLP pricing is based on the amount of data processed. The first 5,000 units (each unit = 1 MB scanned) per month are free. After that, you pay per unit. De-identification transforms also incur charges. There are no charges for creating templates or managing jobs, only for data processing.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Cloud Data Loss Prevention (DLP) API — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.

Done with this chapter?