AI-900Chapter 99 of 100Objective 4.3

PII Detection and Extraction

This chapter covers PII (Personally Identifiable Information) detection and extraction using Azure AI Language, a core topic in the NLP domain of the AI-900 exam (Objective 4.3: Identify features of NLP workloads on Azure). You will learn what PII is, how Azure detects and extracts it, and the key capabilities of the PII detection API. Approximately 5-7% of AI-900 exam questions touch on this area, typically asking you to identify the correct service for a given scenario or to recall specific entity types. Mastery of this chapter will help you answer questions about data privacy, compliance, and text analytics confidently.

25 min read
Intermediate
Updated May 31, 2026

PII Detection Like a Mail Sorter

Think of a large postal sorting facility that processes millions of letters daily. Each letter has an envelope that may contain sensitive information like social security numbers, credit card details, or medical records written visibly on the outside. The facility employs a team of trained inspectors who scan every envelope before it enters the sorting machine. The inspectors have a checklist of patterns: nine-digit numbers (like SSNs), 16-digit card numbers, phone numbers, and email addresses. When an inspector spots a match, they immediately flag the envelope with a red sticker and route it to a secure holding area. They do not open the envelope—they only look at the visible text. After sorting, a supervisor reviews all flagged items to confirm the detection and decides whether to allow the envelope through (with the sensitive data removed or masked) or to discard it. In Azure AI Language, PII detection works similarly: the service scans text for patterns matching predefined entity types (like US social security number or credit card number), labels each detected entity with its type and a confidence score, and returns the results. You can choose to redact the entities by replacing them with asterisks or tags, much like the supervisor masking the sensitive text before forwarding the envelope. The service does not understand the meaning of the data—it relies on pattern matching and machine learning models trained on labeled examples. Just as the inspectors rely on a checklist, Azure's models rely on a curated list of entity types and regular expressions, plus contextual clues from surrounding words.

How It Actually Works

What is PII Detection and Why Does It Matter?

Personally Identifiable Information (PII) refers to any data that can be used to identify an individual. Examples include names, addresses, phone numbers, email addresses, social security numbers, passport numbers, credit card numbers, and medical record numbers. PII detection is the process of automatically identifying such information within text documents. This is crucial for compliance with data protection regulations like GDPR, HIPAA, and CCPA, which require organizations to protect sensitive data. Azure AI Language provides a pre-built PII detection feature that can identify, categorize, and optionally redact PII entities from text.

How Azure AI Language PII Detection Works Internally

Azure AI Language uses a combination of machine learning models and rule-based pattern matching to detect PII. The service is part of the Azure AI Language (formerly Text Analytics) API, which is a cloud-based service that provides natural language processing capabilities. When you submit a text document, the service performs the following steps:

1.

Text Preprocessing: The text is tokenized into words and sentences. The service also identifies the document's language automatically (if not specified) using the language detection API.

2.

Entity Recognition: The service runs a named entity recognition (NER) model trained on a large corpus of labeled text. This model identifies entities like persons, locations, and organizations. For PII, a specialized model is used that focuses on sensitive entity types. The model uses contextual clues (e.g., words like 'SSN' or 'credit card') to improve accuracy.

3.

Pattern Matching: For structured PII types (e.g., social security numbers, credit card numbers, phone numbers), the service uses regular expressions to find exact matches. For example, a US social security number pattern is \d{3}-\d{2}-\d{4}. This step ensures high precision for well-defined formats.

4.

Confidence Scoring: Each detected entity is assigned a confidence score between 0 and 1. Scores above 0.8 are considered high confidence. The service returns entities with a confidence score above a configurable threshold (default 0.5).

5.

Redaction: If requested, the service can redact the detected PII by replacing the original text with asterisks or a placeholder like [SSN]. This is useful for masking sensitive data before sharing documents.

Key Entity Types and Categories

Azure AI Language supports a comprehensive list of PII entity types. These are organized into categories. On the AI-900 exam, you are expected to know the most common ones:

US Social Security Number (SSN): Pattern ###-##-####. Example: 123-45-6789.

Credit Card Number: 13-16 digit numbers, often grouped in 4s. Example: 4111-1111-1111-1111.

Phone Number: US phone numbers with area code, e.g., (555) 123-4567.

Email Address: Standard email format, e.g., user@example.com.

IP Address: IPv4 or IPv6 addresses.

Person Name: Full names detected via NER.

Physical Address: Street addresses, including city, state, zip.

Date of Birth: Dates in various formats.

Bank Account Number: Patterns vary by country.

Passport Number: Country-specific formats.

Driver's License Number: Country/state-specific formats.

Medical Record Number: Typically alphanumeric.

Health Plan ID: Insurance identifiers.

For a complete list, refer to the Azure documentation. The exam may ask you to identify which entity type is detected from a given example.

API Configuration and Usage

To use PII detection, you call the REST API endpoint: https://<your-resource-name>.cognitiveservices.azure.com/language/:analyze-text?api-version=2023-04-01. The request body includes the text and the kind of analysis: PiiEntityRecognition. Here is an example request:

{
  "kind": "PiiEntityRecognition",
  "parameters": {
    "modelVersion": "latest",
    "domain": "phi",
    "piiCategories": ["USSocialSecurityNumber", "CreditCardNumber"]
  },
  "analysisInput": {
    "documents": [
      {
        "id": "1",
        "language": "en",
        "text": "My SSN is 123-45-6789 and my credit card is 4111-1111-1111-1111."
      }
    ]
  }
}

The domain parameter can be set to "phi" (Protected Health Information) to include healthcare-specific entities. The piiCategories parameter allows you to filter which entity types to detect. If omitted, all supported types are detected.

The response includes a list of detected entities with their offsets, lengths, types, and confidence scores:

{
  "kind": "PiiEntityRecognitionResults",
  "results": {
    "documents": [
      {
        "id": "1",
        "redactedText": "My SSN is *********** and my credit card is ****************.",
        "entities": [
          {
            "text": "123-45-6789",
            "category": "USSocialSecurityNumber",
            "offset": 9,
            "length": 11,
            "confidenceScore": 0.85
          },
          {
            "text": "4111-1111-1111-1111",
            "category": "CreditCardNumber",
            "offset": 42,
            "length": 19,
            "confidenceScore": 0.95
          }
        ],
        "warnings": []
      }
    ],
    "errors": [],
    "modelVersion": "2023-01-01"
  }
}

Note the redactedText field: it shows the original text with detected entities replaced by asterisks of the same length. This is the default redaction behavior. You can also choose to redact with a tag by setting the redactionPolicy parameter to "maskWithEntityType".

Interacting with Related Technologies

PII detection is often used in conjunction with other Azure AI services:

Azure Cognitive Search: Use PII detection to index sensitive documents while redacting PII before indexing.

Azure Logic Apps / Power Automate: Automate workflows that detect and redact PII in incoming emails or documents.

Azure Data Factory: Process large volumes of text data and apply PII detection as a transformation step.

Microsoft Purview Compliance Portal: Integrate with Azure AI Language for data classification and policy enforcement.

Performance and Scale Considerations

The PII detection API supports up to 125,000 characters per document and up to 1,000 documents per request (batch). The service has a throughput limit based on your pricing tier (Free tier: 5,000 transactions per month, Standard tier: varies). For high-volume scenarios, consider using asynchronous batch processing with the /analyze endpoint (long-running operations).

Verification and Testing

You can test PII detection using: - Azure Language Studio: A no-code UI where you can paste text and see detected entities interactively. - REST API: Use tools like Postman or curl. - SDKs: Available for Python, C#, Java, JavaScript, and Go.

Example curl command:

curl -X POST "https://<your-resource-name>.cognitiveservices.azure.com/language/:analyze-text?api-version=2023-04-01" \
-H "Ocp-Apim-Subscription-Key: <your-key>" \
-H "Content-Type: application/json" \
-d '{
  "kind": "PiiEntityRecognition",
  "parameters": {
    "modelVersion": "latest"
  },
  "analysisInput": {
    "documents": [
      {
        "id": "1",
        "language": "en",
        "text": "Contact me at john.doe@example.com or call (555) 123-4567."
      }
    ]
  }
}'

Summary of Internal Mechanics

The service uses a pre-trained transformer-based model (similar to BERT) fine-tuned for entity recognition.

Pattern matching acts as a fallback for structured entities, ensuring high recall.

The confidence score is derived from the model's output probabilities.

Redaction is performed server-side before returning the redacted text.

The service does not store your data; it is processed in memory and discarded after the response is sent.

This understanding is critical for the AI-900 exam, where you may be asked about the capabilities, limitations, and appropriate use cases of PII detection.

Walk-Through

1

Create Azure AI Language Resource

In the Azure portal, create a Language resource (formerly Text Analytics). Choose a region (e.g., East US), pricing tier (Free F0 for testing, Standard S for production), and resource group. Note the endpoint and key. This resource provides access to the PII detection API. The Free tier allows 5,000 transactions per month with a rate limit of 20 requests per minute. For production, use Standard tier which scales based on throughput.

2

Prepare Text Input

Gather the text documents you want to analyze. Each document must be a string of up to 125,000 characters. You can send up to 1,000 documents in a single batch request. Ensure the language is specified (e.g., 'en') or leave it blank for automatic detection. For best results, use clean UTF-8 encoded text.

3

Call the Analyze Text API

Send a POST request to the endpoint with the kind set to 'PiiEntityRecognition'. Include your subscription key in the header. The request body must include the documents array and optional parameters like modelVersion, domain (for PHI), and piiCategories (to filter entities). The API returns a JSON response with detected entities and redacted text.

4

Parse the Response

Examine the response JSON. Each document object contains a 'redactedText' field and an 'entities' array. Each entity has 'text', 'category', 'offset', 'length', and 'confidenceScore'. Use the confidence score to filter out low-confidence detections (typically keep > 0.8). The offset is character-based from the start of the document.

5

Handle Redaction and Export

If you requested redaction, the redactedText field contains the original text with PII replaced by asterisks. You can then store or further process this redacted text. For compliance, ensure the original text is not persisted. You can also use the detected entities to generate reports on PII occurrences. The API does not store any data on the server side.

What This Looks Like on the Job

Enterprise Scenario 1: Healthcare Compliance (HIPAA)

A hospital system needs to de-identify patient notes before sharing them with researchers. They use Azure AI Language's PII detection with the domain parameter set to 'phi' to detect Protected Health Information (PHI) such as medical record numbers, health plan IDs, and patient names. They process thousands of documents daily via a batch pipeline using Azure Data Factory. The redacted notes are stored in Azure Blob Storage with access restricted to authorized researchers. A common misconfiguration is forgetting to set the domain parameter, which would miss PHI-specific entities like 'HealthPlanID' or 'MedicalRecordNumber'. Performance considerations: the Standard tier handles up to 1,000 documents per batch, and they monitor API usage to avoid throttling. When the API returns low confidence scores for certain entities, they implement a manual review process using Azure Machine Learning to improve model accuracy.

Enterprise Scenario 2: Customer Support Email Redaction

A financial services company wants to automatically redact sensitive information from customer support emails before storing them in a CRM. They use Azure Logic Apps to trigger PII detection whenever a new email arrives. The email body is sent to the API, and the redacted text is saved to the CRM. The company configures the piiCategories parameter to only detect credit card numbers, social security numbers, and bank account numbers to avoid over-redacting names (which are needed for customer identification). A common pitfall is that the API may detect false positives for credit card numbers in strings of digits that are not valid credit card numbers (e.g., a long order number). The company mitigates this by using a confidence threshold of 0.9 and by implementing a post-processing step that validates the Luhn algorithm for credit card numbers.

Enterprise Scenario 3: Data Loss Prevention (DLP) in Documents

A legal firm uses Azure Cognitive Search to index internal documents. Before indexing, they run PII detection to redact sensitive client information. They use the redaction policy 'maskWithEntityType' to replace each PII entity with its category name (e.g., '[SSN]') for easier review. The firm processes documents in batches of 500 using the asynchronous batch API to handle large volumes without hitting rate limits. A common issue is that the API may not detect all PII due to unusual formatting (e.g., SSN written as '123456789' without dashes). To address this, they preprocess documents to normalize formats before detection. They also use custom entity recognition for domain-specific identifiers not covered by the built-in types.

How AI-900 Actually Tests This

Exam Focus: AI-900 Objective 4.3 (Identify features of NLP workloads on Azure)

The AI-900 exam tests your ability to identify the correct Azure service for a given NLP scenario. For PII detection, the key service is Azure AI Language (specifically the PII detection feature). The exam will not ask you to write API calls, but you must know:

The service name: 'Azure AI Language' (formerly Text Analytics).

The feature name: 'PII detection' or 'Personally Identifiable Information detection'.

Common entity types: SSN, credit card number, phone number, email, IP address, person name, physical address.

The ability to redact detected entities.

The domain parameter for PHI (Protected Health Information).

Common Wrong Answers and Why Candidates Choose Them

1.

Wrong: 'Azure Cognitive Search' – Candidates see 'search' and think of finding data, but Cognitive Search is for indexing and querying, not real-time entity detection. PII detection is a pre-processing step that can feed into Cognitive Search.

2.

Wrong: 'Azure Form Recognizer' – Candidates confuse document processing with text analytics. Form Recognizer extracts structured data from forms, not general PII from free text.

3.

Wrong: 'Azure Bot Service' – Candidates think of conversational AI, but Bot Service is for building chatbots, not analyzing text for PII.

4.

Wrong: 'Azure Machine Learning' – Candidates assume custom models are needed, but Azure AI Language provides pre-built PII detection without custom training.

Specific Numbers and Terms

Maximum document size: 125,000 characters.

Maximum batch size: 1,000 documents per request.

Confidence score range: 0 to 1.

Default model version: 'latest'.

Domain parameter values: 'none' (default) or 'phi'.

Redaction policy: 'maskWithEntityType' or 'maskWithAsterisks' (default).

Entity categories: Over 20 types, including USSocialSecurityNumber, CreditCardNumber, PhoneNumber, Email, IPAddress, PersonName, Address, DateOfBirth, BankAccountNumber, PassportNumber, DriverLicenseNumber, MedicalRecordNumber, HealthPlanID.

Edge Cases and Exceptions

The API may return false positives for numbers that look like SSNs but are not (e.g., order numbers). Confidence score helps filter.

PII in non-English text: The service supports multiple languages, but accuracy may be lower for less common languages.

The 'phi' domain includes additional entities like 'MedicalRecordNumber', 'HealthPlanID', and 'DrugCode'.

Redaction only works for detected entities; if the model misses something, it remains in the redacted text.

How to Eliminate Wrong Answers

If the scenario mentions 'redact' or 'mask' sensitive data, the answer is likely Azure AI Language with PII detection.

If the scenario mentions 'compliance' or 'GDPR', think PII detection.

If the scenario is about extracting information from forms (like invoices), use Form Recognizer, not PII detection.

If the scenario is about building a chatbot, use Bot Service or QnA Maker, not PII detection.

Always match the specific capability to the service: PII detection is a feature of Azure AI Language.

Key Takeaways

PII detection is a feature of Azure AI Language (not a separate service).

Common entity types tested on AI-900: SSN, credit card number, phone number, email, IP address, person name, physical address.

Maximum document size: 125,000 characters; maximum batch: 1,000 documents.

Use the 'domain' parameter set to 'phi' to detect Protected Health Information (PHI).

Default redaction replaces each character with an asterisk; use 'maskWithEntityType' to replace with entity category name.

Confidence scores range 0-1; typical threshold is 0.8 for high confidence.

The API does not store data; processing is ephemeral.

For custom entities, use Custom Entity Recognition (Custom NER) instead of pre-built PII detection.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure AI Language PII Detection

Pre-built model for common PII types (SSN, credit card, etc.)

No training required; works out of the box

Supports redaction of detected entities

Limited to predefined entity categories

Best for standard compliance scenarios (GDPR, HIPAA)

Custom Entity Recognition (Custom NER)

Requires training with labeled data

Can detect custom entities not in pre-built list (e.g., employee IDs, contract numbers)

No built-in redaction; you must implement it yourself

More flexible but requires more effort

Best for domain-specific or proprietary data

Watch Out for These

Mistake

PII detection can detect all types of personal data, including biometric data.

Correct

Azure AI Language's built-in PII detection covers common types like SSN, credit card, phone, email, etc., but does not detect biometric data (e.g., fingerprints, iris scans). For custom data, you would need Custom Entity Recognition.

Mistake

PII detection stores the text on Microsoft servers for training.

Correct

The API does not store any data. Text is processed in memory and discarded after the response is sent. Microsoft does not use customer data to train models unless you opt-in for custom model training.

Mistake

You must train a custom model to detect PII.

Correct

Azure AI Language provides a pre-trained PII detection model that works out of the box. No custom training is needed. Custom Entity Recognition is for detecting custom entities not covered by the pre-built model.

Mistake

PII detection only works for English text.

Correct

The service supports multiple languages including English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, and more. Language detection is automatic if not specified.

Mistake

Redaction replaces PII with the entity type name by default.

Correct

The default redaction policy replaces each character of the detected entity with an asterisk. To replace with the entity type name (e.g., '[SSN]'), you must set the 'redactionPolicy' parameter to 'maskWithEntityType'.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between PII detection and Custom Entity Recognition?

PII detection is a pre-built feature of Azure AI Language that identifies common sensitive data types like SSN, credit card numbers, and email addresses. It requires no training and includes built-in redaction. Custom Entity Recognition (Custom NER) allows you to train a model to detect custom entities specific to your domain (e.g., product codes, employee IDs). It requires labeled training data and does not include redaction. For the AI-900 exam, remember that PII detection is pre-built and Custom NER is for custom entities.

Can I use PII detection to redact information in real-time?

Yes, the API is synchronous and returns results quickly (typically under 1 second for small texts). For real-time applications, you can call the API per request. However, be mindful of rate limits (e.g., Free tier: 20 requests per minute). For high-throughput real-time scenarios, consider using the Standard tier and possibly caching results.

Does PII detection work for images or PDFs?

No, the API works on plain text. To extract text from images or PDFs, use Azure AI Vision's OCR (optical character recognition) or Azure Form Recognizer, then pass the extracted text to PII detection.

How do I handle false positives in PII detection?

You can filter by confidence score (e.g., only accept entities with score > 0.8). You can also use the piiCategories parameter to limit detection to specific entity types. For example, if you are only interested in credit card numbers, set piiCategories to ['CreditCardNumber']. Additionally, you can implement post-processing validation (e.g., Luhn algorithm for credit cards).

What languages does PII detection support?

The service supports over 30 languages, including English, Spanish, French, German, Italian, Portuguese, Chinese (Simplified and Traditional), Japanese, Korean, Arabic, Hindi, and more. The accuracy varies by language; English has the highest accuracy. You can specify the language in the request or let the service auto-detect it.

Is PII detection available in all Azure regions?

Azure AI Language is available in most Azure regions, including East US, West Europe, Southeast Asia, etc. Check the Azure Products by Region page for the latest availability. Some regions may have restricted features due to compliance requirements.

Can I use PII detection to detect PII in audio or video?

Not directly. First, transcribe audio to text using Azure Speech-to-Text, then run PII detection on the transcribed text. Similarly, for video, extract audio and transcribe it. The PII detection API only accepts text input.

Terms Worth Knowing

Ready to put this to the test?

You've just covered PII Detection and Extraction — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?