SC-900Chapter 68 of 103Objective 4.2

Data Classification in Microsoft Purview

This chapter covers data classification in Microsoft Purview, a core component of the Microsoft Purview compliance portal. Data classification is the process of identifying and labeling sensitive data across your organization's digital estate, including files, emails, and databases. For the SC-900 exam, this topic appears in approximately 5-10% of questions, primarily in the context of compliance solutions (Objective 4.2). Understanding how classification works, the types of classifiers, and how they integrate with sensitivity labels is essential for passing the exam and for real-world data governance.

25 min read
Intermediate
Updated May 31, 2026

The Library Color-Coding System for Sensitive Documents

Imagine a large corporate library where every document must be labeled with a colored sticker based on its sensitivity: red for top secret, yellow for internal only, green for public. A librarian (the data classification engine) walks through the shelves (your data estate) with a set of rules: if a document contains the phrase 'patent pending', it gets a red sticker; if it has an employee's social security number, it gets a red sticker; if it's a press release, it gets a green sticker. The librarian can also learn from examples: you show her 10 documents that are red and 10 that are green, and she builds a pattern (machine learning classifier) to label similar ones. Once labeled, the library's access control system (sensitivity labels) can automatically restrict who can check out red-labeled books, encrypt them, or add watermarks. Without the color-coding (classification), the access control system has no way to know which books need protection. This is exactly how Microsoft Purview Data Classification works: it scans content using trainable classifiers, exact data match, or manual rules to assign sensitivity labels, which then enforce protection actions.

How It Actually Works

What is Data Classification in Microsoft Purview?

Data classification in Microsoft Purview is the process of automatically or manually identifying sensitive data (e.g., credit card numbers, health records, intellectual property) and applying metadata tags called sensitivity labels to that data. These labels then drive protective actions such as encryption, access restrictions, and visual markings (watermarks, headers/footers). The classification engine is part of the Microsoft Purview compliance portal and works across Microsoft 365 services (Exchange Online, SharePoint Online, OneDrive for Business, Microsoft Teams) and on-premises file shares via the Microsoft Purview Information Protection scanner.

Why Does Data Classification Exist?

Organizations generate vast amounts of data, and much of it is sensitive. Without classification, there is no way to consistently apply protection. Regulatory frameworks (GDPR, HIPAA, PCI-DSS, etc.) require organizations to know where sensitive data resides and to protect it appropriately. Data classification provides the foundation for:

Data Loss Prevention (DLP): DLP policies use classification to detect and block unauthorized sharing of sensitive data.

Sensitivity Labels: Labels are applied based on classification results, enabling encryption, rights management, and visual markings.

Microsoft Purview Data Map: Classification metadata enriches the data catalog for discovery and governance.

How Data Classification Works Internally

Data classification in Purview operates through a combination of content scanning and pattern matching. The process involves several stages:

1.

Content Crawling: The Microsoft Purview Information Protection scanner (for on-premises files) or built-in service (for cloud workloads) scans content. For SharePoint Online and OneDrive, scanning is performed by the Microsoft 365 service itself. For on-premises file shares, the scanner is installed on a Windows Server and communicates with Azure Information Protection.

2.

Classifier Execution: Each item is evaluated against a set of classifiers. Classifiers are rules that detect specific types of sensitive information. There are three main types:

- Sensitive Information Types (SITs): Predefined or custom regex-based patterns. Examples include:

- Credit Card Number: matches the Luhn checksum and pattern of major card issuers. - U.S. Social Security Number: matches format XXX-XX-XXXX with valid area numbers. - Azure Storage Account Key: matches a 88-character base64 string. - Trainable Classifiers: Machine learning models that learn from sample data you provide. You feed it positive and negative examples (e.g., 50 resumes and 50 non-resumes), and it builds a model to classify similar content. - Exact Data Match (EDM): Uses a database of exact sensitive values (e.g., a list of employee IDs) to match against content. The database is hashed and stored in a secure table.

3.

Confidence Levels: For each match, the engine calculates a confidence level. SITs have a minimum confidence level parameter (default 75, range 1-100). Only items meeting or exceeding this threshold are classified. Trainable classifiers output a probability score; you set a threshold (e.g., 0.7) to decide when to apply a label.

4.

Label Application: Based on classification results, the engine can automatically apply a sensitivity label (auto-labeling) or suggest one. Auto-labeling policies are configured in the Microsoft Purview compliance portal under Information Protection > Auto-labeling. These policies can be set to run in simulation mode first to see impact without enforcement.

Key Components, Values, Defaults, and Timers

Sensitive Information Types (SITs): Over 200 predefined types. Custom SITs can be created using regex with keyword lists and proximity rules. For example, a custom SIT for a project code might be [A-Z]{3}-\d{4} with a keyword list ["Project", "Code"] within 300 characters.

Trainable Classifiers: Seven out-of-the-box classifiers (e.g., Resume, Source Code, Financial Document). You can create up to 50 custom trainable classifiers per tenant. Training requires at least 50 positive samples and 50 negative samples, with a minimum of 200 samples total for optimal accuracy.

Exact Data Match (EDM): Supports up to 10 million rows per table. The data is hashed using SHA256 and salted. The schema is uploaded as an XML file, and the actual data is uploaded as a CSV.

Auto-labeling Policies: Can target all locations (Exchange, SharePoint, OneDrive, Teams) or specific ones. Default simulation period is 7 days before you can turn on enforcement.

Scanner: For on-premises, the scanner runs on Windows Server 2016 or later. It can scan up to 1 million files per day depending on hardware.

Configuration and Verification

To view classification results, navigate to Microsoft Purview compliance portal > Data Classification > Overview. Here you can see:

Top sensitive information types detected.

Top sensitivity labels applied.

Data classification activity over time.

You can also use Content Explorer to see a list of items with their classification. For example, a CSV file with credit card numbers will show the SIT "Credit Card Number" and any applied label.

Interaction with Related Technologies

Data classification feeds into:

Sensitivity Labels: Classification triggers label assignment. Labels can then enforce encryption, rights management, and visual markings.

DLP Policies: DLP rules can reference sensitivity labels or SITs. For example, a DLP policy might block sharing of any file labeled "Highly Confidential" with external users.

Microsoft 365 Compliance Center: All classification data is aggregated here for reporting and alerts.

Microsoft Purview Data Map: Classification metadata is exported to the data map for cataloging and discovery.

Exam-Relevant Details

The SC-900 exam focuses on the purpose of data classification and its role in compliance, not on deep technical configuration of custom SITs or EDM.

You need to know the three types of classifiers: sensitive information types, trainable classifiers, and exact data match.

Understand that classification precedes protection: you classify first, then apply labels or DLP.

Remember that classification can be automatic (via policies) or manual (user applies label).

Know that the Microsoft Purview Information Protection scanner is used for on-premises file shares.

Be aware that classification is content-based, not just metadata-based. It scans actual file content and email body.

Common Misunderstandings

Classification is not the same as labeling. Classification is the process of identifying sensitive data; labeling is the action of applying a tag.

Trainable classifiers are not the same as SITs. SITs are regex-based; trainable classifiers use machine learning.

Auto-labeling can be run in simulation mode before enforcement.

The scanner does not modify files; it only reports classification results. Labels are applied by the service when the file is accessed in the cloud.

Walk-Through

1

Identify Sensitive Data Sources

First, you must inventory where sensitive data resides: Exchange Online mailboxes, SharePoint Online sites, OneDrive accounts, Microsoft Teams channels, and on-premises file shares. For cloud locations, Microsoft 365 automatically indexes content. For on-premises, you deploy the Microsoft Purview Information Protection scanner on a Windows Server. The scanner requires a service account with read permissions to the file shares. You configure which repositories to scan in the scanner's configuration file (e.g., UNC paths). The scanner then crawls the file shares, reading file content (including Office documents, PDFs, text files, and images via OCR with the add-on). The crawl runs on a schedule (default every 24 hours) or on-demand.

2

Configure Classifiers and Policies

In the Microsoft Purview compliance portal, you define which classifiers to use. For sensitive information types, you can use built-in ones (e.g., ABA Routing Number, SWIFT Code) or create custom ones. For trainable classifiers, you upload sample documents and train the model. Then you create an auto-labeling policy: choose a name, select classifiers (e.g., Credit Card Number, Trainable Classifier for Resumes), choose locations (e.g., SharePoint, Exchange), and set the action (apply a sensitivity label, e.g., "Confidential"). You must decide whether to run in simulation mode first. The policy is then published and applied to content.

3

Scan and Classify Content

Once the policy is active, the classification engine scans content. For cloud locations, scanning occurs in near real-time as files are created or modified. For on-premises, the scanner runs on its schedule. During scanning, each file's content is evaluated against the configured classifiers. For SITs, the engine looks for patterns and calculates a confidence score. For trainable classifiers, the model outputs a probability. If the score meets the threshold, the item is marked as matching that classifier. Multiple classifiers can match a single item. The results are stored in the classification activity log.

4

Apply Sensitivity Labels

If the auto-labeling policy is set to enforce (not simulation), the matching items automatically receive the specified sensitivity label. The label is applied at the item level in the cloud (for SharePoint, OneDrive, Exchange) or reported for on-premises files (the scanner does not apply labels locally; you must sync to the cloud or use a separate process). The label can then trigger protection actions: encryption (via Azure Information Protection), visual markings (watermark, header, footer), and access restrictions. Users see the label in Office apps. The label is stored as metadata in the file.

5

Monitor and Refine Classification

After deployment, you monitor classification results in the Data Classification dashboard. You can view the number of items classified by type, the top locations, and false positive/negative rates. Use Content Explorer to drill into specific items. If too many false positives occur, adjust the confidence threshold for SITs or retrain trainable classifiers with more samples. You can also export classification reports for compliance audits. The process is iterative: as new data types emerge, you update classifiers and policies.

What This Looks Like on the Job

Enterprise Scenario 1: Healthcare Compliance (HIPAA)

A large hospital network must ensure that patient health information (PHI) is protected. They use Microsoft Purview to classify emails and documents containing PHI. They create a custom sensitive information type that detects medical record numbers (format: MRN-XXXXXXXX) and a trainable classifier for clinical notes. They deploy the on-premises scanner to file shares containing legacy patient records. The auto-labeling policy applies the "Highly Confidential" sensitivity label to any item matching these classifiers. The label enforces encryption and restricts access to authorized clinical staff. The compliance team monitors the dashboard weekly. A common issue: false positives from non-PHI documents that happen to contain similar patterns (e.g., a research paper with a sample MRN). They adjust the confidence threshold from 75 to 85 and add keyword exclusions (e.g., "sample" or "test") to reduce noise.

Enterprise Scenario 2: Financial Services (PCI-DSS)

A bank must protect credit card numbers (PANs) across its environment. They use the built-in "Credit Card Number" sensitive information type. They also use Exact Data Match (EDM) to detect a list of compromised card numbers from a breach database. EDM requires uploading a CSV of card numbers, hashed and salted. The auto-labeling policy applies the "Confidential - Financial" label. The DLP policy blocks sharing of labeled files externally. Performance consideration: EDM matching is slower than regex because it involves hash comparison. The bank schedules scanning during off-peak hours. A misconfiguration: if the EDM schema does not include the correct column delimiter, no matches occur. The admin must validate the schema XML carefully.

Enterprise Scenario 3: Intellectual Property Protection

A technology company wants to prevent source code leaks. They use the built-in trainable classifier "Source Code" and create a custom SIT for internal project codenames (e.g., "Project Aurora"). The auto-labeling policy applies the "Top Secret" label to source code files and documents with codenames. The label applies a watermark "CONFIDENTIAL" and restricts access to the development team only. The scanner runs on file shares containing legacy code repositories. A common problem: the trainable classifier may misclassify non-code text (e.g., a document describing code architecture) as source code. The admin retrains the classifier with more negative examples (e.g., design documents, emails). The classifier improves over time.

How SC-900 Actually Tests This

What SC-900 Tests on Data Classification

The SC-900 exam objective 4.2 focuses on "Describe the capabilities of Microsoft Purview Data Classification." Specifically, you need to know:

The purpose of data classification: To identify sensitive data and enable protection.

Types of classifiers: Sensitive information types (regex-based), trainable classifiers (ML-based), and exact data match (database lookup).

How classification integrates with sensitivity labels and DLP.

The role of the Microsoft Purview Information Protection scanner for on-premises data.

Simulation mode for auto-labeling policies.

Common Wrong Answers and Why Candidates Choose Them

1.

"Classification and labeling are the same thing." Wrong because classification is the identification step; labeling is the application of a tag. Candidates confuse the terms because they are often used together.

2.

"Trainable classifiers are the same as sensitive information types." Wrong because SITs use regex, while trainable classifiers use machine learning. Candidates may think all classifiers are rules-based.

3.

"Data classification only works for cloud data." Wrong because the scanner supports on-premises file shares. Candidates forget the on-premises capability.

4.

"Auto-labeling policies apply labels immediately without testing." Wrong because simulation mode exists. Candidates assume automatic means immediate enforcement.

Specific Numbers and Terms That Appear on the Exam

Sensitive Information Types (SITs): Over 200 predefined types.

Trainable Classifiers: 7 built-in, up to 50 custom.

Exact Data Match (EDM): Up to 10 million rows.

Confidence Level: Default 75 (range 1-100).

Scanner: Runs on Windows Server 2016+.

Edge Cases and Exceptions

Images: Classification can detect text in images via OCR (requires OCR add-on license).

Encrypted files: Classification cannot scan encrypted files unless decrypted (e.g., by using the service's decryption capability).

File size limit: The scanner can skip files larger than a configurable size (default 100 MB).

How to Eliminate Wrong Answers

If an answer says classification is only for email, eliminate it (works for files, Teams, etc.).

If an answer says trainable classifiers require no training, eliminate it (they require sample data).

If an answer says classification automatically encrypts data, eliminate it (classification only identifies; labeling encrypts).

Look for keywords: "identify" vs "protect", "regex" vs "ML", "cloud only" vs "on-premises".

Key Takeaways

Data classification identifies sensitive data using sensitive information types, trainable classifiers, or exact data match.

Classification is the prerequisite for applying sensitivity labels and DLP policies.

The Microsoft Purview Information Protection scanner enables classification of on-premises file shares.

Auto-labeling policies can run in simulation mode to test impact before enforcement.

There are over 200 predefined sensitive information types covering common regulations.

Trainable classifiers require at least 50 positive and 50 negative samples for training.

Exact Data Match (EDM) uses hashed values for matching, not plaintext.

Classification results are visible in the Data Classification dashboard and Content Explorer.

Confidence level for SITs defaults to 75 (range 1-100); adjust to reduce false positives.

Data classification supports both cloud (Exchange, SharePoint, OneDrive, Teams) and on-premises data.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Sensitive Information Types (SITs)

Based on regex patterns and keyword lists.

No training required; works out of the box.

Over 200 predefined types for common data (credit cards, SSNs).

Custom SITs can be created with regex and keywords.

Best for structured data with consistent formats.

Trainable Classifiers

Based on machine learning models.

Requires training with sample documents (at least 50 positives and 50 negatives).

Seven built-in classifiers (e.g., Resume, Source Code).

Up to 50 custom trainable classifiers per tenant.

Best for unstructured data where patterns vary (e.g., contracts, resumes).

Auto-labeling (Automatic)

Labels are applied automatically based on classification policies.

Can run in simulation mode to test before enforcement.

Applies to content at rest (existing files) and in transit (new files).

Requires no user intervention; consistent enforcement.

Can be configured to apply labels to all matching content across the tenant.

Manual Labeling by Users

Users manually select a sensitivity label from Office apps (e.g., Outlook, Word).

Depends on user training and compliance; inconsistent if users forget.

Applies only to content the user is creating or editing.

Gives users flexibility to choose the appropriate label.

Often combined with mandatory labeling policies to enforce a default label.

Watch Out for These

Mistake

Data classification and sensitivity labeling are the same thing.

Correct

Classification is the process of identifying sensitive data; labeling is the action of applying a metadata tag (sensitivity label) to that data. They are distinct steps in the protection workflow.

Mistake

Trainable classifiers work immediately without any training data.

Correct

Trainable classifiers require a training phase where you provide at least 50 positive and 50 negative sample documents. The model learns from these examples before it can classify new content.

Mistake

Data classification only works on files stored in the cloud (SharePoint, OneDrive).

Correct

Microsoft Purview Data Classification also supports on-premises file shares via the Microsoft Purview Information Protection scanner, which can scan local and network file systems.

Mistake

Auto-labeling policies immediately enforce labeling on all existing content.

Correct

Auto-labeling policies can be run in simulation mode first, which shows what labels would be applied without actually applying them. Enforcement is only activated after you turn off simulation.

Mistake

Exact Data Match (EDM) uses the actual sensitive values in plaintext for matching.

Correct

EDM hashes the sensitive data using SHA256 with a salt before uploading. Matching is done against the hashed values, so the plaintext values are never stored in the cloud.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between data classification and sensitivity labels in Microsoft Purview?

Data classification is the process of identifying sensitive data using classifiers (SITs, trainable, EDM). Sensitivity labels are the metadata tags applied to that data to enforce protection (encryption, access restrictions, visual markings). Classification comes first; labeling is the action. For example, a classifier detects a credit card number, and then a sensitivity label like 'Confidential' is automatically applied.

Can data classification scan on-premises file shares?

Yes, using the Microsoft Purview Information Protection scanner. This scanner runs on a Windows Server and can scan local and network file shares. It reports classification results to the Microsoft 365 compliance portal. However, the scanner does not apply labels; labels are applied when the file is accessed in the cloud or through a separate sync process.

What are the three types of classifiers in Microsoft Purview Data Classification?

The three types are: (1) Sensitive Information Types (SITs) – regex-based patterns for common data like credit cards and SSNs; (2) Trainable Classifiers – machine learning models that learn from sample documents; (3) Exact Data Match (EDM) – matches against a database of exact values (hashed). All three can be used in auto-labeling policies.

How do I test auto-labeling policies before enforcing them?

When creating an auto-labeling policy, you can choose to run it in simulation mode. In simulation mode, the policy scans content and reports which items would be labeled, but no labels are actually applied. You can review the results in the Data Classification dashboard. After analysis, you can turn off simulation to enforce labeling.

What is the minimum number of samples needed to train a custom trainable classifier?

You need at least 50 positive samples and 50 negative samples, with a total of at least 200 samples recommended for optimal accuracy. The samples should be representative of the content you want to classify. After training, you test the classifier and can refine it with additional samples if needed.

Can data classification detect sensitive data in images?

Yes, if you have the OCR (Optical Character Recognition) add-on license, the scanner can extract text from images (JPEG, PNG, TIFF, BMP) and then apply classifiers. Without OCR, images are not scanned for text. OCR is available as an add-on for the Microsoft Purview Information Protection scanner.

What happens if a file is encrypted? Can data classification scan it?

By default, encrypted files cannot be scanned because the content is not accessible. However, Microsoft Purview can decrypt certain encrypted files (e.g., those protected by Azure Information Protection) if the scanner has the appropriate decryption rights. Other encryption (e.g., third-party) will block scanning.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Classification in Microsoft Purview — now see how well it sticks with free SC-900 practice questions. Full explanations included, no account needed.

Done with this chapter?