SC-900Chapter 102 of 103Objective 4.2

Trainable Classifiers for Content Classification

This chapter covers trainable classifiers in Microsoft 365, a key component of Microsoft Purview's compliance solutions. Trainable classifiers enable automatic content classification based on machine learning models that learn from sample documents. Understanding how to create, train, and apply trainable classifiers is essential for the SC-900 exam, which tests your ability to distinguish between trainable classifiers and other classification methods like sensitive information types (SITs) and exact data match (EDM). Expect approximately 5-10% of exam questions to touch on content classification, with trainable classifiers being a significant subtopic.

25 min read
Intermediate
Updated May 31, 2026

The Document Sorting Machine

Imagine a postal sorting facility that receives millions of pieces of mail daily. The facility has a set of rules to automatically sort mail into different bins: one for bills, one for personal letters, one for advertisements, and one for legal documents. The rules are not based on the sender's name or the envelope color; instead, they are based on the content inside the envelope. To do this, the facility uses a high-speed scanner that opens each envelope, reads the text, and looks for specific keywords, patterns, or phrases that indicate the type of document. For example, if the scanner finds words like "invoice," "amount due," or "payment terms," it drops the letter into the bills bin. If it sees "Your credit card has been approved" or "limited-time offer," it goes to advertisements. The facility manager can also create custom rules by providing sample documents of each type. The machine learns from these samples what patterns to look for. Importantly, the machine does not just match exact words—it understands context. For instance, it knows that "Dear" followed by a name is likely a personal letter, but "Dear Valued Customer" might be an advertisement. This machine is a trainable classifier: it is trained on example content and can then automatically categorize new documents based on what it learned.

How It Actually Works

What Are Trainable Classifiers?

Trainable classifiers are machine learning models in Microsoft 365 that automatically identify and classify content based on its meaning and context, not just by matching predefined patterns or keywords. They are part of Microsoft Purview's data classification and compliance capabilities. Trainable classifiers are "trainable" because you can train them with your own sample documents to recognize specific types of content relevant to your organization, such as contracts, resumes, or financial reports.

Why Trainable Classifiers Exist

Traditional classification methods like sensitive information types (SITs) rely on regex patterns, keywords, or checksums to detect sensitive data (e.g., credit card numbers, Social Security numbers). However, many document types are not defined by such rigid patterns. For example, a contract can vary widely in wording but still be recognizable by its structure, clauses, and legal language. Trainable classifiers address this gap by using machine learning to understand the semantic content of documents.

How Trainable Classifiers Work Internally

Trainable classifiers use a machine learning algorithm called a linear classifier, specifically a support vector machine (SVM). The training process involves three main phases:

1.

Seed Phase: You provide at least 50 sample documents that are positive examples of the content you want to classify (e.g., 50 contracts). You can also optionally provide negative examples (documents that are similar but not the target type). The classifier analyzes these documents, extracting features such as words, phrases, and their frequency (TF-IDF - term frequency-inverse document frequency). It then builds an initial model.

2.

Review Phase: After seeding, the classifier processes a larger set of documents (up to 500) from your environment and assigns each a probability score (0-100) indicating how likely it is to match your target content. You review these results, confirming true positives and correcting false positives/negatives. Each confirmed document is fed back into the model, refining it.

3.

Publish Phase: Once you are satisfied with the accuracy (typically after reviewing at least 200 documents and achieving a confidence level above 70%), you publish the classifier. It then becomes available for use in auto-labeling policies, retention labels, sensitivity labels, DLP policies, and Microsoft 365 compliance features.

Key Components and Defaults

Number of seed samples: Minimum 50 positive samples; up to 500 can be provided. Microsoft recommends 100-200 for best results.

Review phase sample size: The classifier processes up to 500 documents for review, but you must review at least 200 before publishing.

Confidence threshold: The default threshold for classification is 70%. You can adjust this in the classifier settings. Higher thresholds reduce false positives but may miss some matches.

Supported languages: English, French, German, Italian, Spanish, Dutch, Portuguese, Chinese (Simplified), Japanese, Korean, and more. The classifier automatically detects the language of the document.

Supported file types: Word documents (.docx), PDFs, PowerPoint presentations (.pptx), Excel spreadsheets (.xlsx), and text files (.txt). Emails and meeting invitations are not supported for training but can be classified after publishing.

Retraining: You can retrain a published classifier by adding more seed samples or providing feedback on its classifications.

Configuration and Verification Commands

Trainable classifiers are managed through the Microsoft Purview compliance portal or PowerShell. Key PowerShell cmdlets:

# Create a new trainable classifier
New-DlpSensitiveInformationTypeClassification -Name "MyContractClassifier" -Description "Classifies contracts" -RulePackageId <GUID> -ClassificationRuleSetId <GUID>

# To view existing classifiers
Get-DlpSensitiveInformationType -IncludeClassificationRuleSets | Where-Object {$_.ClassificationRuleSet -ne $null}

# To train a classifier
Start-DlpClassificationRuleSetTraining -ClassificationRuleSetId <GUID>

In the UI, navigate to Microsoft Purview compliance portal > Data classification > Trainable classifiers. You can create, train, and publish classifiers from there.

Interaction with Related Technologies

Trainable classifiers can be used in:

Auto-labeling policies: Automatically apply sensitivity labels or retention labels to documents that match the classifier.

Data Loss Prevention (DLP) policies: Trigger DLP rules when content matches a trainable classifier.

Microsoft 365 compliance features: Used in communication compliance, information barriers, and insider risk management.

Microsoft 365 Defender: Can be used for data spillage detection.

Limitations

Trainable classifiers cannot be trained on emails or meeting invitations directly; they work only on documents for training.

The classifier does not support custom sensitive information types; it is separate from SITs.

Training requires a minimum of 50 positive samples, and the quality of samples heavily impacts accuracy.

The classifier is not real-time; it processes documents via a scheduled scan (every 24 hours by default).

Once published, you cannot delete a classifier; you can only retire it (stop using it).

Exam Relevance

For SC-900, you need to know:

The difference between trainable classifiers and SITs (trainable classifiers use machine learning; SITs use pattern matching).

The three phases: seed, review, publish.

Minimum sample counts: 50 seed, 200 review before publish.

Default confidence threshold: 70%.

Use cases: contracts, invoices, resumes, any content that does not have a fixed pattern.

The fact that trainable classifiers are part of Microsoft Purview, not Azure or Defender.

Common Trap Patterns

Trap 1: Confusing trainable classifiers with sensitive information types (SITs). SITs are for detecting patterns like credit card numbers; trainable classifiers are for content types without fixed patterns.

Trap 2: Thinking you can train classifiers on emails. You cannot; training is only on documents.

Trap 3: Believing that trainable classifiers require no human review. The review phase is mandatory before publishing.

Trap 4: Assuming the classifier works instantly. It processes documents on a schedule (every 24 hours).

Walk-Through

1

Identify content type to classify

Determine the type of content you want to automatically classify. This could be contracts, resumes, invoices, or any document type that does not have a fixed pattern like a credit card number. Ensure you have at least 50 sample documents that are positive examples of this content type. Optionally, gather negative examples (similar but not matching) to improve accuracy. The samples must be stored in SharePoint Online, OneDrive, or Exchange Online (as attachments).

2

Create the trainable classifier in Purview

Navigate to Microsoft Purview compliance portal > Data classification > Trainable classifiers. Click 'Create trainable classifier'. Provide a name and description. Upload your seed samples (minimum 50). You can upload them as a .zip file containing the documents. The system will process the samples and begin building the initial model. This phase is called the 'seed phase'.

3

Review and refine classifier results

After seeding, the classifier enters the 'review phase'. It scans up to 500 documents from your tenant and assigns each a confidence score. You must review at least 200 of these documents to confirm whether the classifier correctly identified them. For each document, you mark it as 'Yes' (correct match), 'No' (false positive), or 'Skip'. This feedback is used to retrain the model. The more accurate feedback you provide, the better the classifier becomes.

4

Publish the classifier

Once you have reviewed at least 200 documents and are satisfied with the accuracy (typically a confidence score above 70%), you can publish the classifier. Publishing makes it available for use in auto-labeling policies, DLP policies, and other compliance features. After publishing, the classifier will continuously scan new and modified documents in your environment.

5

Apply the classifier in policies

After publishing, you can use the classifier in various policies. For example, create an auto-labeling policy to automatically apply a sensitivity label to documents classified as 'Contract'. Or add the classifier as a condition in a DLP policy to block sharing of classified content. The classifier appears in the list of conditions as 'Content contains a trainable classifier'.

What This Looks Like on the Job

Scenario 1: Automating Contract Classification

A large legal firm receives thousands of contracts annually. They need to ensure all contracts are labeled with a sensitivity label 'Highly Confidential' and retained for 7 years. Using trainable classifiers, they train a classifier on 150 sample contracts. The classifier is published and used in an auto-labeling policy that applies the sensitivity label and a retention label. The firm also creates a DLP policy that blocks sharing of contracts with external users. In production, the classifier processes documents daily. A common issue is false positives when the classifier misclassifies non-contract legal documents (e.g., memos) as contracts. To mitigate, the firm periodically reviews the classifier's accuracy and retrains with additional negative samples.

Scenario 2: Detecting Resumes in HR Systems

A multinational company wants to automatically detect resumes uploaded to SharePoint and apply a retention label of 2 years. They create a trainable classifier for resumes. The HR team provides 200 resumes from previous hiring rounds. After publishing, the classifier scans all SharePoint sites. However, they encounter performance issues: the classifier takes up to 24 hours to process new documents, which is too slow for real-time DLP. They learn that trainable classifiers are not real-time; they run on a daily schedule. For near-real-time detection, they combine the classifier with a sensitive information type that detects common resume phrases like "Work Experience" and "Education."

Scenario 3: Financial Report Classification

A bank needs to classify quarterly financial reports to ensure they are encrypted before being shared. They train a classifier on 100 financial reports. During the review phase, they notice the classifier has a high false positive rate, flagging press releases as financial reports. They add 50 negative samples (press releases) and retrain. After retraining, accuracy improves. They publish the classifier and use it in a DLP policy that triggers encryption. A misconfiguration occurs when the DLP policy is set to block all sharing of classified content, but some reports need to be shared with auditors. They create an exception for specific recipients. This highlights the importance of testing policies in audit mode first.

How SC-900 Actually Tests This

Exactly What SC-900 Tests

SC-900 objective 4.2 covers "Describe the capabilities of Microsoft Purview." Specifically, you must understand trainable classifiers as a method for content classification. The exam focuses on: - Distinguishing trainable classifiers from sensitive information types (SITs) – SITs use pattern matching; trainable classifiers use machine learning. - The three phases: seed, review, publish. - Minimum requirements: 50 seed samples, 200 reviewed documents before publishing. - Default confidence threshold: 70%. - Use cases: Contracts, resumes, invoices, any content without a fixed pattern. - Where to manage: Microsoft Purview compliance portal > Data classification > Trainable classifiers.

Common Wrong Answers and Why

1. Wrong: "Trainable classifiers can be trained on emails." Why: Candidates confuse the ability to classify emails (yes, after publishing) with training on emails (no, training is only on documents). 2. Wrong: "You need at least 500 seed samples." Why: The minimum is 50; 500 is the maximum for seed samples. Candidates misremember the review phase sample size (500) as the seed requirement. 3. Wrong: "Trainable classifiers are real-time." Why: They run on a schedule (every 24 hours). Candidates may assume DLP policies are real-time, but the classifier itself is not. 4. Wrong: "You can delete a published classifier." Why: Once published, a classifier cannot be deleted; only retired. Candidates may think they can remove it like any other object.

Specific Numbers and Terms

Minimum seed samples: 50

Recommended seed samples: 100-200

Review phase sample size: Up to 500, must review at least 200

Default confidence threshold: 70%

Supported file types for training: .docx, .pdf, .pptx, .xlsx, .txt

Phases: Seed, Review, Publish

Edge Cases

If you do not have 50 seed samples, you cannot create a classifier. The exam may present a scenario where a company wants to classify a rare document type and ask what is required.

If you publish a classifier with low accuracy, you can retrain it by adding more seed samples or providing feedback. The exam may test that retraining is possible.

The classifier can only be used in the same tenant where it was trained. You cannot export or import classifiers between tenants.

How to Eliminate Wrong Answers

If the question mentions "pattern matching" or "regular expressions," it is about SITs, not trainable classifiers.

If the question says "requires no human intervention," it is likely wrong because the review phase requires human feedback.

If the question says "works instantly," it is wrong because classifiers run on a schedule.

If the question says "can be trained on emails," it is wrong; only documents are used for training.

Key Takeaways

Trainable classifiers use machine learning (SVM) to classify content based on meaning, not patterns.

Minimum 50 seed samples are required to create a classifier; recommended 100-200.

The three phases are: seed, review, publish.

You must review at least 200 documents in the review phase before publishing.

Default confidence threshold is 70%; can be adjusted.

Trainable classifiers are managed in Microsoft Purview compliance portal under Data classification.

They can be used in auto-labeling, DLP, communication compliance, and retention policies.

Training is only on documents; emails cannot be used as seed samples.

Published classifiers can be retrained but not deleted.

Classifiers run on a 24-hour schedule, not real-time.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Trainable Classifiers

Uses machine learning (SVM) to classify content

Requires at least 50 seed samples for training

Can identify content without fixed patterns (e.g., contracts)

Requires human review before publishing

Supported file types: .docx, .pdf, .pptx, .xlsx, .txt

Sensitive Information Types (SITs)

Uses pattern matching (regex, keywords, checksums)

No training required; predefined or custom with XML

Identifies specific sensitive data (e.g., credit card numbers)

Can be used immediately after creation

Supports more file types including emails and images (OCR)

Watch Out for These

Mistake

Trainable classifiers are the same as sensitive information types (SITs).

Correct

SITs use pattern matching (regex, keywords) to detect specific sensitive data like credit card numbers. Trainable classifiers use machine learning to identify content based on meaning and context, such as contracts or resumes, which do not have fixed patterns.

Mistake

You need at least 500 seed samples to create a trainable classifier.

Correct

The minimum number of seed samples is 50. You can provide up to 500, but 50 is the minimum. The review phase processes up to 500 documents, which may cause confusion.

Mistake

Trainable classifiers can be trained on emails and meeting invitations.

Correct

Training is only supported on documents (Word, PDF, PowerPoint, Excel, text). Emails and meeting invitations cannot be used as seed samples. However, after publishing, the classifier can scan and classify emails.

Mistake

Once published, a trainable classifier cannot be modified or retrained.

Correct

You can retrain a published classifier by adding more seed samples or providing feedback on its classifications. You cannot delete a published classifier, but you can retire it.

Mistake

Trainable classifiers provide real-time classification.

Correct

Trainable classifiers run on a scheduled basis (every 24 hours by default). They are not real-time. For near-real-time needs, combine with other methods like SITs.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a trainable classifier and a sensitive information type?

A trainable classifier uses machine learning to identify content based on its meaning and context, such as contracts or resumes. A sensitive information type (SIT) uses pattern matching (regex, keywords, checksums) to detect specific sensitive data like credit card numbers or Social Security numbers. Trainable classifiers require training with sample documents, while SITs are predefined or custom-built with rules.

How many seed samples do I need to create a trainable classifier?

You need a minimum of 50 positive seed samples. Microsoft recommends 100-200 for best accuracy. You can also provide up to 500 seed samples. Additionally, you can optionally provide negative samples to improve the classifier's ability to distinguish your target content from similar content.

Can I train a trainable classifier on emails?

No, you cannot train a trainable classifier directly on emails. Training is only supported on documents such as Word (.docx), PDF, PowerPoint (.pptx), Excel (.xlsx), and text (.txt) files. However, after the classifier is published, it can scan and classify emails and other content in your environment.

How long does it take for a trainable classifier to process documents?

Trainable classifiers run on a scheduled basis, typically every 24 hours. They are not real-time. When you create or retrain a classifier, it may take up to 24 hours for the classifier to process documents and update its results. For near-real-time classification, consider using sensitive information types or exact data match.

Can I delete a published trainable classifier?

No, once a trainable classifier is published, it cannot be deleted. You can only retire it, which stops it from being used in policies. Retiring a classifier does not remove it from the list; it remains in a retired state. If you need to completely remove it, you must contact Microsoft support.

What is the default confidence threshold for a trainable classifier?

The default confidence threshold is 70%. This means that the classifier must be at least 70% confident that a document matches the trained content type before it is classified. You can adjust this threshold in the classifier settings. A higher threshold reduces false positives but may increase false negatives.

Can I use a trainable classifier in multiple tenants?

No, trainable classifiers are tenant-specific. They cannot be exported or imported between tenants. Each tenant must create and train its own classifiers. This is because the training data and model are stored within the tenant's environment for security and privacy reasons.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Trainable Classifiers for Content Classification — now see how well it sticks with free SC-900 practice questions. Full explanations included, no account needed.

Done with this chapter?