This chapter covers Exact Data Match (EDM) for sensitive info types in Microsoft Purview, a key topic under Compliance Solutions for the SC-900 exam. EDM enables organizations to detect and protect sensitive data with high precision by matching against custom databases of exact values. Approximately 5-10% of SC-900 exam questions touch on data classification and sensitive info types, with EDM being a specific focus within objective 4.3. Understanding EDM's mechanism, configuration, and use cases is essential for exam success.
Jump to a section
Exact Data Match (EDM) is like a company that maintains a private database of employee fingerprints for high-security access. In a typical fingerprint scanner, the system compares your finger against a built-in library of generic fingerprint patterns. But with EDM, the company has a custom database containing only the fingerprints of its employees—each fingerprint corresponds to a specific employee ID, name, and role. When you scan your finger, the system doesn't just check if it matches a generic pattern; it checks against the exact fingerprints in the database. If it finds a match, it can immediately identify you by name and role and grant access accordingly. The database is carefully protected, and only authorized administrators can add or remove fingerprints. This is exactly how EDM works for sensitive info types: you define a custom database of sensitive data (like employee IDs or credit card numbers) that you want to detect. Microsoft Purview Information Protection uses this database to scan content—emails, documents, etc.—and when it finds an exact match, it can apply protection policies. The key is that EDM uses exact matches, not patterns, so it's highly accurate but requires you to maintain the database. Just as the fingerprint database must be kept up-to-date with new hires and departures, your EDM database must be refreshed with current data to avoid false positives or missed detections.
What is Exact Data Match (EDM) and Why It Exists
Exact Data Match (EDM) is a feature in Microsoft Purview Information Protection that allows organizations to create custom sensitive information types based on exact values stored in a database. Unlike built-in sensitive info types (e.g., credit card numbers, social security numbers) that rely on pattern recognition and regular expressions, EDM uses a hash-based matching engine to compare content against a predefined dataset. This enables detection of specific, structured data such as employee IDs, customer account numbers, or proprietary part numbers that follow no standard pattern.
The primary reason EDM exists is to address the limitations of pattern-based detection. Pattern-based sensitive info types can generate false positives because they match any string that fits the pattern, even if the actual data is not sensitive. For example, a built-in credit card number detector might flag a test number like "4111-1111-1111-1111" even though it's not a real credit card. EDM eliminates this by only matching against a curated list of actual sensitive values. This is critical for compliance with regulations like GDPR, HIPAA, and PCI DSS, where false positives waste investigation time and false negatives expose risk.
On the SC-900 exam, you need to understand that EDM is an add-on to the built-in sensitive info types. It is not a replacement but a complement for scenarios where pattern matching is insufficient.
How EDM Works Internally
EDM operates through a multi-step process involving schema definition, data upload, hashing, and matching. Here's the detailed mechanism:
Schema Definition: You define a schema that describes the structure of your sensitive data. The schema specifies column names and types (e.g., string, integer). Each column can be designated as searchable (primary key) or non-searchable (additional info). For example, a schema for employee data might have columns: EmployeeID (searchable), Name, Department.
Data Upload: You upload the actual sensitive data in a CSV file that matches the schema. The data is stored in a secure database in the Microsoft Purview compliance portal. The upload process uses a secure connection and the data is encrypted at rest.
Hashing and Indexing: The EDM system hashes the content of searchable columns using a one-way cryptographic hash (SHA-256). The hashes are stored in an index. The original plaintext data is not stored in the index; only hashes are kept for matching. The non-searchable columns are stored as plaintext in the database but are not indexed for search.
Matching Process: When content is scanned (e.g., an email or document), the scanner extracts text and computes hashes for strings that match the defined primary key pattern (usually a regular expression). These computed hashes are compared against the indexed hashes. If a match is found, the scanner retrieves the corresponding row from the database (including non-searchable columns) and flags the content as containing sensitive information.
Policy Enforcement: Once matched, a sensitivity label or retention policy can be applied automatically. For example, an email containing a real employee ID might be automatically encrypted.
Key technical detail: The hash is computed on the normalized value. Normalization includes trimming whitespace, converting to uppercase, etc. This ensures that "EMP123" and "emp123 " both match the same hash.
Key Components, Values, Defaults, and Timers
- Schema: Defined in XML format. Example:
<?xml version="1.0" encoding="utf-8"?>
<EdmSchema xmlns="http://schemas.microsoft.com/office/2018/edm">
<DataStore name="EmployeeRecords" description="Employee data">
<Field name="EmployeeID" searchable="true" />
<Field name="Name" searchable="false" />
<Field name="Department" searchable="false" />
</DataStore>
</EdmSchema>CSV File: Must have a header row matching field names. Maximum file size: 5 GB per upload. Maximum number of rows: 10 million per table.
Refresh Interval: After uploading data, there is a delay before it becomes available for scanning. Typically 24-48 hours for initial indexing, but subsequent refreshes (deltas) can take a few hours.
Primary Key: At least one field must be designated as searchable (primary key). This field is hashed and indexed.
Hashing Algorithm: SHA-256, with salt per tenant for security.
Retention: Data is stored for up to 90 days after the last upload. If no new upload occurs, the data is purged.
Configuration and Verification Commands
Configuration is done via the Microsoft Purview compliance portal or PowerShell. Key PowerShell commands:
- Connect to Compliance Center:
Connect-IPPSSession- Create Schema:
New-DlpEdmSchema -FilePath "C:\schema.xml"- Upload Data:
New-DlpEdmDataFile -SchemaName "EmployeeRecords" -DataFile "C:\data.csv"- Verify Upload Status:
Get-DlpEdmDataFile -SchemaName "EmployeeRecords"- List Schemas:
Get-DlpEdmSchema- Remove Schema:
Remove-DlpEdmSchema -SchemaName "EmployeeRecords"Important: After uploading, you must create a sensitive info type that uses the EDM schema. This is done via the compliance portal or using the New-DlpSensitiveInformationType cmdlet with the EdmSchema parameter.
Interaction with Related Technologies
EDM works alongside other Microsoft Purview features:
Sensitive Info Types: EDM-based types are a subclass of sensitive info types. They appear in the same list and can be used in DLP policies, auto-labeling, and retention labels.
Data Loss Prevention (DLP): DLP policies can include EDM-based sensitive info types as conditions. For example, a DLP rule might block external sharing of documents containing EDM-matched employee IDs.
Auto-Labeling: Sensitivity labels can be configured to apply automatically when EDM content is detected.
Microsoft 365 Defender: EDM detection can feed into alerts and investigations.
Information Barriers: EDM can be used to identify content that should be restricted between groups.
EDM does not replace but enhances these features by providing exact match capability.
Common Pitfalls and Exam Traps
False sense of security: EDM only detects exact matches. If the data in the database is stale, it may miss new sensitive data. Regular refreshes are critical.
Case sensitivity: Normalization makes matching case-insensitive, but candidates often think it's case-sensitive.
Hash storage: The exam may test that only hashes of searchable fields are stored in the index, not plaintext.
Refresh delay: The 24-48 hour initial indexing delay is a common exam point.
Maximum rows: 10 million rows per table. If you exceed this, you need multiple tables or a different approach.
Schema changes: Once data is uploaded, you cannot modify the schema. You must delete and recreate.
Specific Numbers and Values for the Exam
Maximum file size: 5 GB
Maximum rows: 10 million
Initial indexing delay: 24-48 hours
Subsequent refresh delay: a few hours
Data retention without upload: 90 days
Hash algorithm: SHA-256 with per-tenant salt
At least one searchable field required
PowerShell module: Exchange Online PowerShell (V2)
Edge Cases the Exam Loves
Empty database: If the database is empty, no matches occur. The exam may ask what happens if no data is uploaded.
Partial matches: EDM requires exact match on the searchable field. A partial match (e.g., substring) does not trigger detection.
Multiple tables: You can have multiple schemas/tables, but each is independent.
Data expiration: After 90 days without refresh, the data is deleted. The exam may test that you must re-upload.
Schema deletion: Deleting a schema also deletes the uploaded data. Re-upload is required.
Summary of Mechanism for Exam Recall
Think of EDM as a three-step pipeline: (1) Define structure (schema), (2) Upload actual values (CSV), (3) Create sensitive info type that references the schema. When scanning, the system hashes candidate strings and compares against stored hashes. Match triggers policy. No match, no action.
Define the Schema
Create an XML schema that describes the structure of your sensitive data. The schema includes field names and specifies which fields are searchable (primary key) and which are not. The searchable fields are the ones that will be hashed and indexed for matching. For example, if you want to detect employee IDs, you define a field 'EmployeeID' as searchable. Non-searchable fields like 'Name' and 'Department' are stored as plaintext and returned when a match occurs, but they are not used for matching. The schema must be uploaded to Microsoft Purview using PowerShell or the compliance portal. Once uploaded, it cannot be modified; you must delete and recreate it if changes are needed. The schema XML uses a specific namespace: http://schemas.microsoft.com/office/2018/edm.
Upload the Data File
Prepare a CSV file that contains the actual sensitive data. The CSV must have a header row that matches the field names in the schema. Each row represents a record. The data is uploaded using the New-DlpEdmDataFile PowerShell cmdlet. The file size must not exceed 5 GB, and the table can have up to 10 million rows. During upload, the data is encrypted and stored securely. The system then begins hashing the searchable fields. The initial indexing process can take 24-48 hours. Subsequent refreshes (delta uploads) take less time, typically a few hours. After uploading, you can check the status with Get-DlpEdmDataFile. The data is retained for 90 days without a new upload; after that, it is automatically deleted.
Create the Sensitive Info Type
After the schema and data are in place, you must create a custom sensitive information type that references the EDM schema. This is done in the Microsoft Purview compliance portal under Classification > Sensitive info types > Create. You specify the schema name and the primary key field. Optionally, you can define a confidence level and additional detection logic. The sensitive info type can then be used in DLP policies, auto-labeling, and retention labels. The exam expects you to know that the sensitive info type must be created after the data upload; otherwise, it will have no data to match against. The sensitive info type is what triggers detection when content is scanned.
Test and Verify Detection
Once the sensitive info type is created, you should test it to ensure it detects the sensitive data correctly. You can use the compliance portal's test feature or create a sample document containing the sensitive values and run a DLP policy scan. Note that detection only works if the content contains an exact match of the searchable field value. Partial matches or variations (e.g., different formatting) will not be detected unless they normalize to the same value. The system uses normalization (trim, uppercase) to reduce false negatives. After testing, you can adjust the schema or data if needed. Remember that schema changes require deletion and re-creation, which also deletes the uploaded data. So plan carefully.
Implement Policies and Monitor
With the EDM-based sensitive info type active, you can incorporate it into DLP policies, auto-labeling policies, and retention labels. For example, you might create a DLP policy that blocks emails containing employee IDs from being sent outside the organization. You can also monitor detection events in the compliance portal's Activity explorer. Regular data refreshes are necessary to keep the database current. Automate the upload process using PowerShell scripts or Microsoft 365 APIs. The exam may ask about the importance of refreshing data to avoid stale matches. Also, be aware that EDM detection works across Exchange Online, SharePoint Online, OneDrive for Business, and Teams messages.
Enterprise Scenario 1: Healthcare Provider Protecting Patient IDs
A large hospital network needs to ensure that patient medical record numbers (MRNs) are not accidentally exposed in emails or documents. MRNs follow an internal format that is not a standard pattern (e.g., 'MRN-2024-0001234'). Built-in sensitive info types cannot detect these because they lack a common pattern. The hospital uses EDM to create a database of all active patient MRNs. The schema has one searchable field (MRN) and non-searchable fields (PatientName, DateOfBirth). The CSV contains millions of rows. They upload the data weekly to capture new patients and remove discharged ones. A DLP policy is configured to block external sharing of any document containing an MRN match. In production, they encountered an issue: the initial upload of 8 million rows took nearly 48 hours to index. They learned to schedule uploads during low-activity periods. Another challenge was that some MRNs appeared in clinical notes with leading zeros or spaces; normalization handled this. False positives were nearly eliminated compared to pattern-based detection. Performance is generally good, but the DLP scanning latency increased slightly due to the hash lookup. They monitor the Activity explorer for false positives and adjust the database accordingly.
Enterprise Scenario 2: Financial Services Firm Detecting Proprietary Account Numbers
A financial services firm uses proprietary account numbers that are 16-digit alphanumeric codes. These codes are unique per customer and are considered sensitive under GDPR. The firm uses EDM to detect these codes in internal communications and documents. They created a schema with the account number as searchable and additional fields like customer name and risk category. The data is refreshed daily via an automated PowerShell script that exports from their CRM. They encountered a common misconfiguration: the schema initially had two searchable fields, which caused unexpected behavior because EDM only uses one primary key for matching. They corrected it to one searchable field. Another issue was that the CSV file exceeded 5 GB due to including many non-searchable columns. They optimized by only including necessary fields. The DLP policy applies a 'Confidential' label to any document containing a matching account number. They also use auto-labeling in SharePoint to protect documents stored in libraries. The firm reports that EDM reduced false positives by 95% compared to a custom regex pattern they previously used.
Enterprise Scenario 3: Government Agency Classifying Personnel Data
A government agency needs to classify documents containing employee social security numbers (SSNs) but only for current employees. They have a database of 500,000 active employees. They use EDM with a schema where SSN is the searchable field. They upload data monthly. The challenge was that SSNs appear in various formats (with dashes, spaces, no dashes). Normalization handles this by stripping dashes and spaces before hashing. They also had to ensure that the CSV file did not contain any SSNs of terminated employees to avoid false positives. They automated the upload with a script that exports only active employees. The DLP policy is configured to encrypt any email containing an SSN match when sent outside the agency. They also use retention labels to retain documents containing SSNs for 7 years. The agency found that the 90-day data retention without refresh was a risk; they set up a reminder to upload every 60 days. In one incident, a data upload failed due to a format error in the CSV (missing header), and the old data expired, causing a gap in detection. They now have monitoring alerts for upload failures.
Objective Coverage
This topic aligns with SC-900 objective 4.3: Describe the capabilities of Microsoft Purview Information Protection. Specifically, EDM is part of 'sensitive information types'. The exam expects you to understand:
What EDM is and when to use it (exact match, custom data)
The high-level process: schema, data upload, sensitive info type
Key limitations: 10 million rows, 5 GB file size, 24-48 hour initial index delay
That EDM uses hashing (SHA-256) for matching, not plaintext comparison
That EDM is an extension of built-in sensitive info types
Common Wrong Answers and Why Candidates Choose Them
'EDM uses pattern matching like built-in types' – Candidates confuse EDM with pattern-based types. Reality: EDM uses exact match against a database of hashed values. Pattern matching is for built-in types.
'EDM can detect partial matches' – Candidates think a substring match works. Reality: EDM requires an exact match on the entire searchable field value. Partial matches are not detected.
'EDM stores plaintext data in the index' – Candidates assume the data is stored as-is. Reality: Only SHA-256 hashes of searchable fields are stored in the index; plaintext is stored separately for non-searchable fields.
'EDM data is immediately available after upload' – Candidates ignore the indexing delay. Reality: Initial indexing takes 24-48 hours; subsequent refreshes take a few hours.
'You can modify the schema after data upload' – Candidates think schemas are mutable. Reality: You must delete and recreate the schema, which also deletes the data.
Specific Exam-Verbatim Terms and Values
'Exact Data Match' is the full name.
'Schema', 'data file', 'sensitive info type' are key terms.
Maximum rows: 10 million.
Maximum file size: 5 GB.
Initial indexing delay: 24-48 hours.
Data retention without refresh: 90 days.
Hash algorithm: SHA-256.
At least one field must be 'searchable'.
PowerShell cmdlets: New-DlpEdmSchema, New-DlpEdmDataFile, Get-DlpEdmDataFile, Remove-DlpEdmSchema.
Edge Cases and Exceptions
Empty database: If no data is uploaded, the sensitive info type will never trigger a match. The exam may ask what happens if the data file is empty.
Multiple searchable fields: Only one primary key is used for matching. Additional searchable fields are ignored for matching but stored.
Data expiration: After 90 days without a new upload, the data is purged. The sensitive info type will stop matching.
Schema deletion: Deleting the schema removes the data and the sensitive info type becomes non-functional.
CSV format errors: If the CSV has missing headers or wrong column order, the upload fails. The exam may test troubleshooting.
How to Eliminate Wrong Answers
If a question mentions 'pattern', 'regex', or 'format', it is likely about built-in sensitive info types, not EDM.
If a question says 'immediately available', it is wrong; EDM has a delay.
If a question says 'plaintext matching', it is wrong; EDM uses hashes.
If a question says 'modify schema without data loss', it is wrong; you must recreate.
If a question mentions 'partial match', it is wrong; EDM requires exact match.
Look for keywords like 'custom database', 'exact values', 'hash', 'upload', 'schema' to identify EDM scenarios.
Study Tips
Memorize the three-step process: schema, data, sensitive info type.
Know the exact numbers: 10 million rows, 5 GB, 24-48 hours, 90 days.
Understand the hash mechanism: SHA-256, one-way, normalized input.
Practice with PowerShell commands (even if just reading them).
Review the exam blueprint for objective 4.3 and note any updates.
EDM detects sensitive data by exact match against a custom database of hashed values.
The three-step process: define schema, upload data (CSV), create sensitive info type.
Maximum 10 million rows per table and 5 GB per file upload.
Initial indexing delay is 24-48 hours; data is retained for 90 days without refresh.
Hash algorithm is SHA-256 with per-tenant salt; only hashes of searchable fields are indexed.
Schema cannot be modified after data upload; must delete and recreate.
EDM is used for custom, non-patterned sensitive data like employee IDs or account numbers.
EDM works with DLP, auto-labeling, and retention policies in Microsoft Purview.
These come up on the exam all the time. Here's how to tell them apart.
Built-in Sensitive Info Types
Uses pattern recognition (regex) and keywords.
No custom data upload required; works out-of-box.
Can detect data that follows a standard format (e.g., credit card numbers).
Higher false positive rate due to pattern matching.
Immediately available; no indexing delay.
Exact Data Match (EDM) Sensitive Info Types
Uses exact match against a custom database of hashed values.
Requires schema definition and data upload.
Detects custom, proprietary data that has no standard pattern.
Very low false positive rate because only exact matches trigger.
Initial indexing delay of 24-48 hours; subsequent refreshes take hours.
Mistake
EDM can detect any sensitive data automatically without configuration.
Correct
EDM requires manual setup: you must define a schema, upload a CSV file of actual sensitive values, and create a sensitive info type. It does not automatically discover data.
Mistake
EDM stores the original plaintext data in its index for matching.
Correct
Only SHA-256 hashes of searchable fields are stored in the index. Plaintext is stored separately for non-searchable fields but is not used for matching. This protects the actual sensitive data.
Mistake
EDM can match partial values, such as a substring of an employee ID.
Correct
EDM requires an exact match on the entire searchable field value. For example, if the database contains 'EMP123', only the exact string 'EMP123' will match; 'EMP12' or 'EMP1234' will not.
Mistake
Once uploaded, EDM data is immediately available for scanning.
Correct
Initial indexing takes 24-48 hours. Subsequent refreshes take a few hours. Data is not available until indexing completes.
Mistake
You can modify the schema after uploading data without losing the data.
Correct
Schema modifications are not allowed. You must delete the schema (which also deletes the uploaded data) and recreate it with the new schema, then re-upload the data.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Exact Data Match (EDM) is a feature that allows you to create custom sensitive information types based on exact values from a database. You define a schema, upload a CSV file of sensitive data, and then create a sensitive info type that references the schema. When content is scanned, the system hashes candidate strings and compares them against hashed values in the database. Only exact matches trigger detection. EDM is used for data that does not follow a standard pattern, such as proprietary employee IDs.
The initial indexing process takes 24-48 hours. Subsequent refreshes (delta uploads) typically take a few hours. During indexing, the system hashes the searchable fields and builds an index. Data is not available for scanning until indexing completes. The exam may test this delay as a key characteristic.
The maximum file size for a single upload is 5 GB. The maximum number of rows per table is 10 million. If you have more than 10 million rows, you need to split the data into multiple tables or use a different approach. These limits are important for exam questions about scalability.
No, EDM requires an exact match on the entire searchable field value. However, normalization is applied: whitespace is trimmed, and text is converted to uppercase before hashing. So 'EMP123 ' and 'emp123' both normalize to 'EMP123' and will match. But 'EMP12' or 'EMP1234' will not match. This is a common exam trap.
EDM stores only SHA-256 hashes of the searchable fields in the index. The original plaintext values are stored in a separate encrypted database and are only retrieved when a match occurs. Non-searchable fields are stored as plaintext but are not indexed. This ensures that even if the index is compromised, the actual sensitive values cannot be reversed from the hashes.
Data in the EDM database is retained for 90 days from the last upload. If no new data is uploaded within that period, the data is automatically deleted. The sensitive info type will then stop matching any content. To maintain detection, you must upload fresh data before the 90-day expiration. The exam may test this retention period.
No, once a schema is created and data is uploaded, you cannot modify the schema. To change the schema, you must delete the existing schema (which also deletes all uploaded data) and recreate it with the new schema, then re-upload the data. Plan your schema carefully before initial upload.
You've just covered Exact Data Match for Sensitive Info Types — now see how well it sticks with free SC-900 practice questions. Full explanations included, no account needed.
Done with this chapter?