Create and manage exact data match sensitive info types

15 minutes

Exact data match (EDM) allows you to create a sensitive information type (SITs) that uses exact data values for identifying and protecting sensitive information. With EDM, you can ensure your SIT:

Is easily updated: Quickly adapt to changes in your sensitive data.
Reduces false positives: Accurately identify the correct information, minimizing mistakes.
Fits structured data: Works well with organized data sets.
Ensures privacy: Keeps sensitive data secure and private, even from Microsoft.
Integrates across services: Functions with a range of Microsoft cloud services for better data governance.

For example, if you have customer account numbers, EDM specifically flags those numbers only, which significantly lessens the risk of incorrect flags.

EDM-based classification enables you to create custom sensitive types that match exact values from a database. This database can hold up to 100 million rows of data and can be refreshed daily to reflect changes such as new or departing employees, patients, or clients. This ensures your custom sensitive information types remain current and relevant.

What's different in an EDM SIT

An EDM SIT is different from standard SITs because it matches exact data values instead of relying only on patterns or keywords. It also includes a few specific concepts:

Schema

The schema is an XML file that serves as the blueprint for your EDM SIT. It defines:

The name of the schema, later referred to as the DataStore.
Field names that correspond to the columns in your sensitive information source table.
Which fields are searchable, allowing precise control over what the SIT will check.
A configurable match to refine your search, such as case sensitivity or ignoring punctuation.

Sensitive information source table

The sensitive information source table is the actual dataset used for matching. It contains:

Column headers representing the field names (like First Name, Last Name, Date of Birth).
Rows representing individual records, with each cell containing the specific value for its field.

Example table:

First name	Family name	Date of birth
Isaiah	Langer	05-05-1960
Ana	Bowman	11-24-1971
Oscar	Ward	02-12-1998

Rule package

The rule package in an EDM SIT defines:

Matches specify the primary element used for exact lookups, such as a regular expression or a function.
Classification determines the type of sensitive information being searched for.
Confidence levels measure the likelihood of a match, based on how much supporting evidence is present.
Proximity defines the allowable character distance between the primary and supporting elements.
Supporting elements provide additional context, improving accuracy by reducing false positives and increasing confidence. For instance, finding "SSN" near a social security number helps confirm it.

Primary and secondary support elements

In an EDM SIT, the primary element is the key data point you're looking to protect, such as a social security number or credit card number. You must match the primary element to an existing SIT that Microsoft Purview can already identify.

Once the primary element is detected, EDM looks for a secondary supporting element, like finding the term "SSN" near the actual social security number. This further confirms the identification, increasing confidence in the match.

Supporting elements don't always need fixed patterns, but if they contain multiple words, a defined pattern is required.

Create an EDM-based SIT

Creating an EDM SIT is a multi-phase process. You can use either the new experience or the classic experience, depending on your needs.

The new EDM experience

The new EDM experience integrates schema creation and SIT definition into a streamlined workflow:

Simplified workflow: The new EDM experience streamlines the process by combining schema and SIT creation, reducing steps and guiding you through mapping data fields to predefined SITs.
Additional guardrails to ensure better performance: Alerts you when fields are too broad, helping you avoid inefficient matches and ensuring high performance.

The classic EDM experience

You can toggle back and forth between the new and classic experiences, but we recommend using the new experience unless your needs fall into one or more of these four use cases:

Multiple SITs per schema: The classic experience allows for multiple SITs to be mapped to a single schema, which isn't possible in the new experience.
Managing more than 10 SITs: If you need to create or manage more than 10 SITs, you need to use the classic experience. Because you can map multiple EDM SITs to the same schema, you can have more than 10 EDM SITs. Attempting to create an eleventh schema with the new experience generates an error.
Custom schema names: The classic experience lets you specify custom names for your EDM schemas, unlike the new experience that auto-generates schema names.
Editing existing schemas: If you need to edit schemas created in the classic experience or uploaded via PowerShell, you must use the classic experience, as the new experience doesn't support this functionality.

Use the procedure to understand how to create an EDM SIT. Select the appropriate tab for guidance on creating one with the new or classic experience.

New EDM experience
Classic EDM experience

Sign in to the Microsoft Purview portal, then navigate to Solutions > Information Protection > Classifiers > EDM classifiers.
Make sure the New EDM experience toggle is set to On.
Select Create EDM classifier.
Review the Familiarize yourself with the steps needed to put your classifier to work page, then select Create EDM classifier.
On the Name and describe your EDM classifier page, name the SIT and add a description. The system uses this name, appended with the word schema, for the associated schema it generates.
Select Next.
On the Choose a method for defining your schema page, select the method you want to use for your schema: either Upload a file containing sample data, or Manually define your data structure.

Best practice is to upload a sample data file. The rest of this procedure assumes this option.
Select Next.
On the Upload your sample file page, select your sample file and then select Upload file. Select Next.

If errors display during the upload, address them and then try again.
On the Select primary elements page:
1. In the Primary element column, select your primary element. Each primary element must be mapped to a SIT. Best practice is to select fields that show Full match under the Match Validation column.
2. In the Match mode column for each field, designate which of the following matching options to apply:
  - Option 1: Do nothing to accept the system-suggested SIT.
  - Option 2: Expand the dropdown menu. Under Sensitive Info type (SIT), choose the pencil (Edit) icon and then select another existing SIT.
  - Option 3: Under Match mode select Single token.
  - Option 4: Under Match mode select Multi-token.
Select Next.
Configure settings for data in selected columns.
- The toggle Use the same settings for all columns is set to On by default. If you want to use separate settings for each data field, set the toggle to Off.
- The Data in columns are case-insensitive option is selected by default. To enforce case-sensitive detection, uncheck this box.
- If needed, select the option to Ignore delimiters and punctuation for data in all columns You can then either select the delimiters and punctuation marks you want to ignore from a list or you can enter custom delimiters and punctuation marks to ignore.
On the Review settings and finish page, select Submit.
On the You successfully created an EDM classifier page, capture the Schema name. This name is required when hashing and uploading the sensitive information source table to ensure proper mapping of the data to the schema.
Once you've captured the schema name, select Done.

Once you create your EDM schema, the next step is to hash and upload your sensitive data. This ensures the data can be used securely for classification. For detailed steps on hashing and uploading your source table, see Hash and upload the sensitive information source table for exact data match sensitive information types