Use multi-token matching for corroborative evidence in EDM SITs

Article
12/11/2023

To use multi-token matching to identify corroborative evidence for your EDM sensitive information types, you must explicitly opt in for multi-token support for each multi-token corroborative evidence field. You can do this either through the new EDM UI experience or through the schema XML update process. Following are the steps for each method.

Opt-in through the new EDM UI experience

The new EDM UI experience guides you through the process of creating or editing an EDM schema and uploading your data. You can access it from the Microsoft Purview Compliance Portal, under Data classification > Sensitive info types > Exact data match.

When you upload a sample file, EDM automatically maps the SIT that can best detect the sample data in each field. If none of the SITs in your environment can detect the sample data in a field, EDM defaults to single-token matching. Single-token matching means that EDM only compares the hashes of individual strings identified in DLP policy fields with each individual string in the content. The field selected as the primary element for your schema must be mapped to a SIT. In contrast, corroborative evidence fields can either be mapped to a SIT or can be configured for either single-token or multi-token matching.

However, if the sample data in a policy field contains multi-token values and if EDM is operating in single-token mode, a warning displays stating that using single-token matching might result in missed detections. In this case, you can switch to multi-token matching. When you use multi-token matching, EDM compares the hashes of consecutive tokens (strings) in the policy fields against consecutive tokens in the content, up to a maximum number of tokens supported by this feature. The current maximum is five (5) tokens. For example, if your field contains the value "Jane Doe", EDM compares the hashes of "Jane", "Doe", and "Jane Doe" with the content. It then produces a match if the hashes of any of these combinations are identical to the hashed value in your data source.

Tip

To ensure multi-token match detection, trim multi-token corroborative evidence fields to the maximum number of tokens supported. Alternatively, map the multi-token fields to a SIT that can fully detect the multi-token data.

In contrast, if you select multi-token matching for a field that only contains single-token data, a warning displays stating that using multi-token matching might result in higher latency. Higher latency in matching occurs because multi-token matching is inefficient. Additionally, multi-token matching isn't required if the actual EDM data that is later hashed and uploaded is expected to be a single-token field. In general, if any of the fields you use for corroborative evidence can be mapped to a SIT, do that instead of relying on single-token or multi-token matching.

Make sure that your sample data is fully representative of your production data. If your sample file doesn't include multi-token values, but your production data does, you might not get a warning for multi-token matches. For instance, this can occur in cases where your sample data includes only single-token values for a FirstName field, but your production data includes multi-token values for the same field.

Opt-in through the EDM schema XML update process

You can also create or update an XML schema to implement multi-token matching for corroborative evidence using PowerShell.

Here's a sample XML schema with multi-token settings for protecting patient records:

<EdmSchema xmlns="http://schemas.microsoft.com/office/2018/edm">
 <DataStore name="DemoMTCorroborativeEvidenceSchema"
            description="Test" version="1">
  <Field name="Name"
         searchable="false" 
      caseInsensitive="true" 
      ignoredDelimiters="" 
      isMultiToken="false"/>
  <Field name="SSN"
         searchable="true" 
      caseInsensitive="true" 
      ignoredDelimiters="" 
      isMultiToken="false"/>
  <Field name="PatientID" 
      searchable="false" 
      caseInsensitive="true" 
      ignoredDelimiters="" 
      isMultiToken="false"/>
  <Field name="MedicalCondition" 
      searchable="false" 
      caseInsensitive="true" 
      ignoredDelimiters="" 
      isMultiToken="true"/>
  <Field name="Address"
         searchable="false" 
      caseInsensitive="true" 
      ignoredDelimiters="" 
      isMultiToken="true"/>[Learn about exact data match based sensitive information types](sit-learn-about-exact-data-match-based-sits.md#learn-about-exact-data-match-based-sensitive-information-types)
 </DataStore>
</EdmSchema>

Each corroborative evidence field can be configured for multi-token support through the new XML parameter isMultiToken. This parameter can be set to true or false. Any field with isMultiToken set to true is treated as a multi-token field and is compared with consecutive strings in the content, up to the maximum number of tokens supported by this feature. EDM treats fields with isMultiToken set to false, or that don’t have this parameter, as single-token fields. These fields are only compared with individual strings in the content.

For more information about what an EDM schema is, see the Schema section of the Learn about exact data match based sensitive information types article.

For details about creating an EDM schema, see Create your EDM schema and SIT.

Share via

Use multi-token matching for corroborative evidence in EDM SITs

Opt-in through the new EDM UI experience

Opt-in through the EDM schema XML update process

Feedback

Additional resources