Posted in Classification, Governance, Information Management, Microsoft 365, Records management, Retention and disposal

Classifying records in Microsoft 365

There are three main options in Microsoft 365 to apply recordkeeping classification terms to (some) records:

  • Metadata columns added to SharePoint sites, including those added to Content Types and/or added directly to document libraries.
  • Taxonomy terms stored in the central Term Store, including those added as site columns and added to site content types and/or added directly to document libraries. The only difference with the first option is that with the Term Store the classification terms are stored and managed centrally and are therefore available to every SharePoint site.
  • Retention labels that: (a) ‘map’ to classification terms; (b) are linked with a File Plan that includes the classification terms; (c) are either the same as (a) or (b) and are used in with a Document Understanding Model in SharePoint Syntex; or (d) the same as (a) or (b) and used with conjunction with Trainable Classifiers.

The first two options can only be applied to content stored in SharePoint. Retention labels may be applied to emails and OneDrive content. None of the three options can be applied to Teams chats. Also note that there is no connection between the SharePoint Term Store and the File Plan, both of which can be used to store classification terms.

This post:

  • Defines the meaning of classification from a recordkeeping point of view.
  • Describes each of the above options and their limits.
  • Discusses the requirement to classify records and other options in Microsoft 365.

What is classification?

Humans are natural-born classifiers. We see it in the way we store cutlery or linen, or other household items or personal records.

Business records also need some form of classification. But what does that mean? The 2002 version of the records management standard ISO 15489, defines classification as:

‘the systematic identification and arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules represented in a classification system’. (ISO 15489.1 2017 clause 3.5).

The standard also states (4.2.1) that a classification scheme based on business activities, along with a records disposition authority and a security and access classification scheme, were the principal instruments used in records management operations.

The classification of records in business is important to establish their context and help finding them.

Microsoft 365 includes various options to apply classification terms to records.

Metadata columns in SharePoint

The simplest way to classify records stored in SharePoint document libraries is to either create site columns containing the classification terms and add those columns to document libraries, or create them directly in those libraries.

Benefits

Adding site or library columns is relatively simple. As classification terms are usually in the form of a (hierarchical) list, it is simple to add one choice or lookup column for function and another for activities.

A lookup column can bring across a value from another column when an item is selected; for example, if the look up list places ‘Accounting’ (Activity) in the same list row as ‘Financial Management’, selection of ‘Accounting’ will bring across ‘Financial Management’ as a separate (linked) column.

Default values (or even one value) can be set meaning that records added to a library (that only contains records with those classification terms) can be assigned the same classification terms each time without user intervention.

Negatives

SharePoint choice or lookup columns do not allow for hierarchical views or values to be displayed from the list view so the context for the classification terms may not be obvious unless both function and activity are listed.

The Term Store

The Term Store, also known as the Managed Metadata Service (MMS) has existed in SharePoint as a option to create and centrally manage classification and taxonomy terms in SharePoint only for at least a decade.

In 2020, access to the Term Store was re-located from its previous location (https://tenantname-admin.sharepoint.com/_layouts/15/TermStoreManager.aspx) to the SharePoint Online admin portal under the ‘Content Services’ section:

The location of the Term Store in the SharePoint admin center

Organisations can create multiple sets of taxonomies or ‘term groups’ (e.g, ‘BCS’ or ‘People’) within the Term Store. Each Term Group consists of the following:

  • Term Sets. These generally could map to a business function. Each Term Set has a name and description, and four tabs with the following information: (a) General: Owner, Stakeholders, Contact, Unique ID (GUID); (b) Usage settings: Submission policy, Available for tagging, Sort order; (c) Navigation: Use term set for site navigation or faceted navigation – both disabled by default; (d) Advanced: Translation options, custom properties.
  • Terms. These generally could map to an activity. Each Term has a name and three tabs: (a) General: Language, translation, synonyms and description; (b) Usage settings: Available for tagging, Member of (Term Set), Unique ID (GUID); (c) Advanced: Shared custom properties, Local custom properties.

In the example below, the Term Set (function) of ‘Community Relations’ has three Terms (activities).

Once they have been created in the Term Store, term set or terms can be added to a SharePoint site, either as a new site or local library/list column, as shown in the two screenshots below:

First, create a new column and select Managed Metadata

Then scroll down to Term Set Settings and choose the term set to be used.

Once added as a site column, the new column may be added to a Content Type that is added to a library, or directly on the library or list.

A Term Store-based column added to a library via a Content Type.

Benefits

The primary benefit using the central Term Store terms via a Managed Metadata column is that the Term Store is the ‘master’ classification scheme providing consistency in classification terms for all SharePoint sites.

As we will see below, Term Store terms may be used to help with the application of retention labels (which themselves may ‘map’ to classification terms in a function/activity-based retention schedule).

Negatives

Using metadata terms from the Term Store is almost identical with using a choice or lookup column. The only real difference is that the Term Store provides a ‘master’ and consistent list of classification (and other) terms.

Term store classification terms, including in Content Types, may only be used on a minority of SharePoint sites.

Additionally:

  • It is not possible to select a Term Set (e.g., the function level), only a Term within a Term Set.
  • Only the selected classification Term appears in the library metadata, without the parent Term Set or visual hierarchy reference to that Term Set – see screenshot below. Technically only that Term is searchable. It is not possible to view a global listing of all records classified according to function and activity.
  • If multiple choices are allowed, a record may be classified according to more than one Term. This may cause issues with grouping, sorting or filtering the content of a library in views.
How the Term appears in the library column.

As we will see below, there is no connection between the classification Terms in Term Sets and the categorisation options available when creating new retention labels via a File Plan. ‘Business Function’ or ‘Category’ choices in the File Plan do not connect with the Term Store.

Term Store terms and Content Types can only be used to classify content stored in SharePoint.

Retention labels

Retention labels in Microsoft 365 can be used in an indirect way to classify records in SharePoint, email and OneDrive because they can be ‘mapped’ to classification elements.

For example, a label may be based on the following elements:

  • Function: Financial Management
  • Activity: Accounting
  • Description: Accounting records
  • Retention: 7 years

Every retention label contains the following options:

  • Name. The name can provide simple details of the classification, for example: ‘Financial Management Accounting – 7 years’.
  • Description for users. This can be the full wording of the retention class.
  • Description for admins. This can contain details of how to apply or interpret the class, if required.
  • Retention settings (e.g., 7 years after date created/modified or label applied).

Benefits

Where the classification terms map to a retention class, the process of applying a retention label to an individual record, email or OneDrive content could potentially be seen as classifying those records against the classification scheme.

The Data Classification section in the Microsoft 365 Compliance portal provides an overview of the volume of records in SharePoint, OneDrive or Exchange that have a specific retention class:

Negatives

Not every record in every SharePoint document library may be subject to a retention label. Many records (for example in Teams-based SharePoint sites) may be subject to a ‘back end’ retention policy applied to the entire site (which creates a Preservation Hold library).

A retention label applied to a record doesn’t actually add any classification terms to the record.

Retention labels don’t map in any way to Term Store classification terms, except in SharePoint Syntex – see below (but this only applies to SharePoint content).

Retention labels/File Plan combination

The File Plan option (Records Management > File Plan, requires E5 licences) can also be used to add classification terms to a retention label as shown in the screenshot below. Note that there is no link with the Term Store.

Benefits

Records (including emails) that have been assigned a retention label could, in theory, be regarded as having been classified in this way because the label contains (or references) the classification terms.

Negatives

When applied to content in SharePoint, OneDrive or Exchange, retention labels linked with the File Plan do not show the File Plan classification terms. It may be possible to write a script that displays all records with the terms from the File Plan, but it may be easier to do this using the Data Classification option described above.

Retention labels/SharePoint Syntex combination

SharePoint Syntex provides a way to apply retention labels to records, stored in SharePoint, that have been identified through the Document Understanding Model process.

Benefits

As can be seen in the screenshot above, each new DU model allows similar types of records (in the example above, ‘Statements of Work’) to be associated with a new or existing Content Type that can include a Term Store Term – for SharePoint records only – and a retention label. This provides three types of ‘classification’:

  • Grouping by record type (e.g., Statement of Work, Invoice)
  • Linking (of sorts) between the records ‘classified’ in this way and a Term Store term added as a metadata column to the Content Type.
  • Assigning of a retention label. This provides the same form of retention label-based classification described above.

Furthermore, if the Extraction option is also used, data extracted to SharePoint columns can be based on choices listed in the Term Store metadata.

Negatives

SharePoint Syntex only works for records – and only those records that have some form of consistency – stored in SharePoint.

Retention labels/trainable classifiers combination

Trainable classifiers are another way that could be used to identify related records and apply a retention label to those records. Microsoft 365 includes six ‘out of the box’ trainable classifiers that will not be of much value to records managers for the classification of records:

  • Source code
  • Harassment
  • Profanity
  • Threat
  • Resume
  • Offensive language (to be deprecated)

The creation of new trainable classifiers requires an E5 licence; they are created through the Data Classification area of the Microsoft 365 Compliance admin portal. Machine Learning is used to identify related records to create the trainable classifiers.

Once created, a retention label may be auto-applied to content stored in SharePoint or Exchange mailboxes using the classifier.

The option to auto-apply a label based on a trainable classifier.

Benefits

The primary outcome (from a recordkeeping classification point of view) of using trainable classifiers is the application of a retention label to content stored in SharePoint and Exchange mailboxes. It can also be used to apply a sensitivity label to that content.

Negatives

It is unlikely that every record will be classified according to every classification option.

Trainable classifiers only work with SharePoint and Exchange mailboxes.

Classifying records per workload

The options are summarised below for each main workload:

  • SharePoint: Use local site or library columns, Term Store terms or retention labels (mapped to a File Plan as necessary), applied manually or automatically, including via SharePoint Syntex or trainable classifiers.
  • Exchange mailboxes: The only feasible option to classify these records is to manually or auto-apply retention labels that are mapped to a classification, including a trainable classifer.
  • OneDrive: Manually or auto-apply retention labels mapped to a classification.
  • Teams. It is not possible to classify Teams chats with the options available.

Is classification necessary?

The classification model described in ISO 15489 and other standards was based on the idea that records would be stored in a central recordkeeping system where they would be subject to and tagged by the terms contained a classification scheme, often applied at the aggregation level (e.g., a file).

Microsoft 365 is not a recordkeeping system but a collection of multiple applications that may create or capture records, primarily in Exchange mailboxes, SharePoint, OneDrive and MS Teams (and also Yammer).

There is no central option to classify records in the recordkeeping sense. The closest options are:

  • The grouping of records in SharePoint sites (and Teams, each of which has a SharePoint site) and libraries that map to business functions and activities.
  • The use of metadata, either terms set in the central Term Store or created in local sites/libraries, to ‘classify’ individual records (including emails) stored in SharePoint document libraries. Each item in the library might have a default classification, or could be classified differently.
  • The use of retention labels that ‘map’ to function/activity pairs in a records disposal authority/schedule. These may be applied, manually or automatically, to content stored in SharePoint, OneDrive and Exchange mailboxes.

Neither of the above may apply, or be applied consistently, to all SharePoint sites, Exchange mailboxes, OneDrive accounts. And neither can be applied to Teams chats.

A different approach to this problem is required, one that likely will likely involve greater use of Artificial Intelligence (AI) and Machine Learning (ML) methods to identify and enable the grouping of records, and provide visualisations of the records so-classified.

Image: Werribee Mansion, Victoria, Australia stairwell (Andrew Warland photo)

Posted in Artificial Intelligence, Classification, Electronic records, Information Management, Microsoft 365, Records management, Retention and disposal

Can Microsoft technology classify records better than a human?

In late 2012, IDM magazine published an article I co-authored with Umi Asma Mokhtar in Malaysia titled ‘Can technology classify records better than a human?’

The article drew on research into recent advances in technology to assist in legal discovery, known as ‘computer-assisted coding’, or ‘predictive coding’, including the following two articles:

Grossman and Cormack’s article noted that ‘a technology-assisted review process involves the interplay of humans and computers to identify the documents in a collection that are responsive to a production request, or to identify those documents that should be withheld on the basis of privilege‘. By contrast, an ‘exhaustive manual review’ required ‘one or more humans to examine each and every document in the collection, and to code them as response (or privileged) or not‘.

The article noted, somewhat gently, that ‘relevant literature suggests that manual review is far from perfect’.

Peck’s article contained similar conclusions. He also noted how computer-based coding was based on a initial ‘seed set’ of documents identified by a human; the computer then identified the properties of those documents and used that to code other similar documents. ‘As the senior reviewer continues to code more sample documents, the computer predicts the reviewer’s coding‘ (hence predictive coding).

By 2011, this new technology was challenging old methods of manual review and classification. Despite some scepticism and slow uptake (for example, see this 2015 IDM article ‘Predictive Coding – What happened to the next big thing?‘), by 2021, it had become an accepted option to support discovery, sometimes involving offshore processing for high volumes of content.

Meanwhile, in an almost unnoticed part of the technology woods, Microsoft acquired Equivio in January 2015. In its press release ‘Microsoft acquires Equivio, provider of machine learning-powered compliance solutions‘, Microsoft stated that the product:

‘… applies machine learning … enabling users to explore large, unstructured sets of data and quickly find what is relevant. It uses advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sorting documents into themes, grouping near-duplicates, isolating unique data, and helping users quickly identify the documents they need. As part of this process, users train the system to identify documents relevant to a particular subject, such as a legal case or investigation. This iterative process is more accurate and cost effective than keyword searches and manual review of vast quantities of documents.’ 

It added that the product would be deployed in Office 365.

Classifying records

The concept of classification for records was defined in paragraph 7.3 of part 1 of the Australian Standard (AS) 4390, released in 1996. The standard defined classification as:

‘… the process of devising and applying schemes based on the business activities generating records, whereby they are categorised in systematic and consistent ways to facilitate their capture, retrieval, maintenance and disposal. Classification includes the determination of naming conventions, user permissions and security restrictions on records’.

The definition provided a number of examples of how the classification of business activities could act as a ‘powerful tool to assist in many of the processes involved in the management of records, resulting from those activities’. This included ‘determining appropriate retention periods for records’.

The only problem with the concept was the assumption that all records could be classified in this way, in a singular recordkeeping system. Unless they were copied to that system, emails largely escaped classification.

Fast forward to 2020

Managing all digital records according to recordkeeping standards has always been a problem. Electronic records management (ERM) systems managed the records that were copied into them, but a much higher percentage remained outside its control – in email systems, network files shares and, increasingly over the past 10 years, created and captured on host of alternative systems including third-party and social media platforms.

By the end of 2019, Microsoft had built a comprehensive single ecosystem to create, capture and manage digital content, including most of the records that would have been previously consigned to an ERMS. And then COVID appeared and working from home become common. All of a sudden (almost), it had to be possible to work online. Online meeting and collaboration systems such as Microsoft Teams took off, usually in parallel with email. Anything that required a VPN to access became a problem.

2021 – Automated classification for records (maybe)

The Microsoft 365 ecosystem generated a huge volume of new content scattered across four main workloads – Exchange/Outlook, SharePoint, OneDrive and Teams. A few other systems such as Yammer also added to the mix.

Most of this information was not subject to any form of classification in the recordkeeping sense. The Microsoft 365 platform included the ability to apply retention policies to content but there was a disconnect between classification and retention.

Microsoft announced Project Cortex at Ignite in 2019. According to the announcement, Project Cortex:

  • Uses advanced AI to deliver insights and expertise in the apps that are used every day, to harness collective knowledge and to empower people and teams to learn, upskill and innovate faster.
  • Uses AI to reason over content across teams and systems, recognizing content types, extracting important information, and automatically organizing content into shared topics like projects, products, processes and customers.
  • Creates a knowledge network based on relationships among topics, content, and people.

Project Cortex drew on technological capabilities present in Azure’s Cognitive Services and the Microsoft Graph. It is not known to what extent the Equivio product, acquired in 2015, was integrated with these solutions but, from all the available details, it appears the technology is at least connected in one way or another.

During Ignite 2020, Microsoft announced SharePoint Syntex and trainable classifiers, either of which could be deployed to classify information and apply retention rules.

Trainable classifiers

Trainable classifiers were made generally available (GA) in January 2021.

Trainable classifiers sound very similar to the predictive coding capability that appeared from 2011. However, they:

  • Use the power of Machine Learning (ML) to identify categories of information. This is achieved by creating an initial ‘seed’ of data in a SharePoint library, creating a new trainable classifier and pointing it at the seed, then reviewing the outcomes. More content is added to ensure accuracy.
  • Can be used to identify similar content in Exchange mailboxes, SharePoint sites, OneDrive for Business accounts, and Microsoft 365 Groups and apply a pre-defined retention label to that content.

In theory, this means it might be possible to identify a set of similar records – for example, financial documents – and apply the same retention label to them. The Content Explorer in the Compliance admin portal will list the records that are subject to that label.

SharePoint Syntex

SharePoint Syntex was announced at Ignite in September 2020 and made generally available in early 2021.

The original version of Syntex (as part of Project Cortex) was targeted at the ability to extract metadata from forms, a capability that has existed with various other scanning/OCR products for at least a decade. The capability that was released in early 2021 included the base metadata extraction capability as well as a broader capability to classify content and apply a retention label.

The two Syntex capabilities, described in a YouTube video from Microsoft titled ‘Step-by-Step: How to Build a Document Understanding Model using Project Cortex‘, are:

  • Classification. This capability involves the following steps: (a) Creation of (SharePoint site) Content Center; (b) Creation of a Document Understanding Model (DUM) for each ‘type’ of record; the DUM can create a new content type or point to an existing one; the DUM can also link with the retention label to be applied; (c) Creation of an initial seed of records (positives and a couple of negatives); (d) Creation of Explanations that help the model find records by phrase, proximity, or pattern (matching, e.g., dates); (e) Training; (f) Applying the model to SharePoint sites or libraries. The outcome of the classification is that matching records in the location where it is pointed are assigned to the Content Type (replacing any previous one) and tagged with a retention label (also replacing any previous one).
  • Extraction. This capability has similar steps to the classification option except that the Explanations identify what metadata is to be extracted from where (again based on phrase, proximity or pattern) to what metadata column. The outcome of extraction is that the matching records include the extracted metadata in the library columns (in addition to the Content Type and retention label).

As with trainable classifiers, Syntex uses Machine Learning to classify records, but Syntex also has the ability to extract metadata. Syntex can only classify or extract data from SharePoint libraries.

Trainable classifiers or Syntex?

Both options require the organisation to create an initial seed of content and to use Machine Learning to develop an understanding of the content, in order to classify it.

The models are similar, the primary difference is that trainable classifiers can work on content stored in email, SharePoint and OneDrive, whereas Syntex is currently restricted to SharePoint.

Predictive coding

On 18 March 2021, Microsoft announced the pending (April 2021) preview release of an enhanced predictive coding module for advanced eDiscovery in Microsoft 365.

The announcement, pointing to this roadmap item, noted that eDiscovery managers would be able to create and train relevance models within Advanced eDiscovery using as few as 50 documents, to prioritize review.

So, can Microsoft technology classify records better than humans?

In their 1999 book ‘Sorting Things Out: Classification and its Consequences‘ (MIT Press), Geoffrey Bowker and Susan Leigh Star noted that ‘to classify is human’ and that classification was ‘the sleeping beauty of information science’ and ‘the scaffolding of information infrastructures’.

But they also noted how ‘each standard and category valorizes some point or view and silences another. Standards and classifications (can) produce advantage or suffering’ (quote from review in link above).

Technology-based classification in theory is impartial. It categorises what it finds through machine learning and algorithms. But, technology-based classification requires human review of the initial and subsequent seeds. Accordingly such classification has the potential to be skewed according to the way the reviewer’s bias or predilections, the selection of one set of preferred or ‘matching’ records over another.

Ultimately, a ‘match’ is based on a scoring ‘relevancy’ algorithm. Perhaps the technology can classify better than humans, but whether the classification is accurate may depend on the human to make accurate, consistent and impartial decisions.

Either way, the manual classification of records is likely to go the same way as the manual review of legal documents for discovery.

Image source: Providence Public Library Flickr

Posted in Classification, Compliance, Information Management, Records management, Retention and disposal

Classifying records in Microsoft 365

The classification of records is fundamental recordkeeping activity. It is defined in the international standard ISO 15489-1:2016 (Information and Documentation – Records Management) as the ‘systematic identification and/or arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules‘. (Terms and Definitions, 3.4)

The purpose of classification is defined by State Records NSW as follows: ‘In records management, records are classified according to the business functions and activities which generate the records. This functional approach to classification means that classification can be used for a range of records management purposes, including appraisal and disposal, determining handling, storage and security requirements, and setting user permissions, as well as providing a basis for titling and indexing‘. (Records Classification, accessed 13 January 2021.)

The ever-increasing volume of digital records, the many different ways to create them, and the multitude of record types that are created and storage locations, have made it more difficult to accurately and consistently manually classify records, including through the creation of pre-defined ‘containers’ or aggregations based on classification terms. Despite this, the requirement to link the classification of records with their retention and and disposal remains.

For over three decades, Microsoft’s applications and technology platforms have been used to create, capture, store and manage records. Some of these records (in the earlier period) were printed and placed on paper files, or stored (from around 2000) in dedicated electronic document and records management (EDRM) systems.

But the volume and type of digital content, including with new types of records (e.g., chat messages) and storage locations, continues to grow. In response, Microsoft invested heavily in addressing the need to classify records ‘at scale’.

This post looks at various ways to classify records, for retention and disposition purposes, in Microsoft 365.

The old-school, manual method – metadata

Most of the records in Microsoft 365 will be created, captured or stored in one of the four primary workloads: Exchange mailboxes, SharePoint sites/libraries, MS Teams chats (a ‘compliance copy’ of which is stored in Exchange mailboxes), and OneDrive for Business libraries. Some records may also exist in Yammer or other web page content (e.g., intranet).

Most SharePoint sites as well as Teams (that have a SharePoint site) will be created according to some form of business need to create, capture, store and share records; that is, the site or team purpose may be based on a business function or activity. This way of grouping records may in some ways be used as a way to classify records – by SharePoint site (e.g., function) or document library (e.g., activity).

Records may be stored in multiple document libraries, or within a folder structure of a single library.

A number of methods (some of which rely on others) can be used to add classification (and other) metadata to records stored in SharePoint document libraries:

1 – Creating the classification taxonomy in the Managed Metadata Service (MMS)/Term Store via the SharePoint Admin portal – Content Services – Term store, and then applying these terms in content types that are then deployed in SharePoint sites.

An example of Business Classification Terms in the MMS

2 – Creating global content types from the SharePoint Admin portal, in the Content Services – Content type gallery area (see ‘Finance Document’ example below) and then deploying these in specific SharePoint sites where site columns that contain classification terms will be added.

3 – Creating site columns that contain classification terms, including from the MMS, and adding these to global or site content types or document libraries where they can be applied to records.

In this example, the site column ‘BCS Function’ maps to the MMS BCS terms

4 – Creating site content types and adding site columns (including MMS-based columns), then adding these content types to document libraries.

In this example, the MMS-based column now appears in the library columns

But, most of the above is somewhat complicated and cumbersome and would normally only be used for and manually applied to specific types of records.

The simplest way to apply BCS/File Plan terms at the document (or document set) level is to (a) store records to the same BCS function or activity in the same library, (b) create site or library columns with default values and add these to the library. This way means that the default terms are applied automatically as soon as a new record is uploaded, including when shared/inherited from the site columns added to a document set that ‘contains’ a document content type.

Example metadata columns shared from the document set content type

However, keep in mind that SharePoint is just one of the workloads where records are stored.

Records in the form of emails, chats and ‘personal’ content (as well as Yammer messages and web pages) are created in and stored across the other workloads. Some attempt may be made to copy these other records (especially emails) in SharePoint sites but it starts to get complicated or impossible to do so with things like Teams chat messages.

In most cases (and according to Microsoft’s own recommendations), it is better to leave the records where they were created or captured (‘in place’), and apply centralised compliance controls (classification, retention labels and policies) to this content.

Leaving the records in place in this way does not exclude the ability to create SharePoint sites and document libraries in those sites that map to classification terms, and/or use the site column approach described above but these are more likely to be exceptions.

In fact, some form of logical structure is almost certain anyway as most end-users will probably want to access and manage information in their own specific work context (the Team/SharePoint site).

Trainable classifiers

Since not all records are stored in SharePoint and the ever-increasing volume of digital content stored across the Microsoft 365 platform, Microsoft needed to find a way to classify records ‘at scale’.

The solution was to use machine learning (ML) via trainable classifiers accessed in the ‘Data Classification’ section of the Microsoft 365 Compliance portal. This capability is only available to E5 licences.

The trainable classifiers solution was released to General Availability on 12 January 2021 (‘Announcing GA of machine learning trainable classifiers for your compliance needs‘, accessed 13 January 2021).

See the Microsoft web page ‘Learn about trainable classifiers‘ to learn more about this option. To quote from that page:

This classification method is particularly well suited to content that isn’t easily identified by either the manual or automated pattern matching methods. This method of classification is more about training a classifier to identify an item based on what the item is, not by elements that are in the item (pattern matching).’

Organisations (including E3 licence holders) may make use of five pre-defined trainable classifiers (Resumes, Source Code, Targeted Harassment, Profanity or Threat. A sixth classifier ‘Offensive language’, is to be deprecated). Custom classifiers require an E5 licence.

Custom classifiers require ‘significantly more work’ than the pre-existing classifiers and the process is quite involved (see the process flow diagram in the ‘Learn about’ page link above) but in summary it involves the following steps:

  • Creating the custom classifier.
  • Creating a set of manually selected example records (50 to 500) in a dedicated SharePoint Online site as the ‘seed’. This would include a range of emails in the seed examples.
  • Testing the classifier with the seeded documents.
  • Re-training with additional content – both positive and negative matches.

Once the classifier is published, it can be used to identify and classify related content across SharePoint Online, Exchange, and OneDrive (but not Teams).

The page ‘Default crawled file name extensions and parsed file types‘ provides details of all the record types that can be classified in this way. Note it is not clear if trainable classifiers can crawl the compliance copy of Teams chat messages stored in hidden folders in Exchange mailboxes.

Label-based retention policies can then be automatically applied to content that has been identified through the trainable classifier.

However, note that the classifier does not ‘group’, aggregate or ‘present’ (list) the records for review (except broadly via the Content Explorer); however, the label applied to the records can be searched via the ‘Content Search’ option in the Compliance portal. This is a much better option than not having any idea how many records of a particular classification may exist in Exchange mailboxes, OneDrive accounts, or general SharePoint sites. It requires some degree of ‘letting go’ of the ability to view and browse content classified this way, and trusting the system.

The main limit with trainable classifiers is that it requires an E5 or E5 compliance licence.

The other limitation is the management of the disposition of records that have been identified with trainable classifiers and had a label-based retention policy applied. There are significant shortcomings with the current ‘Disposition Review’ process, specifically the lack of adequate metadata to review records due for disposal or the details of what has been destroyed.

SharePoint Syntex

Another (but limited) option might be to use SharePoint Syntex (see ‘Introduction to Microsoft SharePoint Syntex‘ for an overview), although its range is limited to SharePoint and – it seems – only records that have a relatively consistent structure and format.

SharePoint Syntex evolved out of Project Cortex’s ability to extract and capture metadata from records. It can also be used through its ‘Document Understanding Model‘ (DUM) to provide a way to classify records stored in SharePoint Online (only). It makes use of a ‘seeding’ model that is similar to trainable classifiers (and may be based on the same underlying AI engine).

Broadly speaking, the DUM works on the basis of loading a small ‘seed’ set of (relatively consistently formated) example files into a dedicated Content Center (or Centers). This is very similar to the process of using trainable classifiers, except that the latter does not require a ‘content center’ SharePoint site to be created.

  • The example files are ‘trained’ by being ‘classified’ through the document understanding process based on a set of ‘explanation types‘ that are used to help find the relevant content. The three explanation types are: (a) phrase list (a list of words, phrases, numbers, or other characters used in the document or information that you are extracting); (b) pattern list (patterns of numbers, letters, or other characters); and (c) proximity (describes how close other explanations are to each other).
  • The document understanding model (DUM) produced through the explanation types is associated (and deployed) with a new or existing content type. 
  • Once applied to a SharePoint site library, the DUM/content type provides the basis for identifying and tagging (with metadata) other similar records in the location (e.g., the library) where the DUM has been deployed. 
  • If the documents have consistent content such as invoices, certain data from those documents can be extracted as metadata. 

Retention labels may be applied to records classified using SharePoint Syntex, as described on this page ‘Apply a retention label to a document understanding model‘.

Summing up – which one should be used?

The answer to this question will depend on your compliance requirements.

Smaller organisations may be able to set up SharePoint sites and document libraries with site columns/metadata that maps to their business classification scheme or file plan, and copy emails to those libraries. There may be little need to use AI-based classification methods.

In large and more complex organisations (with E5 licences), especially those with a lot of content stored across Exchange mailboxes and SharePoint sites (including Teams-based sites) there will most certainly be a need for some form of AI-based classification in addition to classification-mapped SharePoint sites (and Teams).

Organisations with E3 licences might use the manual methods described above for specific types of records, and consider acquiring additional E5 Compliance licences to make use of trainable classifiers or SharePoint Syntex for other records.