The classification of records is fundamental recordkeeping activity. It is defined in the international standard ISO 15489-1:2016 (Information and Documentation – Records Management) as the ‘systematic identification and/or arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules‘. (Terms and Definitions, 3.4)
The purpose of classification is defined by State Records NSW as follows: ‘In records management, records are classified according to the business functions and activities which generate the records. This functional approach to classification means that classification can be used for a range of records management purposes, including appraisal and disposal, determining handling, storage and security requirements, and setting user permissions, as well as providing a basis for titling and indexing‘. (Records Classification, accessed 13 January 2021.)
The ever-increasing volume of digital records, the many different ways to create them, and the multitude of record types that are created and storage locations, have made it more difficult to accurately and consistently manually classify records, including through the creation of pre-defined ‘containers’ or aggregations based on classification terms. Despite this, the requirement to link the classification of records with their retention and and disposal remains.
For over three decades, Microsoft’s applications and technology platforms have been used to create, capture, store and manage records. Some of these records (in the earlier period) were printed and placed on paper files, or stored (from around 2000) in dedicated electronic document and records management (EDRM) systems.
But the volume and type of digital content, including with new types of records (e.g., chat messages) and storage locations, continues to grow. In response, Microsoft invested heavily in addressing the need to classify records ‘at scale’.
This post looks at various ways to classify records, for retention and disposition purposes, in Microsoft 365.
The old-school, manual method – metadata
Most of the records in Microsoft 365 will be created, captured or stored in one of the four primary workloads: Exchange mailboxes, SharePoint sites/libraries, MS Teams chats (a ‘compliance copy’ of which is stored in Exchange mailboxes), and OneDrive for Business libraries. Some records may also exist in Yammer or other web page content (e.g., intranet).
Most SharePoint sites as well as Teams (that have a SharePoint site) will be created according to some form of business need to create, capture, store and share records; that is, the site or team purpose may be based on a business function or activity. This way of grouping records may in some ways be used as a way to classify records – by SharePoint site (e.g., function) or document library (e.g., activity).
Records may be stored in multiple document libraries, or within a folder structure of a single library.
A number of methods (some of which rely on others) can be used to add classification (and other) metadata to records stored in SharePoint document libraries:
1 – Creating the classification taxonomy in the Managed Metadata Service (MMS)/Term Store via the SharePoint Admin portal – Content Services – Term store, and then applying these terms in content types that are then deployed in SharePoint sites.
2 – Creating global content types from the SharePoint Admin portal, in the Content Services – Content type gallery area (see ‘Finance Document’ example below) and then deploying these in specific SharePoint sites where site columns that contain classification terms will be added.
3 – Creating site columns that contain classification terms, including from the MMS, and adding these to global or site content types or document libraries where they can be applied to records.
4 – Creating site content types and adding site columns (including MMS-based columns), then adding these content types to document libraries.
But, most of the above is somewhat complicated and cumbersome and would normally only be used for and manually applied to specific types of records.
The simplest way to apply BCS/File Plan terms at the document (or document set) level is to (a) store records to the same BCS function or activity in the same library, (b) create site or library columns with default values and add these to the library. This way means that the default terms are applied automatically as soon as a new record is uploaded, including when shared/inherited from the site columns added to a document set that ‘contains’ a document content type.
However, keep in mind that SharePoint is just one of the workloads where records are stored.
Records in the form of emails, chats and ‘personal’ content (as well as Yammer messages and web pages) are created in and stored across the other workloads. Some attempt may be made to copy these other records (especially emails) in SharePoint sites but it starts to get complicated or impossible to do so with things like Teams chat messages.
In most cases (and according to Microsoft’s own recommendations), it is better to leave the records where they were created or captured (‘in place’), and apply centralised compliance controls (classification, retention labels and policies) to this content.
Leaving the records in place in this way does not exclude the ability to create SharePoint sites and document libraries in those sites that map to classification terms, and/or use the site column approach described above but these are more likely to be exceptions.
In fact, some form of logical structure is almost certain anyway as most end-users will probably want to access and manage information in their own specific work context (the Team/SharePoint site).
Since not all records are stored in SharePoint and the ever-increasing volume of digital content stored across the Microsoft 365 platform, Microsoft needed to find a way to classify records ‘at scale’.
The solution was to use machine learning (ML) via trainable classifiers accessed in the ‘Data Classification’ section of the Microsoft 365 Compliance portal. This capability is only available to E5 licences.
The trainable classifiers solution was released to General Availability on 12 January 2021 (‘Announcing GA of machine learning trainable classifiers for your compliance needs‘, accessed 13 January 2021).
See the Microsoft web page ‘Learn about trainable classifiers‘ to learn more about this option. To quote from that page:
‘This classification method is particularly well suited to content that isn’t easily identified by either the manual or automated pattern matching methods. This method of classification is more about training a classifier to identify an item based on what the item is, not by elements that are in the item (pattern matching).’
Organisations (including E3 licence holders) may make use of five pre-defined trainable classifiers (Resumes, Source Code, Targeted Harassment, Profanity or Threat. A sixth classifier ‘Offensive language’, is to be deprecated). Custom classifiers require an E5 licence.
Custom classifiers require ‘significantly more work’ than the pre-existing classifiers and the process is quite involved (see the process flow diagram in the ‘Learn about’ page link above) but in summary it involves the following steps:
- Creating the custom classifier.
- Creating a set of manually selected example records (50 to 500) in a dedicated SharePoint Online site as the ‘seed’. This would include a range of emails in the seed examples.
- Testing the classifier with the seeded documents.
- Re-training with additional content – both positive and negative matches.
Once the classifier is published, it can be used to identify and classify related content across SharePoint Online, Exchange, and OneDrive (but not Teams).
The page ‘Default crawled file name extensions and parsed file types‘ provides details of all the record types that can be classified in this way. Note it is not clear if trainable classifiers can crawl the compliance copy of Teams chat messages stored in hidden folders in Exchange mailboxes.
Label-based retention policies can then be automatically applied to content that has been identified through the trainable classifier.
However, note that the classifier does not ‘group’, aggregate or ‘present’ (list) the records for review (except broadly via the Content Explorer); however, the label applied to the records can be searched via the ‘Content Search’ option in the Compliance portal. This is a much better option than not having any idea how many records of a particular classification may exist in Exchange mailboxes, OneDrive accounts, or general SharePoint sites. It requires some degree of ‘letting go’ of the ability to view and browse content classified this way, and trusting the system.
The main limit with trainable classifiers is that it requires an E5 or E5 compliance licence.
The other limitation is the management of the disposition of records that have been identified with trainable classifiers and had a label-based retention policy applied. There are significant shortcomings with the current ‘Disposition Review’ process, specifically the lack of adequate metadata to review records due for disposal or the details of what has been destroyed.
Another (but limited) option might be to use SharePoint Syntex (see ‘Introduction to Microsoft SharePoint Syntex‘ for an overview), although its range is limited to SharePoint and – it seems – only records that have a relatively consistent structure and format.
SharePoint Syntex evolved out of Project Cortex’s ability to extract and capture metadata from records. It can also be used through its ‘Document Understanding Model‘ (DUM) to provide a way to classify records stored in SharePoint Online (only). It makes use of a ‘seeding’ model that is similar to trainable classifiers (and may be based on the same underlying AI engine).
Broadly speaking, the DUM works on the basis of loading a small ‘seed’ set of (relatively consistently formated) example files into a dedicated Content Center (or Centers). This is very similar to the process of using trainable classifiers, except that the latter does not require a ‘content center’ SharePoint site to be created.
- The example files are ‘trained’ by being ‘classified’ through the document understanding process based on a set of ‘explanation types‘ that are used to help find the relevant content. The three explanation types are: (a) phrase list (a list of words, phrases, numbers, or other characters used in the document or information that you are extracting); (b) pattern list (patterns of numbers, letters, or other characters); and (c) proximity (describes how close other explanations are to each other).
- The document understanding model (DUM) produced through the explanation types is associated (and deployed) with a new or existing content type.
- Once applied to a SharePoint site library, the DUM/content type provides the basis for identifying and tagging (with metadata) other similar records in the location (e.g., the library) where the DUM has been deployed.
- If the documents have consistent content such as invoices, certain data from those documents can be extracted as metadata.
Retention labels may be applied to records classified using SharePoint Syntex, as described on this page ‘Apply a retention label to a document understanding model‘.
Summing up – which one should be used?
The answer to this question will depend on your compliance requirements.
Smaller organisations may be able to set up SharePoint sites and document libraries with site columns/metadata that maps to their business classification scheme or file plan, and copy emails to those libraries. There may be little need to use AI-based classification methods.
In large and more complex organisations (with E5 licences), especially those with a lot of content stored across Exchange mailboxes and SharePoint sites (including Teams-based sites) there will most certainly be a need for some form of AI-based classification in addition to classification-mapped SharePoint sites (and Teams).
Organisations with E3 licences might use the manual methods described above for specific types of records, and consider acquiring additional E5 Compliance licences to make use of trainable classifiers or SharePoint Syntex for other records.