Posted in Classification, Governance, Information Management, Microsoft 365, Records management, Retention and disposal

Classifying records in Microsoft 365

There are three main options in Microsoft 365 to apply recordkeeping classification terms to (some) records:

  • Metadata columns added to SharePoint sites, including those added to Content Types and/or added directly to document libraries.
  • Taxonomy terms stored in the central Term Store, including those added as site columns and added to site content types and/or added directly to document libraries. The only difference with the first option is that with the Term Store the classification terms are stored and managed centrally and are therefore available to every SharePoint site.
  • Retention labels that: (a) ‘map’ to classification terms; (b) are linked with a File Plan that includes the classification terms; (c) are either the same as (a) or (b) and are used in with a Document Understanding Model in SharePoint Syntex; or (d) the same as (a) or (b) and used with conjunction with Trainable Classifiers.

The first two options can only be applied to content stored in SharePoint. Retention labels may be applied to emails and OneDrive content. None of the three options can be applied to Teams chats. Also note that there is no connection between the SharePoint Term Store and the File Plan, both of which can be used to store classification terms.

This post:

  • Defines the meaning of classification from a recordkeeping point of view.
  • Describes each of the above options and their limits.
  • Discusses the requirement to classify records and other options in Microsoft 365.

What is classification?

Humans are natural-born classifiers. We see it in the way we store cutlery or linen, or other household items or personal records.

Business records also need some form of classification. But what does that mean? The 2002 version of the records management standard ISO 15489, defines classification as:

‘the systematic identification and arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules represented in a classification system’. (ISO 15489.1 2017 clause 3.5).

The standard also states (4.2.1) that a classification scheme based on business activities, along with a records disposition authority and a security and access classification scheme, were the principal instruments used in records management operations.

The classification of records in business is important to establish their context and help finding them.

Microsoft 365 includes various options to apply classification terms to records.

Metadata columns in SharePoint

The simplest way to classify records stored in SharePoint document libraries is to either create site columns containing the classification terms and add those columns to document libraries, or create them directly in those libraries.

Benefits

Adding site or library columns is relatively simple. As classification terms are usually in the form of a (hierarchical) list, it is simple to add one choice or lookup column for function and another for activities.

A lookup column can bring across a value from another column when an item is selected; for example, if the look up list places ‘Accounting’ (Activity) in the same list row as ‘Financial Management’, selection of ‘Accounting’ will bring across ‘Financial Management’ as a separate (linked) column.

Default values (or even one value) can be set meaning that records added to a library (that only contains records with those classification terms) can be assigned the same classification terms each time without user intervention.

Negatives

SharePoint choice or lookup columns do not allow for hierarchical views or values to be displayed from the list view so the context for the classification terms may not be obvious unless both function and activity are listed.

The Term Store

The Term Store, also known as the Managed Metadata Service (MMS) has existed in SharePoint as a option to create and centrally manage classification and taxonomy terms in SharePoint only for at least a decade.

In 2020, access to the Term Store was re-located from its previous location (https://tenantname-admin.sharepoint.com/_layouts/15/TermStoreManager.aspx) to the SharePoint Online admin portal under the ‘Content Services’ section:

The location of the Term Store in the SharePoint admin center

Organisations can create multiple sets of taxonomies or ‘term groups’ (e.g, ‘BCS’ or ‘People’) within the Term Store. Each Term Group consists of the following:

  • Term Sets. These generally could map to a business function. Each Term Set has a name and description, and four tabs with the following information: (a) General: Owner, Stakeholders, Contact, Unique ID (GUID); (b) Usage settings: Submission policy, Available for tagging, Sort order; (c) Navigation: Use term set for site navigation or faceted navigation – both disabled by default; (d) Advanced: Translation options, custom properties.
  • Terms. These generally could map to an activity. Each Term has a name and three tabs: (a) General: Language, translation, synonyms and description; (b) Usage settings: Available for tagging, Member of (Term Set), Unique ID (GUID); (c) Advanced: Shared custom properties, Local custom properties.

In the example below, the Term Set (function) of ‘Community Relations’ has three Terms (activities).

Once they have been created in the Term Store, term set or terms can be added to a SharePoint site, either as a new site or local library/list column, as shown in the two screenshots below:

First, create a new column and select Managed Metadata

Then scroll down to Term Set Settings and choose the term set to be used.

Once added as a site column, the new column may be added to a Content Type that is added to a library, or directly on the library or list.

A Term Store-based column added to a library via a Content Type.

Benefits

The primary benefit using the central Term Store terms via a Managed Metadata column is that the Term Store is the ‘master’ classification scheme providing consistency in classification terms for all SharePoint sites.

As we will see below, Term Store terms may be used to help with the application of retention labels (which themselves may ‘map’ to classification terms in a function/activity-based retention schedule).

Negatives

Using metadata terms from the Term Store is almost identical with using a choice or lookup column. The only real difference is that the Term Store provides a ‘master’ and consistent list of classification (and other) terms.

Term store classification terms, including in Content Types, may only be used on a minority of SharePoint sites.

Additionally:

  • It is not possible to select a Term Set (e.g., the function level), only a Term within a Term Set.
  • Only the selected classification Term appears in the library metadata, without the parent Term Set or visual hierarchy reference to that Term Set – see screenshot below. Technically only that Term is searchable. It is not possible to view a global listing of all records classified according to function and activity.
  • If multiple choices are allowed, a record may be classified according to more than one Term. This may cause issues with grouping, sorting or filtering the content of a library in views.
How the Term appears in the library column.

As we will see below, there is no connection between the classification Terms in Term Sets and the categorisation options available when creating new retention labels via a File Plan. ‘Business Function’ or ‘Category’ choices in the File Plan do not connect with the Term Store.

Term Store terms and Content Types can only be used to classify content stored in SharePoint.

Retention labels

Retention labels in Microsoft 365 can be used in an indirect way to classify records in SharePoint, email and OneDrive because they can be ‘mapped’ to classification elements.

For example, a label may be based on the following elements:

  • Function: Financial Management
  • Activity: Accounting
  • Description: Accounting records
  • Retention: 7 years

Every retention label contains the following options:

  • Name. The name can provide simple details of the classification, for example: ‘Financial Management Accounting – 7 years’.
  • Description for users. This can be the full wording of the retention class.
  • Description for admins. This can contain details of how to apply or interpret the class, if required.
  • Retention settings (e.g., 7 years after date created/modified or label applied).

Benefits

Where the classification terms map to a retention class, the process of applying a retention label to an individual record, email or OneDrive content could potentially be seen as classifying those records against the classification scheme.

The Data Classification section in the Microsoft 365 Compliance portal provides an overview of the volume of records in SharePoint, OneDrive or Exchange that have a specific retention class:

Negatives

Not every record in every SharePoint document library may be subject to a retention label. Many records (for example in Teams-based SharePoint sites) may be subject to a ‘back end’ retention policy applied to the entire site (which creates a Preservation Hold library).

A retention label applied to a record doesn’t actually add any classification terms to the record.

Retention labels don’t map in any way to Term Store classification terms, except in SharePoint Syntex – see below (but this only applies to SharePoint content).

Retention labels/File Plan combination

The File Plan option (Records Management > File Plan, requires E5 licences) can also be used to add classification terms to a retention label as shown in the screenshot below. Note that there is no link with the Term Store.

Benefits

Records (including emails) that have been assigned a retention label could, in theory, be regarded as having been classified in this way because the label contains (or references) the classification terms.

Negatives

When applied to content in SharePoint, OneDrive or Exchange, retention labels linked with the File Plan do not show the File Plan classification terms. It may be possible to write a script that displays all records with the terms from the File Plan, but it may be easier to do this using the Data Classification option described above.

Retention labels/SharePoint Syntex combination

SharePoint Syntex provides a way to apply retention labels to records, stored in SharePoint, that have been identified through the Document Understanding Model process.

Benefits

As can be seen in the screenshot above, each new DU model allows similar types of records (in the example above, ‘Statements of Work’) to be associated with a new or existing Content Type that can include a Term Store Term – for SharePoint records only – and a retention label. This provides three types of ‘classification’:

  • Grouping by record type (e.g., Statement of Work, Invoice)
  • Linking (of sorts) between the records ‘classified’ in this way and a Term Store term added as a metadata column to the Content Type.
  • Assigning of a retention label. This provides the same form of retention label-based classification described above.

Furthermore, if the Extraction option is also used, data extracted to SharePoint columns can be based on choices listed in the Term Store metadata.

Negatives

SharePoint Syntex only works for records – and only those records that have some form of consistency – stored in SharePoint.

Retention labels/trainable classifiers combination

Trainable classifiers are another way that could be used to identify related records and apply a retention label to those records. Microsoft 365 includes six ‘out of the box’ trainable classifiers that will not be of much value to records managers for the classification of records:

  • Source code
  • Harassment
  • Profanity
  • Threat
  • Resume
  • Offensive language (to be deprecated)

The creation of new trainable classifiers requires an E5 licence; they are created through the Data Classification area of the Microsoft 365 Compliance admin portal. Machine Learning is used to identify related records to create the trainable classifiers.

Once created, a retention label may be auto-applied to content stored in SharePoint or Exchange mailboxes using the classifier.

The option to auto-apply a label based on a trainable classifier.

Benefits

The primary outcome (from a recordkeeping classification point of view) of using trainable classifiers is the application of a retention label to content stored in SharePoint and Exchange mailboxes. It can also be used to apply a sensitivity label to that content.

Negatives

It is unlikely that every record will be classified according to every classification option.

Trainable classifiers only work with SharePoint and Exchange mailboxes.

Classifying records per workload

The options are summarised below for each main workload:

  • SharePoint: Use local site or library columns, Term Store terms or retention labels (mapped to a File Plan as necessary), applied manually or automatically, including via SharePoint Syntex or trainable classifiers.
  • Exchange mailboxes: The only feasible option to classify these records is to manually or auto-apply retention labels that are mapped to a classification, including a trainable classifer.
  • OneDrive: Manually or auto-apply retention labels mapped to a classification.
  • Teams. It is not possible to classify Teams chats with the options available.

Is classification necessary?

The classification model described in ISO 15489 and other standards was based on the idea that records would be stored in a central recordkeeping system where they would be subject to and tagged by the terms contained a classification scheme, often applied at the aggregation level (e.g., a file).

Microsoft 365 is not a recordkeeping system but a collection of multiple applications that may create or capture records, primarily in Exchange mailboxes, SharePoint, OneDrive and MS Teams (and also Yammer).

There is no central option to classify records in the recordkeeping sense. The closest options are:

  • The grouping of records in SharePoint sites (and Teams, each of which has a SharePoint site) and libraries that map to business functions and activities.
  • The use of metadata, either terms set in the central Term Store or created in local sites/libraries, to ‘classify’ individual records (including emails) stored in SharePoint document libraries. Each item in the library might have a default classification, or could be classified differently.
  • The use of retention labels that ‘map’ to function/activity pairs in a records disposal authority/schedule. These may be applied, manually or automatically, to content stored in SharePoint, OneDrive and Exchange mailboxes.

Neither of the above may apply, or be applied consistently, to all SharePoint sites, Exchange mailboxes, OneDrive accounts. And neither can be applied to Teams chats.

A different approach to this problem is required, one that likely will likely involve greater use of Artificial Intelligence (AI) and Machine Learning (ML) methods to identify and enable the grouping of records, and provide visualisations of the records so-classified.

Image: Werribee Mansion, Victoria, Australia stairwell (Andrew Warland photo)

Posted in Conservation and preservation, Digital preservation, Electronic records, Records management, Retention and disposal

The challenge of identifying born-digital records

A recent ‘functional and efficiency’ review into the National Archives of Australia (also known as the ‘Tune Review’, published on 30 January 2021) noted the ‘rapid and ever-evolving challenges of the digital world’.

It stated that ‘the definition of a ‘record’ needs to reflect current international standards, be more directly applied to digital technologies, and more clearly provide for direct capture of records that are susceptible to deletion, such as emails, texts or online messages’.

The review also highlighted the difficulties associated with ingesting digital records ‘via manual intensive activities (due to lack of interoperable systems)’ and proposed a new model based on the ‘continuous automated appraisal of [Agency] digital records that would require a combination of artificial intelligence and skilled archivists’.

The review underlined the challenges of identifying and managing born-digital records, and the need for better solutions.

This post explores the challenges of accurately and identifying born-digital records in order to manage them.

Identifying and protecting records

Records usually provide evidence of something that happened – an action, an activity or process, a decision, or a current state (including a photograph or video record). They may have or be associated with descriptive metadata used to provide context to the records and guide or determine retention.

Like all other types of evidence, the authenticity, integrity and reliability should be protected for as long as they must be kept.

In the paper world, this outcome was achieved by storing physical records (including the printed version of born-digital records) on paper files or in physical storage spaces.

For the past twenty years or so, this outcome was achieved for (some) digital records by (mostly manually) copying them from a network drive or email system (or via a connector) to a dedicated electronic records management (ERM) system and then ‘locking’ them in that system to prevent unauthorised change or deletion. Most ERM systems consisted of a database for the metadata and an associated network drive file store for the objects.

The main problem with this centralised storage model – however good it might be at protecting copies of records stored in it – was that the original versions, along with all the other records that were not identified or could not be copied to the ERMS, remained where they were created or captured.

And the records stored ‘in’ the ERMS were actually stored on a network file share on a server that was (a) accessible to IT, and (b) almost always backed up. So, yet more copies existed.

The challenge of born-digital records

There are several key challenges with born-digital records:

  • Consistently and accurately identifying (or ‘declaring’) all records in all formats created or captured in all locations. For too long, the focus has primarily been on emails and anything that can be saved to a network drive with the onus of identifying a record on end-users.
  • Ensuring their authenticity, reliability and integrity over time. For records stored in the ERMS, this has usually involved locking them from edit, including through the ‘declaration’ process, or preventing deletion. But in almost all cases, the original version (in email, on the network drives), could continue to be modified. Other records that were not identified or stored in an ERMS may be deleted.
  • Ensuring that born-digital records will remain accessible for as long as they are required.

It is not possible to consistently and accurately manually (or even automatically) identify every born-digital record that an organisation creates or captures to ensure their authenticity, reliability, integrity or accessibility over time. Only a small percentage of born-digital records are copied to an ERMS.

Records remain hidden in personal mailboxes, personal drives and third-party (often unauthorised) systems. Records may exist in multiple forms and formats, sometimes created or stored in ‘private’ systems or on social media platforms. They may take the form of text or instant messages or social networking posts and threads. They may be drawings, images, voice or video recordings.

Even if a record is identified, it is not always possible to save it to an ERMS. Text or instant messages on mobile devices are a case in point that has been a problem for at least two decades. More recent examples include chat messages, reactions (emojis, comments), and recordings of online meetings.

And even if a high percentage of born-digital records could be stored in the ERMS, the original versions will almost always remain where they were created or captured.

A different approach is needed.

Triaging records?

One approach to the problem would be to accept that not all records have equal value. That is, not all records need to be managed the same way.

To some degree, this way of thinking is already reflected in classes in the structure of records retention schedules and the attention paid to each:

  • Records that have permanent or archival value and need to be transferred to archival institutions.
  • Specific types of records that must be created or kept by the organisation for a minimum periods (sometimes quite long but not ‘forever’), for legal, compliance or auditing purposes.
  • Records that are not subject to legal or compliance requirements but which the organisation decides to keep for a minimum period of time.
  • Everything else.

Triaging records means that they can be managed as required at each level, but nothing is missed. It requires a risk management approach.

For records of permanent value, or are subject to legal or compliance requirements, it means that ensuring that these records receive the most attention and every effort it made to ensure that they are and can be identified (declared) and managed accordingly. This would include ensuring that it is possible to identify and capture these records in the systems used to create or capture them, for example, key emails.

A similar approach would be taken to records that need to be kept for legal, compliance or auditing purposes but with an understanding that some of these records (e.g., emails) may remain in the original system where they were created or captured. Technological solutions may be used to identify or tag these records. The destruction of these records should be subject to some form of review and a record kept of the approval and what was destroyed.

For all other records would remain stored wherever they were created or captured and subject to minimum retention periods after which they can be destroyed without review – but a record kept of the basic metadata of each record (including original storage location).

Protecting – or proving – the authenticity, integrity and reliability of records

The assumption behind the protection of records is that they should not be changed or deleted.

The reality, with digital records, is that they may change at any time through new threads, new revisions, new chats, or even through photoshopping.

A more realistic approach may be to use information about what was changed, by whom, and when – not to protect the record but to provide an evidentiary trail to prove what it is or was. The ‘smoking gun’ evidence for most born-digital records is the metadata that is recorded when it was captured or modified, not (necessarily) the added descriptive metadata.

For example:

  • Someone may author a document (metadata records each revision, and each revision can be viewed).
  • The document may be approved electronically (recorded in metadata).
  • Someone then modifies the approved version.
  • All of the above is recorded in the ‘modified’, ‘modified by’ and approval metadata.
  • The record should (or may) also recorded who viewed the record, and when.

EXIF metadata stored on images provides a similar form of evidence (and may even include GPS information).

Which record is more likely to be accepted as evidence:

  • A record stored in an EDRMS, versions or revisions of which may exist in multiple other places, including on network file shares, email system and even backup tapes
  • A record stored in a system that shows the full set of metadata about access and changes, or the most recent thread of an email discussion?

Conclusions

At the end of the day, it should be possible to confirm the authenticity, reliability and integrity of records based on information/metadata that forms part of the born-digital record: who and when it was created, the context in which it was created and its relationship with other records.

Perhaps, instead of focussing on trying to identify and capture all born-digital objects that might be records and ‘protecting’ a version of that record, it may be more practical and easier to leave most records where they were created or captured (and retained by retention policies) and use change or revision metadata to provide evidence of authenticity.

This may, in the end, be a much easier way to protect the authenticity of records than having to rely on manual identification or declaration.

Posted in Records management, Electronic records, Retention and disposal, Information Management, Classification, Microsoft 365, Artificial Intelligence

Can Microsoft technology classify records better than a human?

In late 2012, IDM magazine published an article I co-authored with Umi Asma Mokhtar in Malaysia titled ‘Can technology classify records better than a human?’

The article drew on research into recent advances in technology to assist in legal discovery, known as ‘computer-assisted coding’, or ‘predictive coding’, including the following two articles:

Grossman and Cormack’s article noted that ‘a technology-assisted review process involves the interplay of humans and computers to identify the documents in a collection that are responsive to a production request, or to identify those documents that should be withheld on the basis of privilege‘. By contrast, an ‘exhaustive manual review’ required ‘one or more humans to examine each and every document in the collection, and to code them as response (or privileged) or not‘.

The article noted, somewhat gently, that ‘relevant literature suggests that manual review is far from perfect’.

Peck’s article contained similar conclusions. He also noted how computer-based coding was based on a initial ‘seed set’ of documents identified by a human; the computer then identified the properties of those documents and used that to code other similar documents. ‘As the senior reviewer continues to code more sample documents, the computer predicts the reviewer’s coding‘ (hence predictive coding).

By 2011, this new technology was challenging old methods of manual review and classification. Despite some scepticism and slow uptake (for example, see this 2015 IDM article ‘Predictive Coding – What happened to the next big thing?‘), by 2021, it had become an accepted option to support discovery, sometimes involving offshore processing for high volumes of content.

Meanwhile, in an almost unnoticed part of the technology woods, Microsoft acquired Equivio in January 2015. In its press release ‘Microsoft acquires Equivio, provider of machine learning-powered compliance solutions‘, Microsoft stated that the product:

‘… applies machine learning … enabling users to explore large, unstructured sets of data and quickly find what is relevant. It uses advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sorting documents into themes, grouping near-duplicates, isolating unique data, and helping users quickly identify the documents they need. As part of this process, users train the system to identify documents relevant to a particular subject, such as a legal case or investigation. This iterative process is more accurate and cost effective than keyword searches and manual review of vast quantities of documents.’ 

It added that the product would be deployed in Office 365.

Classifying records

The concept of classification for records was defined in paragraph 7.3 of part 1 of the Australian Standard (AS) 4390, released in 1996. The standard defined classification as:

‘… the process of devising and applying schemes based on the business activities generating records, whereby they are categorised in systematic and consistent ways to facilitate their capture, retrieval, maintenance and disposal. Classification includes the determination of naming conventions, user permissions and security restrictions on records’.

The definition provided a number of examples of how the classification of business activities could act as a ‘powerful tool to assist in many of the processes involved in the management of records, resulting from those activities’. This included ‘determining appropriate retention periods for records’.

The only problem with the concept was the assumption that all records could be classified in this way, in a singular recordkeeping system. Unless they were copied to that system, emails largely escaped classification.

Fast forward to 2020

Managing all digital records according to recordkeeping standards has always been a problem. Electronic records management (ERM) systems managed the records that were copied into them, but a much higher percentage remained outside its control – in email systems, network files shares and, increasingly over the past 10 years, created and captured on host of alternative systems including third-party and social media platforms.

By the end of 2019, Microsoft had built a comprehensive single ecosystem to create, capture and manage digital content, including most of the records that would have been previously consigned to an ERMS. And then COVID appeared and working from home become common. All of a sudden (almost), it had to be possible to work online. Online meeting and collaboration systems such as Microsoft Teams took off, usually in parallel with email. Anything that required a VPN to access became a problem.

2021 – Automated classification for records (maybe)

The Microsoft 365 ecosystem generated a huge volume of new content scattered across four main workloads – Exchange/Outlook, SharePoint, OneDrive and Teams. A few other systems such as Yammer also added to the mix.

Most of this information was not subject to any form of classification in the recordkeeping sense. The Microsoft 365 platform included the ability to apply retention policies to content but there was a disconnect between classification and retention.

Microsoft announced Project Cortex at Ignite in 2019. According to the announcement, Project Cortex:

  • Uses advanced AI to deliver insights and expertise in the apps that are used every day, to harness collective knowledge and to empower people and teams to learn, upskill and innovate faster.
  • Uses AI to reason over content across teams and systems, recognizing content types, extracting important information, and automatically organizing content into shared topics like projects, products, processes and customers.
  • Creates a knowledge network based on relationships among topics, content, and people.

Project Cortex drew on technological capabilities present in Azure’s Cognitive Services and the Microsoft Graph. It is not known to what extent the Equivio product, acquired in 2015, was integrated with these solutions but, from all the available details, it appears the technology is at least connected in one way or another.

During Ignite 2020, Microsoft announced SharePoint Syntex and trainable classifiers, either of which could be deployed to classify information and apply retention rules.

Trainable classifiers

Trainable classifiers were made generally available (GA) in January 2021.

Trainable classifiers sound very similar to the predictive coding capability that appeared from 2011. However, they:

  • Use the power of Machine Learning (ML) to identify categories of information. This is achieved by creating an initial ‘seed’ of data in a SharePoint library, creating a new trainable classifier and pointing it at the seed, then reviewing the outcomes. More content is added to ensure accuracy.
  • Can be used to identify similar content in Exchange mailboxes, SharePoint sites, OneDrive for Business accounts, and Microsoft 365 Groups and apply a pre-defined retention label to that content.

In theory, this means it might be possible to identify a set of similar records – for example, financial documents – and apply the same retention label to them. The Content Explorer in the Compliance admin portal will list the records that are subject to that label.

SharePoint Syntex

SharePoint Syntex was announced at Ignite in September 2020 and made generally available in early 2021.

The original version of Syntex (as part of Project Cortex) was targeted at the ability to extract metadata from forms, a capability that has existed with various other scanning/OCR products for at least a decade. The capability that was released in early 2021 included the base metadata extraction capability as well as a broader capability to classify content and apply a retention label.

The two Syntex capabilities, described in a YouTube video from Microsoft titled ‘Step-by-Step: How to Build a Document Understanding Model using Project Cortex‘, are:

  • Classification. This capability involves the following steps: (a) Creation of (SharePoint site) Content Center; (b) Creation of a Document Understanding Model (DUM) for each ‘type’ of record; the DUM can create a new content type or point to an existing one; the DUM can also link with the retention label to be applied; (c) Creation of an initial seed of records (positives and a couple of negatives); (d) Creation of Explanations that help the model find records by phrase, proximity, or pattern (matching, e.g., dates); (e) Training; (f) Applying the model to SharePoint sites or libraries. The outcome of the classification is that matching records in the location where it is pointed are assigned to the Content Type (replacing any previous one) and tagged with a retention label (also replacing any previous one).
  • Extraction. This capability has similar steps to the classification option except that the Explanations identify what metadata is to be extracted from where (again based on phrase, proximity or pattern) to what metadata column. The outcome of extraction is that the matching records include the extracted metadata in the library columns (in addition to the Content Type and retention label).

As with trainable classifiers, Syntex uses Machine Learning to classify records, but Syntex also has the ability to extract metadata. Syntex can only classify or extract data from SharePoint libraries.

Trainable classifiers or Syntex?

Both options require the organisation to create an initial seed of content and to use Machine Learning to develop an understanding of the content, in order to classify it.

The models are similar, the primary difference is that trainable classifiers can work on content stored in email, SharePoint and OneDrive, whereas Syntex is currently restricted to SharePoint.

Predictive coding

On 18 March 2021, Microsoft announced the pending (April 2021) preview release of an enhanced predictive coding module for advanced eDiscovery in Microsoft 365.

The announcement, pointing to this roadmap item, noted that eDiscovery managers would be able to create and train relevance models within Advanced eDiscovery using as few as 50 documents, to prioritize review.

So, can Microsoft technology classify records better than humans?

In their 1999 book ‘Sorting Things Out: Classification and its Consequences‘ (MIT Press), Geoffrey Bowker and Susan Leigh Star noted that ‘to classify is human’ and that classification was ‘the sleeping beauty of information science’ and ‘the scaffolding of information infrastructures’.

But they also noted how ‘each standard and category valorizes some point or view and silences another. Standards and classifications (can) produce advantage or suffering’ (quote from review in link above).

Technology-based classification in theory is impartial. It categorises what it finds through machine learning and algorithms. But, technology-based classification requires human review of the initial and subsequent seeds. Accordingly such classification has the potential to be skewed according to the way the reviewer’s bias or predilections, the selection of one set of preferred or ‘matching’ records over another.

Ultimately, a ‘match’ is based on a scoring ‘relevancy’ algorithm. Perhaps the technology can classify better than humans, but whether the classification is accurate may depend on the human to make accurate, consistent and impartial decisions.

Either way, the manual classification of records is likely to go the same way as the manual review of legal documents for discovery.

Image source: Providence Public Library Flickr

Posted in Classification, Compliance, Information Management, Records management, Retention and disposal

Classifying records in Microsoft 365

The classification of records is fundamental recordkeeping activity. It is defined in the international standard ISO 15489-1:2016 (Information and Documentation – Records Management) as the ‘systematic identification and/or arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules‘. (Terms and Definitions, 3.4)

The purpose of classification is defined by State Records NSW as follows: ‘In records management, records are classified according to the business functions and activities which generate the records. This functional approach to classification means that classification can be used for a range of records management purposes, including appraisal and disposal, determining handling, storage and security requirements, and setting user permissions, as well as providing a basis for titling and indexing‘. (Records Classification, accessed 13 January 2021.)

The ever-increasing volume of digital records, the many different ways to create them, and the multitude of record types that are created and storage locations, have made it more difficult to accurately and consistently manually classify records, including through the creation of pre-defined ‘containers’ or aggregations based on classification terms. Despite this, the requirement to link the classification of records with their retention and and disposal remains.

For over three decades, Microsoft’s applications and technology platforms have been used to create, capture, store and manage records. Some of these records (in the earlier period) were printed and placed on paper files, or stored (from around 2000) in dedicated electronic document and records management (EDRM) systems.

But the volume and type of digital content, including with new types of records (e.g., chat messages) and storage locations, continues to grow. In response, Microsoft invested heavily in addressing the need to classify records ‘at scale’.

This post looks at various ways to classify records, for retention and disposition purposes, in Microsoft 365.

The old-school, manual method – metadata

Most of the records in Microsoft 365 will be created, captured or stored in one of the four primary workloads: Exchange mailboxes, SharePoint sites/libraries, MS Teams chats (a ‘compliance copy’ of which is stored in Exchange mailboxes), and OneDrive for Business libraries. Some records may also exist in Yammer or other web page content (e.g., intranet).

Most SharePoint sites as well as Teams (that have a SharePoint site) will be created according to some form of business need to create, capture, store and share records; that is, the site or team purpose may be based on a business function or activity. This way of grouping records may in some ways be used as a way to classify records – by SharePoint site (e.g., function) or document library (e.g., activity).

Records may be stored in multiple document libraries, or within a folder structure of a single library.

A number of methods (some of which rely on others) can be used to add classification (and other) metadata to records stored in SharePoint document libraries:

1 – Creating the classification taxonomy in the Managed Metadata Service (MMS)/Term Store via the SharePoint Admin portal – Content Services – Term store, and then applying these terms in content types that are then deployed in SharePoint sites.

An example of Business Classification Terms in the MMS

2 – Creating global content types from the SharePoint Admin portal, in the Content Services – Content type gallery area (see ‘Finance Document’ example below) and then deploying these in specific SharePoint sites where site columns that contain classification terms will be added.

3 – Creating site columns that contain classification terms, including from the MMS, and adding these to global or site content types or document libraries where they can be applied to records.

In this example, the site column ‘BCS Function’ maps to the MMS BCS terms

4 – Creating site content types and adding site columns (including MMS-based columns), then adding these content types to document libraries.

In this example, the MMS-based column now appears in the library columns

But, most of the above is somewhat complicated and cumbersome and would normally only be used for and manually applied to specific types of records.

The simplest way to apply BCS/File Plan terms at the document (or document set) level is to (a) store records to the same BCS function or activity in the same library, (b) create site or library columns with default values and add these to the library. This way means that the default terms are applied automatically as soon as a new record is uploaded, including when shared/inherited from the site columns added to a document set that ‘contains’ a document content type.

Example metadata columns shared from the document set content type

However, keep in mind that SharePoint is just one of the workloads where records are stored.

Records in the form of emails, chats and ‘personal’ content (as well as Yammer messages and web pages) are created in and stored across the other workloads. Some attempt may be made to copy these other records (especially emails) in SharePoint sites but it starts to get complicated or impossible to do so with things like Teams chat messages.

In most cases (and according to Microsoft’s own recommendations), it is better to leave the records where they were created or captured (‘in place’), and apply centralised compliance controls (classification, retention labels and policies) to this content.

Leaving the records in place in this way does not exclude the ability to create SharePoint sites and document libraries in those sites that map to classification terms, and/or use the site column approach described above but these are more likely to be exceptions.

In fact, some form of logical structure is almost certain anyway as most end-users will probably want to access and manage information in their own specific work context (the Team/SharePoint site).

Trainable classifiers

Since not all records are stored in SharePoint and the ever-increasing volume of digital content stored across the Microsoft 365 platform, Microsoft needed to find a way to classify records ‘at scale’.

The solution was to use machine learning (ML) via trainable classifiers accessed in the ‘Data Classification’ section of the Microsoft 365 Compliance portal. This capability is only available to E5 licences.

The trainable classifiers solution was released to General Availability on 12 January 2021 (‘Announcing GA of machine learning trainable classifiers for your compliance needs‘, accessed 13 January 2021).

See the Microsoft web page ‘Learn about trainable classifiers‘ to learn more about this option. To quote from that page:

This classification method is particularly well suited to content that isn’t easily identified by either the manual or automated pattern matching methods. This method of classification is more about training a classifier to identify an item based on what the item is, not by elements that are in the item (pattern matching).’

Organisations (including E3 licence holders) may make use of five pre-defined trainable classifiers (Resumes, Source Code, Targeted Harassment, Profanity or Threat. A sixth classifier ‘Offensive language’, is to be deprecated). Custom classifiers require an E5 licence.

Custom classifiers require ‘significantly more work’ than the pre-existing classifiers and the process is quite involved (see the process flow diagram in the ‘Learn about’ page link above) but in summary it involves the following steps:

  • Creating the custom classifier.
  • Creating a set of manually selected example records (50 to 500) in a dedicated SharePoint Online site as the ‘seed’. This would include a range of emails in the seed examples.
  • Testing the classifier with the seeded documents.
  • Re-training with additional content – both positive and negative matches.

Once the classifier is published, it can be used to identify and classify related content across SharePoint Online, Exchange, and OneDrive (but not Teams).

The page ‘Default crawled file name extensions and parsed file types‘ provides details of all the record types that can be classified in this way. Note it is not clear if trainable classifiers can crawl the compliance copy of Teams chat messages stored in hidden folders in Exchange mailboxes.

Label-based retention policies can then be automatically applied to content that has been identified through the trainable classifier.

However, note that the classifier does not ‘group’, aggregate or ‘present’ (list) the records for review (except broadly via the Content Explorer); however, the label applied to the records can be searched via the ‘Content Search’ option in the Compliance portal. This is a much better option than not having any idea how many records of a particular classification may exist in Exchange mailboxes, OneDrive accounts, or general SharePoint sites. It requires some degree of ‘letting go’ of the ability to view and browse content classified this way, and trusting the system.

The main limit with trainable classifiers is that it requires an E5 or E5 compliance licence.

The other limitation is the management of the disposition of records that have been identified with trainable classifiers and had a label-based retention policy applied. There are significant shortcomings with the current ‘Disposition Review’ process, specifically the lack of adequate metadata to review records due for disposal or the details of what has been destroyed.

SharePoint Syntex

Another (but limited) option might be to use SharePoint Syntex (see ‘Introduction to Microsoft SharePoint Syntex‘ for an overview), although its range is limited to SharePoint and – it seems – only records that have a relatively consistent structure and format.

SharePoint Syntex evolved out of Project Cortex’s ability to extract and capture metadata from records. It can also be used through its ‘Document Understanding Model‘ (DUM) to provide a way to classify records stored in SharePoint Online (only). It makes use of a ‘seeding’ model that is similar to trainable classifiers (and may be based on the same underlying AI engine).

Broadly speaking, the DUM works on the basis of loading a small ‘seed’ set of (relatively consistently formated) example files into a dedicated Content Center (or Centers). This is very similar to the process of using trainable classifiers, except that the latter does not require a ‘content center’ SharePoint site to be created.

  • The example files are ‘trained’ by being ‘classified’ through the document understanding process based on a set of ‘explanation types‘ that are used to help find the relevant content. The three explanation types are: (a) phrase list (a list of words, phrases, numbers, or other characters used in the document or information that you are extracting); (b) pattern list (patterns of numbers, letters, or other characters); and (c) proximity (describes how close other explanations are to each other).
  • The document understanding model (DUM) produced through the explanation types is associated (and deployed) with a new or existing content type. 
  • Once applied to a SharePoint site library, the DUM/content type provides the basis for identifying and tagging (with metadata) other similar records in the location (e.g., the library) where the DUM has been deployed. 
  • If the documents have consistent content such as invoices, certain data from those documents can be extracted as metadata. 

Retention labels may be applied to records classified using SharePoint Syntex, as described on this page ‘Apply a retention label to a document understanding model‘.

Summing up – which one should be used?

The answer to this question will depend on your compliance requirements.

Smaller organisations may be able to set up SharePoint sites and document libraries with site columns/metadata that maps to their business classification scheme or file plan, and copy emails to those libraries. There may be little need to use AI-based classification methods.

In large and more complex organisations (with E5 licences), especially those with a lot of content stored across Exchange mailboxes and SharePoint sites (including Teams-based sites) there will most certainly be a need for some form of AI-based classification in addition to classification-mapped SharePoint sites (and Teams).

Organisations with E3 licences might use the manual methods described above for specific types of records, and consider acquiring additional E5 Compliance licences to make use of trainable classifiers or SharePoint Syntex for other records.

Posted in Compliance, Electronic records, Governance, Information Management, Legal, Microsoft 365, Microsoft Teams, Products and applications, Records management, Retention and disposal, SharePoint Online, Yammer

All the ways SharePoint sites can be created

SharePoint is a core foundational element in Microsoft 365. It is primarily used for the storage of digital objects (including pages) in document libraries and rows and columns of data in lists. It is ubiquitous and almost impossible to remove from a Microsoft 365 licence because it ‘powers’ so many different things.

While the idea that anyone can easily create a SharePoint site seems a good idea in some ways, from a recordkeeping of view this starts to look like network file shares all over again.

Microsoft’s response to the default ‘free for all’ ability to create SharePoint sites is to use the so-called ‘records management’ functionality (via the more expensive E5 licence) to auto-classify content and auto-apply retention labels. The problem is that those (more expensive options) provide limited functionality, including inadequate metadata details to make decisions on disposal, and similarly inadequate metadata (for records subject to disposition review labels only) as ‘proof of disposition’.

So, records managers are more often than not left with a network file share-like sprawl of uncontrolled content.

Unfortunately, the ability to create a new SharePoint site is fairly easy, almost as easy as creating a folder on a … network file share.

The following is a list of the main ways a person can create a SharePoint site. Have I missed any?

1. Via a PowerShell script

As described in the Microsoft docs web page ‘Create SharePoint Online sites and add users with PowerShell‘. The script is based on a csv file (‘sitecollections.csv) and looks something like the following (see the link for more details):

Import-Csv C:\users\MyAlias\desktop\SiteCollections.csv | ForEach-Object {New-SPOSite -Owner $_.Owner -StorageQuota $_.StorageQuota -Url $_.Url -NoWait -ResourceQuota $_.ResourceQuota -Template $_.Template -TimeZoneID $_.TimeZoneID -Title $_.Name}

This option also allows the administrator to provision new SharePoint sites.

2. Via the SharePoint Admin portal (+ Create)

This option allows the creation of three main types of sites: modern team sites (Team site),
communication sites, and non-Microsoft 365 Group-linked sites (Other options).

3. By creating a Microsoft 365 Group

Microsoft 365 Groups are created in the Microsoft 365 Admin portal, in the Groups section, Add a group > Microsoft 365. This is also where Security Groups and Distribution Lists (both collectively known as ‘AD Groups’) are created.


Every new Microsoft 365 Group creates both a SharePoint site and an Exchange mailbox that is visible in the Outlook application (under ‘Groups’) of everyone who is an Owner or a Member of the Group.

The new Group creation process allows the Group email address to be created (it really should be the same as the Group name), the Group to be made public or private, and a new Team to be created.

Because the Microsoft 365 Group name becomes the SharePoint site (URL) name, it is a good idea to consider naming conventions.

4. By an end-user creating a new Team in MS Teams

Unless the creation of Microsoft 365 Groups is not restricted, an end-user can create a new SharePoint site (possibly without realising it) by creating a new Team in MS Teams. There is nothing in the creation process to indicate that (a) they will create a SharePoint site or a Microsoft 365 Group, or (b) that they will be the Owner of the Team, Group and SharePoint site – and therefore have responsibility for managing the Team/Group membership.

Every new Team creates a Microsoft 365 Group which always has a SharePoint site and an
Exchange Online mailbox that is not visible in Outlook.

5. By creating a Private Channel in MS Teams

If the option is not disabled in the MS Teams admin portal under Teams > Teams Policies, end users will be able to create private channel in a Teams channel. Every private channel creates a new SharePoint site with a name that is an extension of the ‘parent’ Team site name.

For example, if the parent site name is ‘Finance’ and the private channel is named ‘Invoice chat’, the new SharePoint site will be ‘Finance-Invoicechat’. These new site is not connected with the ‘parent’ site and is not visible in the list of Active Sites from the SharePoint admin portal (and so the SharePoint Admin won’t know it exists). It is only visible in the list of Sites under the Resources section of the Microsoft 365 Admin portal.

A private channel does not create a new Microsoft 365 Group. A ‘compliance copy’ of the chats in the private channel are stored in the Exchange Online mailboxes of individual participants in the chat.

6. By the Teams Admin creating a new Team

The MS Teams admin area includes the ability for the Teams admin to go to Manage Teams, click +Add and create a new Team.

As with the end-user creation process, a new Team creates a Microsoft 365 Group that has an Exchange mailbox and a SharePoint site.

7. From the end-user SharePoint portal (+ Create site)

If not disabled, end users can create a new SharePoint site by clicking on ‘+ Create site’ from the SharePoint portal – https://tenantname.sharepoint.com/_layouts/15/sharepoint.aspx

This process creates a Microsoft 365 Group that has a SharePoint site and an Exchange mailbox. It also creates a new Team with the same name.

It is recommended that the ability for end-users to create new sites this way is disabled, at least initially. This is done from the SharePoint admin portal under Settings > Site Creation.

8. From OneDrive for Business as a ‘shared library’

This option is relatively new. When the end-user opens their OneDrive for Business, they will see ‘Create shared library’ directly under a list of sites they have access to under a heading ‘Shared libraries’ (they are actually SharePoint sites; when you click on the site name, it (confusingly) displays the document libraries as … folders.

9. When a new Plan is created in Planner

If end-users open the Planner app, they will see ‘New Plan’ on the top left. This opens a dialogue to create a New Plan or add one to an existing Microsoft 365 Group. The process of creating a new Plan creates a new Microsoft 365 Group with a SharePoint site.

10. When a new Yammer community is created

End users with access to Yammer can click on ‘Create a Community’ from Yammer.

To quote from the Microsoft 365 documentation ‘Join and create a community in Yammer‘: ‘When a new Office 365 connected Yammer community is created, it gets a new SharePoint site, SharePoint document library, OneNote notebook, plan in Microsoft Planner, and shows up in the Global Address Book.’

Why have Microsoft allowed this?

It’s a smarter way to manage access.

Some years back, Microsoft moved away from the idea of having Security Groups that give access to individual IT resources, to having individual Microsoft 365 Groups that provide access to multiple IT resources, in this case resources across Microsoft 365. One Microsoft 365 Group controls access to a SharePoint site, an Exchange mailbox, a Team, a Plan, and a Yammer Community. Security Groups don’t have that sort of functionality.

The trade off is that you get all of these options with a Microsoft 365 Group, whether you like it or not.

But, some of the decisions don’t seem to make sense.

  • Why allow end-users to create a private channel in Teams when they can simply use the 1:1 chat area?
  • Why allow the creation of a so-called ‘Shared Library’ from OneDrive, limited to and controlled by the person who created it, when a SharePoint site provides that functionality.
  • Why does an end-user need an Exchange mailbox (for the Microsoft 365 Group) when they create a new site from the ‘Create site’ option in SharePoint?
  • And why does a new Plan create a SharePoint site? For what purpose?

Perhaps there is a reason for it. It’s just not clear.

Posted in Archiving third party content, Connectors, Conservation and preservation, Electronic records, Information Management, Microsoft 365, Microsoft Graph, Records management, Retention and disposal, Solutions

Using Microsoft 365 connectors to support records management

Microsoft 365 includes a range of connectors, in three categories, that can be used to support the management of records created by other applications. The three categories are:

  • Search connectors, that find content created by and/or stored in a range of internal and external applications, including social media.
  • Archive connectors, that import and archive content created by third-party applications.
  • API connectors, that support business processes such as capturing email attachments.

This post how these connectors can assist with the management of records.

The recordkeeping dilemma

Finding, capturing and managing records across an ever increasing volume of digital content and content types has been one of the biggest challenges for recordkeeping since the early 2000s.

The primary method of managing digital records for most of the past 20 years has been to require digital records (mostly emails and other digital content created on file shares) to be saved to or stored in an electronic document and records management system (EDRMS). The EDRMS was established as ‘the’ recordkeeping system for the organisation.

EDRM systems were also used to manage paper records which, over the past 20 years, have mostly contained the printed version of born-digital records that remain stored in the systems where they were created or captured.

There were two fundamental flaws in the EDRMS model. The first was an expectation that end-users would be willing to save digital records to the EDRMS. The second was that the original digital record remained in place where it was created or captured, usually ignored but often the source of rich pickings for eDiscovery.

The introduction of web-based email and document storage systems, smart phones, social media and personal messaging applications from around 2005 (in addition to already existing text messaging/SMS messages) further challenged the concept of a centralised recordkeeping system; in many cases, the only option to save these records was to print and scan, screenshot and save the image, or save to PDF, none of which were particularly effective in capturing the full set of records.

The hasty introduction from early 2020 of ‘work from home’ applications such as Zoom and Microsoft Teams has been a further blow to these methods.

In place records management

To the chagrin of records managers around the world, Microsoft never made it easy to save an email from Outlook to another system. Emails stubbornly remained stored in Exchange mailboxes with no sign of integration with file shares.

And for good reason – they have a different purpose and architecture to support that purpose. It would be similar to asking when it would be possible to create and send an email in Word.

The introduction of Office 365 (later Microsoft 365) from the mid 2010s changed the paradigm from a centralised model – where records were all copied to a central location and the originals left where they were created or captured, to a de-centralised or ‘in place’ model – where records are mostly left where they were created or captured.

The decentralised model does not exclude the ability to store copies of some records (e.g., emails) in other applications (e.g., SharePoint document libraries), but these are exceptions to the general rule.

It also does not exclude the ability to import or migrate content from third-party applications where necessary for recordkeeping purposes.

Microsoft 365 connectors

Microsoft 365 includes a wide range of options to connect with both internal and external systems. Many of these connectors simplify business processes and support integration models.

Connectors may also be used to support recordkeeping requirements, in three broad categories.

The three connectors

Archive connectors

Archive connectors allow organisations to import and archive data from third-party systems such as social media, instant messaging and document collaboration* platforms. Most of this data will be stored in Exchange mailboxes, where it can be subject to retention policies, eDiscovery and legal holds.

(*This option is still limited via connectors, but also see below under Search).

The social media and instant messaging data that can be archived in this way currently includes Facebook (business pages), LinkedIn company page data, Twitter, Webex Teams, Webpages, WhatsApp, Workplace from Facebook, Zoom Meetings. For the full listing, and a detailed description of what is required to connect each service, see this Microsoft description ‘Archive third-party data‘.

An important thing to keep in mind is that the data will be archived to an Exchange mailbox; this will require an account to be created for the purpose. Any data archived ot the mailbox will contribute to the overall storage quotas.

Search connectors

Search connectors (also known as Microsoft Graph connectors) index third-party data that then appears in Microsoft search results, including via Bing (the ‘Work’ tab), from http://www.office.com, and via SharePoint Online.

Most ECM/EDRM systems are listed, which means that organisations that continue to use those systems can allow end-users to find content from a single search point, only surfacing content that users are permitted to see.

The following is an example of what a Bing search looks like in the ‘Work’ tab (when enabled).

Example Bing search showing the Work tab

Note: as at 17 November 2020, Microsoft’s page ‘Overview of Microsoft Graph connectors‘ (which includes a very helpful architecture diagram) states that these are ‘currently in preview status available for tenants in Targeted release.’

There are two main types of search connector:

  • Microsoft built: Azure Data Lake Storage Gen2, Azure DevOps, Azure SQL, Enterprise websites, MediaWiki, Microsoft SQL, and ServiceNow.
  • Partner built. Includes the following on-premise and online document management/ECM/EDRM connectors – Alfresco, Alfresco Content Services, Box, Confluence, Documentum, Facebook Workplace, File Share (on prem), File System (on prem), Google Drive, IBM Connections, Lotus Notes, iManage, MicroFocus Content Manager (HPE Records Manager, HP TRIM), Objective, OneDrive, Open Text, Oracle, SharePoint (on prem), Slack, Twitter, Xerox DocuShare, Yammer

See the ‘Microsft Graph connectors gallery‘ web page for the full set of current connectors.

A consideration when deploying search connectors is the quality of the data that will be surfaced via searches. Duplicate content is likely to be a problem in identifying the single – or most recent – source of truth of any particular digital record, especially when the organisation has required records to be copied from one system (mailbox/file share) to another (EDRMS).

API Connectors

API connectors provide a way for Microsoft 365 to access and use content, including in third-party applications. To quote from the Microsoft ‘Connectors‘ web page:

‘A connector is a proxy or a wrapper around an API that allows the underlying service to talk to Microsoft Power Automate, Microsoft Power Apps, and Azure Logic Apps. It provides a way for users to connect their accounts and leverage a set of pre-built actions and triggers to build their apps and workflows.’

To see the complete list and for more information about each connector, see the Microsoft web page ‘Connector reference overview‘.

Each connector provides two things:

  • Actions. These are changes initiated by an end-user.
  • Triggers. There are two types of triggers: Polling and Push. Triggers may notify the app when a specific event occurs, resulting in an action. See the above web page for more details.

API connectors can support records management requirements in different ways (such as triggering an action when a specific event occurs) but they should not be confused with archiving or search connectors.

Summing up

The connectors available in Microsoft 365 support the model of keeping records in place where they were first created or captured. They enable the ability to archive data from third-party cloud applications, search for data in those (and on-premise) applications, and triggers actions based on events.

The use of connectors should be part of an overall strategic plan for managing records across the organisation. This may include a business decision to continue using an ECM/EDRMS in addition to the content created and captured in Microsoft 365. Ideally, however, the content in the ECM/EDRMS should not be a copy of what already exists in Microsoft 365.

Posted in Access controls, Conservation and preservation, Digital preservation, Electronic records, Exchange 2010, Exchange 2013, Exchange Online, Information Management, Records management, Retention and disposal, XML

The enduring problem of emails as records

Ever since emails first appeared as a way to communicate more than 30 years ago they have been a problem for records management, for two main reasons.

  • Emails (and attachments) are created and captured in a separate (email) system, and are stored in mailboxes that are inaccessible to records managers (a bit like ‘personal’ drives).
  • The only way to manage them in the context of other records was/is to print and file or copy them to a separate recordkeeping system, leaving the originals in place.

Thirty-plus years of email has left a trail of mostly inaccessible digital debris. An unknown volume of records remains locked away in ‘personal’ and archived mailboxes. Often, the only way to find these records is via legal eDiscovery, but even that can be limited in terms of how back you can go.

Options for the preservation of legacy emails

The Council on Information and Library Resources (CLIR) published a detailed report in August 2018 titled ‘The Future of Email Archives: A Report from the Task Force on Technical Approaches to Email Archives‘.

The report noted (from page 58) three common approaches to the preservation of legacy emails:

  • Bit-Level Preservation
  • Migration (to MBOX, EML or even XML)
  • Emulation

In a follow up article, the Australian IDM magazine published an article in March 2020 by one of the CLIR report authors (Chris Prom). The article, titled ‘The Future of Past Email is PDF‘, suggested that PDF may be (or become) a more suitable long-term solution for preservation of legacy emails.

Preservation is one thing, what about access

There is little point in preserving important records if they cannot be accessed. The two must go together. In fact, preservation without the ability access a record is not a long different from destruction through negligence.

Assuming emails can be migrated to a long-term and accessible format, what then?

No-one (except possible well-funded archival institutions perhaps) is seriously likely to attempt to move or copy individual legacy emails to pre-defined and pre-existing containers or aggregations of other records. This would be like printing individual emails and storing them in the same paper file or box that other records on the same subject are stored.

Access to legacy emails in an digitally accessible, metadata-rich format like PDF provides a range of potential opportunities to ‘harvest’ and make use of the content, including through machine learning and artificial intelligence.

These options have been available for close to twenty years in the eDiscovery world, but to support specific legal requirements.

Search, discovery and retention/disposal tools available in the Microsoft 365 Compliance portal, along with the underlying Graph and AI tools (including SharePoint Syntex) provide the potential to manage legacy content, including emails.

The starting point is migrating all those old legacy emails to an accessible format.

Posted in Compliance, Electronic records, Exchange Online, Information Management, Microsoft Teams, Records management, Retention and disposal, Security

Using MS Teams without an Exchange Online mailbox

When people chat in Microsoft Teams (MS Teams), a ‘compliance’ copy of the chat is saved to either personal or (Microsoft 365) Group mailboxes. This copy is subject to retention policies, and can be found and exported via Content Search.

But what happens if there is no Exchange Online mailbox? It seems the chats become inaccessible which could be an issue from a recordkeeping and compliance point of view.

This post explains what happens, and why it may not be a good idea (from a compliance and recordkeeping point of view) not to disable the Exchange Online mailbox option as part of licence provisioning.

Licences and Exchange Online mailboxes

When an end-user is allocated a licence for Microsoft 365, a decision (sometimes incorporated into a script) is made about which of the purchased licences – and apps in those licences – will be assigned to that person.

E1, E3 and E5 licences include ‘Exchange Online’ as an option under ‘Apps’. This option is checked by default (along with many of the other options), but it can be disabled (as shown below).

If the checkbox option is disabled as part of the licence assigning process (not after), the end-user won’t have an Exchange mailbox and so won’t see the Outlook option when they log on to office.com portal. (Note – If they have an on-premise mailbox, that will continue to exist, nothing changes).

Having an Exchange Online mailbox is important if end-users are using MS Teams, because the ‘compliance’ copy of 1:1 chat messages in MS Teams are stored in a hidden folder (/Conversation History/Team Chat) in the Exchange Online mailbox of every participant in the chat. If the mailbox doesn’t exist, those copies aren’t made and so aren’t accessible and may be deleted.

If end-users chat with other end-users who don’t have an Exchange mailbox as shown in the example below, the same thing happen – no compliance copy is kept. The chat remains inaccessible (unless the Global Admins take over the account).

The exchange above, between Roger Bond and Charles, includes some specific key words. As we will see below, these chats cannot be found via a Content Search.

(On a related note, if the ability to create private channels is enabled and they create a private channel and chat there, the chats are also not saved because a compliance copy of private channel chats are stored in the mailboxes of the individual participants.)

Searching for chats when no mailbox exists

As we can see above, the word ‘mosquito’ was contained in the chat messages between Roger and Charles.

Content Searches are carried out via the Compliance portal and are more or less the same as eDiscovery searches in that they are created as cases.

From the Content Search option, a new search is created by clicking on ‘+New Search’, as shown below. The word ‘mosquito’ has been added as a keyword.

We then need to determine where the search will look. In this case the search will look through all the options shown below, including all mailboxes and Teams messages.

When the search was run, the results area shows the words ‘No results found’.

Clicking on ‘Status details’ in the search results, the following information is displayed – ‘0 items’ found. The ‘5 unindexed items’ is unrelated to this search and simply indicates that there are 5 unindexed items.

Double-checking the results

To confirm the results were accurate, another search was conducted where the end-user originally did not have a mailbox, and then was assigned one.

If the end-user didn’t have a mailbox but the other recipient/s of the message did, the Content Search found one copy of the chat message in the mailbox of the other participants. Only one item was found.

When the Exchange Online option was enabled for the end-user who previously did not have a mailbox (so they were now assigned a mailbox), a copy of the chat was found in the mailbox of both participants, as shown in the details below (‘2 items’).

Summary and implications

In summary:

  • If end users chat in the 1:1 area of MS Teams and don’t have an Exchange Online mailbox, no compliance copy of the chat will be saved, and so it will not be found via Content Search.
  • If any of the participants in the 1:1 chat have an Exchange Online mailbox, the chat will appear in the mailboxes of those participants.
  • If all participants in the 1:1 chat have an Exchange Online mailbox, the chat will be found in the mailbox of all participants.

Further to the above:

  • If end users can delete chats (via Teams policies) and don’t have a mailbox, no copy of the chat will exist.
  • If end-users with a mailbox can delete Teams chats, but a retention policy has been applied to the chats, the chats will be retained as per the retention policy (in a hidden folder).

And finally, if you allow private channels, end-users can create private channels in the Organisation Team. The chats in these private channels are usually stored in the personal mailboxes of participants (not the Group mailbox) – so these chats will also be inaccessible and cannot be found via Content Search.

The implications for the above are that, if you need to ensure that personal chat messages can be accessed (from Content Search), then the participants in the chat must have an Exchange Online mailbox.

Further, if you allow deletion of chats but need to be able to recover them for compliance purposes, a retention policy should be applied to Teams 1:1 chat.

Posted in Electronic records, Information Management, Microsoft 365, Records management, Retention and disposal

A modern way to manage the retention of digital records

In his April 2007 article titled ‘Useful Void: The Art of Forgetting in the Age of Ubiquitous Computing’ (Harvard University RWP07-022), Viktor Mayer-Schönberger noted that the default human behaviour for millenia was to forget. Only information that needed to be kept would be retained. He noted that the digital world had changed the default to remembering, and that the concept of forgetting needed to be re-introduced through the active deletion of digital content that does not need to be retained.

The harsh reality is that there is now so much digital information in the world, including digital content created and captured by individual organisations, that active deletion of content that does not need to be retained, seems an almost impossible task.

This post explores issues with the traditional model of records retention in the digital world, and why newer options such as the records retention capability of Microsoft 365 is a more effective way to manage the retention and disposal of records, and all other digital content.

The traditional retention model

The traditional model of managing the retention and disposal/disposition of records was based on the ability to apply a retention policy to a group or aggregation of information identified as records. For the most part, those paper records were the only copy that existed (with some allowance for working and carbon copies).

The model worked reasonably well for paper records, but started to falter when paper records became the printed versions of born-digital records, and where the original digital versions remained where they were created or captured – on network files shares, in email systems, and on backups. Although, technically, the official record was on a file, a digital version was likely to remain on network file shares or in an email mailbox after the paper version was destroyed at the end of the retention period, and remain overlooked.

How many of us have had to wade through the content of old network file shares to examine the content, determine its value, and perhaps see if it can even still be accessed? Or do the same with old backup tapes?

The volume of unmanaged digital content, not subject to any retention policy, only continued to increase. This situation continued to worsen when electronic document and records management (EDRM) systems were introduced from the late 1990s. End-users had to copy records to the EDRMS, thereby creating yet another digital copy, in addition to the born-digital originals stored in mailboxes or file shares.

Even if the record in the EDRMS were destroyed, there was a good chance the original ‘uncontrolled’ version of the digital record – along with an unknown volume of digital records that probably should have been consigned to the EDRMS but weren’t – remained in email mailboxs, on file shares, or on a backup tape somewhere.

eDiscovery was born.

The emergence of new forms of digital records, including instant messages, social media, and smart-phone based chat and other apps from the early 2000s only added to the volume of digital content, much of which was stored in third-party cloud-based and mobile-device accessible applications completely out of the reach and ability of the organisation trying to manage records.

Modern retention management

A modern approach to retention management should be based on the following principles:

  • Information, not just records, should only be kept for as long as it is required.
  • It is no longer possible to accurately and/or consistently identify and capture all records in a single recordkeeping system.
  • Duplication of digital content can be reduced by creating and capturing records in place, promoting ‘working out loud’, co-authoring and sharing (no more attachments and private copies).

None of the above points excludes the ability to manage certain types of records at a more granular level where this is required. But these records, or the location in which they are created or captured, should not be regarded as the only form of record.

Ideally, these records should be created (or captured) directly in the system where they are to be managed – not copied to it.

Change management is necessary

Some of these new ways of working are likely to come up against deeply ingrained behaviours, many of which go back several decades and have contributed to a reluctance to ‘forget’ and destroy old digital content, including:

  • hiding/hoarding content in personal drives (and personal cloud-based systems or on USB drives);
  • communicating by email, the content is which is inaccessible to anyone else;
  • attaching documents to emails;
  • printing and filing born-digital content; and
  • sometimes, scanning/digitising the printed copies of born-digital records and saving them back to a digital system.

What about destruction?

Records managers in organisations moving away from the authorised destruction of digital content identified as records, to the destruction of all digital content (including identified records) need to consider what is required to achieve this outcome, and the implications for existing process and practices (including those described above).

  • Some activities will remain unchanged. For example, the need to review certain types of records before they are destroyed (aka ‘disposition review’), to seek approval for that destruction, and to keep a record of what was destroyed.
  • Some activities are new and can replace other existing actions and activities. For example, the application of retention policies to mailboxes can remove the requirement to backup those mailboxes.
  • Some of activities or outcomes may be challenging. For example, the automatic destruction without review of digital content that is not the subject of more granular retention requirements, such as emails out of mailboxes, documents in personal working drives. This content will simply disappear after the retention period expires.

How Microsoft 365 can support modern retention management

Microsoft recognised some time ago that it was becoming increasingly difficult to manage the volumes and types of digital content that was being created every day by organisations.

Exisiting and newly released functionality in the Compliance portal of Microsoft 365 includes the ability to create and apply both label-based retention policies to specific types of records, including automatically based on machine learning capabilities, and broader ‘workload’ specific (e.g., mailboxes, SharePoint sites, OneDrive accounts, MS Teams chats) retention policies. This capability helps organisations to focus retention requirements on the records that need to be retained, while destroying digital content that is no longer relevant and can be forgotten.

Instead of directing end-users to identify records and copy them from one system to another (thereby creating two versions), Microsoft 365 allows end-users to create and capture records in place, providing a single source of truth that can be shared (rather than attached), be the subject of co-authoring, and protected from unauthorised changes (and even downloads).

Limitations with Microsoft 365

It is important to keep in mind that there are some limitations with the current (October 2020) retention capability in Microsoft 365.

  • Retention and disposal is based on individual digital objects, not aggregations. There are limited ways to group individual records by the original aggregations in which they may have been stored (e.g., document libraries in SharePoint).
  • Only the (minimal) details of records that were subject to a disposition review are recorded in the ‘disposed items’ listing, and this is only kept for a year (but can be exported). No record is kept of any other destroyed record, except in audit logs (for a limited period).
  • The metadata details of records subject to a disposition review that were destroyed is minimal – the document type and name, date destroyed, destroyed by whom.
  • When records are destroyed from SharePoint document libraries or lists, the library or list remains with no record kept of what was previously stored there. It is not possible to leave a ‘stub’ for a destroyed record.

Summing up

The primary outcome from introducing modern ways to manage retention will be that all digital content, not just content that has been identified as records or copied to a recordkeeping system, will be subject to some form of retention and disposal management.

In other words, a change from exception-based retention (where all the other digital content is overlooked), to a more holistic method of retention with both granular controls on certain types of records where this is required, and broader retention capability allowing us to forget the content that is no longer relevant – the ‘redundant, trivial and outdated’ (ROT) content often scattered across network file shares.