Posted in Electronic records, Governance, Information Management, Microsoft 365, Microsoft Teams, Records management, Retention and disposal, SharePoint Online

A basic retention model for Microsoft Teams

In my previous post about managing inactive Teams, the third option listed was to apply retention policies to those Teams. It included the graphic below.

This post provides more details of a basic retention model that can be applied to both active and inactive Teams.

Key takeaways

Key takeaways from this post for records and information managers:

  • Every Team has a ‘Posts’ (group chat messages) and ‘Files’ (documents etc) tab, and usually also starts with a Wiki tab (which can be removed). Other tabs may be added via the + option.
  • A Team in Microsoft Teams is not a single container or aggregation for the capture and storage of records. Almost all the records in a Team are stored in a hidden folder in Exchange Online (EXO) mailboxes (posts) or SharePoint Online (SPO) (files). Some records (conversations) may also be created and captured in the EXO mailbox of the associated Microsoft 365 (M365) Group.
  • It is not possible to apply a single retention policy to a Team; at least two separate policies will be required – one policy for the Team channel posts of EVERY team, and one or more policies for the content captured in SPO sites (files) or groups of sites.
  • Some records, created in and accessible from Teams, may be stored in other M365 applications (e.g., Tasks, Forms, WhiteBoard, etc) or third-party applications. It is not possible to apply any Microsoft 365 retention policy to records created by or captured in these applications.
  • Records and information managers should have access to the details (not necessarily the content) of every M365 Group, Team, and SPO site in order to establish a plan for the creation and application of retention policies to Teams. At a minimum, they should be assigned the Global Reader role (for details of M365 Groups and SPO sites) and the Compliance admin role (for retention policies).
  • It is relatively easy to overcomplicate the retention model for Teams, for example by applying separate retention labels to different folders and sub-folders in each channel ‘files’ tab.
  • Try to keep the model simple for as long as possible.

Core components of a Team

The main components of every Team are shown in the diagram below. If private channels are not allowed in the organisation, ignore the top two left and right elements.

The relationship of a Team to its M365 Group, Exchange mailbox and SharePoint site, showing where the content is stored (dotted lines).

As shown in the diagram above:

  • Every Team is directly linked with an M365 Group. Every M365 Group has an Exchange Online (EXO) mailbox and a SharePoint Online (SPO) site.
    • The Team, M365 Group, SPO site, and mailbox address (teamname@) all share the same name. The original name (which should be brief, <20 characters if possible) and the display name may be different.
    • The Owners and Members of the Team are the Owners and Members of the M365 Group and those Groups are added to the SPO site Owners and Members permission groups respectively.
  • A ‘compliance copy’ of every post in a normal channel is copied from the Azure-based Teams chat service (which is always inaccessible) to a hidden folder of the EXO mailbox of the M365 Group linked with the Team.
    • Where private channels are allowed, a ‘compliance copy’ of every post in a private channel is copied to a hidden folder of the ‘personal’ EXO mailboxes of all participants in the private channel.
  • Any content created or captured in the ‘Files’ tab of the Team channels is stored in the SPO site of the M365 Group linked with the Team. If any lists are created, they are either stored on the same SPO site or are linked from another site.
    • Where private channels are allowed, a separate SPO site is created (using the name of the ‘parent’ site followed by a hyphen then the private channel name, e.g., parentsitename-privatechannelnamesite). Any content created or captured in the ‘Files’ tab is stored in that SPO site.

So, a Team is a combination of at least four elements: the Teams user-interface (and back-end database), an M365 Group, a SPO site, and an EXO mailbox. The mailbox is used for three main purposes:

  • Email-based ‘conversations’ (when used).
  • Calendaring.
  • Storage of Teams posts.

This is why it is not possible to apply a single retention policy to a Team.

The basic retention model

The basic retention model for Teams assumes the following:

  • If the organisation’s retention schedule/disposal authority does not include coverage for Team posts (chat messages) and also general Team chats, there is a legally defensible policy that defines how long Team channel (including private channel) posts (and chats) will be retained. Note: This policy will define a single retention period for ALL posts and and a separate policy for ALL chats.
  • Records and information managers know the details of every M365 Group, Team (including number of private channels) and SPO site (including last activity and number of files).
  • One or more retention policies will be created for SPO sites.
  • One or more retention policies may be created for M365 Groups.
  • Unless it is done ‘manually’, there will be no review process before the content is destroyed at the end of the retention period.
  • No label-based retention policies will be applied (at this point). They may be added later as required (see below).
  • Unless the option to auto-expiry M365 Groups is used, there will be a manual process to delete inactive and empty M365 Groups or Teams; deleting either will also delete the linked SPO site.

Creating retention policies

Retention policies are created in the Information Governance section of the M365 Compliance admin portal under ‘Retention policies’.

Generally speaking, organisations should not create many of these policies as they should ideally target entire workloads (all SPO sites, all EXO mailboxes, etc) or in some cases major groupings (e.g., EXO mailboxes of senior executives, all other mailboxes).

And remember, these policies do NOT destroy the container (Team, SPO site, EXO mailbox), only the content in those containers.

Every new retention policy has three parts.

Name

The name of the retention policy should be easily recognisable, for example ‘Teams channel posts 7 years’ (all encompassing, for all channel posts, see next dot point), or ‘General SPO site retention 7 years’. The name section also includes a description that should always be used to link the policy to details in a retention schedule/disposal authority or corporate policy.

Location

The ‘location’ element is where the complexity arises as it is not possible to create a single retention policy for all the elements in a Team. Selecting either ‘Teams channel messages’ or ‘Teams private channel messages’ will disable all other options. It is not possible to select ‘SharePoint sites’ or ‘Microsoft 365 Groups’ AND any of the Teams options in the same policy.

Because of this limitation, at least two separate retention policies will be required for a basic retention model, with an additional one for private channels (if required):

  • A retention policy for either all or selected SharePoint sites, including private channel sites. The simplest model is to create a single retention policy for all SharePoint sites. This creates a preservation hold library on every site, retaining all deleted content for the minimum period required. Alternatively, and especially if there is a way to ‘group’ SPO sites (e.g., all project team sites), create retention policies for those groups and add in the site names. Always keep in mind that a retention policy applied to the SPO site has no connection with or impact on the channel posts.
  • A retention policy for all Teams channel messages. Note that this cannot include or exclude any Teams – it’s all or none. Depending on the retention selected for channel posts (next point), this could mean that channel posts are destroyed before (or after) the Team’s SPO content.
  • A retention policy for all Teams private channel posts. Similar to the previous point, this is an ‘all or none’ policy.

If the Team is also making use of the M365 Group’s ‘conversations’ in Outlook, consideration may also be given to creating a retention policy for M365 Groups (or included/excluded Groups). This policy will cover (a) Group ‘conversations’ and (b) the SharePoint site linked with the Group/Team. It will NOT cover the Team channel posts that may be stored in the M365 Group EXO mailbox. Note: It is possible to select just the M365 Group mailbox OR the M365 Group’s SPO site in this policy via a PowerShell script.

Retention period

Retention options are shown in the screenshot below. These options are the same for every retention policy.

Retention policies either automatically delete content after a minimum period or do nothing (includes the ‘retain items forever’ option). There is no disposition review. This means that the content in the SPO site and Team channel (including any ‘deleted’ content, which is not actually deleted, just hidden) simply disappears when the retention period expires.

Retention variations

Organisations may of course have different requirements or decide to apply retention differently. Each of these will still be some variation on the above model.

In most cases, there should be at least one retention policy in place for each of the different elements that make up a Team – the M365 Group, the SPO site, the channel posts, the private channel posts. Whether those policies have the same retention period will be up the organisation to determine, but in all cases, the details should be documented somewhere as currently this information is not easily available.

Retention labels

It is not possible to apply retention labels to Teams channel or private channel posts (or chats). There is only one option, and that is a single retention policy for each of these.

Retention labels may be applied to the content stored in the Teams linked SPO site, and these may be applied instead of using retention policies. This may be an effective model when combined with auto-expiry of M365 Groups as this (auto-expiry) will not occur if the content is subject to an active retention policy or retention label.

However, applying labels to the content stored in each Team channel ‘files’ tab has the potential to be a very complicated model that will become almost impossible to monitor or manage in time.

Each channel ‘files’ tab maps to a folder with the same name in the Documents library of the linked SPO site. As each Team channel may have been created for the records of a different subject with a different retention requirement, this means that each folder (or potentially even sub-folders) in the library may have a different label.

As retention labels (and policies) apply to individual items in the library (but not the folder), this means that individual items, stored in folders, that are subject to disposition review will come up for review in the future.

The application of multiple retention labels to folders within the single Document library of the SPO site is already complicated; having to review some of the individual items as part of a disposition review in the future is just adding to the complexity.

My view is that Teams should, as far as possible, ‘contain’ records relating to the same subject with the same single retention period that can be applied to the entire SPO site. Applying individual labels to folders or sub-folders within a single document library is a complex model both to apply and manage into the future.

What do to with empty Teams?

As noted already, retention policies (and labels) do not delete the SPO site, Team or M365 Group, only the content stored in them. Each of these ‘containers’ remain after the content has been destroyed within them.

Accordingly, it is advisable for records and information managers to (a) have access to the details of every SPO site, Team and M365 Group and (b) work closely with IT to determine when these containers can be deleted (and document that activity). Otherwise, the M365 environment will be left with the hollow shells of sites, Teams and Groups.

Further reading

The following Microsoft links provide further details on this subject.

Learn about retention policies and retention labels

Learn about retention for Microsoft Teams

Learn about retention for SharePoint and OneDrive

Create and configure retention policies

Apply retention labels to files in SharePoint or OneDrive

Teams messages about retention policies

Featured image: http://www.pexels.com

Posted in Microsoft Teams, Products and applications, Records management, Retention and disposal

Managing inactive Teams in Microsoft Teams

The rapid and often uncontrolled rollout of Microsoft (MS) Teams as part of Microsoft 365 (M365) deployments from early 2020 has become a headache for many records and information managers. In many organisations, inactive Teams – some with no owners and inaccessible to records managers – litter the M365 landscape.

The introduction of private channels in 2020 added a new layer of complexity for the management of inactive Teams.

This post examines three ways to manage inactive Teams, especially those that may contain records.

  • Auto-expiration (and deletion) of M365 Groups.
  • Archiving Teams.
  • Applying (separate) retention policies to the elements that make up each Team.

It assumes that records and information managers will or should:

  • Take a leading role or be involved in decisions with IT departments around the creation of new Teams and the management of inactive Teams and their associated SPO sites.
  • Have access to the details of all active and inactive M365 Groups, Teams (including private channels), and SharePoint sites, including through role assignment (e.g., Global Reader, Compliance admin).
  • Know how and where Teams stores content in different applications.
  • Be directly involved in decisions about the creation and application of retention policies to Teams content, and disposition actions when those policies expire.
  • Where appropriate, be made the owners of inactive Teams (and M365 Groups) to allow them to review the content of that Team.

Option 1 – Auto-expiry of M365 Groups

Every Team in MS Teams is directly connected with an M365 Group; a Team uses the M365 Group’s EXO mailbox and SPO site for the storage of content. Therefore, if the M365 Group is destroyed, so will the Team and all its content.

Microsoft 365 includes the ability to automatically ‘expire’ and then delete all or selected M365 Groups after a given period of inactivity.

The Group’s expiration option is set in the Azure Active Directory (AAD) admin portal under Groups > Settings > General. This option includes renewal notifications (which will appear in Teams) and the ability to select specific M365 Groups (the default is None).

Azure AD Group Expiration

Pros of auto-expiry

Automatically expiring and then deleting M365 Groups can be a simple way to clean up inactive Groups and the linked Teams, based on the last activity of the Group or in the Team (SPO site, EXO email-based ‘conversations’, or channel posts). This may be particularly effective for general Teams that have been hardly used and/or known not to contain records.

Auto-expiry may be a useful option in conjunction with retention policies; M365 Groups and linked Teams subject to both will be retained beyond the expiry date if they are subject to retention policies.

If the expiry notification is missed or overlooked and the Team is soft-deleted, M365 Groups (and their associated Team content) can be restored for up to 30 days. The SPO site will be recoverable for 93 days. But, beyond 30 days the deleted M365 Group and all the content associated with it (including Teams) is irrecoverable (93 days for the SPO site).

Cons of auto-expiry

Auto-expiry is effectively auto-deletion without review. This option may work best for organisations with a relatively low number of Groups and/or where there is low concern or risk of deleting records prematurely. Organisations that are concerned about the deletion of records without review should be cautious of this approach.

Note that even if auto-expiry is set, this will not destroy any M365 Group or Team that is still subject to a retention policy – see below.

For more information about auto expiry of M365 Groups, see the Microsoft docs page ‘Microsoft 365 group expiration policy‘ and also ‘Team expiration and renewal‘ that shows how the M365 Group expiration notification works in Teams.

Option 2 – Archiving Teams

Any Team in MS Teams can be archived either by the MS Teams admin (via the admin portal), or by a Team Owner via the gear icon at the bottom left of the MS Teams application, next to ‘Join or Create a Team’. Clicking the gear icon opens a list of Teams; at the far right, the three-dot menu includes the options (including ‘Archive Team’) listed below.

The list of options for each Team.

The process of archiving a Team includes the option to make the linked SharePoint site read only, and makes the Team’s channels read only.

If the SPO site is not also made read only, the members of the Team can continue to upload and edit content via the Team’s channels or via the SPO site directly (and also via File Explorer for synced libraries).

Teams that have been archived appear in a separate ‘Archived’ section, from where they can be ‘restored’ (un-archived, made editable again) provided they are not subject to an auto-expiry policy or retention policies.

Pros of archiving Teams

Archiving Teams (and making the linked SPO site read only) may be a useful way to prevent any further changes to those Teams, but it does not do more than that. Additional options, including either auto-expiry (for low-risk Teams) or retention policies (for Teams with records) should be considered to ensure that inactive archived Teams are destroyed when this is allowed.

Archiving Teams may also be a useful way to ‘tag’ Teams that cease to be active, making them more easily identifiable for retention or disposal.

Cons of archiving Teams

Archiving Teams is not an effective or safe way to ensure that any records contained in the Team remain unchanged for as long as the Team still exists. It simply makes the Team’s channels read-only, and may also make the SharePoint site read only, if that option is selected.

If an archived Team is subject to an auto-expiry policy, it will be destroyed (with prior notification after a specified period. A better option for Teams used to create or capture records would be to apply retention policies to the Team.

For more information about archiving Teams, see this Microsoft docs page ‘Archive or delete a team‘.

Option 3 – Apply retention policies

This is probably the most complex area of M365 for records and information managers to understand given the multiple elements that make up MS Teams. Careful planning is necessary before any retention policy is applied, based on a thorough understanding of the structure of Teams and where the content is stored.

As a starting point, it is important to understand that:

  • A single retention policy cannot be applied to all the content of a Team and its associated M365 Group (private channel chats, channel posts, SPO files, Outlook ‘conversations’). Multiple retention policies will be required.
  • It is NOT possible to apply retention labels to either Teams public or private channel posts. These can only be covered by retention policies. Retention labels could be applied to content stored in the SPO site.

The model for applying retention to Teams (not the 1:1 chats area) may include up to four separate retention policies (and also retention labels):

  • One or more retention policies for the Team (non private) channel posts. These policies will apply to the compliance copies of those posts stored in a hidden folder of the linked M365 Group’s EXO mailbox.
  • One or more retention policies for the Team’s private channel posts if they exist. These policies will apply to the compliance copies of those posts stored in a hidden folder in the EXO mailbox of all members of the private channel.
  • One or more retention policies for the Team’s files stored in the SPO site. Additional retention labels may also be applied (see below).
  • If the mailbox is used for Group conversations, one or more retention policies for the M365 Group, which includes coverage for both the emails and the files.

So, each Team could potentially be subject to up to four separate retention policies.

Retention policies that could apply to every Team, or groups of Teams

In addition to the above, retention labels may be applied either ‘manually’ or automatically (including via trainable classifiers or SharePoint syntex) to content stored in the SPO site (the channel files – each channel is a folder in the default Documents library). These labels will likely have retention periods that are longer than the retention policy and may include disposition review.

A even more complex model is to apply multiple retention labels to the channel-linked folders (and sub-folders) in the SPO site’s Documents library. This model is fraught with complexity in terms of future disposition review and would be the equivalent of applying retention policies to different folders and subfolders in a network file share.

Pros of applying retention policies (and labels)

Retention policies ensure that content is not destroyed for the period set in the retention policy.

Retention policies are better than auto-expiry because they capture any content that is ‘deleted’ by end-users for the life of the policy. They are better than ‘archiving’ Teams as they set a minimum retention period, protect the content from destruction during that time (‘in place holds’), then destroy the content.

Retention policies could also be used in conjunction with the other two options as necessary. For example, there may be some Teams that contain no records and could simply be deleted via the auto-expiry option. If they contain records, a retention policy will retain the content for as long as required.

Cons of applying retention policies

The main negative of applying retention policies is the complexity of the model, and knowing what has been applied and where. This is especially true if there are many Teams. Consultation and coordinated planning between RM/IM and IT, and documentation of the model, are all essential.

Unfortunately, the Microsoft 365 Compliance admin portal does not provide a single view of what policies have applied where. Unless a third-party application is used, the only way to achieve this is by recording the details of the policies in – say – a spreadsheet or a SharePoint list.

Retention policies do not include the option for disposition review, so records and information managers might need to consider the requirement to find a way to document the disposition (deletion) process and retain a record of what was destroyed.

By actively monitoring Teams, records and information managers should know when the content in Teams is due for destruction, allowing time to extract metadata (where possible) and other information.

For more information about applying retention to Teams and SPO, see these Microsoft docs pages: ‘Learn about retention for Teams‘, ‘Learn about retention for SharePoint and OneDrive‘ and also ‘Limits for retention policies and retention label policies‘.

Concluding comments

All of the above underlines why records and information managers need to know what Teams exist, where the records are stored, and be proactively involved in decisions about what happens to inactive Teams.

As long as retention policies have been correctly applied to the various parts of the Team, that content will be retained for minimum periods. End-users may think they are deleting content, but it remains stored and accessible via a Content Search.

Feature Image Credit: David Yu (image 2081166, via Pexels)

Posted in Electronic records, Information Management, Planner, Records management, Retention and disposal, Tasks

Managing tasks as records in Microsoft 365 Planner/Tasks

There are several ways to create, record and assign tasks in organisations. These may include:

  • Personal tasks (or calendar entries) in email applications such as Outlook, or set via the Microsoft ‘To Do’ application.
  • Team and Group-based tasks created and managed in various ways, including on physical white boards, via Microsoft 365 Planner/Tasks or ‘Tasks by Planner for Teams’.
  • Project-based tasks, including in Microsoft Project or other similar applications. Depending on the type of project (e.g., agile or waterfall), this may also involve tasks pinned on Kanban boards.
  • Activity-based tasks, including in dedicated task-based software such as Jira, Trello, etc.

This post describes the three main elements of tasks in Planner/Tasks (including via Teams), where the records are stored, and recordkeeping considerations.

An important point to consider while reading this post is whether you regards tasks in Planner (or Tasks by Planner for Teams) as records? If your answer is yes, then you will need to think about how these records will be managed.

(Thanks to the team at Office365 for IT Pros for some of the detail in this post).

What is Planner?

The Planner option in office.com

To quote from the e-book ‘Office 365 for IT Pros’, Microsoft Planner (also known as ‘Tasks by Planner and To Do’ in Teams) is ‘a lightweight task-oriented planning application’ that is based on membership of Microsoft 365 Groups (click link if you are unfamiliar with Microsoft 365 Groups).

The Planner app in Teams

While there is some functional similarity between Microsoft Project and Planner, organisations soon (or will need to) learn which one is most appropriate for their business needs. Based on my own experience:

  • MS Project is best for tracking activities and tasks for major projects.
  • Planner is useful for general group task assignment and tracking of those tasks.

What are the three main elements of tasks in Planner?

Every task in Planner has three main elements:

  • Data. The details of the task itself including the ‘bucket’ it belongs to, progress, priority, dates, notes and a checklist.
  • Attachments. This may include either uploaded documents or links. Two tasks cannot have the same attachment, for reasons explained below.
  • Comments. These are effectively ‘conversations’.

When a new task is added via Planner or Teams (Tasks by Planner for Teams) via the ‘+ Add task’ option, an end-user simply needs to enter the task name, set a due date (if required), and assign if (if required).

Adding a task

After the new task has been created, the end-user may click on the three dot menu to add a label, assign the task, copy it, copy a link to it, move it, or delete it. Note that deleting a task does NOT delete any attachments or comments.

Task 3-dot menu options

The end-user may also click on the name of the tasks, which offers the options shown below to add attachments or make comments.

What is stored where?

Task data

According to Office 365 for IT Pros, ‘Planner stores the metadata for plans, including information describing the tasks and buckets that make up each plan, in an Azure data service’. Click this link to learn in which country your Planner data is stored)

The accessible metadata about each plan can be seen when the plan is exported to Excel.

  • Task ID (for example: QXkIWsgkqkO5rLu5pvfMhQgAEyXz)
  • Task Name
  • Bucket Name
  • Progress
  • Priority
  • Assigned To
  • Created By
  • Created Date
  • Start Date
  • Due Date
  • Late (true/false)
  • Completed Date
  • Completed By
  • Description (= Notes)
  • Completed Checklist Items
  • Checklist Items
  • Labels

As can be see, the Plan metadata does not include or show references to attachments or notes. There is no way of knowing from the exported data if the task had any attachments or comments

Task attachments

Any task can have attachments or links to other content. When uploaded ‘from computer’, these attachments are not stored in Planner but in the Documents library of the Team’s SharePoint site (the ‘Files’ tab), at the same level as (public) channel folders, as described in detail below. There is no option to choose where they will be saved.

This can be quite confusing, especially as all attachments uploaded from a computer, for all Tasks may be stored in the same location, without reference to the task. (This underlines the importance of saving the required attachments to the Teams channel Files tab first).

In the example below, the Teams channel ‘New Sites’ has a plan named ‘New sites tasks’. A task (‘Does this seem right’) has been added with an attachment ‘ExamplePDFA’. (Note, the visual of the document is a check-box option; only one visual can be displayed if there are multiple attachments).

Example task with an attachment.

As noted already, if uploaded from a computer, an attachment is actually stored in the Documents library at the same level as the channel folders, which means they are not visible from the Files tab for the channel as can be seen in the screenshot below.

The task attachments are NOT stored in the channel Files tab

To get to the task attachments from Teams you have two options:

  • Go to the ‘General’ channel, click on the ‘Files’ tab, then click on the ‘Documents’ option (to the left of ‘> General’). ALL attachments to ALL tasks for every channel in the entire Team are stored in this location. This needs to be kept in mind if anyone syncs the library to File Explorer as there is no indication that these attachments belong to a task in Planner.
  • By clicking on ‘Open in SharePoint’ and then navigating to the top of the Documents library as can be seen below.

In the same way that the task data exported to Excel does not show any reference to attachments, attachments uploaded from a computer (or, for that matter, attachments from Teams files) show no reference to the related task.

From a retention point of view:

  • If retention labels have been applied to the Team’s folders in SharePoint, these labels will not apply to uploaded documents linked with tasks.
  • If a retention policy has been applied to the entire site, then these attachments will be deleted in line with that policy.

The following could happen:

  • Anyone with delete rights, not knowing why these uploaded documents exist, to simply delete them.
  • A member of the Team or Group could add more content to the library at the same level as the uploaded attachments, especially if they are working via File Explorer. (Keep in mind that a new channel is NOT created when a new folder is created in the library at the same level as the channel linked folders.)

Also, if the person who created or is editing the tasks ‘removes’ the document from the three dot menu next to an existing attachment, that attachment is not deleted from the library, which is why there are two documents titled ExamplePDFA above, one with the extra ‘ 1’.

Removing an attachment doesn’t delete it, adding to the potential confusion.

Although it may be difficult to enforce in reality, asking end-users to attach or create a link to a document already stored in a Teams Files tab is better practice.

Task Comments

Task Comments are threaded conversations that are captured in the Microsoft 365 Group’s mailbox. If the Team was created first, the M365 Group mailbox will not be visible to the end users in their Outlook client. However, they will receive a copy of the conversation in their normal inbox.

In the example task below below, which was created in a Team with a visible Outlook mailbox, there is one initial comment to indicate the task was created, then two additional notes.

In the Outlook client, each of these added comments is visible as a thread ‘in reply’ to the original task.

Curiously, the copy that appears in the end-user’s Inbox also shows the retention period for all other Inbox emails. It is not clear if this retention policy will apply to the task conversations or not.

The header of the thread in the Inbox shows a retention policy, not visible in the one above.

Managing records in Planner/Tasks

Are tasks records?

If organisations decide that tasks are records, they will need to consider how they will be managed given:

  • The way that Planner stores task data, attachments, and comments separately. Planner task data is made visible via the Teams interface, it is not stored in Teams.
  • The ability for members of Teams to create multiple plans with multiple tasks with multiple uploaded attachments (all stored in the same location without reference to the task it relates to).
  • The fact that a Group/Team may create a range of different types of content, not just in Teams.
  • The inability to apply retention policies to tasks in Planner, while retention policies might affect uploaded attachments, Teams files or comments as conversations in Outlook.
  • The inability to close or archive a plan, or export all the content as a single entity.

At a minimum, all the task data could be exported to Excel and stored somewhere – perhaps even on the Team’s SharePoint site. The exported data will not include any attachments or comments (neither of which are not referenced in the Excel export). One problem with this approach may be deciding when and if the task data is to be exported, and if the original plan should then be deleted – who is responsible?

If organisations decide that tasks are not records, they should still consider how to manage the various elements of each task and plan from a retention point of view.

  • At what point can a plan be deleted? Does the deletion need to be recorded somewhere?
  • What if the Team decides to delete it anyway? There is currently no information governance/retention coverage for Planner but attachments and comments (if any) may remain.

Perhaps the easiest approach is to regard Planner tasks as low-level working content, not really records, in the same way that tasks in the former Outlook were generally overlooked as being records.

Posted in In Place Records, Records management, Retention and disposal

Setting retention labels on folders in SharePoint document libraries

A common question asked by many organisations is whether Microsoft 365 (M365) retention policies – labels in particular – can be applied to folders in SharePoint document libraries so the content in those folders will have the same label.

The quick answer is yes, but it is a manual process and – for all its perceived benefits – is likely to be more of an administrative and support burden and not worth the effort. Folders should NOT be thought of as the replacement for ‘files’ (aggregations of individual records), but more like dividers in a lever arch (= the document library).

This post describes how labels can be applied to and work with folders, including in SharePoint sites linked with Teams. It also suggests alternative options.

How retention labels are applied to a library

Retention labels are created in the M365 Compliance portal under either the Information Governance > Labels or Records Management > File Plan sections.

Where labels are created in the Information Governance section of the Compliance portal

Labels created in the Compliance portal do not do anything once created; they must be applied to content in various ways to make them work. This includes by:

  • Publishing one or more labels as part of a retention policy to various locations including SharePoint sites, Exchange mailboxes and OneDrive (but not Teams – see screenshot of available locations below). In this scenario, each label will be visible to – and selectable by – end-users.
  • Auto-applying them to the same locations based on various options, including (for E5) trainable classifiers
  • Adding them to Content Types used in SharePoint Syntex.
Locations where labels can be published

Publishing labels to SharePoint sites

When one or more labels are published to SharePoint sites they don’t do anything until they are ‘manually’ enabled through one of the following options:

  • On each individual document library via the library settings option ‘Apply labels to items in this list or library’ (see screenshot below)
  • On each individual folder in a library (via the information panel, see screenshot below)
  • On each individual object in the library (also via the information panel)

Applying a label to the library

A label can be applied to the entire document library via Library Settings, as shown in the screenshot below.

Note that the ‘None’ option is shorter if no labels have been published here

If the drop-down option is set to ‘None’, and there are no options to choose from, it means that no labels have been published to this SharePoint site.

If labels exist, they will appear in the drop down list (below the default ‘None’). Note that only one label can be set as the default for the library. If the check box ‘Apply label to existing items in the library’ is selected, this will apply the label to all existing items. It will also likely override any existing label that may have been applied.

When the retention label has been applied to a library, the label only applies to the non-folder objects stored in the library as can be seen in the option below. That is, the retention label is NOT applied to the folders by default.

Document library folders without retention labels

The retention label can be seen when the folder is opened:

Content inside a folder, with retention labels applied

Applying a retention label to a folder or document

It is also possible to apply a retention label to a folder or object stored in the folder via the information panel, even when a default library label has been set, as shown below. This can be done on each individual folder including, for Teams-based sites, each folder that maps to a channel.

When applied to a folder in this way, any content stored in the folder will inherit that retention label.

Documents stored inside the June folder have inherited the folder’s label

If a default label has already been applied at the library level, the folder-based label will replace it, although in testing this, one of the original default labels wasn’t replaced automatically as shown below, but could be manually changed via the information panel.

Implications for Teams-based records (Files)

Every Team in MS Teams has an associated SharePoint site linked with the underlying Microsoft 365 Group.

Every non-private channel in the Team maps to a folder in the Documents library of the SharePoint site as can be seen in the two screenshots below. (Every private channel has a separate SharePoint site that would be covered by a separate retention policy).

The Team’s channels
Four channel-linked folders in a Teams-based SharePoint site

Keep in mind that retention labels remove the ability to delete objects stored in the library (including via the ‘Files’ tab in a Team). If end-users are working in Teams, this could be annoying and potentially put them off using Teams. However, end-users can remove the label by navigating to the SharePoint site and removing the label via the Information panel.

Why folder-based retention labels may not be a good idea

The default options to apply retention labels to content stored in SharePoint document libraries are:

  • By applying them at the library level. This can apply the label to all existing (and future) content stored in the library but does not apply to folders.
  • Through the auto-application of labels.
  • Via SharePoint Syntex using labels on Content Types.

Applying retention labels to individual folders in a document library is a manual-intensive process, one that may be a waste of time given the potential number of libraries that can exist and the ease with which they can be removed by end users.

Additionally, applying retention labels to the channel-linked folders of Teams may be pointless if end users:

  • Store documents at the same level as the channel-linked folders; that is, ‘above’ the folder structure.
  • Create new folders via a synced library or SharePoint. These folders are not linked to channels.
  • Create new libraries in the SharePoint site.

Keep it simple

It is very easy to deliberately or inadvertently establish over-complicated retention settings for content stored in SharePoint, especially as there is currently no simple way to see what label has been applied where.

Given the retention period linked with retention policies generally, there is a good chance that the person who applied the labels may not be around when the retention period expires, or to keep an eye on what has been applied or changed over time.

The best retention intentions may be overruled by practical necessity.

The best retention model, in my opinion, is a simple one that does not get in the way of end-users but ensures that records will be kept for a minimum period required. So, instead of applying retention labels to folders, especially on Teams-based SharePoint site libraries, it is recommended to:

  • Start by trying to avoid mixing content with different retention periods in the same SharePoint site or Team, or document library. That will make it easier to manage the retention outcomes. (If you can’t avoid mixing content, you may need to use auto-application of labels including via Syntex or trainable classifiers).
  • Use ‘back-end’ safety net retention policies applied to all SharePoint sites. This ensures a minimum retention period and does not get in the way of end-user activities.
  • Use retention labels on site libraries where more granular retention is required. Ideally, apply them as the default to all the content in a single document library (including the default library for all Teams-based SharePoint sites) and – preferably – only apply the labels when the content is inactive and the library can be made read only, to protect the records from that point.
  • Only use multiple labels on folders when (a) all the labels applied to the site relate to the same function/activity pair or subject matter, and (b) the content is largely inactive. Ideally, avoid folder-based retention to avoid complication in the future.

Posted in Classification, Governance, Information Management, Microsoft 365, Records management, Retention and disposal

Classifying records in Microsoft 365

There are three main options in Microsoft 365 to apply recordkeeping classification terms to (some) records:

  • Metadata columns added to SharePoint sites, including those added to Content Types and/or added directly to document libraries.
  • Taxonomy terms stored in the central Term Store, including those added as site columns and added to site content types and/or added directly to document libraries. The only difference with the first option is that with the Term Store the classification terms are stored and managed centrally and are therefore available to every SharePoint site.
  • Retention labels that: (a) ‘map’ to classification terms; (b) are linked with a File Plan that includes the classification terms; (c) are either the same as (a) or (b) and are used in with a Document Understanding Model in SharePoint Syntex; or (d) the same as (a) or (b) and used with conjunction with Trainable Classifiers.

The first two options can only be applied to content stored in SharePoint. Retention labels may be applied to emails and OneDrive content. None of the three options can be applied to Teams chats. Also note that there is no connection between the SharePoint Term Store and the File Plan, both of which can be used to store classification terms.

This post:

  • Defines the meaning of classification from a recordkeeping point of view.
  • Describes each of the above options and their limits.
  • Discusses the requirement to classify records and other options in Microsoft 365.

What is classification?

Humans are natural-born classifiers. We see it in the way we store cutlery or linen, or other household items or personal records.

Business records also need some form of classification. But what does that mean? The 2002 version of the records management standard ISO 15489, defines classification as:

‘the systematic identification and arrangement of business activities and/or records into categories according to logically structured conventions, methods and procedural rules represented in a classification system’. (ISO 15489.1 2017 clause 3.5).

The standard also states (4.2.1) that a classification scheme based on business activities, along with a records disposition authority and a security and access classification scheme, were the principal instruments used in records management operations.

The classification of records in business is important to establish their context and help finding them.

Microsoft 365 includes various options to apply classification terms to records.

Metadata columns in SharePoint

The simplest way to classify records stored in SharePoint document libraries is to either create site columns containing the classification terms and add those columns to document libraries, or create them directly in those libraries.

Benefits

Adding site or library columns is relatively simple. As classification terms are usually in the form of a (hierarchical) list, it is simple to add one choice or lookup column for function and another for activities.

A lookup column can bring across a value from another column when an item is selected; for example, if the look up list places ‘Accounting’ (Activity) in the same list row as ‘Financial Management’, selection of ‘Accounting’ will bring across ‘Financial Management’ as a separate (linked) column.

Default values (or even one value) can be set meaning that records added to a library (that only contains records with those classification terms) can be assigned the same classification terms each time without user intervention.

Negatives

SharePoint choice or lookup columns do not allow for hierarchical views or values to be displayed from the list view so the context for the classification terms may not be obvious unless both function and activity are listed.

The Term Store

The Term Store, also known as the Managed Metadata Service (MMS) has existed in SharePoint as a option to create and centrally manage classification and taxonomy terms in SharePoint only for at least a decade.

In 2020, access to the Term Store was re-located from its previous location (https://tenantname-admin.sharepoint.com/_layouts/15/TermStoreManager.aspx) to the SharePoint Online admin portal under the ‘Content Services’ section:

The location of the Term Store in the SharePoint admin center

Organisations can create multiple sets of taxonomies or ‘term groups’ (e.g, ‘BCS’ or ‘People’) within the Term Store. Each Term Group consists of the following:

  • Term Sets. These generally could map to a business function. Each Term Set has a name and description, and four tabs with the following information: (a) General: Owner, Stakeholders, Contact, Unique ID (GUID); (b) Usage settings: Submission policy, Available for tagging, Sort order; (c) Navigation: Use term set for site navigation or faceted navigation – both disabled by default; (d) Advanced: Translation options, custom properties.
  • Terms. These generally could map to an activity. Each Term has a name and three tabs: (a) General: Language, translation, synonyms and description; (b) Usage settings: Available for tagging, Member of (Term Set), Unique ID (GUID); (c) Advanced: Shared custom properties, Local custom properties.

In the example below, the Term Set (function) of ‘Community Relations’ has three Terms (activities).

Once they have been created in the Term Store, term set or terms can be added to a SharePoint site, either as a new site or local library/list column, as shown in the two screenshots below:

First, create a new column and select Managed Metadata

Then scroll down to Term Set Settings and choose the term set to be used.

Once added as a site column, the new column may be added to a Content Type that is added to a library, or directly on the library or list.

A Term Store-based column added to a library via a Content Type.

Benefits

The primary benefit using the central Term Store terms via a Managed Metadata column is that the Term Store is the ‘master’ classification scheme providing consistency in classification terms for all SharePoint sites.

As we will see below, Term Store terms may be used to help with the application of retention labels (which themselves may ‘map’ to classification terms in a function/activity-based retention schedule).

Negatives

Using metadata terms from the Term Store is almost identical with using a choice or lookup column. The only real difference is that the Term Store provides a ‘master’ and consistent list of classification (and other) terms.

Term store classification terms, including in Content Types, may only be used on a minority of SharePoint sites.

Additionally:

  • It is not possible to select a Term Set (e.g., the function level), only a Term within a Term Set.
  • Only the selected classification Term appears in the library metadata, without the parent Term Set or visual hierarchy reference to that Term Set – see screenshot below. Technically only that Term is searchable. It is not possible to view a global listing of all records classified according to function and activity.
  • If multiple choices are allowed, a record may be classified according to more than one Term. This may cause issues with grouping, sorting or filtering the content of a library in views.
How the Term appears in the library column.

As we will see below, there is no connection between the classification Terms in Term Sets and the categorisation options available when creating new retention labels via a File Plan. ‘Business Function’ or ‘Category’ choices in the File Plan do not connect with the Term Store.

Term Store terms and Content Types can only be used to classify content stored in SharePoint.

Retention labels

Retention labels in Microsoft 365 can be used in an indirect way to classify records in SharePoint, email and OneDrive because they can be ‘mapped’ to classification elements.

For example, a label may be based on the following elements:

  • Function: Financial Management
  • Activity: Accounting
  • Description: Accounting records
  • Retention: 7 years

Every retention label contains the following options:

  • Name. The name can provide simple details of the classification, for example: ‘Financial Management Accounting – 7 years’.
  • Description for users. This can be the full wording of the retention class.
  • Description for admins. This can contain details of how to apply or interpret the class, if required.
  • Retention settings (e.g., 7 years after date created/modified or label applied).

Benefits

Where the classification terms map to a retention class, the process of applying a retention label to an individual record, email or OneDrive content could potentially be seen as classifying those records against the classification scheme.

The Data Classification section in the Microsoft 365 Compliance portal provides an overview of the volume of records in SharePoint, OneDrive or Exchange that have a specific retention class:

Negatives

Not every record in every SharePoint document library may be subject to a retention label. Many records (for example in Teams-based SharePoint sites) may be subject to a ‘back end’ retention policy applied to the entire site (which creates a Preservation Hold library).

A retention label applied to a record doesn’t actually add any classification terms to the record.

Retention labels don’t map in any way to Term Store classification terms, except in SharePoint Syntex – see below (but this only applies to SharePoint content).

Retention labels/File Plan combination

The File Plan option (Records Management > File Plan, requires E5 licences) can also be used to add classification terms to a retention label as shown in the screenshot below. Note that there is no link with the Term Store.

Benefits

Records (including emails) that have been assigned a retention label could, in theory, be regarded as having been classified in this way because the label contains (or references) the classification terms.

Negatives

When applied to content in SharePoint, OneDrive or Exchange, retention labels linked with the File Plan do not show the File Plan classification terms. It may be possible to write a script that displays all records with the terms from the File Plan, but it may be easier to do this using the Data Classification option described above.

Retention labels/SharePoint Syntex combination

SharePoint Syntex provides a way to apply retention labels to records, stored in SharePoint, that have been identified through the Document Understanding Model process.

Benefits

As can be seen in the screenshot above, each new DU model allows similar types of records (in the example above, ‘Statements of Work’) to be associated with a new or existing Content Type that can include a Term Store Term – for SharePoint records only – and a retention label. This provides three types of ‘classification’:

  • Grouping by record type (e.g., Statement of Work, Invoice)
  • Linking (of sorts) between the records ‘classified’ in this way and a Term Store term added as a metadata column to the Content Type.
  • Assigning of a retention label. This provides the same form of retention label-based classification described above.

Furthermore, if the Extraction option is also used, data extracted to SharePoint columns can be based on choices listed in the Term Store metadata.

Negatives

SharePoint Syntex only works for records – and only those records that have some form of consistency – stored in SharePoint.

Retention labels/trainable classifiers combination

Trainable classifiers are another way that could be used to identify related records and apply a retention label to those records. Microsoft 365 includes six ‘out of the box’ trainable classifiers that will not be of much value to records managers for the classification of records:

  • Source code
  • Harassment
  • Profanity
  • Threat
  • Resume
  • Offensive language (to be deprecated)

The creation of new trainable classifiers requires an E5 licence; they are created through the Data Classification area of the Microsoft 365 Compliance admin portal. Machine Learning is used to identify related records to create the trainable classifiers.

Once created, a retention label may be auto-applied to content stored in SharePoint or Exchange mailboxes using the classifier.

The option to auto-apply a label based on a trainable classifier.

Benefits

The primary outcome (from a recordkeeping classification point of view) of using trainable classifiers is the application of a retention label to content stored in SharePoint and Exchange mailboxes. It can also be used to apply a sensitivity label to that content.

Negatives

It is unlikely that every record will be classified according to every classification option.

Trainable classifiers only work with SharePoint and Exchange mailboxes.

Classifying records per workload

The options are summarised below for each main workload:

  • SharePoint: Use local site or library columns, Term Store terms or retention labels (mapped to a File Plan as necessary), applied manually or automatically, including via SharePoint Syntex or trainable classifiers.
  • Exchange mailboxes: The only feasible option to classify these records is to manually or auto-apply retention labels that are mapped to a classification, including a trainable classifer.
  • OneDrive: Manually or auto-apply retention labels mapped to a classification.
  • Teams. It is not possible to classify Teams chats with the options available.

Is classification necessary?

The classification model described in ISO 15489 and other standards was based on the idea that records would be stored in a central recordkeeping system where they would be subject to and tagged by the terms contained a classification scheme, often applied at the aggregation level (e.g., a file).

Microsoft 365 is not a recordkeeping system but a collection of multiple applications that may create or capture records, primarily in Exchange mailboxes, SharePoint, OneDrive and MS Teams (and also Yammer).

There is no central option to classify records in the recordkeeping sense. The closest options are:

  • The grouping of records in SharePoint sites (and Teams, each of which has a SharePoint site) and libraries that map to business functions and activities.
  • The use of metadata, either terms set in the central Term Store or created in local sites/libraries, to ‘classify’ individual records (including emails) stored in SharePoint document libraries. Each item in the library might have a default classification, or could be classified differently.
  • The use of retention labels that ‘map’ to function/activity pairs in a records disposal authority/schedule. These may be applied, manually or automatically, to content stored in SharePoint, OneDrive and Exchange mailboxes.

Neither of the above may apply, or be applied consistently, to all SharePoint sites, Exchange mailboxes, OneDrive accounts. And neither can be applied to Teams chats.

A different approach to this problem is required, one that likely will likely involve greater use of Artificial Intelligence (AI) and Machine Learning (ML) methods to identify and enable the grouping of records, and provide visualisations of the records so-classified.

Image: Werribee Mansion, Victoria, Australia stairwell (Andrew Warland photo)

Posted in Conservation and preservation, Digital preservation, Electronic records, Records management, Retention and disposal

The challenge of identifying born-digital records

A recent ‘functional and efficiency’ review into the National Archives of Australia (also known as the ‘Tune Review’, published on 30 January 2021) noted the ‘rapid and ever-evolving challenges of the digital world’.

It stated that ‘the definition of a ‘record’ needs to reflect current international standards, be more directly applied to digital technologies, and more clearly provide for direct capture of records that are susceptible to deletion, such as emails, texts or online messages’.

The review also highlighted the difficulties associated with ingesting digital records ‘via manual intensive activities (due to lack of interoperable systems)’ and proposed a new model based on the ‘continuous automated appraisal of [Agency] digital records that would require a combination of artificial intelligence and skilled archivists’.

The review underlined the challenges of identifying and managing born-digital records, and the need for better solutions.

This post explores the challenges of accurately and identifying born-digital records in order to manage them.

Identifying and protecting records

Records usually provide evidence of something that happened – an action, an activity or process, a decision, or a current state (including a photograph or video record). They may have or be associated with descriptive metadata used to provide context to the records and guide or determine retention.

Like all other types of evidence, the authenticity, integrity and reliability should be protected for as long as they must be kept.

In the paper world, this outcome was achieved by storing physical records (including the printed version of born-digital records) on paper files or in physical storage spaces.

For the past twenty years or so, this outcome was achieved for (some) digital records by (mostly manually) copying them from a network drive or email system (or via a connector) to a dedicated electronic records management (ERM) system and then ‘locking’ them in that system to prevent unauthorised change or deletion. Most ERM systems consisted of a database for the metadata and an associated network drive file store for the objects.

The main problem with this centralised storage model – however good it might be at protecting copies of records stored in it – was that the original versions, along with all the other records that were not identified or could not be copied to the ERMS, remained where they were created or captured.

And the records stored ‘in’ the ERMS were actually stored on a network file share on a server that was (a) accessible to IT, and (b) almost always backed up. So, yet more copies existed.

The challenge of born-digital records

There are several key challenges with born-digital records:

  • Consistently and accurately identifying (or ‘declaring’) all records in all formats created or captured in all locations. For too long, the focus has primarily been on emails and anything that can be saved to a network drive with the onus of identifying a record on end-users.
  • Ensuring their authenticity, reliability and integrity over time. For records stored in the ERMS, this has usually involved locking them from edit, including through the ‘declaration’ process, or preventing deletion. But in almost all cases, the original version (in email, on the network drives), could continue to be modified. Other records that were not identified or stored in an ERMS may be deleted.
  • Ensuring that born-digital records will remain accessible for as long as they are required.

It is not possible to consistently and accurately manually (or even automatically) identify every born-digital record that an organisation creates or captures to ensure their authenticity, reliability, integrity or accessibility over time. Only a small percentage of born-digital records are copied to an ERMS.

Records remain hidden in personal mailboxes, personal drives and third-party (often unauthorised) systems. Records may exist in multiple forms and formats, sometimes created or stored in ‘private’ systems or on social media platforms. They may take the form of text or instant messages or social networking posts and threads. They may be drawings, images, voice or video recordings.

Even if a record is identified, it is not always possible to save it to an ERMS. Text or instant messages on mobile devices are a case in point that has been a problem for at least two decades. More recent examples include chat messages, reactions (emojis, comments), and recordings of online meetings.

And even if a high percentage of born-digital records could be stored in the ERMS, the original versions will almost always remain where they were created or captured.

A different approach is needed.

Triaging records?

One approach to the problem would be to accept that not all records have equal value. That is, not all records need to be managed the same way.

To some degree, this way of thinking is already reflected in classes in the structure of records retention schedules and the attention paid to each:

  • Records that have permanent or archival value and need to be transferred to archival institutions.
  • Specific types of records that must be created or kept by the organisation for a minimum periods (sometimes quite long but not ‘forever’), for legal, compliance or auditing purposes.
  • Records that are not subject to legal or compliance requirements but which the organisation decides to keep for a minimum period of time.
  • Everything else.

Triaging records means that they can be managed as required at each level, but nothing is missed. It requires a risk management approach.

For records of permanent value, or are subject to legal or compliance requirements, it means that ensuring that these records receive the most attention and every effort it made to ensure that they are and can be identified (declared) and managed accordingly. This would include ensuring that it is possible to identify and capture these records in the systems used to create or capture them, for example, key emails.

A similar approach would be taken to records that need to be kept for legal, compliance or auditing purposes but with an understanding that some of these records (e.g., emails) may remain in the original system where they were created or captured. Technological solutions may be used to identify or tag these records. The destruction of these records should be subject to some form of review and a record kept of the approval and what was destroyed.

For all other records would remain stored wherever they were created or captured and subject to minimum retention periods after which they can be destroyed without review – but a record kept of the basic metadata of each record (including original storage location).

Protecting – or proving – the authenticity, integrity and reliability of records

The assumption behind the protection of records is that they should not be changed or deleted.

The reality, with digital records, is that they may change at any time through new threads, new revisions, new chats, or even through photoshopping.

A more realistic approach may be to use information about what was changed, by whom, and when – not to protect the record but to provide an evidentiary trail to prove what it is or was. The ‘smoking gun’ evidence for most born-digital records is the metadata that is recorded when it was captured or modified, not (necessarily) the added descriptive metadata.

For example:

  • Someone may author a document (metadata records each revision, and each revision can be viewed).
  • The document may be approved electronically (recorded in metadata).
  • Someone then modifies the approved version.
  • All of the above is recorded in the ‘modified’, ‘modified by’ and approval metadata.
  • The record should (or may) also recorded who viewed the record, and when.

EXIF metadata stored on images provides a similar form of evidence (and may even include GPS information).

Which record is more likely to be accepted as evidence:

  • A record stored in an EDRMS, versions or revisions of which may exist in multiple other places, including on network file shares, email system and even backup tapes
  • A record stored in a system that shows the full set of metadata about access and changes, or the most recent thread of an email discussion?

Conclusions

At the end of the day, it should be possible to confirm the authenticity, reliability and integrity of records based on information/metadata that forms part of the born-digital record: who and when it was created, the context in which it was created and its relationship with other records.

Perhaps, instead of focussing on trying to identify and capture all born-digital objects that might be records and ‘protecting’ a version of that record, it may be more practical and easier to leave most records where they were created or captured (and retained by retention policies) and use change or revision metadata to provide evidence of authenticity.

This may, in the end, be a much easier way to protect the authenticity of records than having to rely on manual identification or declaration.

Posted in Artificial Intelligence, Classification, Electronic records, Information Management, Microsoft 365, Records management, Retention and disposal

Can Microsoft technology classify records better than a human?

In late 2012, IDM magazine published an article I co-authored with Umi Asma Mokhtar in Malaysia titled ‘Can technology classify records better than a human?’

The article drew on research into recent advances in technology to assist in legal discovery, known as ‘computer-assisted coding’, or ‘predictive coding’, including the following two articles:

Grossman and Cormack’s article noted that ‘a technology-assisted review process involves the interplay of humans and computers to identify the documents in a collection that are responsive to a production request, or to identify those documents that should be withheld on the basis of privilege‘. By contrast, an ‘exhaustive manual review’ required ‘one or more humans to examine each and every document in the collection, and to code them as response (or privileged) or not‘.

The article noted, somewhat gently, that ‘relevant literature suggests that manual review is far from perfect’.

Peck’s article contained similar conclusions. He also noted how computer-based coding was based on a initial ‘seed set’ of documents identified by a human; the computer then identified the properties of those documents and used that to code other similar documents. ‘As the senior reviewer continues to code more sample documents, the computer predicts the reviewer’s coding‘ (hence predictive coding).

By 2011, this new technology was challenging old methods of manual review and classification. Despite some scepticism and slow uptake (for example, see this 2015 IDM article ‘Predictive Coding – What happened to the next big thing?‘), by 2021, it had become an accepted option to support discovery, sometimes involving offshore processing for high volumes of content.

Meanwhile, in an almost unnoticed part of the technology woods, Microsoft acquired Equivio in January 2015. In its press release ‘Microsoft acquires Equivio, provider of machine learning-powered compliance solutions‘, Microsoft stated that the product:

‘… applies machine learning … enabling users to explore large, unstructured sets of data and quickly find what is relevant. It uses advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sorting documents into themes, grouping near-duplicates, isolating unique data, and helping users quickly identify the documents they need. As part of this process, users train the system to identify documents relevant to a particular subject, such as a legal case or investigation. This iterative process is more accurate and cost effective than keyword searches and manual review of vast quantities of documents.’ 

It added that the product would be deployed in Office 365.

Classifying records

The concept of classification for records was defined in paragraph 7.3 of part 1 of the Australian Standard (AS) 4390, released in 1996. The standard defined classification as:

‘… the process of devising and applying schemes based on the business activities generating records, whereby they are categorised in systematic and consistent ways to facilitate their capture, retrieval, maintenance and disposal. Classification includes the determination of naming conventions, user permissions and security restrictions on records’.

The definition provided a number of examples of how the classification of business activities could act as a ‘powerful tool to assist in many of the processes involved in the management of records, resulting from those activities’. This included ‘determining appropriate retention periods for records’.

The only problem with the concept was the assumption that all records could be classified in this way, in a singular recordkeeping system. Unless they were copied to that system, emails largely escaped classification.

Fast forward to 2020

Managing all digital records according to recordkeeping standards has always been a problem. Electronic records management (ERM) systems managed the records that were copied into them, but a much higher percentage remained outside its control – in email systems, network files shares and, increasingly over the past 10 years, created and captured on host of alternative systems including third-party and social media platforms.

By the end of 2019, Microsoft had built a comprehensive single ecosystem to create, capture and manage digital content, including most of the records that would have been previously consigned to an ERMS. And then COVID appeared and working from home become common. All of a sudden (almost), it had to be possible to work online. Online meeting and collaboration systems such as Microsoft Teams took off, usually in parallel with email. Anything that required a VPN to access became a problem.

2021 – Automated classification for records (maybe)

The Microsoft 365 ecosystem generated a huge volume of new content scattered across four main workloads – Exchange/Outlook, SharePoint, OneDrive and Teams. A few other systems such as Yammer also added to the mix.

Most of this information was not subject to any form of classification in the recordkeeping sense. The Microsoft 365 platform included the ability to apply retention policies to content but there was a disconnect between classification and retention.

Microsoft announced Project Cortex at Ignite in 2019. According to the announcement, Project Cortex:

  • Uses advanced AI to deliver insights and expertise in the apps that are used every day, to harness collective knowledge and to empower people and teams to learn, upskill and innovate faster.
  • Uses AI to reason over content across teams and systems, recognizing content types, extracting important information, and automatically organizing content into shared topics like projects, products, processes and customers.
  • Creates a knowledge network based on relationships among topics, content, and people.

Project Cortex drew on technological capabilities present in Azure’s Cognitive Services and the Microsoft Graph. It is not known to what extent the Equivio product, acquired in 2015, was integrated with these solutions but, from all the available details, it appears the technology is at least connected in one way or another.

During Ignite 2020, Microsoft announced SharePoint Syntex and trainable classifiers, either of which could be deployed to classify information and apply retention rules.

Trainable classifiers

Trainable classifiers were made generally available (GA) in January 2021.

Trainable classifiers sound very similar to the predictive coding capability that appeared from 2011. However, they:

  • Use the power of Machine Learning (ML) to identify categories of information. This is achieved by creating an initial ‘seed’ of data in a SharePoint library, creating a new trainable classifier and pointing it at the seed, then reviewing the outcomes. More content is added to ensure accuracy.
  • Can be used to identify similar content in Exchange mailboxes, SharePoint sites, OneDrive for Business accounts, and Microsoft 365 Groups and apply a pre-defined retention label to that content.

In theory, this means it might be possible to identify a set of similar records – for example, financial documents – and apply the same retention label to them. The Content Explorer in the Compliance admin portal will list the records that are subject to that label.

SharePoint Syntex

SharePoint Syntex was announced at Ignite in September 2020 and made generally available in early 2021.

The original version of Syntex (as part of Project Cortex) was targeted at the ability to extract metadata from forms, a capability that has existed with various other scanning/OCR products for at least a decade. The capability that was released in early 2021 included the base metadata extraction capability as well as a broader capability to classify content and apply a retention label.

The two Syntex capabilities, described in a YouTube video from Microsoft titled ‘Step-by-Step: How to Build a Document Understanding Model using Project Cortex‘, are:

  • Classification. This capability involves the following steps: (a) Creation of (SharePoint site) Content Center; (b) Creation of a Document Understanding Model (DUM) for each ‘type’ of record; the DUM can create a new content type or point to an existing one; the DUM can also link with the retention label to be applied; (c) Creation of an initial seed of records (positives and a couple of negatives); (d) Creation of Explanations that help the model find records by phrase, proximity, or pattern (matching, e.g., dates); (e) Training; (f) Applying the model to SharePoint sites or libraries. The outcome of the classification is that matching records in the location where it is pointed are assigned to the Content Type (replacing any previous one) and tagged with a retention label (also replacing any previous one).
  • Extraction. This capability has similar steps to the classification option except that the Explanations identify what metadata is to be extracted from where (again based on phrase, proximity or pattern) to what metadata column. The outcome of extraction is that the matching records include the extracted metadata in the library columns (in addition to the Content Type and retention label).

As with trainable classifiers, Syntex uses Machine Learning to classify records, but Syntex also has the ability to extract metadata. Syntex can only classify or extract data from SharePoint libraries.

Trainable classifiers or Syntex?

Both options require the organisation to create an initial seed of content and to use Machine Learning to develop an understanding of the content, in order to classify it.

The models are similar, the primary difference is that trainable classifiers can work on content stored in email, SharePoint and OneDrive, whereas Syntex is currently restricted to SharePoint.

Predictive coding

On 18 March 2021, Microsoft announced the pending (April 2021) preview release of an enhanced predictive coding module for advanced eDiscovery in Microsoft 365.

The announcement, pointing to this roadmap item, noted that eDiscovery managers would be able to create and train relevance models within Advanced eDiscovery using as few as 50 documents, to prioritize review.

So, can Microsoft technology classify records better than humans?

In their 1999 book ‘Sorting Things Out: Classification and its Consequences‘ (MIT Press), Geoffrey Bowker and Susan Leigh Star noted that ‘to classify is human’ and that classification was ‘the sleeping beauty of information science’ and ‘the scaffolding of information infrastructures’.

But they also noted how ‘each standard and category valorizes some point or view and silences another. Standards and classifications (can) produce advantage or suffering’ (quote from review in link above).

Technology-based classification in theory is impartial. It categorises what it finds through machine learning and algorithms. But, technology-based classification requires human review of the initial and subsequent seeds. Accordingly such classification has the potential to be skewed according to the way the reviewer’s bias or predilections, the selection of one set of preferred or ‘matching’ records over another.

Ultimately, a ‘match’ is based on a scoring ‘relevancy’ algorithm. Perhaps the technology can classify better than humans, but whether the classification is accurate may depend on the human to make accurate, consistent and impartial decisions.

Either way, the manual classification of records is likely to go the same way as the manual review of legal documents for discovery.

Image source: Providence Public Library Flickr