Posted in Conservation and preservation, Digital preservation, Electronic records, Records management, Retention and disposal

The challenge of identifying born-digital records

A recent ‘functional and efficiency’ review into the National Archives of Australia (also known as the ‘Tune Review’, published on 30 January 2021) noted the ‘rapid and ever-evolving challenges of the digital world’.

It stated that ‘the definition of a ‘record’ needs to reflect current international standards, be more directly applied to digital technologies, and more clearly provide for direct capture of records that are susceptible to deletion, such as emails, texts or online messages’.

The review also highlighted the difficulties associated with ingesting digital records ‘via manual intensive activities (due to lack of interoperable systems)’ and proposed a new model based on the ‘continuous automated appraisal of [Agency] digital records that would require a combination of artificial intelligence and skilled archivists’.

The review underlined the challenges of identifying and managing born-digital records, and the need for better solutions.

This post explores the challenges of accurately and identifying born-digital records in order to manage them.

Identifying and protecting records

Records usually provide evidence of something that happened – an action, an activity or process, a decision, or a current state (including a photograph or video record). They may have or be associated with descriptive metadata used to provide context to the records and guide or determine retention.

Like all other types of evidence, the authenticity, integrity and reliability should be protected for as long as they must be kept.

In the paper world, this outcome was achieved by storing physical records (including the printed version of born-digital records) on paper files or in physical storage spaces.

For the past twenty years or so, this outcome was achieved for (some) digital records by (mostly manually) copying them from a network drive or email system (or via a connector) to a dedicated electronic records management (ERM) system and then ‘locking’ them in that system to prevent unauthorised change or deletion. Most ERM systems consisted of a database for the metadata and an associated network drive file store for the objects.

The main problem with this centralised storage model – however good it might be at protecting copies of records stored in it – was that the original versions, along with all the other records that were not identified or could not be copied to the ERMS, remained where they were created or captured.

And the records stored ‘in’ the ERMS were actually stored on a network file share on a server that was (a) accessible to IT, and (b) almost always backed up. So, yet more copies existed.

The challenge of born-digital records

There are several key challenges with born-digital records:

  • Consistently and accurately identifying (or ‘declaring’) all records in all formats created or captured in all locations. For too long, the focus has primarily been on emails and anything that can be saved to a network drive with the onus of identifying a record on end-users.
  • Ensuring their authenticity, reliability and integrity over time. For records stored in the ERMS, this has usually involved locking them from edit, including through the ‘declaration’ process, or preventing deletion. But in almost all cases, the original version (in email, on the network drives), could continue to be modified. Other records that were not identified or stored in an ERMS may be deleted.
  • Ensuring that born-digital records will remain accessible for as long as they are required.

It is not possible to consistently and accurately manually (or even automatically) identify every born-digital record that an organisation creates or captures to ensure their authenticity, reliability, integrity or accessibility over time. Only a small percentage of born-digital records are copied to an ERMS.

Records remain hidden in personal mailboxes, personal drives and third-party (often unauthorised) systems. Records may exist in multiple forms and formats, sometimes created or stored in ‘private’ systems or on social media platforms. They may take the form of text or instant messages or social networking posts and threads. They may be drawings, images, voice or video recordings.

Even if a record is identified, it is not always possible to save it to an ERMS. Text or instant messages on mobile devices are a case in point that has been a problem for at least two decades. More recent examples include chat messages, reactions (emojis, comments), and recordings of online meetings.

And even if a high percentage of born-digital records could be stored in the ERMS, the original versions will almost always remain where they were created or captured.

A different approach is needed.

Triaging records?

One approach to the problem would be to accept that not all records have equal value. That is, not all records need to be managed the same way.

To some degree, this way of thinking is already reflected in classes in the structure of records retention schedules and the attention paid to each:

  • Records that have permanent or archival value and need to be transferred to archival institutions.
  • Specific types of records that must be created or kept by the organisation for a minimum periods (sometimes quite long but not ‘forever’), for legal, compliance or auditing purposes.
  • Records that are not subject to legal or compliance requirements but which the organisation decides to keep for a minimum period of time.
  • Everything else.

Triaging records means that they can be managed as required at each level, but nothing is missed. It requires a risk management approach.

For records of permanent value, or are subject to legal or compliance requirements, it means that ensuring that these records receive the most attention and every effort it made to ensure that they are and can be identified (declared) and managed accordingly. This would include ensuring that it is possible to identify and capture these records in the systems used to create or capture them, for example, key emails.

A similar approach would be taken to records that need to be kept for legal, compliance or auditing purposes but with an understanding that some of these records (e.g., emails) may remain in the original system where they were created or captured. Technological solutions may be used to identify or tag these records. The destruction of these records should be subject to some form of review and a record kept of the approval and what was destroyed.

For all other records would remain stored wherever they were created or captured and subject to minimum retention periods after which they can be destroyed without review – but a record kept of the basic metadata of each record (including original storage location).

Protecting – or proving – the authenticity, integrity and reliability of records

The assumption behind the protection of records is that they should not be changed or deleted.

The reality, with digital records, is that they may change at any time through new threads, new revisions, new chats, or even through photoshopping.

A more realistic approach may be to use information about what was changed, by whom, and when – not to protect the record but to provide an evidentiary trail to prove what it is or was. The ‘smoking gun’ evidence for most born-digital records is the metadata that is recorded when it was captured or modified, not (necessarily) the added descriptive metadata.

For example:

  • Someone may author a document (metadata records each revision, and each revision can be viewed).
  • The document may be approved electronically (recorded in metadata).
  • Someone then modifies the approved version.
  • All of the above is recorded in the ‘modified’, ‘modified by’ and approval metadata.
  • The record should (or may) also recorded who viewed the record, and when.

EXIF metadata stored on images provides a similar form of evidence (and may even include GPS information).

Which record is more likely to be accepted as evidence:

  • A record stored in an EDRMS, versions or revisions of which may exist in multiple other places, including on network file shares, email system and even backup tapes
  • A record stored in a system that shows the full set of metadata about access and changes, or the most recent thread of an email discussion?

Conclusions

At the end of the day, it should be possible to confirm the authenticity, reliability and integrity of records based on information/metadata that forms part of the born-digital record: who and when it was created, the context in which it was created and its relationship with other records.

Perhaps, instead of focussing on trying to identify and capture all born-digital objects that might be records and ‘protecting’ a version of that record, it may be more practical and easier to leave most records where they were created or captured (and retained by retention policies) and use change or revision metadata to provide evidence of authenticity.

This may, in the end, be a much easier way to protect the authenticity of records than having to rely on manual identification or declaration.

Posted in Artificial Intelligence, Classification, Electronic records, Information Management, Microsoft 365, Records management, Retention and disposal

Can Microsoft technology classify records better than a human?

In late 2012, IDM magazine published an article I co-authored with Umi Asma Mokhtar in Malaysia titled ‘Can technology classify records better than a human?’

The article drew on research into recent advances in technology to assist in legal discovery, known as ‘computer-assisted coding’, or ‘predictive coding’, including the following two articles:

Grossman and Cormack’s article noted that ‘a technology-assisted review process involves the interplay of humans and computers to identify the documents in a collection that are responsive to a production request, or to identify those documents that should be withheld on the basis of privilege‘. By contrast, an ‘exhaustive manual review’ required ‘one or more humans to examine each and every document in the collection, and to code them as response (or privileged) or not‘.

The article noted, somewhat gently, that ‘relevant literature suggests that manual review is far from perfect’.

Peck’s article contained similar conclusions. He also noted how computer-based coding was based on a initial ‘seed set’ of documents identified by a human; the computer then identified the properties of those documents and used that to code other similar documents. ‘As the senior reviewer continues to code more sample documents, the computer predicts the reviewer’s coding‘ (hence predictive coding).

By 2011, this new technology was challenging old methods of manual review and classification. Despite some scepticism and slow uptake (for example, see this 2015 IDM article ‘Predictive Coding – What happened to the next big thing?‘), by 2021, it had become an accepted option to support discovery, sometimes involving offshore processing for high volumes of content.

Meanwhile, in an almost unnoticed part of the technology woods, Microsoft acquired Equivio in January 2015. In its press release ‘Microsoft acquires Equivio, provider of machine learning-powered compliance solutions‘, Microsoft stated that the product:

‘… applies machine learning … enabling users to explore large, unstructured sets of data and quickly find what is relevant. It uses advanced text analytics to perform multi-dimensional analyses of data collections, intelligently sorting documents into themes, grouping near-duplicates, isolating unique data, and helping users quickly identify the documents they need. As part of this process, users train the system to identify documents relevant to a particular subject, such as a legal case or investigation. This iterative process is more accurate and cost effective than keyword searches and manual review of vast quantities of documents.’ 

It added that the product would be deployed in Office 365.

Classifying records

The concept of classification for records was defined in paragraph 7.3 of part 1 of the Australian Standard (AS) 4390, released in 1996. The standard defined classification as:

‘… the process of devising and applying schemes based on the business activities generating records, whereby they are categorised in systematic and consistent ways to facilitate their capture, retrieval, maintenance and disposal. Classification includes the determination of naming conventions, user permissions and security restrictions on records’.

The definition provided a number of examples of how the classification of business activities could act as a ‘powerful tool to assist in many of the processes involved in the management of records, resulting from those activities’. This included ‘determining appropriate retention periods for records’.

The only problem with the concept was the assumption that all records could be classified in this way, in a singular recordkeeping system. Unless they were copied to that system, emails largely escaped classification.

Fast forward to 2020

Managing all digital records according to recordkeeping standards has always been a problem. Electronic records management (ERM) systems managed the records that were copied into them, but a much higher percentage remained outside its control – in email systems, network files shares and, increasingly over the past 10 years, created and captured on host of alternative systems including third-party and social media platforms.

By the end of 2019, Microsoft had built a comprehensive single ecosystem to create, capture and manage digital content, including most of the records that would have been previously consigned to an ERMS. And then COVID appeared and working from home become common. All of a sudden (almost), it had to be possible to work online. Online meeting and collaboration systems such as Microsoft Teams took off, usually in parallel with email. Anything that required a VPN to access became a problem.

2021 – Automated classification for records (maybe)

The Microsoft 365 ecosystem generated a huge volume of new content scattered across four main workloads – Exchange/Outlook, SharePoint, OneDrive and Teams. A few other systems such as Yammer also added to the mix.

Most of this information was not subject to any form of classification in the recordkeeping sense. The Microsoft 365 platform included the ability to apply retention policies to content but there was a disconnect between classification and retention.

Microsoft announced Project Cortex at Ignite in 2019. According to the announcement, Project Cortex:

  • Uses advanced AI to deliver insights and expertise in the apps that are used every day, to harness collective knowledge and to empower people and teams to learn, upskill and innovate faster.
  • Uses AI to reason over content across teams and systems, recognizing content types, extracting important information, and automatically organizing content into shared topics like projects, products, processes and customers.
  • Creates a knowledge network based on relationships among topics, content, and people.

Project Cortex drew on technological capabilities present in Azure’s Cognitive Services and the Microsoft Graph. It is not known to what extent the Equivio product, acquired in 2015, was integrated with these solutions but, from all the available details, it appears the technology is at least connected in one way or another.

During Ignite 2020, Microsoft announced SharePoint Syntex and trainable classifiers, either of which could be deployed to classify information and apply retention rules.

Trainable classifiers

Trainable classifiers were made generally available (GA) in January 2021.

Trainable classifiers sound very similar to the predictive coding capability that appeared from 2011. However, they:

  • Use the power of Machine Learning (ML) to identify categories of information. This is achieved by creating an initial ‘seed’ of data in a SharePoint library, creating a new trainable classifier and pointing it at the seed, then reviewing the outcomes. More content is added to ensure accuracy.
  • Can be used to identify similar content in Exchange mailboxes, SharePoint sites, OneDrive for Business accounts, and Microsoft 365 Groups and apply a pre-defined retention label to that content.

In theory, this means it might be possible to identify a set of similar records – for example, financial documents – and apply the same retention label to them. The Content Explorer in the Compliance admin portal will list the records that are subject to that label.

SharePoint Syntex

SharePoint Syntex was announced at Ignite in September 2020 and made generally available in early 2021.

The original version of Syntex (as part of Project Cortex) was targeted at the ability to extract metadata from forms, a capability that has existed with various other scanning/OCR products for at least a decade. The capability that was released in early 2021 included the base metadata extraction capability as well as a broader capability to classify content and apply a retention label.

The two Syntex capabilities, described in a YouTube video from Microsoft titled ‘Step-by-Step: How to Build a Document Understanding Model using Project Cortex‘, are:

  • Classification. This capability involves the following steps: (a) Creation of (SharePoint site) Content Center; (b) Creation of a Document Understanding Model (DUM) for each ‘type’ of record; the DUM can create a new content type or point to an existing one; the DUM can also link with the retention label to be applied; (c) Creation of an initial seed of records (positives and a couple of negatives); (d) Creation of Explanations that help the model find records by phrase, proximity, or pattern (matching, e.g., dates); (e) Training; (f) Applying the model to SharePoint sites or libraries. The outcome of the classification is that matching records in the location where it is pointed are assigned to the Content Type (replacing any previous one) and tagged with a retention label (also replacing any previous one).
  • Extraction. This capability has similar steps to the classification option except that the Explanations identify what metadata is to be extracted from where (again based on phrase, proximity or pattern) to what metadata column. The outcome of extraction is that the matching records include the extracted metadata in the library columns (in addition to the Content Type and retention label).

As with trainable classifiers, Syntex uses Machine Learning to classify records, but Syntex also has the ability to extract metadata. Syntex can only classify or extract data from SharePoint libraries.

Trainable classifiers or Syntex?

Both options require the organisation to create an initial seed of content and to use Machine Learning to develop an understanding of the content, in order to classify it.

The models are similar, the primary difference is that trainable classifiers can work on content stored in email, SharePoint and OneDrive, whereas Syntex is currently restricted to SharePoint.

Predictive coding

On 18 March 2021, Microsoft announced the pending (April 2021) preview release of an enhanced predictive coding module for advanced eDiscovery in Microsoft 365.

The announcement, pointing to this roadmap item, noted that eDiscovery managers would be able to create and train relevance models within Advanced eDiscovery using as few as 50 documents, to prioritize review.

So, can Microsoft technology classify records better than humans?

In their 1999 book ‘Sorting Things Out: Classification and its Consequences‘ (MIT Press), Geoffrey Bowker and Susan Leigh Star noted that ‘to classify is human’ and that classification was ‘the sleeping beauty of information science’ and ‘the scaffolding of information infrastructures’.

But they also noted how ‘each standard and category valorizes some point or view and silences another. Standards and classifications (can) produce advantage or suffering’ (quote from review in link above).

Technology-based classification in theory is impartial. It categorises what it finds through machine learning and algorithms. But, technology-based classification requires human review of the initial and subsequent seeds. Accordingly such classification has the potential to be skewed according to the way the reviewer’s bias or predilections, the selection of one set of preferred or ‘matching’ records over another.

Ultimately, a ‘match’ is based on a scoring ‘relevancy’ algorithm. Perhaps the technology can classify better than humans, but whether the classification is accurate may depend on the human to make accurate, consistent and impartial decisions.

Either way, the manual classification of records is likely to go the same way as the manual review of legal documents for discovery.

Image source: Providence Public Library Flickr

Posted in Access controls, Information Management, Information Security, Microsoft 365, Microsoft Teams, Office 365 Groups, SharePoint Online

Understanding permission groups in Teams and SharePoint

One of the most confusing aspects of Teams and SharePoint in Microsoft 365 is the relationship between permission groups used to control access to both of these resources. This is especially the case as every Team in MS Teams has an associated SharePoint site (the ‘Files’ tab).

This post explains how permission groups work between MS Teams, Microsoft 365 Groups and SharePoint.

SharePoint permission groups

Before discussing how Teams permissions relate to SharePoint, here is a brief reminder of how SharePoint permissions work.

SharePoint has always had three default permission groups, prefixed by the URL name of the site, as shown in the screenshot below (the name of the site always prefixes the words Owners, Members and Visitors).

Site Owners

  • People (including in a Group, see below) added to the Owners permission group have full access (full control) to all parts of the site and are usually responsible for managing the SharePoint site. There would normally be two or three site owners.

Site Members

  • People (including in a Group, see below) added to the Members permission group have add/edit (contribute) rights.

Site Visitors

  • People added to the Visitors permission group have read-only (view) rights.

These permissions are set at the site level and inherited on everything in the site, unless that inheritance is broken and unique permission are applied. Additional permission groups can be created as necessary but most SharePoint sites only use the default Owners, Members and Visitors groups.

Microsoft 365 Groups

Microsoft 365 Groups were introduced in 2017 and control access to resources, like Security Groups.

However, unlike Security Groups, which usually provide access to individual resources (such as a single SharePoint site, or Line of Business (LOB) system), Microsoft 365 Groups control access to multiple linked Microsoft 365 resources.

Microsoft 365 groups, distribution lists, mail-enabled security groups, and security groups (collectively referred to as Active Directory (AD) groups, are all created in ‘Groups’ area of the Microsoft 365 Admin portal.

When a new group is created, the following options appear.

As noted above, Microsoft 365 groups are recommended. It is important to understand the relationship between Microsoft 365 groups, Teams and SharePoint.

A new group has a visible mailbox and a Team is created by default

When a new Microsoft 365 group is created (from the dialogue above), it creates:

  • At least one Owner must be specified. The Owner/s are responsible for managing the Members group.
  • An Exchange mailbox with the same email @ name as the Microsoft 365 group. The mailbox is visible in Outlook to the members of the Group.
  • A SharePoint site with the same URL name as the Microsoft 365 group.
  • By default (unless the checkbox is unchecked), a new Team is also created in MS Teams.

When a new Team is created from MS Teams, or a new SharePoint Team site is created, it creates:

  • A Microsoft 365 Group with an Exchange mailbox and a SharePoint site (‘Files’ tab).
  • The name of the Team becomes the name of the Group and the SharePoint site.
  • The mailbox is not visible in Outlook and is only used for calendaring and for the storage of Teams chats (in a hidden folder).

Importantly, when a new Microsoft 365 group or Team is created (which creates a Microsoft 365 group), the Group Owners: (a) are the same as the Team Owners and (b) are added to the SharePoint Owners permission group, as explained below. .

Group/Team Owners and Members

In other words, the Microsoft 365 group owners (group) is added to the SharePoint site owners permission group – a ‘group within a group’.

That is, the Microsoft 365 group controls access to the Team and the SharePoint site as shown in the diagram below. Security Groups may also be added to the Microsoft 365 Group site, but this does not provide access to the Team.

The relationship between Microsoft 365 Groups, Teams and SharePoint

This ‘group within a group’ model is visible from the ‘Site Permissions’ section of the gear/cog icon as shown below (the name of the Microsoft 365 Group/Team/SharePoint site is ‘SharePoint Admin’). The SharePoint Admin Group Owners (group) is in the SharePoint site owners group, and the SharePoint Admin Group Members (group) is in the Site members group.

If a mouse hovers over the Group ‘icon’ (in the above example, GO or GM), it is possible to view the members of the Group and, for Owners, to modify that list. Confusingly, the ‘GM’ in the SharePoint site permissions group becomes ‘SG’ in the drop down list.

You can also see the ‘group within group’ model from the back-end ‘Advanced permissions’ section of the SharePoint site, but you cannot manage the Microsoft 365 Group members here.

Implementing the model

As with Security Groups, the members of Microsoft 365 Groups will usually be a logical group of people who require access to something, in this case access to the SharePoint site or the Team (for chat, files, or other resources).

The main thing to remember is that membership of the (backend) Microsoft 365 Group provides access to BOTH the Team and the Team’s SharePoint site (the ‘Files’ tab in a Team).

  • Every Team in MS Teams will usually consist of the members of a logical group with a common interest – a business unit, project team, or with some other work relationship, for example, the members of a committee. The Team Owners are responsible for managing the Team Members.
  • The Team Owners are the SharePoint site owners and are responsible for managing the site if they decide to access it directly. The Team Members are the SharePoint site members and have the ability to add or edit content, usually via the ‘Files’ tab in Teams.

Note: Security Groups with the same members as Microsoft 365 Groups (and Teams) may already exist. There is no need to add a Security Group if it has the same members as a Microsoft 365 Group.

As noted earlier, a Group/Team does not have visitors with read-only rights. Every Member of the Team has add/edit access to both the Team and its associated SharePoint site.

  • If there is a requirement to give specific other people either add/edit or read-only access to the SharePoint site, that outcome is achieved by adding people by name, or a Security Group, to either the SharePoint Members or Visitors group.
  • If there is a requirement to give everyone in the organisation either add/edit rights, or read only access, to the SharePoint site, that outcome is achieved by adding ‘Everyone except external users’ to either the SharePoint Members or Visitors group.

External guests may also be added to the Team and the Team’s SharePoint site.