(Note – while drafting this post I became aware of an MA Dissertation on the subject of ‘Artificial Intelligence and Record-keeping’ being developed by Mohamed Ben Tahayekt at University College London. I have not had access to this material but I believe some of the concepts may be similar to those outlined in this post.)
Digital records have long been thought of (and described) as being ‘unstructured’.
The reality, however, is that almost all contemporary text-based digital record is made up of a defined, structured and mostly open or accessible package of data that is based on standards. For example:
- Microsoft Word, PowerPoint and Excel documents are all based on an XML structure (indicated by the ‘x’ on the end of the file extension) described in ISO/IEC 29500 and ECMA 376.
- Google Docs exist only in an online format (described in this Google site); to access them offline they must be converted to one of the following formats ISO/IEC 29500 format, ODT (ISO/IEC 26300), PDF or html.
- Emails are now mostly based on the Internet Messaging Format (IMF), standardized by RFC 5322.
- PDFs are based on the open standard ISO 32000.
All of these standards support interoperability between systems (and devices). (See my post about Metadata Payloads for more information on this subject).
An exception to the above are binary objects, including digital photos and images and where these are embedded in text-based documents. But even so, most binary objects are stored with a range of metadata to describe them.

Given that text-based digital records are already full of readable and accessible structured data (and binary objects come with a range of descriptive metadata), is it possible to manage digital records as self-contained data objects?
Records and context
Digital content (records and non-records) will be always be captured, saved to or stored somewhere:
- In email mailboxes. Emails of course may include attachments that duplicate records stored elsewhere in the system, or are not stored anywhere else – e.g., received from outside the organisation.
- In a drive/folder structure in a network file share location, including ‘personal’ drives.
- In a library/folder in online file storage and collaboration platforms, including ‘personal’ online storage locations.
- In corporate enterprise ‘social’ platforms such as the intranet.
- In corporate messaging and chat applications.
Some of the above may have well-defined ‘filing’ or storage structures (including folders) that are used to store or ‘file’ records. Some of these may include the ability to classify and categorise records, and add additional metadata.

In an organisational setting, all of this digital content will be created, sent/received, or modified by someone listed in Active Directory (AD), a system that generally links employees through their organisational structure. Additionally, employees are likely to belong to several AD Groups that further define relationships between them.
These relationships are important as they help us to understand the context for records.
Isolating records from other content
But one of the challenges for any organisation is knowing what is a record and what isn’t. Perhaps that isn’t as important as it sounds, if all the digital content is considered a potential record.
Organisations create or receive and store a lot of digital content, and a lot of this content has traditionally been kept (on backup tapes) for a long time to support disaster recovery and investigation purposes.
Only a percentage of this content is likely to fit the standard definition of a record – ‘evidence of business activities’.
And some digital content may not obviously be a record until it is connected with or related to other content or activities. For example, a simple email that says ‘Yes’ or ‘OK’ may be the record of agreement to something that doesn’t form part of any other obvious records until it is identified as being a record.
Not uncommonly in traditional electronic recordkeeping systems, there could be no guarantee that everything copied there was a copy of every record that existed on a given subject. Additionally, a record stored in a recordkeeping system may be of relevance in other contexts.
The key to what a record might be is the word ‘evidence’; this is exactly what lawyers look for when they conduct eDiscovery activities.
Rather than assume all records can be accurately found and managed amidst the volume of all digital content, it may be more efficient and accurate to assume all digital content is a record and then apply rules and tools to manage that content, with the aim of identifying records and their potential context based on the data contained in the individual digital objects and their relationships with both other records and people.
In other words find records amongst the entire content, rather than seeking to isolate only those digital objects that are identified as records and copy them to another location – while leaving the originals and potentially other related records in place. Managing records this way avoids the problem of email threads or chats that continue after the copy has been made, or a new copy of a Word document appearing.
How can we achieve this outcome?
There are three potential ways to manage records as data.
The first is to understand, even in general terms, is that digital content is not unstructured, and to learn more about how they are structured. Some simple examples:
- Every email (and instant messages) has a sender, recipient, date sent, date received. They usually (but not always) have a subject. The text-based body of the email provides an additional form of accessible data. A quick look at email headers reveals a huge amount about the email.
- Every document (and web page) has an author, created dated, modified date and last modified by, and a name. They also have a large amount of other data, a lot of which is visible in the Properties section.
- Photographs are stored as binary objects but have a range of EXIF metadata that includes the creation date, information about the camera settings, and may also include the name of the person who created it, as well as a GPS location.

The second is to understand that digital content may include added data or metadata. This added data may relate to or derive from the location where the record is stored, or may be added by end-users as part of their work. It may include a unique identifier and information about the aggregation where it is stored, as well as recordkeeping classification terms. Additionally, it may include both process metadata (modified by, and when) and security or access control metadata. Depending on where it is stored, this additional metadata may be embedded with the document properties (the metadata payload).
The third is to have access to (ideally) all digital content across the organisation, and the necessary tools (or access to people with them who can provide usable output) to search and retrieve, relate, and manage all digital content on any given subject or context through to disposal. A very simple example of this is to run a PowerBI report across the network file shares.
And lastly, while there will always be some form of ‘local’ aggregation for specific records where all the records are stored in the one place (mailbox, document library, folder), the only way to establish an aggregation of all digital records on a given subject or context using data only is through the use of advanced searches and/or eDiscovery tools and/or data reporting or visualisations and/or artificial intelligence to find, link and relate content.
Linking and relating content
The diagram below from Microsoft about its Graph technology, provides a simple example of how content can be linked and related through its data.

There are now many data analytics and data visualisation tools that help to understand digital content. These tools are just one part of the picture.
Data analytics tools (such as ‘Constellation‘, a joint project between the Australian Signals Directorate and the Australian CSIRO) are a starting point to understand digital content – including digital content from line of business systems.

These tools might be used to identify content or people related to a given subject, through chat messages, emails, documents etc, including content that is already linked through its own context – the mailbox or a SharePoint library. From that information it would be possible to build a picture – types and volume of content, people, and the relationships between them.
A global search should be able to retrieve, and if necessary export, the content, keeping in mind always that the nature of digital content means it may continue to be modified or new content may added at any time.
As searches improve, narrower set of content allows more granular analysis and visualisation, allowing the identification of sub-sets of records within broader sets. For example, of the potentially large group of ‘everything about COVID’, just the narrower set ‘Vaccines’.
All of this could be achieved through the data that makes up the digital content. And many data-driven organisations are likely to be doing just this, using a range of business intelligence tools to understand the information available to them, in both line of business systems and other content.
Can we manage records as data
Perhaps ‘manage’ is not the right word, or at least not in the sense of expecting digital records to be managed as exceptions to the rest of the digital content.
But there is a lot more we can do to make this outcome possible. We can leave the records where they are stored or captured, we can apply local structure to those records, or security controls. We can keep records of changes that are made. We can apply retention rules that prevent the destruction of any record, or potential record, before it can be legally destroyed.
Instead of ‘managing’ records as exceptions, we can leave the data where it was created or stored, and use a range of tools to help us understand and manage it.
This will allow us to manage records as data and finally achieve the ‘semantic office‘ I wrote about in 2010.