Data-based records

(Note – the image above is a small ticket dated 1956 from my grandmother’s visit to Denmark. I used this because the word ‘Kontrolbillet’ seemed appropriate for this post.)

In response to several queries about this following my previous post about whether it is possible to manage records as data, it seemed apparent that the data-based nature of contemporary modern digital content formats, especially Office documents, is not well known.

This post provides details of the data structure content of a typical Word document, to help explain why such records could be seen (and managed) as self-contained data sets.

Just to be clear, the idea of managing records as data does not remove the need or business requirement to store and manage records in ‘local’ aggregations or context – a SharePoint document library or a mailbox for example (less so a network file share because of the limited metadata, but still possible). These aggregations will generally map to business activities, can have specific metadata requirements and can be used to control access to and retention of records as long as they need to be managed.

Managing records as data is a more holistic data analytics concept that allows organisations to better understand and analyse records amidst the volume of all other digital content. It should should help to ensure that all records on a given subject or context are managed appropriately through time, and that, wherever possible, only one copy exists.

A document in a SharePoint Online library

For this example, a document is stored in a SharePoint Online document library called ‘Client Agreements’. The library has a set of metadata columns that must be added to every record. The library uses document sets but it could equally use metadata or folders, the important point is that metadata is added to the library.

The metadata added to the library can be anything, including terms from a business classification scheme. The metadata can be mandatory or optional, and can be set as default options – for example, you may want every document in a library to automatically have the same function and activity terms.

In the screenshot below, we can see the document library with two document sets (a type of folder). The library has four added metadata options: Client Name, Client Reference, ClientRef, and Date of Birth (not visible in the screenshot but we’ll see it later).

The metadata properties

Here are the metadata columns for the library. As we will see below in the actual data, metadata columns with a space between words results in additional characters (‘_0020_’) replacing the space.

The columns from ‘Created’ downwards are all default columns

When I open the Harpin ‘folder’, I can see the metadata columns next to a document. In this case they were added to the document automatically as the documents inherit the same metadata properties as the document set. This is set via the Document Set settings – ‘Shared Columns’:

Alternatively, the metadata can be added to each new individual document when the document is added.

If the Harpin document is selected as shown below …

… the information panel on the far right shows the metadata properties for the document (and also the activity – when the document was modified and by whom, and who viewed it):

As this particular document is a Word template added to a content type in the library, an end user can to select it when they create a new document in the library as shown in the screenshot below. Alternatively, the ‘Client Folder’ option allows them to create a new document set folder with all the metadata that relates to the client; this data is then inherited by any document created in the library:

If the document is opened, you can click on File – Info and see the metadata properties already added TO the document in the library. These properties remain with the document even if it is downloaded and/or attached to an email. If Document IDs have been enabled, that metadata value is also added to the document properties, meaning we can see that it came from a SharePoint library (and which one):

Because it is used as a template, the Word document can make use of the metadata added to the record in the body of the document, in addition to the metadata forming part of the properties for the document.

The metadata properties have been automatically added to the body of the document

The XML properties

Let’s now look at the XML of the document.

Download the document to an accessible location. Using the Command Prompt (CMD), rename the document to .zip (You cannot do this from File Explorer). From File Explorer, the original file will now have the extension .zip. In the list below, the other file with a similar name is a copy, but the size is identical.

Now, unzip the zip file (right click, Extract All).

Here is the top level output, which is standard for all Word documents.

Open the ‘customXml’ folder and you will see a set of XML files:

Open item1.xml, and you will see the custom properties which, as you can see, includes both the Document ID as well as the original path SharePoint site/library location. Just to be clear the Document ID ends in ‘119’, which is the actual document; the original document set folder’s ID ends in 118 (scroll up to check):

To avoid ‘_x0020_’ in the metadata properties, don’t use spaces in the metadata column names

As can be seen, the document that was downloaded has the unique Document ID embedded in the metadata. Note that this ID will change if the document is uploaded to a different library.

In the ‘docProps’ folder we find three sets of XML files:

In the ‘coreXML’ file we see the Dublin Core (DC) metadata that you see in the document Properties above. You can add all the Dublin Core metadata to the library, they are built-in to every library, which means that every document can have all that metadata.

The actual content (the body) of the email is found in the ‘word’ folder of the XML files. Here is the content of that ‘word’ folder:

In the screenshot below you can see some of the ‘document.xml’ content including the metadata that has been added in the body of the document (separately from the properties of the document).

All this metadata is accessible and is used by the Microsoft Graph.

Excel files

Excel files are interesting because, in a sense, they contain data within data. Here is some data in a spreadsheet:

This data is – strangely – stored in two different XML files. The text (including the column headings) is stored here: \xl\sharedStrings.xml:

The values are stored according to each worksheet. For example: \xl\worksheets\sheet1.xml (first two rows only)

A note about emails

Emails do not have the same XML-based structure as Office documents and generally cannot have additional metadata added (except as tags).

Emails in Outlook (sent or received) become ‘.msg’ files if saved to another location from Outlook.

The ‘.msg’ format is based on CFB_3, or compound file binary format, a format that was also used by earlier versions of Microsoft Office documents. It is ‘a general-purpose file format that provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data’. (Source: Microsoft web page on Compound File Binary File Format).

Copies of Microsoft Teams chat messages are also stored in a hidden folder in Exchange mailboxes, as instant messages. They cannot be accessed directly but should be considered as a type of archive copy – the originals are stored in a separate database.

If emails are saved to a SharePoint document library, they can be described with additional metadata while stored in the library, but this metadata does not become part of the core metadata of the email or remain with it if it is downloaded, as it does with other Office documents.

In any case, whether they remain in Exchange/Outlook mailboxes, are copied and stored in SharePoint or other Microsoft-based locations, the metadata content in them is accessible via searches.

Active Directory completes the relationships

Every digital record has an author and is likely to have contributors (modified by). Every email is sent and received by someone. All of the internal names linked with digital content are recorded in an organisation’s Active Directory. Employees are also likely to be added to Security Groups (sometimes known as AD Groups) that provide a way to control access to IT resources.

The relationship between document-based content (documents, emails), and between people in AD Security Groups, provides the ability to establish relationships between content, people and business activities.

A final word

Importantly, managing records data does NOT remove or exclude the business need or requirement to aggregate (e.g., in document libraries, mailboxes), manage through time, and then destroy or transfer records according to business requirements. Instead, it enhances this capability by ensuring that all records about a given subject or context can be identified and that, as much as possible, only one copy of the record exists.

Records about the world

Data-based records