Posted in Electronic records, Information Management, Microsoft Graph, Office 365, Semantic Office, XML

Metadata Payloads in the Digital World

For at least twenty years, a core tenet of both document and records management has been the metadata that defined records. A number of metadata schema were developed over the years, including the well-known Dublin Core (http://dublincore.org/documents/dces/) that defined 15 core metadata elements for digital content:

  • Contributor
  • Coverage
  • Creator
  • Date
  • Description
  • Format
  • Identifer
  • Language
  • Publisher
  • Relation
  • Rights
  • Source
  • Subject
  • Title
  • Type

Introduction of XML based documents

Parallel with the development of metadata schema, the introduction of XML-based documents (e.g., .docx, odb) from the early 2000s introduced a new way of both structuring and describing documents. Instead of being external to the document, metadata could be embedded within the document, making it effectively a type of ‘metadata payload’.

Around the same time that XML-based documents were introduced, I wrote about the ‘Semantic Office’. The Semantic Office drew on the same ideas developed and implemented for the ‘Semantic Web’. Conceptually, the idea was quite simple – just as web pages would contain their own embedded metadata in the form of Resource Description Framework (RDF) triples (subject – predicate – object, e.g., sky – is – blue), common office documents such as Outlook, Word and Excel could carry their own embedded metadata ‘payload’.

Some of this metadata is visible in the Properties pane of a records but only as descriptive terms not as metadata defined against a specific schema.

The (mostly overlooked and under-reported) outcome of the introduction of XML-based documents was that a document could be stored anywhere and be found again based on the embedded metadata – as opposed to finding it through  metadata that was created and managed separately from the record (for example, in a document management system). For some reason, however, the predominant and persistent model for document management has been to store metadata about a document separately from the document.

In most document and records management systems since the late 1990s, digital records (emails included, if they are saved to the DRMS) were/are stored in secure file shares while the metadata about the record (including its ‘file’ or ‘container’ identifier) was stored in a separate database. Visually this gives the user the illusion that the records are stored ‘in’ a container even though they are actually stored in a network file share.

This pervasive document management model is conceptually similar to the way computers record metadata about documents stored in a Windows NT File System (NTFS) in the Windows Master File Table (MFT). MFT entries include details of the size, time and date stamps, permissions, and so on. It assumes that the actual location of the record is recorded in the metadata.

How XML-based documents embed metadata

XML-based Office documents (as well as PDFs and image files), however, retain core metadata information within the document itself. The information is accessible regardless of where the document is stored.

Ironically (perhaps) it may be different from any external metadata used to describe the document.

To view the embedded metadata in a Word document you only need to rename it to .zip and then unzip it. Extracting a zipped Word document reveals (in most cases) several folders and one XML file:

  • [trash] – contains ‘dat’ files (may not be present in all documents)
  • _rels – contains the ‘.rels’ XML document
  • customXml – contains a number of ‘item’ and ‘itemProps’ XML documents
  • docProps – contains three very small files: app.xml, core.xml, custom.xml
  • word – contains a range of XML files and additional folders with other XML files.
  • [Content_Types].xml

In one example Word document downloaded from a SharePoint library, the file ‘item4.xml’ in the ‘customXml’ folder contained both XML namespace (xmlns) information as well as the embedded document management elements (highlighted in bold):

A separate xml document also located in the ‘customXML’ folder contained the following core properties, including most of the Dublin Core elements listed above (but note that they are all blank).

Arguably, the body of the record is also a form of metadata, enclosed by the terms <body>text</body>. In the example document downloaded from SharePoint, the body of the document is contained in the file ‘document.xml’ under the ‘word’ folder of the package.

  • xmlns:wps=”http://schemas.microsoft.com/office/word/2010/wordprocessingShape&#8221; mc:Ignorable=”w14 w15 w16se wp14″>
  • <w:body>
  • <w:p w14:paraId=”195D8795″ w14:textId=”77777777″ w:rsidR=”0001502C” w:rsidRDefault=”00880316″>
  • <w:r>
  • <w:t>Test document</w:t>
  • </w:r>
  • </w:p>
  • <w:p w14:paraId=”195D8796″ w14:textId=”77D86E32″ w:rsidR=”006832E2″ w:rsidRDefault=”006832E2″ w:rsidP=”006832E2″>
  • <w:r>
  • <w:t>Lorem ipsum (and the rest of the text, deleted for brevity)</w:t>
  • </w:r>
  • <w:bookmarkStart w:id=”0″ w:name=”_GoBack”/><w:bookmarkEnd w:id=”0″/>
  • </w:p><w:sectPr w:rsidR=”006832E2″>
  • <w:pgSz w:w=”11906″ w:h=”16838″/>
  • <w:pgMar w:top=”1440″ w:right=”1440″ w:bottom=”1440″ w:left=”1440″ w:header=”708″ w:footer=”708″ w:gutter=”0″/>
  • <w:cols w:space=”708″/>
  • <w:docGrid w:linePitch=”360″/>
  • </w:sectPr>
  • </w:body>
  • </w:document>

Other core metadata elements are contained in the ‘core.xml’ file:

Why is this important?

The existence of – and ability to make use of – embedded metadata seems to have been overlooked since the introduction of these types of records over 15 years ago. This may have been primarily because no-one had a system in place to access or use that data in any meaningful way.

Instead, most records continued to be defined by metadata that is created or captured and managed separately from the record itself.

The problems with storing metadata separately from the record are that: (a) the external metadata may be different from the embedded metadata, and (b) the external metadata may unnecessarily limit or restrict the ability to see the record in different contexts.

For example, one person may assign a specific metadata term, such as a function from the Business Classification Scheme (BCS) to the digital record, or assign it to a specific ‘container’. Some time later, another person may try to find the same record but discover it is not in the same file, or assigned to the same function term. They are likely to be looking for the record in or from a completely different context.

The only way they may be able to find it is by doing a general search that includes the body or content of the records, something I found to be the case in real life scenarios where users couldn’t find the records they were looking for based on metadata searches.

Of course, metadata is still important, but my point is the difference between embedded metadata that can be added when the document is saved to a document library, and external metadata that is stored separately from the digital record.

Being able to leverage the metadata embedded in records, wherever they are stored, provides a much more powerful ability to leverage this information, similar to the way the application of metadata to web pages facilitates access.

Records Description Framework

A core part of the world wide web is the application of metadata to web pages to facilitate their discovery in a highly connected world. The core elements of this metadata are defined in the World Wide Web Consortium (W3C)’s Resource Description Framework, or RDF.

To quote the World Wide Web (W3) consortium:

‘RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.’ (Source: https://www.w3.org/RDF/)

It is perhaps not surprising that Microsoft named the analytic engine behind Office 365 the Microsoft Graph.

According to Microsoft:

‘Microsoft Graph is made up of resources connected by relationships. For example, a user can be connected to a group through a memberOf relationship, and to another user through a manager relationship. Your app can traverse these relationships to access these connected resources and perform actions on them through the API. You can also get valuable insights and intelligence about the data from Microsoft Graph. For example, you can get the popular files trending around a particular user, or get the most relevant people around a user.‘ (Source: https://developer.microsoft.com/en-us/graph/docs/concepts/overview)

microsoft_graph

The RDF model is also used in knowledge management applications such as Protege that supports the creation and use of RDF/XML ontologies.

Implications

In my opinion, the implications of XML-based office content (which has been around for over 10 years now) are quite important for records management theory and practice.

While, like traditional EDRM systems, documents are visually displayed ‘in’ the document library, each document retains its own originally assigned metadata even if it is downloaded – unless the user uses the ‘Check for Issues’ – ‘Inspect Document’ option from the Info panel to remove them.

The ability to store metadata properties directly in the document facilities that ability to locate and retrieve documents that have the same, similar or related properties, via the Microsoft Graph, in the same way that web pages use RDF triples, allows otherwise unconnected resources to be linked and presented to the user (subject to any security controls) automatically based on their specific context.

In other words, instead of records being locked to a specific container based on their metadata being stored in a database, records could be discovered and linked wherever they are located based on their embedded metadata.

Relevance of W3 XML schema to Office 365 content

The use of RDF-based metadata embedded in Office documents in Office 365 means that this data can be used to link resources in a way that supports the discovery of the resources. It allows for cross-linking of information. Documents with metadata payloads are one of the many resources that can be connected in this way.

For example, ‘… a user can be connected to a group through a ‘memberOf’ relationship, and to another user through a manager relationship. Your app can traverse these relationships to access these connected resources and perform actions on them through the API. You can also get valuable insights and intelligence about the data from Microsoft Graph. For example, you can get the popular files trending around a particular user, or get the most relevant people around a user.’ (Source: https://developer.microsoft.com/en-us/graph/docs/concepts/overview)

‘Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.’ (Source: https://www.w3.org/RDF/)

Posted in Electronic records, Governance, Information Management, Office 365, Products and applications, Records management, Retention and disposal, SharePoint Online

Four observations about Office 365/SharePoint Online and records management

The following is a slightly modified version of four points I made recently to a records management professional, responding to the point that ‘many CIOs are rolling out Office 365 and SharePoint Online to replace traditional recordkeeping  systems such as TRIM/CM etc’.

First, generally speaking, records managers have traditionally not had a strong technical knowledge and/or weren’t close to the IT team.

Even if they managed TRIM/CM/other EDRM it was usually as the front end admin, not the back end technical IT admin, which remained with IT. Conversely, IT people have generally never had much knowledge of how to manage records (it not usually part of their skill set).

There was almost always a gap (technical, organisational, communication etc) between the records area and IT; consequently, IT departments have rolled out SharePoint and more recently Office 365 without reference to (or the feeling they even needed to refer to) records managers, and often without a solid architecture and planning for implementing and managing SharePoint (or Office 365).

Into the space between IT and records (but usually closer to IT) are various vendors who offer products that they say does the records management they claim that SharePoint does not do.

This by the way is not a criticism of those vendors as such, but there has been a tendency to buy their products without really understanding what the base product can do. This has almost always been the case for many IT products – back in 2006/7 I was part of a team looking to acquire a major ECM product and was a trained system administrator. The product itself could do exactly what was required without any modifications, the problem was the client (the company I worked for) wanted modifications that required consulting work. Close to a million dollars later in consulting fees, the product was still unused.

I’m also concerned at the way some vendors pitch the suitability or ‘compliance’ of their products in relation to add-ons to SharePoint for managing records. I had one telling me in all sincerity that their product ‘complied with ISO 15489’, which was interesting to hear since their is no compliance framework. The same vendor’s salesman was not aware of ISO 16175 when I asked about it.

Second, from SharePoint 2010 onwards, Microsoft implemented a range of new records management functionality to meet minimum (mostly corporate rather than government) requirements for managing records.

That new functionality included a great deal more features than most people knew about. One Australian consultant (John Wise) identified that SharePoint 2010 met 88% of the requirements of the then ICA standard that became ISO 16175 Part 2. For most non-government organisations that didn’t need the level of information security found in government, it was closer to 95%, and the 5% remaining was not particularly important for most organisations. With the introduction of both retention/disposal policy management, and information security classifications, via the Security and Compliance Centre in the Office 365 admin portal, SharePoint meets almost all requirements listed in ISO 16175 that do not refer to legacy systems.

In many respects, by ignoring ‘traditional’ ways that other EDRM systems have managed records, Microsoft introduced a brand new paradigm for managing records, underlined by the idea that digital records do not work the same way as paper records.

In my view, many older EDRM products failed to adapt to the new digital world and continued to enforce the concept that records must be ‘moved’ (saved to) a container in the recordkeeping system just as paper records had to be saved onto a single subject file. As long as Exchange and network files shares remained completely separate, this meant (and continues to mean) that the original versions of those records always remained in Exchange/network files even after they were copied to the EDRM.

A much smarter model, which SharePoint Online offers via both the create and save processes, is to allow people to save non-email records directly to SharePoint, including in syncronised document libraries in File Explorer; the document libraries can have default metadata applied to content types, and retention policies can be applied to those libraries. Emails can be moved automatically via Flow, or retained in the mailboxes with Office 365 retention policies applied. Recordkeeping happens in the background, people don’t have to fill in a form every time they want to save a record to the system.

Microsoft have centralised records management across the Office 365 environment. For example, the creation and management of records disposal/retention classes (called ‘classification policies’) is now carried out in the Security and Compliance Admin centre of the Office 365 portal. Records managers need to be assigned specific roles to do what they need to do (and I would argue, the corporate records managers should also be Site Collection Administrators on every site, preferably via a Security Group).

It doesn’t matter if the record is in Exchange or in SharePoint (or some of the other Office 365 applications), a classification policy can be applied wherever it is. When implemented correctly (based on a good architecture model), classification policies can provide the recordkeeping context required to link records over time.

Third, just like a home subscription to Office 365 with cloud storage is more cost effective than buying the product as before, most IT organisations have seen the benefits of moving their enterprise agreement licencing from per-device licence (where the licence is based on the computer) to a per-user licence (where the user can use the product on multiple machines including mobile devices or from home). This has also allowed them to shift storage (and the costs of maintaining servers, including technical staff) from their own or hosted data centres to the Microsoft cloud (which, ironically, may be in the same hosted data centre).

One large organisation that I’m familiar with had around 30TB of storage in the data centre; by acquiring Office 365 E3/E1 licences, they had 45TB – PLUS, 1TB for each user’s OneDrive. I suspect this point is not known to most records managers (first point above), who simply see the CIO’s introducing or rolling out Office 365 for no obvious reason.

Fourth, SharePoint has traditionally been many things to different people because it has always had a dual nature – publishing/intranet and team sites.

This is no different in SharePoint Online but the options to customise are now fewer (thankfully). Communication sites are a simple and elegant way to publish information, while team sites (including Office 365 Group-based team sites) are more or less the functional replacement for network drives (OneDrive for Business replaces personal drives).

In my opinion, it is important for anyone getting involved with SharePoint to understand this – that SharePoint Online is NOT the same as the ‘old’ SharePoint on-premise that could be customised to do just about anything.

Keep it simple, using the very rich ‘out of the box’ options, and it begins to make more sense. Plus, as noted already, users can synchronise SharePoint document libraries to File Explorer and work from there, so their experience can be more or less exactly what it is now using network drives.

Can you manage records in SharePoint Online? Absolutely, keeping in mind that SharePoint Online is very much a part of the Office 365 ecosystem and should not be considered a standalone application as it was when installed in an on-premise server.

Records managers need to get up to speed (quickly, in my opinion, although I’ve been saying it for years) with not only the recordkeeping functionality already in SharePoint Online and be SharePoint System Administrators (to give them access to the SharePoint Admin portal) and Site Collection Administrators, but also really need to understand the Office 365 portal and the relevant parts of the Security and Compliance Admin Centre including classification policies, ediscovery options and audit options.