Archive for the ‘Semantic Office’ Category

Metadata Payloads in the Digital World

March 19, 2019

For at least twenty years, a core tenet of both document and records management has been the metadata that defined records. A number of metadata schema were developed over the years, including the well-known Dublin Core (http://dublincore.org/documents/dces/) that defined 15 core metadata elements for digital content:

  • Contributor
  • Coverage
  • Creator
  • Date
  • Description
  • Format
  • Identifer
  • Language
  • Publisher
  • Relation
  • Rights
  • Source
  • Subject
  • Title
  • Type

Introduction of XML based documents

Parallel with the development of metadata schema, the introduction of XML-based documents (e.g., .docx, odb) from the early 2000s introduced a new way of both structuring and describing documents. Instead of being external to the document, metadata could be embedded within the document, making it effectively a type of ‘metadata payload’.

Around the same time that XML-based documents were introduced, I wrote about the ‘Semantic Office’. The Semantic Office drew on the same ideas developed and implemented for the ‘Semantic Web’. Conceptually, the idea was quite simple – just as web pages would contain their own embedded metadata in the form of Resource Description Framework (RDF) triples (subject – predicate – object, e.g., sky – is – blue), common office documents such as Outlook, Word and Excel could carry their own embedded metadata ‘payload’.

Some of this metadata is visible in the Properties pane of a records but only as descriptive terms not as metadata defined against a specific schema.

The (mostly overlooked and under-reported) outcome of the introduction of XML-based documents was that a document could be stored anywhere and be found again based on the embedded metadata – as opposed to finding it through  metadata that was created and managed separately from the record (for example, in a document management system). For some reason, however, the predominant and persistent model for document management has been to store metadata about a document separately from the document.

In most document and records management systems since the late 1990s, digital records (emails included, if they are saved to the DRMS) were/are stored in secure file shares while the metadata about the record (including its ‘file’ or ‘container’ identifier) was stored in a separate database. Visually this gives the user the illusion that the records are stored ‘in’ a container even though they are actually stored in a network file share.

This pervasive document management model is conceptually similar to the way computers record metadata about documents stored in a Windows NT File System (NTFS) in the Windows Master File Table (MFT). MFT entries include details of the size, time and date stamps, permissions, and so on. It assumes that the actual location of the record is recorded in the metadata.

How XML-based documents embed metadata

XML-based Office documents (as well as PDFs and image files), however, retain core metadata information within the document itself. The information is accessible regardless of where the document is stored.

Ironically (perhaps) it may be different from any external metadata used to describe the document.

To view the embedded metadata in a Word document you only need to rename it to .zip and then unzip it. Extracting a zipped Word document reveals (in most cases) several folders and one XML file:

  • [trash] – contains ‘dat’ files (may not be present in all documents)
  • _rels – contains the ‘.rels’ XML document
  • customXml – contains a number of ‘item’ and ‘itemProps’ XML documents
  • docProps – contains three very small files: app.xml, core.xml, custom.xml
  • word – contains a range of XML files and additional folders with other XML files.
  • [Content_Types].xml

In one example Word document downloaded from a SharePoint library, the file ‘item4.xml’ in the ‘customXml’ folder contained both XML namespace (xmlns) information as well as the embedded document management elements (highlighted in bold):

A separate xml document also located in the ‘customXML’ folder contained the following core properties, including most of the Dublin Core elements listed above (but note that they are all blank).

Arguably, the body of the record is also a form of metadata, enclosed by the terms <body>text</body>. In the example document downloaded from SharePoint, the body of the document is contained in the file ‘document.xml’ under the ‘word’ folder of the package.

  • xmlns:wps=”http://schemas.microsoft.com/office/word/2010/wordprocessingShape&#8221; mc:Ignorable=”w14 w15 w16se wp14″>
  • <w:body>
  • <w:p w14:paraId=”195D8795″ w14:textId=”77777777″ w:rsidR=”0001502C” w:rsidRDefault=”00880316″>
  • <w:r>
  • <w:t>Test document</w:t>
  • </w:r>
  • </w:p>
  • <w:p w14:paraId=”195D8796″ w14:textId=”77D86E32″ w:rsidR=”006832E2″ w:rsidRDefault=”006832E2″ w:rsidP=”006832E2″>
  • <w:r>
  • <w:t>Lorem ipsum (and the rest of the text, deleted for brevity)</w:t>
  • </w:r>
  • <w:bookmarkStart w:id=”0″ w:name=”_GoBack”/><w:bookmarkEnd w:id=”0″/>
  • </w:p><w:sectPr w:rsidR=”006832E2″>
  • <w:pgSz w:w=”11906″ w:h=”16838″/>
  • <w:pgMar w:top=”1440″ w:right=”1440″ w:bottom=”1440″ w:left=”1440″ w:header=”708″ w:footer=”708″ w:gutter=”0″/>
  • <w:cols w:space=”708″/>
  • <w:docGrid w:linePitch=”360″/>
  • </w:sectPr>
  • </w:body>
  • </w:document>

Other core metadata elements are contained in the ‘core.xml’ file:

Why is this important?

The existence of – and ability to make use of – embedded metadata seems to have been overlooked since the introduction of these types of records over 15 years ago. This may have been primarily because no-one had a system in place to access or use that data in any meaningful way.

Instead, most records continued to be defined by metadata that is created or captured and managed separately from the record itself.

The problems with storing metadata separately from the record are that: (a) the external metadata may be different from the embedded metadata, and (b) the external metadata may unnecessarily limit or restrict the ability to see the record in different contexts.

For example, one person may assign a specific metadata term, such as a function from the Business Classification Scheme (BCS) to the digital record, or assign it to a specific ‘container’. Some time later, another person may try to find the same record but discover it is not in the same file, or assigned to the same function term. They are likely to be looking for the record in or from a completely different context.

The only way they may be able to find it is by doing a general search that includes the body or content of the records, something I found to be the case in real life scenarios where users couldn’t find the records they were looking for based on metadata searches.

Of course, metadata is still important, but my point is the difference between embedded metadata that can be added when the document is saved to a document library, and external metadata that is stored separately from the digital record.

Being able to leverage the metadata embedded in records, wherever they are stored, provides a much more powerful ability to leverage this information, similar to the way the application of metadata to web pages facilitates access.

Records Description Framework

A core part of the world wide web is the application of metadata to web pages to facilitate their discovery in a highly connected world. The core elements of this metadata are defined in the World Wide Web Consortium (W3C)’s Resource Description Framework, or RDF.

To quote the World Wide Web (W3) consortium:

‘RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.’ (Source: https://www.w3.org/RDF/)

It is perhaps not surprising that Microsoft named the analytic engine behind Office 365 the Microsoft Graph.

According to Microsoft:

‘Microsoft Graph is made up of resources connected by relationships. For example, a user can be connected to a group through a memberOf relationship, and to another user through a manager relationship. Your app can traverse these relationships to access these connected resources and perform actions on them through the API. You can also get valuable insights and intelligence about the data from Microsoft Graph. For example, you can get the popular files trending around a particular user, or get the most relevant people around a user.‘ (Source: https://developer.microsoft.com/en-us/graph/docs/concepts/overview)

microsoft_graph

The RDF model is also used in knowledge management applications such as Protege that supports the creation and use of RDF/XML ontologies.

Implications

In my opinion, the implications of XML-based office content (which has been around for over 10 years now) are quite important for records management theory and practice.

While, like traditional EDRM systems, documents are visually displayed ‘in’ the document library, each document retains its own originally assigned metadata even if it is downloaded – unless the user uses the ‘Check for Issues’ – ‘Inspect Document’ option from the Info panel to remove them.

The ability to store metadata properties directly in the document facilities that ability to locate and retrieve documents that have the same, similar or related properties, via the Microsoft Graph, in the same way that web pages use RDF triples, allows otherwise unconnected resources to be linked and presented to the user (subject to any security controls) automatically based on their specific context.

In other words, instead of records being locked to a specific container based on their metadata being stored in a database, records could be discovered and linked wherever they are located based on their embedded metadata.

Relevance of W3 XML schema to Office 365 content

The use of RDF-based metadata embedded in Office documents in Office 365 means that this data can be used to link resources in a way that supports the discovery of the resources. It allows for cross-linking of information. Documents with metadata payloads are one of the many resources that can be connected in this way.

For example, ‘… a user can be connected to a group through a ‘memberOf’ relationship, and to another user through a manager relationship. Your app can traverse these relationships to access these connected resources and perform actions on them through the API. You can also get valuable insights and intelligence about the data from Microsoft Graph. For example, you can get the popular files trending around a particular user, or get the most relevant people around a user.’ (Source: https://developer.microsoft.com/en-us/graph/docs/concepts/overview)

‘Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.’ (Source: https://www.w3.org/RDF/)

Advertisements

Folders uninterrupted – the enduring problem of folders

November 19, 2015

In September 2014 I presented to the Records and Information Management Professionals Australasia (RIMPA)’s annual conference in Adelaide, South Australia, on the subject of folders and how it is possible to manage digital content differently using products like SharePoint.

Last week a couple of people in completely different parts of the company I work for demonstrated clearly the problem of folders on Outlook and network drives. The first had at least 100 folders and sub-folders in her Outlook where she said she ‘filed’ everything so she could find it again. The other person showed me a newly created section of the network drives with hundreds of new folders and sub-folders set up by someone who didn’t like or get the existing agreed folder structure that has been in place for many years.

In both cases I asked the same question – how do you find anything? The answer, in most cases, is that the person clicks on folder names that they created or are familiar with, until they find what they are looking for. Search was usually relegated to last ditch attempt to find things but was not considered to be very helpful or accurate. Both individuals lamented the duplication of information in folders, duplication they often didn’t know about until the inevitable clash of versions that occurs regularly.

At a workshop/presentation I did for another organisation recently, I was told that the organisation relied heavily on what are called ‘titling conventions’ to enable the information to be found. That is, the names of files, folders and documents follow a pre-defined model.

In reply to my comment that such titling was surely less valuable when you could search for the content in their document management system, I was advised that they didn’t yet have the ability to search by the content of documents (a resource issue). They could only search for content based on titles, once again reinforcing the need for folders (with unambiguous and consistent titles) to find that information.

Breaking the habit

I have said many times, folders are a very hard habit to break. Given the generally poor ability to search and find content in email and network drives (and in some EDRM systems – still), search is relegated to second place and folders and pre-defined titling conventions continue to dominate where content searching is not available.

For at least fifteen years, working across a range of organisations, I have had the ability to search within the content of digital records. This includes being able to search for the content of scanned documents that have been subject to OCR or optical character recognition. In this digital age I’d be very disappointed if the only way I could find content was by navigating labyrinthine folder structures on email or network drives.

I have fewer than five sub-folders under my work email inbox. I use those additional folders to route specific content such as junk mail, SharePoint notifications or personal email. I don’t save anything to a network drive and only use SharePoint to store and retrieve records and other corporate information. Why? Because search is really good.

I know this is not common practice around the organisation. Common practice is to create and maintain multiple folders and squirrel useful content away in network drive folders. It’s been that way for 30 years.

I don’t think folders are going to vanish from the landscape anytime soon – they are no more as unlikely to vanish in the next 10 years as email, although both are slowly morphing over time into something different, particularly as ‘collaboration’ tools, search, and the presentation of information via analytics engines are introduced.

Content finds you

Facebook and Google are successful for a reason – they both present content (including advertising) relevant to the user based on her or his digital content. A simple analogy is to compare the corner store where you know you can go to get milk with the ability of digital devices to remind a user they need to buy a specific type of milk because (a) their internet linked refrigerator has noticed that milk is running low and (b) they regularly post to social media (or even in a single email) about their preference for this type of milk. It’s creepy, but it works .

Many contemporary office environments are not much different – users know they find their digital content in the folders (the little corner stores) they created or have learned about. But they don’t know how much other interesting or directly relevant content is there and they don’t bother looking because they know it will be a poor and time-wasting experience.

They don’t know what they don’t know.

My experience over the past five years with SharePoint (and with similar products before it) is that users are resistant to ‘having’ to store documents in a different location to what they are used to (i.e. network drives).

Users worry their content will be lost, it’s ‘yet another place’ to store and find content, and the user experience (compared with folders) is alien. This doesn’t stop them consigning endless information to social media and being the recipients of targeted advertising and suggested friends based on their content. The difference between the two is so extraordinarily different it is sometimes hard to believe that users do this, checking in on Facebook’s content-driven feed while ‘navigating’ through endless folders, many with names that are about as vague as you can get.

Where is this heading

In the last weeks Microsoft has outlined what is sees as the future of the office and their Office. Three key elements of that future are:

  • SharePoint (on-premises and online) and OneDrive. These two will not replace folders completely but will give users a much better (and more importantly, more mobile) experience with storing and finding content.
  • Office 365 Groups. Groups of people who have a common interest and can communicate across platforms, whether it be via email or other collaboration tools including SharePoint or Yammer.
  • Analytics as the way we will harness and access our information, via interfaces like Delve that serves up relevant information based on the user’s digital content. Just like Facebook.

Does Delve deliver on the ‘Semantic Office’?

Many years ago I wrote on this blog about what I called at the time the ‘Semantic Office’. In that article I asked whether we will one day experience a time when we can truly harness the digital content in our business systems to link and present relevant information to users based on the same methods found on the internet at that time (for example in eBay, Amazon etc).

I think Microsoft has finally figured this out with Delve. My guess is that, by 2020, Delve or a version of it will be the most common ways users first access their content.

So where does this leave folders?

Folders aren’t going to go away anytime soon. I’d like to think that Microsoft took a close look at how they could deprecate folders in email and drives and realised that this was akin to giving up on email – it wasn’t going to happen.

Instead, Microsoft decided to introduce new tools that can capture and – more importantly, perhaps – present information in completely different ways that digital natives expect and even take for granted. As someone said recently, when a large group of young people were asked to send an email, many responded with ‘what’s email, why can’t I Inbox you?’.

Change is happening, slowly, along with the culture shift that must accompany it. For every older ‘paper native’ worker that retires, a younger digital native commences work in the workforce, bringing with them new expectations both in terms of work experience and tools they use.

Just like the clacking of manual typewriters in offices was eventually replaced by the slightly smoother sound of electronic typewriters, and then the click-clack of keyboards hooked to the back of computers, the silent touch screens of the future will eventually dominate and then themselves be replaced by something else.

Habits are hard to break, but change happens.