Ever since computers appeared in the workplace, it has been common practice to download, copy or move documents, or attach them to emails. Consequently, documents are frequently duplicated (including in multiple mailboxes) and, in the absence of useful metadata to assist in identification, it can be difficult to locate the original ‘source of truth’.
This post describes how Microsoft Office documents stored in SharePoint can retain ‘source of origin’ and other metadata even when they are downloaded, attached to emails or moved, thereby helping to identify the original source or storage location.
This feature of SharePoint can be highlighted to encourage end-users to use links instead of downloading or moving documents, or attaching them to emails.
The XML structure of Office documents
Since 2007, Microsoft Office documents (Word, Excel and PowerPoint) have been based on an open (i.e., accessible) XML structure that embeds metadata inside the actual document structure, a form of ‘metadata payload’. My earlier post from March 2019 titled ‘Metadata payloads in the digital world‘ described this structure in considerable detail.
When Office documents are saved to SharePoint document libraries, and the document ID site collection feature is enabled, ‘metadata payloads’ can help to identify the original source of the document. Other embedded metadata may also be used to identify different aspects of the document.
How is metadata embedded in Office documents?
For the purpose of this post, we will examine what happens to the embedded metadata of a document stored in an example SharePoint library used to create agreements as described in the linked post, when they are downloaded, attached to emails, copied, or moved.
Library metadata columns
The example document library contained four metadata elements in addition to the standard ‘system’ metadata (e.g., date created, created by, data modified, modified by, etc):
- Document ID. A site collection feature that must be enabled to appear.
- Title. A standard Dublin Core (DC) metadata element in every library. Can be renamed.
- Client DOB. A custom site column linked with a site content type.
- Client Address. A custom site column linked with a site content type.
All the available metadata columns in a SharePoint library, including the core system generated metadata (Date Created, Created (by), Date Modified, Modified (by), and Title plus document IDs if enabled) can be seen in the Library Settings under the ‘Columns’ section.
Viewing a document’s metadata values in the library
When any type of document has been added to a document library, the document’s metadata values can be viewed by (a) checking the circle to the left of the document and (b) clicking on the ‘information’ icon on the top right. This opens the details panel, in which the metadata about the document is displayed.
Viewing the metadata values of Office documents
In addition to the above, the metadata of every Office document (Word/Excel/PowerPoint) saved to a SharePoint document library can also be accessed when the document is opened in the desktop application (not in the online version).
To access this information, click ‘File’ – ‘Info’ and ‘Show All Properties’ on the right hand side, as shown below. These metadata values remain with the document even when it is downloaded or attached to an email – see below for further information. We can see the Document ID Value that defines where the document was originally saved.
Viewing the metadata values of Office documents inside the XML structure
To view the metadata values inside the XML structure of the document (the ‘metadata payload’):
- Download the file (e.g., to a local computer).
- Using the CMD dialogue, rename the file from ‘docx’ to ‘zip’ (‘ren filename.docx filename.zip’). The document will now appear to be a zip file.
- Right click and use the ‘Extract all’ option to unzip the file (usually to the same location).
The unzipped Word document now appears as shown below, a collection of folders. This structure is common to all Word documents.
The document’s metadata values will be found in the ‘customXml’ folder in one of the XML documents (usually ‘item2’ but can be in others). We can see the Client DOB, Client Address and DocID values for the document we downloaded:
The ‘Title’ metadata field, however, is one of the standard Dublin Core (DC) metadata fields. The value of that field, along with several other core metadata elements (prefixed with ‘dc:), is stored in the ‘docProps’ folder, in the XML file named ‘core’.
What happens to the metadata values if a document is downloaded, emailed, uploaded, copied or moved?
Downloading and attaching to an email
When Office documents are downloaded from SharePoint, or attached to an email, the metadata values assigned to the document in the library remain with the document and are visible via the Properties section or in the XML structure, as shown in the example above.
- Non-Office documents do not retain the metadata values assigned in the library when they are downloaded from SharePoint, or attached to an email.
- Documents that have their own metadata payload, such as the EXIF data in images or metadata in PDFs, will retain that metadata with the document is downloaded or attached to an email.
Uploading back to SharePoint
When a non-Office document is uploaded back to the same SharePoint library with the same name, the document will become a new version (if enabled) and metadata will need to be added. If it doesn’t have the same name, it is regarded as a new document.
When an Office document with the same name is uploaded back to the same SharePoint library, the document becomes a new version (if enabled). The original metadata properties remain assigned to the new version.
If an Office document is renamed and then uploaded, the document becomes a new document in the library and is assigned a new document ID but the other custom metadata properties remain.
When an Office or non-Office document is uploaded to a different library, the document will be assigned a new document ID and lose all other metadata previously assigned to it unless those metadata properties are exactly the same in the destination library.
Copying to a different SharePoint site/library
When an Office or non-Office document is copied to a different library, the document loses all its original metadata (including the document ID) and takes on the metadata of the new library, unless the metadata properties in the new library are exactly the same. Otherwise, it is regarded as a new document. It also only copies the most recent version.
Moving to a different SharePoint site/library
When an Office or non-Office document with custom metadata is moved to a different document library using the ‘Move to’ command, a message will display showing that the document properties will be lost if the document is moved.
If the option to ‘Move anyway’ is accepted, the document is moved to the new location. It will lose all the metadata from the original library (unless they also exist in the destination library), but – curiously – the original document ID remains with the document as can be seen in the File’s Properties pane as well as the XML:
Office documents stored in SharePoint document libraries retain their original metadata properties, including the document ID, if they are downloaded or attached to an email.
Office and non-Office document types may also retain the original metadata if they are copied or moved to a new library with the same metadata columns. Documents that are moved will, however, retain their original document ID.
The persistence of metadata, at least in Office documents and especially the document ID, helps to identify that the document has been downloaded or came from a SharePoint document library, making it much easier to identify the original source (via the document ID and/or other metadata).
This feature in turn should encourage the greater use of links to documents instead of downloading, attaching or moving them.