Abstract
The basic concepts and theory behind the idea of a ‘semantic web’ are already available in XML-based documents produced in Word 2007 and Open Office. This has potential implications for the way records could be managed within organisations, leading to the possibility of a ‘semantic office’.
Background
The conceptual basis for a ‘semantic web’ was first defined by Tim Berners-Lee and Mark Fischetti in their 1999 book ‘Weaving the Web’ (HarperSanFrancisco. chapter 12. ISBN 9780062515872.).
At its core, the concept behind the Semantic Web is creating the ability for computers to understand the meaning (‘semantics’) of information in a web page, by presenting that information in a machine understandable format. In very simple terms, this can be achieved through the use of pre-defined, agreed terms to describe informational elements contained within a web page as individual elements of data. This has led to the suggestion of a ‘World Wide Database’ (Nova Spivack, 2005).
Until recently, the primary way to present information that could be readable in a browser was to define it using hypertext mark-up language, or HTML (current standard version HTML 4), which (like all mark up languages derives from the original Standard Generalised Markup Language (SGML).
While HTML is an effective way of presenting and formatting information read by a browser, it (and its successors including XHTML and more recently HTML 5) does not allow for the definition of all individual informational elements contained within the web page. Extensible mark-up language (XML, released in 1998), Web Ontology Language (OWL) and Resource Description Frameworks (RDF, released in 1999), on the other hand, allow individual informational elements to be defined. The framework for defining information in this way is described below.
The concepts behind the description of individual elements within a web page has its origins in object oriented programming languages of the late 1980s and early 1990s. It can be argued that its origins can be traced further back to the very early 1980s and the development of relational databases that could link, analyse and present data.
In 1988, B Pernici from the Politecnico di Milano presented an article to the Conference on Office Information System in Palo Alto. The article, ‘Supporting OIS design through semantic queries’ , proposed a semantic query language to assist in the retrieval of information from conceptual schemas in office systems.
This was, possibly, the first reference to the Semantic Office.
Developments within the field of semantic representation of knowledge appear to have focussed on two main areas from the mid 1980s – the management of data within database systems, and the management of informational or data elements within web based information.
In the early to mid 2000s, possibly with this background and other drivers, both Microsoft and Open Office began to work on new XML-based document file format standards. Microsoft’s version was called ‘Office Open XML’ and was published in November 2008 as ISO/IEC 29500:2008 (also released as ECMA-376 Office Open XML File Formats – 2nd edition, in December 2008); OpenOffice.org released ‘OpenOffice.org XML’, published in November 2006 as ISO/IEC 26300:2006 Open Document Format for Office Applications (OpenDocument) v1.0.
Defining Semantics
The broad concept behind the Semantic Web is that a document (in this case a web page) can contain elements of information (or data) that are described within pre-defined ‘categories’, therefore allowing other data described in the same category but in other documents (web pages, or documents on a server accessible through the same web page) to be found, retrieved but more importantly used in potentially completely different contexts.
In a wide sense, it should allow accessible information in any part of the internet to be retrieved and used in this way. It turns what was unstructured information into structured information.
As described on the main page of
http://semanticweb.org/wiki/Main_Page, ‘The Semantic Web is the extension of the World Wide Web that enables people to share content beyond the boundaries of applications and websites.’
At the core of the Semantic Web is agreement on how information should be described, in the form of metadata (‘information about information’). One of the most well known set of metadata is the Dublin Core, which consists of 15 simple elements including : Title, Creator, Subject, Description, Date.
Metadata sets are often conceptually the same thing as ontologies, taxonomies, and even (to hark back to its origins) data dictionaries, often when metadata sets become quite complex with relationships with other sets and all terms are used interchangeably, often depending on the context of the person. Web Ontology Language (OWL) is a technology for developing ontologies.
Agreed sets are known as schema; the schema used in a Semantic Web web page are defined at the beginning as XML name spaces or XMLNS (for example ‘xmlns:dc=http://purl.org/dc/terms/’). The presence of this at the top of a web page means that the web page contains metadata elements drawn from Dublin Core within it somewhere. (For example: <span property=”dc:title”>How to Publish Linked Data on the Web</span>)
One ontology that has become very common on the net in recent years is ‘Friend of a Friend’, or FOAF. FOAF is a way to describe people and their relationships in an agreed format. FOAF includes metadata elements such as: Person, name, nick, homepage, weblog, knows, interest, plan, based_near, age, OnlineAccount, Group, member, and so on. This would appear in a web page like this: <h1 property=”foaf:name”>Andrew Warland</h1>
These agreed metadata sets, ontologies or schema are presented on web pages in XML format as shown above using the agreed Resource Description Framework (RDF) for describing information. RFD in its simplest form consists of ‘triples’ based on: subject, predicate, object.
For example: ‘The title of the book is the Semantic Office’. Here, ‘Book’ is the subject, ‘title’ is the predicate’, and ‘The Semantic Office’ is the object. To define this in a web page, we would first have to choose an appropriate schema. In this case, Dublin Core seems appropriate:
Within the body of the web page, we would then include:
<dc:title>The Semantic Office</dc:title>
The relationship to Office Documents
To see the XML contents of a Microsoft docx or xlsx document, simply rename the format from docx or xlsx to zip. You can then open the zipped package and explore the contents.
The following shows the XML-based content embedded within a recent sample docx document. As you can see, this way of presenting information is identical to the way it is presented in Semantic Web formatted documents.
Within the folder ‘docProps’:
-xmlns:dc=http://purl.org/dc/elements/1.1/
-<dc:title>Test document</dc:title>
Within the folder ‘word’:
-xmlns:ve=http://schemas.openxmlformats.org/markup-compatibility/2006
-xmlns:o=”urn:schemas-microsoft-com:office:office”
-xmlns:v=”urn:schemas-microsoft-com:vml”
-xmlns:w10=”urn:schemas-microsoft-com:office:word”
and within the document a range of pre-defined information built around the schemas listed above.
What are we seeing here? Microsoft Word documents presented in almost the same way that web pages in the Semantic Web are formed.
If we included recordkeeping metadata in the same structure, we are achieving, potentially, an internal electronic office environment, accessible via a browser interface, that allows information to be stored, accessed and used in the same way information in the semantic web is used.
So how do we achieve a Semantic Office?
According to Microsoft, this type of additional information is a customised extension of the document information properties, and can only be added using InfoPath. But, once it’s there, it can then be utilised again and again in standardised ways.
What does this mean? Potentially, it means that end users can create documents with all the required recordkeeping metadata embedded within the document.
Of course, some of this information is already there in the form of basic document properties. But, the ability to extend this basic set has ramifications for the way information is created, stored, found, retrieved and used that has very close similarities with the way information is being presented in the Semantic Web.
When we consider how almost every application to manage documents and records is now browser based, the ability to apply Semantic Web and Web 2.0/3.0 tools to this information within the enterprise means that records and the information content of those records could be used in ways that were never previously considered possible. Some products, including EMC’s Documentum, now include an XML store in addition to the traditional file store and relational database. (See
http://www.emc.com/products/detail/software/xml-store.htm).
For example, instead of consigning documents to pre-defined containers or folders within a file plan, it might instead be possible to define the container, title and classification, and place organisational information about the author and her/his organisational context within the document metadata automatically, as part of the recordkeeping metadata schema. This, in a sense, is an encapsulated object, and is not new, but the ability to do it through the original document (eg Word) is new.
Copyright – Andrew Warland 2010
References
Konsynski, B.R., Bracket, L.C., and Bracket, W.E., ‘A model for specification of office communications’, IEEE Trans. on Comm., Vol. COM-30, N. 1, Jan. 1982.
Nutt, G.J. and Ricci, P.A., ‘Quinault: an office modeling system’, Computer, May 1981.
David W. Shipman, The functional data model and the data languages DAPLEX, ACM Transactions on Database Systems (TODS), v.6 n.1, p.140-173, March 1981
Pernici, B., Barbic, F., Fugini, M.G., Maiocchi, R., Rames, J.R., and Rolland, C., ‘C-TODOS: An automatic tool for office system conceptual design’, Politecnico di Milano, Electronics Dept., Rep. N. 87-15, 1987.
Li Ding, Lina Zhou, Tim Finin, and Anupam Joshi, How the Semantic Web is Being Used:An Analysis of FOAF, Proceedings of the 38th International Conference on System Sciences, January 2005.
Spivack, Nova. ‘Towards a world wide database’. Blog post 27 October 2005. http://novaspivack.typepad.com/nova_spivacks_weblog/2005/10/towards_a_world.html accessed 23 January 2010.
Capossella, Chris ‘An Open Letter from Chris Caposella, Senior Vice President, Microsoft Office’ . http://www.microsoft.com/interop/letters/ChrisCapOpenLetter.mspx
http://xmlns.com/foaf/spec/
From the article:
Linked Data is, in short, the future of the web. (We’ve said this before when discussing the semantic web.) It’s perhaps most clearly articulated by Berners-Lee, the inventor of the “original” web, himself: up until now, the web has been a network of linked pages. But it is becoming a network of linked data.
Linking data involves exposing data: publishing it and making it accessible. In Berners-Lee’s vision, publishing data online per evolving publishing standards permits formerly invisible (i.e. public, but virtually impossible to access) and propriety data to be linked with other data.
…
It’s too bad government agencies and other organizations continue to “hug their data” in order to protect it from misinterpretation, and themselves from scrutiny, says Rosling. He suggests we get over it, practice non-attachment, and liberate data so that it could be used and interpreted in ways that lead to positive change.
http://www.hypios.com/thinking/2010/02/02/state-of-the-web-governments-move-towards-linked-data/