The Semantic Puzzle

Andreas Blumauer

Insights into Nature’s Data Publishing Portal

In recent years, Nature has adopted linked data technologies on a broader scale. Andreas BlumauerAndreas Blumauer has studied business informatics at Vienna University of Economics and Business and at Vienna University of Technology and has obtained a Master's degree in business studies. He started his career in 2001 as software developer for financial services organisations. In 2001 he was ... was intrigued to discover more about the strategy and technologies behind. He had the opportunity to talk with Michele Pasin and Tony Hammond who are the architects of Nature’s data publishing portal.

 

Semantic Puzzle: Nature’s data publishing portal is one of the most renowned ones in the linked data community. Could you talk a bit about its history? Why was this project initiated and who have been the brains behind it since then?

Michele PasinMichele Pasin: We have been involved with semantic technologies at Macmillan since 2010. At the time it was primarily my colleague Tony Hammond who saw the potential of these technologies for metadata management and data sharing. Tony set up the data.nature.com portal in April 2012 (and expanded in July 2012), in the context of a broader company initiative aimed at moving towards a ‘digital first’ publication workflow.

The data.nature.com platform was essentially a public RDF output of some of the metadata embedded in our XML articles archive. This included a SPARQL endpoint for data about articles published by NPG from 1845 through to the present day. Additionally the datasets include NPG product and subject ontologies. These datasets are available under a Creative Commons Zero waiver.

The data.nature.com platform was only for external use though, so it was essentially detached from the products end users would see on nature.com. Still, it allowed us to mature a better understanding of how to make use of these tools within our existing technology stack. It is important to remember that in the years the company has been investing a considerable amount of resources on an XML-centered architecture, so finding a solution that could leverage the legacy infrastructure with these new technologies has always been a fundamental requirement for us.

More recently, in 2013 we started working on a new hybrid linked data platform, this time with a much stronger focus on supporting our internal applications. That’s pretty much around the time I joined the company. In essence, we made the point that in order to achieve stronger interoperability levels within our systems we had to create an architecture where RDF is core to the publishing workflow as much as XML is. (By the data.nature.comway if you are interested in the details of this, we presented a paper about this at ISWC 2014.) As part of this phase, we also built a more sophisticated set of ontologies used for encoding the semantics of our data, together with improved versions of the datasets previously released.

The nature.com ontologies portal came out in early 2015 as the result of this second phase of work. On the portal one can find extensive documentation about all of our models, as well as periodical downloads in various RDF formats. The idea is to make it easier for people – both within the enterprise and externally – to access, understand and reuse our linked data.

At the same time, since user engagement level on data.nature.com was not as good as expected, we decided to terminate that service. In the future, we plan to keep releasing periodic snapshots of the datasets and the ontologies we are using, but not a public endpoint in the immediate future.

Semantic Puzzle: As one of your visions you’re stating that your “primary reason for adopting linked data technologies is quite simply better metadata management”. How did you deal with metadata before you started with this transition? What has changed since then, also from a business point of view?

Michele Pasin:  Our pre–linked data approach to dealing with metadata and enterprise taxonomies is probably not unheard of, especially within similar sized companies: a vast array of custom-made solutions, varying from simple word documents sitting in someone’s computer, to Excel spreadsheets or, in the best of cases, database tables in one of our production systems. Of course, there were also a number of ad-hoc applications/scripts responsible for the reading/updating of these metadata sources, as often they would be critical to one or more system in the publishing workflow (e.g. think of the journal’s master list, or the list of approved article-types).

linked-data-aims-alt

It is worth stressing that the lack of a unified technical infrastructure aspect was a key problem, of course, but not the only one. In fact I would argue that addressing the lack of a centralized data governanceData governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization. Through data governance, ... approach was even more crucial. For example, most often you would not know who/which department was in charge of a particular controlled vocabularyControlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of ... or metadata specification. In some cases, no single source of truth was actually available, because different people/groups were in charge of specific aspects of a single specification (due to their differing interests).  

Hence you need a certain amount of management buy-in to implement such a wide-ranging approach to metadata; moving to a single platform and technical solution based on linked data was fundamental, but an equally fundamental organizational change was also needed. Even more so, if one considers that this is not a time-boxed project but rather an ongoing process, an approach which pays off only as much as you can guarantee that as new products and services get launched, they all subscribe to the same metadata management ‘philosophy’.

Semantic Puzzle: One of the promises of Linked Data is that by “using a common data model and a common naming architecture, users can begin to realize the benefits and efficiencies of web scaling.” Could you describe a bit more in detail into which eco-system your content workflows and publishing processes are embedded (internally and externally) and why the use of standards is important for this?

Tony Hammond: We operate with an XML-based workflow for documents where we receive XML from our suppliers and store that within an XML database (MarkLogic). Increasingly we are beginning to move towards a dynamic publishing solution from that database. We are also using the database to provide a full-text search across all our content. In the past we had various workflows and a small number of different DTDs to reconcile, although we are currently converging on a single DTD. To facilitate search across this mixed XML content we abstracted certain key metadata elements into a common header. This was managed organically and was somewhat unpredictable both in terms of content model and naming.

By moving to a linked data solution for managing our metadata which is based on a single, core ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ... we bypass our normalized metadata header and start to build on a new simpler data model (triples) with a common naming architecture. In effect, we have moved from a nominally normalized metadata to a super-normalized metadata which uses web standards for data (URI, RDF, OWL).

Semantic Puzzle: Your contents are also multimedia (image, video, …). How do you embed this non-textual contents into your linked data ecosystem? Which gateways, tools and connectors are used to bridge your linked data environment with multimedia?

Tony Hammond: Some years ago we embarked on a new initiative internally to streamline our production workflows. Our brief was to support a distributed content warehouse where digital assets would be stored in various locations. The idea was to abstract out our storage concerns and to maintain pointers to the various storage subsystems along with other physical characteristics required for accessing that storage.

In practice our main content was housed as XML documents within a MarkLogicMarkLogic Corporation is an American software business that develops and provides an enterprise NoSQL database, MarkLogic. MarkLogic is considered a multi-model NoSQL database for its ability to store, manage, and search JSON and XML documents and graph data (RDF triples). Organizations rely on ... XML database and associated media assets (e.g. images) were primarily stored on the filesystem with some secondary asset types (e.g. videos) being sourced from cloud services.

To relate a physical asset (e.g. an XML document, or a JEPG file) to the underlying concept (e.g. an article, or an image) we made use of XMP packets (a technology developed by Adobe Systems and standardized through ISO) which as simple RDF/XMLRDF/XML is a syntax, defined by the W3C, to express an RDF graph as an XML document. See Resource Description Framework. descriptions allowed us to capture metadata about physical characteristics and to relate those properties to our data model. An XMP packet is a description of one physical resource and could be simply linked to the related conceptual resource.

We started this project with an RDF triplestore for maintaining and querying our metadata, but over time we moved towards a hybrid technology where our semantic descriptions were buried within XML documents as RDF/XML descriptions and could be queried within an XML context using XQueryXQuery is a query and functional programming language that is designed to query collections of XML data. XQuery 1.0 was developed by the XML Query working group of the W3C. The work was closely coordinated with the development of XSLT 2.0 by the XSL Working Group; the two groups shared ... to deliver a highly performant JSON APIAn application programming interface (API) is an interface implemented by a software program to enable interaction with other software, similar to the way a user interface facilitates interaction between humans and computers. APIs are implemented by applications, libraries and operating systems .... These semantic descriptions enclosed minimal XMP documents which described the storage entities.

Semantic Puzzle: Nature links its datasets to external ones, e.g. to DBpedia or MeSH. Who exactly is benefiting from this and how?

Michele Pasin:  I would say that there are at least two reasons why we did this. First, we wanted to maximize the potential reuse of our datasets and models within the semantic web. Building owl:sameAsService that helps to find co-references between different data sets. (http://sameas.org) relationships to other vocabularies, or marking up our ontology classes and properties with subclass/subproperty relationships pointing to external vocabularies is a way to be good ‘linked data citizens’. Moreover, this is a deliberate attempt to counterbalance one of our key design principles: minimal commitment to external vocabularies. This approach to data modeling means that we tend to create our own models and define them within our own namespaces, rather than building production-level software against third party ontologies. It is worth pointing out that this is not because we think our ontologies are better – but because we want our data architecture to reflect as closely as possible the ontological commitment of a publishing enterprise with decades of established business practices, naming conventions etc. In other words, we aimed at creating a very cohesive and robust domain model, one which is resilient to external changes but that also supports semantic interoperability by providing a number of links and mappings to other semantic web standards.

Pointing to external vocabularies is a way to be good ‘linked data citizens’

The second reason for creating these links is to enable more innovative discovery services. For example, a nature.com subject page about photosyntesis could surface encyclopedic materials automatically retrieved from DBpediaDBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, ...; or it could provide links to highly cited articles retrieved from PubMed using MeSH mappings. This just scratches the surface of what one could do. The real difficulty is, how to do it in such a way that the overall user experience improves, rather than adding up to the information overload the majority of internet users already have to deal with. So at the moment, while the data people (us) are focusing on building a rich network of entities for our knowledge graph, the UX and front end teams are exploring design and interaction models that truly take advantage of these functionalities. Hopefully we see these activities continue to converge!

Semantic Puzzle: How do you deal with data quality management in general, and how can linked data technologies help to improve it?

Tony Hammond: We can distinguish between two main types of data: documents and ontologies. (And by ontologies we also comprehend thesauri and taxonomies.) Our documents are created by our suppliers using XML and are amenable to some data validations. We use automated DTD validation in our new workflow and by hand DTD validation in the older workflows. We also use Schematron rulesets to validate certain data points but these address only certain elements. We have a couple hundred Schematron rules which implement various business rules and are also synchronized with our ontologies.

PoolPartyOur ontologies, on the other hand, are by their nature more curated datasets. These are mastered as RDF TurtleSyntax for RDF that allows RDF graphs to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes. Turtle provides levels of compatibility with the existing N-Triples and Notation 3 formats as well as the triple pattern syntax of the ... files and stored within GitHub. These are currently maintained by hand, although we are beginning now to transition some of our taxonomies to the PoolParty taxonomy manager. We have a build process for deploying the ontologies to our XML database where they are combined with our XML documents. During this build process we both validate the RDF as well as running SPINA SPARQL-based rule and constraint language for the Semantic Weband also a mechanism to represent reusable SPARQL queries as templates and to define new SPARQL functions with a web-friendly syntax. (http://www.w3.org/Submission/2011/SUBM-spin-overview-20110222/) rules over the datasets which can validate data elements as well as expanding the dataset with new triples from rules-based inferencing.

Semantic Puzzle: For a publisher like Nature it is somehow “natural” that Linked Data is used. How could other industries make use of these principles for information managementInformation management (IM) is the collection and management of information from one or more sources and the distribution of that information to one or more audiences. This sometimes involves those who have a stake in, or a right to that information. Management means the organization of and ...?

Tony Hammond: The main reason for using linked data is not to do with publishing the data (and indeed many other data models are generally used for data publishing), but with the desire to join one dataset with other datasets – or rather, the data within a dataset to the data within other datasets. It is for this reason that we make use of URIs as common (global) names for data points. Linking data is not just a goal in publishing data but applies equally when consuming data from various sources and integrating over those data sources within an internal environment. Indeed, arguably, the biggest use case for linked data is within private enterprises rather than surfaced on the open web. Once that point is appreciated there is no restriction on any industry in being more disposed to using linked data than any other, and it is used as a means to maximize the data surface that a company operates over.

The biggest use case for linked data is within private enterprises rather than surfaced on the open web

Semantic Puzzle: Where are the limits of Linked Data from your perspective, and do you believe they will ever be exceeded?

Tony Hammond: The limits to using linked data are more to do with top-down vs bottom-up approaches in dealing with data, i.e. linked data vs big dataBig data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to ..., or data curation vs data crunching. Linked data makes use of global names (URIs), schemas, ontologies. It is highly structured, organized data.

Now, whether it is feasible to bring this level of organization to data at large or whether data crunching will provide the appropriate insights over the data is an open question. Our expectation is that we will still need to use ontologies – and hence linked data – as an organizing principle, or reference, to guide us in processing large datasets and for sharing those data organizations. The question may be how much human curation is required in assembling these ontologies.

Michele Pasin: On a more practical level, I’d say that the biggest problem with linked data is still its rather limited adoption on a large scale. I’m referring in particular to the data publishing and reuse aspect. On this front, we really struggled to get the levels of uptake the business was expecting from us. Consider this: we have been publishing metadata for our entire archive since 2012 (approx. 1.2m documents, resulting in almost half a billion triples). However very few people made use of these data, either in the form of bulk downloads or via the SPARQL API we once hosted (and that was then retired due to low usage). This is in stark contrast with other – arguably less flexible – services we make available, e.g. the OpenSearch APIs, or a JSON RESTRepresentational State Transfer (REST) is a style of software architecture for distributed hypermedia systems such as the World Wide Web. The term Representational State Transfer (REST) was introduced and defined in 2000 by Roy Fielding in his doctoral dissertation. Fielding is one of the ... service, which often see significant traffic.

Last year we gave a paper at the Linked Science workshop (affiliated with ISWC 2015) with the specific intent to address the problem within that community. What seemed to emerge is that possibly this has to do with the same reason why this technology has been so useful to us. RDF is an extremely flexible and powerful model, however, when it comes to data consumption and access, the average user cares more about simplicity than flexibility. Also, outside linked data circles we all know that the standard tech for APIs is JSON and RESTRepresentational State Transfer (REST) is a style of software architecture for distributed hypermedia systems such as the World Wide Web. The term Representational State Transfer (REST) was introduced and defined in 2000 by Roy Fielding in his doctoral dissertation. Fielding is one of the ..., rather than RDF and SPARQL.

Lowering the bar to the adoption of semantic tech

The good news though is that we are seeing more initiatives aimed at bridging these two worlds. One that we are keeping an eye on, for example, is JSON-LD. The way this format hides various RDF complexities behind a familiar JSON structure makes it an ideal candidate for a linked data publishing product with a much wider user base. Which is exactly what we are looking for: lowering the bar to the adoption of semantic tech.

 

About Michele Pasin

Michele PasinMichele Pasin is an information architect and product manager with a focus on enterprise metadata management and semantic technologies.

Michele currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.

He has recently taken up the role of product manager for the knowledge graphIn mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges. ... project, an initiative whose goal is to bring together various preexisting linked data repositories, plus a number of other structured and unstructured data sources, into a unified, highly integrated knowledge discovery platform. Before that, he worked on projects like nature.com’s subject pages (a dynamic section of the website that allow users to navigate content by topic) and the nature.com ontologies portal (a public repository of linked open dataLinked Open Data (LOD) stands for freely available data on the World Wide Web, which can be identified via Uniform Resource Identifier (URI) and can be accessed and retrieved directly via HTTP. Finally link your data to other data to provide context.).

He holds a PhD in semantic web technologies from the Knowledge Media Institute (The Open University, UK) and advanced degrees in logic and philosophy of language from the University of Venice (Italy). Previously, he was a research associate at King’s College Department of Digital Humanities (LondonLondon is the capital city of England and the United Kingdom, the largest metropolitan area in the United Kingdom and the largest urban zone in the European Union by most measures. London has been a major settlement for two millennia, its history going back to its founding by the Romans, who ...), where he developed on a number of cultural informatics projects such as the People of Medieval Scotland and the Art of Making in AntiquityOnline Portfolio: http://www.michelepasin.org/projects/

Michele Pasin will give a keynote at this year’s SEMANTiCS conference.SEMANTiCS 2016

About Tony Hammond

Tony Hammond is a data architect with a primary focus in the general area of machine-readable description technologies. He has been actively involved in developing industry standards for network identifiers and metadata frameworks. He has had experience working on both sides of the scientific publishing information chain, from international research centres to leading publishing houses. His background is in physics with astrophysics.

Tony currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.