The Semantic Puzzle

Andreas Blumauer

From Taxonomies over Ontologies to Knowledge Graphs

With the rise of linked data and the semantic web, concepts and terms like ‘ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ...’, ‘vocabulary’, ‘thesaurusA thesaurus is a book that lists words grouped together according to similarity of meaning, in contrast to a dictionary, which contains definitions and pronunciations. The largest thesaurus in the world is the Historical Thesaurus of the Oxford English Dictionary, which contains more than ...’ or ‘taxonomyTaxonomy is the practice and science of classification. The word finds its roots in the Greek τάξις, taxis (meaning 'order' or 'arrangement') and νόμος, nomos (meaning 'law' or 'science'). Taxonomy uses taxonomic units, known as taxa. In addition, the word is also used as a count noun: ...’ are being picked up frequently by information managers, search engine specialists or data engineers to describe ‘knowledge models’ in general. In many cases the terms are used without any specific meaning which brings a lot of people to the basic question:

What are the differences between a taxonomy, a thesaurus, an ontology and a knowledge graph?

This article should bring light into this discussion by guiding you through an example which starts off from a taxonomy, introduces an ontology and finally exposes a knowledge graph (linked data graph) to be used as the basis for semantic applications.

1. Taxonomies and thesauri

Taxonomies and thesauri are closely related species of controlled vocabularies to describe relations between concepts and their labels including synonyms, most often in various languages. Such structures can be used as a basis for domain-specific entity extraction or text categorization services. Here is an example of a taxonomy created with PoolParty Thesaurus Server which is about the Apollo programme:

Apollo programme taxonomyThe nodes of a taxonomy represent various types of ‘things’ (so called ‘resources’): The topmost level (orange) is the root node of the taxonomy, purple nodes are so called ‘concept schemes’ followed by ‘top concepts’ (dark green) and ordinary ‘concepts’ (light green). In 2009 W3CThe World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web (abbreviated WWW or W3). Founded by Tim Berners-Lee at MIT and currently headed by him, the consortium is made up of member organizations which maintain full-time staff for the purpose of ... introduced the Simple Knowledge Organization System (SKOS) as a standard for the creation and publication of taxonomies and thesauri. The SKOS ontology comprises only a few classes and properties. The most important types of resources are: Concept, ConceptScheme and Collection. Hierarchical relations between concepts are ‘broader’ and its inverse ‘narrower’. Thesauri most often cover also non-hierarchical relations between concepts like the symmetric property ‘related’. Every concept has at least on ‘preferred label’ and can have numerous synonyms (‘alternative labels’). Whereas a taxonomy could be envisaged as a tree, thesauri most often have polyhierarchies: a concept can be the child-node of more than one node. A thesaurus should be envisaged rather as a network (graph) of nodes than a simple tree by including polyhierarchical and also non-hierarchical relations between concepts.

2. Ontologies

Ontologies are perceived as being complex in contrast to the rather simple taxonomies and thesauri. Limitations of taxonomies and SKOSSimple Knowledge Organization System (SKOS) is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDFS, and its main objective is to ...-based vocabularies in general become obvious as soon as one tries to describe a specific relation between two concepts: ‘Neil Armstrong’ is not only unspecifically ‘related’ to ‘Apollo 11′, he was ‘commander of’ this certain Apollo mission. Therefore we have to extend the SKOS ontology by two classes (‘Astronaut’ and ‘Mission’) and the property ‘commander of’ which is the inverse of ‘commanded by’.

Apollo ontology relationsThe SKOS concept with the preferred label ‘Buzz Aldrin’ has to be classified as an ‘Astronaut’ in order to be described by specific relations and attributes like ‘is lunar module pilot of’ or ‘birthDate’. The introduction of additional ontologies in order to expand expressivity of SKOS-based vocabularies is following the ‘pay-as-you-go’ strategy of the linked data community. The PoolPartyWeb based ontology manager which can serve as a central hub for your knowledge organization. With PoolParty you can organize and maintain knowledge models based on widely accepted specifications like RDF, SPARQL and SKOS. knowledge modelling approach suggests to start first with SKOS to further extend this simple knowledge model by other knowledge graphs, ontologies and annotated documents and legacy data. This paradigm could be memorized by a rule named ‘Start SKOS, grow big’.

3. Knowledge Graphs

Knowledge graphs are all around (e.g. DBpedia, Freebase, etc.). Based on W3C’s Semantic Web Standards such graphs can be used to further enrich your SKOS knowledge models. In combination with an ontology, specific knowledge about a certain resource can be obtained with a simple SPARQL query. As an example, the fact that Neil Armstrong was born on August 5th, 1930 can be retrieved from DBpediaDBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, .... Watch this YouTube video which demonstrates how ‘linked data harvesting’ works with PoolParty.

Knowledge graphs could be envisaged as a network of all kind things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets.

Why should I transform my content and data into a large knowledge graph?

The answer is simple: to being able to make complex queries over the entirety of all kind of information. By breaking up the data silos there is a high probability that query results become more valid.

With PoolParty Semantic Integrator, content and documents from SharePoint, ConfluenceConfluence is a web-based corporate wiki written in Java and mainly used in corporate environments. It is developed and marketed by Atlassian. Confluence is sold as either on-premises software or as a hosted solution. Its license is proprietary, but a zero-cost license program is available for ..., DrupalDrupal is a free and open source content management system (CMS) written in PHP and distributed under the GNU General Public License. It is used as a back-end system for many different types of websites, ranging from small personal blogs to large corporate and political sites, including ... etc. can be tranformed automatically to integrate them into enterprise knowledge graphs.

Taxonomies, thesauri, ontologies, linked data graphs including enterprise content and legacy data – all kind of information could become part of an enterprise knowledge graph which can be stored in a linked data warehouse. Based on technologies like Virtuoso, such data warehouses have the ability to serve as a complex question answering system with excellent performance and scalability.

4. Conclusion

In the early days of the semantic web, we’ve constantly discussed whether taxonomies, ontologies or linked data graphs will be part of the solution. Again and again discussions like ‘Did the current data-driven world kill ontologies?‘ are being lead. My proposal is: try to combine all of those. Embrace every method which makes meaningful information out of data. Stop to denounce communities which don’t follow the one or the other aspect of the semantic web (e.g. reasoning or SKOS). Let’s put the pieces together – together!


Thomas Thurner

Energy Buildings Performance Scenarios as Linked Open Data

The reduction of green house gas emissions is one of the big global challenges for the next decades. (Linked) Open Data on this multi-domain challenge is key for addressing the issues in policy, construction, energy efficiency, production a like. Today – on the World Environment Day 2014 – a new (linked open) data initiative contributes to this effort: GBPN’s Data Endpoint for Building Energy Performance Scenarios.

gbpn-scenariosGBPN (The Global Buildings Performance Network) provides the full data set on a recently made global scenario analysis for saving energy in the building sector worldwide, projected from 2005 to 2050. The multidimensional dataset includes parameters like housing types, building vintages and energy uses  - for various climate zones and regions and is freely available for full use and re-use as open data under CC-BY 3.0 France license.

To explore this easily, the Semantic Web Company has developed an interactive query / filtering tool which allows to create graphs and tables in slicing this multidimensional data cubeThere are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using RDF. .... Chosen results can be exported as open data in the open formats: RDF and CSV and also queried via a provided SPARQL endpoint (a semantic web based data APIAn application programming interface (API) is an interface implemented by a software program to enable interaction with other software, similar to the way a user interface facilitates interaction between humans and computers. APIs are implemented by applications, libraries and operating systems ...). A built-in query-builder makes the use as well as the learning and understanding of SPARQL easy – for advanced users as well as also for non-experts or beginners.


The LODLinked Open Data (LOD) stands for freely available data on the World Wide Web, which can be identified via Uniform Resource Identifier (URI) and can be accessed and retrieved directly via HTTP. Finally link your data to other data to provide context. based information- & data system is part of Semantic Web Companies’ recent Poolparty Semantic Drupal developments and is based on OpenLinks Virtuoso 7 QuadStore holding and calculating ~235 million triples as well as it makes use of the RDF ETL Tool: UnifiedViews as well as D2R Server for RDF conversion. The underlying GBPN ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ... runs on PoolParty 4.2 and serves also a powerful domain-specific news aggregator realized with SWCThe Semantic Web Company (SWC), based in Vienna, provides companies, institutions and organizations with professional services related to the Semantic Web, semantic technologies and Social Software’s sOnr webminer. with other Energy Efficiency related Linked Open Data Initiatives like REEEP, NREL, BPIE and others, GBPNs recent initative is a contribution towards a broader availability of data supporting action agains global warming - as also Dr. Peter Graham, Executive Director of GBPN emphasized “…data and modelling of building energy use has long been difficult or expensive to access – yet it is critical to policy development and investment in low-energy buildings. With the release of the BEPS open data model, GBPN are providing free access to the world’s best aggregated data analyses on building energy performance.”

The Linked Open DataLinked Open Data (LOD) stands for freely available data on the World Wide Web, which can be identified via Uniform Resource Identifier (URI) and can be accessed and retrieved directly via HTTP. Finally link your data to other data to provide context. (LOD) is modelled using the RDF Data Cube Vocabulary (that is a W3C recommendation) including 17 dimensions in the cube. In total there are 235 million triples available in RDF including links to DBpedia and Geonames – linking the indicators: years – climate zones – regions and building types as well as user scenarios….

Enhanced by Zemanta
Tassilo Pellegrini

Linked Data in the Content Value Chain or Why Dynamic Semantic Publishing makes sense …

In 2012 Jem Rayfield released an insightful post about the BBC’s Linked Data strategy during the Olympic Games 2012. In this post he coined the term “Dynamic Semantic Publishing”, referring to

“the technology strategy the BBCThe British Broadcasting Corporation (BBC) is the principal public service broadcaster in the United Kingdom. It is the largest broadcaster in the world with about 23,000 staff. Its global headquarters are located in London and its main responsibility is to provide public service broadcasting in ... Future Media department is using to evolve from a relational content model and static publishing framework towards a fully dynamic semantic publishing (DSP) architecture.”

According to Rayfield this approach is characterized by

“a technical architecture that combines a document/content store with a triple-store proves an excellent data and metadata persistence layer for the BBC Sport site and indeed future builds including BBC News mobile.”

The technological characteristics are further described as …

  • A triple-store that provides a concise, accurate and clean implementation methodology for describing domain knowledge models.
  • An RDF graph approach that provides ultimate modelling expressivity, with the added advantage of deductive reasoning.
  • SPARQL to simplify domain queries, with the associated underlying RDF schema being more flexible than a corresponding SQL/RDBMS approach.
  • A document/content store that provides schema flexibility; schema independent storage; versioning, and search and query facilities across atomic content objects.
  • Combining a model expressed as RDF to reference content objects in a scalable document/content-store provides a persistence layer that uses the best of both technical approaches.

So what are actually the benefits of Linked Data from a non-technical perspective?

Benefits of Linked (Meta)Data

Semantic interoperability is crucial in building cost efficient IT systems that integrate numerous data sources. Since 2009 the Linked Data paradigm has emerged as a light weight approach to improve data portability ferderated IT systems. By building on Semantic Web standards the Linked Data approach offers significant benefits compared to conventional data integration approaches. These are according to Auer [1]:

  • De-referencability. IRIs are not just used for identifying entities, but since they can be used in the same way as URLs they also enable locating and retrieving resources describing and representing these entities on the Web.
  • Coherence. When an RDF triple contains IRIs from different namespaces in subject and object position, this triple basically establishes a link between the entity identified by the subject (and described in the source dataset using namespace A) with the entity identified by the object (described in the target dataset using namespace B). Through these typed RDF links, data items are effectively interlinked.
  • Integrability. Since all Linked Data sources share the RDF data model, which is based on a single mechanism for representing information, it is very easy to attain a syntactic and simple semantic integration of different Linked Data sets. A higher-level semantic integration can be achieved by employing schema and instance matching techniques and expressing found matches again as alignments of RDF vocabularies and ontologies in terms of additional triple facts.
  • Timeliness. Publishing and updating Linked Data is relatively simple thus facilitating a timely availability. In addition, once a Linked Data source is updated it is straightforward to access and use the updated data source, since time consuming and error prune extraction, transformation and loading is not required.

On top of these technological principles Linked Data promises to improve the reusability and richness (in terms of depth and broadness) of content thus adding significant value to the content value chain.

Linked Data in the Content Value Chain

According to Cisco communication within electronic networks has become increasingly content-centric. I.e. CiscoCisco Systems, Inc. is a multinational corporation headquartered in San Jose, California, that designs and sells consumer electronics, networking, voice, and communications technology and services. Cisco has more than 70,000 employees and annual revenue of US$ 40.0 billion as of 2010. The stock ... reports for the time period from 2011 to 2016 an increase of 90% of video content, 76% of gaming content, 36% VoIP, 36% file sharing being transmitted electronically.  Hence it is legitimate to ask what role Linked Data takes in the content production process. Herein we can distinguish five sequential steps: 1) content acquisition, 2) content editing, 3) content bundling, 4) content distribution and 5) content consumption. As illustrated in the figure below Linked Data can contribute to each step by supporting the associated intrinsic production function [2].

Linked Data in the Content Value Chain

Linked Data in the Content Value Chain

  • Content acquisition is mainly concerned with the collection, storage and integration of relevant information necessary to produce a content item. In the course of this process information is being pooled from internal or external sources for further processing.
  • The editing process entails all necessary steps that deal with the semantic adaptation, interlinking and enrichment of data. Adaptation can be understood as a process in which acquired data is provided in a way that it can be re-used within editorial processes. Interlinking and enrichment are often performed via processes like annotationAn annotation is notes that you make to yourself while you are reading information in a book, document, online record, video, software code or other information, "in the margin", or perhaps just underlined or highlighted passages. Annotated bibliographies, give descriptions about how each source ... and/or referencing to enrich documents either by disambiguating of existing concepts or by providing background knowledge for deeper insights.
  • The bundling process is mainly concerned with the contextualisation and personalisation of information products. It can be used to provide customized access to information and services i.e. by using metadata for the device-sensitive delivery of content, or to compile thematically relevant material into Landing Pages or Dossiers thus improving the navigability, findabilityFindability is a term for the ease with which information contained on a website can be found, both from outside the website (using search engines and the like) and by users already on the website. Although findability has relevance outside the World Wide Web, it is usually used in the context ... and reuse of information.
  • In a Linked Data environment the process of content distribution mainly deals with the provision of machine-readable and semantically interoperable (meta-)data via Application Programming Interfaces (APIs) or SPARQL Endpoints. These can be designed either to serve internal purposes so that data can be reused within controlled environments (i.e. within or between organizational units) or for external purposes so that data can be shared between anonymous users (i.e. as open SPARQL Endpoints on the Web).
  • The last step in the content value chain is dealing with content consumption. This entails any means that enable a human user to search for and interact with content items in a pleasant und purposeful way. So according to this view this step mainly deals with end user applications that make use of Linked Data to provide access to content items (i.e. via search or recommendation engines) and generate deeper insights (i.e. by providing reasonable visualizations).


There is definitely a place for Linked Data in the Content Value Chain, hence we can expect that Dynamic Semantic Publishing is here to stay. Linked Data can add significant value to the content production process and carry the potential to incrementally expand the business portfolio of publishers and other content-centric businesses. But the concrete added value is highly context-dependent and open to discussion. Technological feasibility is easily contradicted by strategic business considerations, a lack of cultural adaptability to legacy issues like dual licensing, technological path dependencies or simply a lack of resources. Nevertheless Linked Data should be considered as a fundamental principle in next generation content management as it provides a radically new environment for value creation.

More about the topic – live

Linked Data in the content value chain is also one of the topics set onto the agenda of this year’s SEMANTiCS 2014. Listen to keynote speaker Sofia Angeletou an others, to learn more about next generation content management.


[1]     Auer, Sören (2011). Creating Knowledge Out of Interlinked Data. In: Proceedings of WIMSYearly conference targeting researchers and practitioners to present past and current research contributing to the state of the art of Web technology research and applications.’11, May 25-27, 2011, p. 1-8

[2] Pellegrini, Tassilo (2012). Integrating Linked Data into the Content Value Chain: A Review of News-related Standards, Methodologies and Licensing Requirements. In: Presutti, Valentina; Pinto, SofiaSofia is the capital and largest city of Bulgaria and the 12th largest city by population in the European Union, with 1.4 million people living in the Capital Municipality. It is located in western Bulgaria, at the foot of Mount Vitosha, and is the administrative, cultural, economic, and ... S.; Sack, Harald; Pellegrini, Tassilo (2012). Proceedings of I-Semantics 2012. 8th International Conference on Semantic Systems. ACM International Conference Proceeding Series, p. 94-102

Enhanced by Zemanta
Thomas Thurner

Linked Data at the BBC: Connecting Content around the Things that matter to Audiences

cache_23146737This year SEMANTiCS conference presents some brilliant speakers. As there is Sofia Angeletou with her keynote on the 2nd generation Linked Data Strategy at the BBCThe British Broadcasting Corporation (BBC) is the principal public service broadcaster in the United Kingdom. It is the largest broadcaster in the world with about 23,000 staff. Its global headquarters are located in London and its main responsibility is to provide public service broadcasting in ....

Quote: “The vision of semantic publishing in the BBC has shifted from supporting high profile events to connecting the BBC’s content around things that matter to the audience. To this end, we have increased the application of linked data to domains other than sports such as news, education and music with the intention that the content we produce can be reused and discovered through a multitude of channels.”

In her keynote, SofiaSofia is the capital and largest city of Bulgaria and the 12th largest city by population in the European Union, with 1.4 million people living in the Capital Municipality. It is located in western Bulgaria, at the foot of Mount Vitosha, and is the administrative, cultural, economic, and ... will outline the technological and cultural factors that have influenced the BBC’s adoption of linked data. A talk reflecting the early assumptions BBC made their effects on the development of the platform and they way BBC are addressing them now.
A talk people who are working in the media and publishing industry should not miss and one of several highlights the completely re-brushed SEMANTiCS conference will provide to you.

Register for SEMANTiCS 2014 in Leipzig / Germany.


Enhanced by Zemanta