Andreas Blumauer

Linked data based search: Make use of linked data to provide means for complex queries

Two live demos of PoolParty Semantic Integrator demonstrate new ways to retrieve information based on linked data technologies

data visualisation

Linked data graphs can be used to annotate and categorize documents. By transforming text into RDF graphs and linking them with LOD like DBpedia, Geonames, MeSH etc. completely new ways to make queries over large document repositories become possible.

An online-demo illustrates those principles: Imagine you were an information officer at the Global Health Observatory of the World Health Organisation. You inform policy makers about the global situation in specific disease areas to direct support to the required health support programs. For your research you need data about disease prevalence in relation with socioeconomic factors.

Datasets and technology

About 160.000 scientific abstracts from PubMed, linked to three different disease categories were collected. Abstracts were automatically annotated with PoolParty Extractor, based on terms from the Medical Subject Headings (MeSH) and Geonames that are organized in a SKOS thesaurus, managed with PoolParty Thesaurus Server. Abstracts were transformed to RDF and stored in Virtuoso RDF store. In the next step, it is easy to combine these data sets within the triple store with large linked data sources like DBPedia, Geonames or Yago. The use of linked data makes it easy to e.g. group annotated countries by the Human Development Index (HDI). The hierarchical structure of the thesaurus was used to collect all concepts that are connected to a specific disease.

This demo was developed based on the libraries sgvizler to visualize SPARQL results. AngularJS was used to dynamically replace variables in SPARQL query templates.

Another example of linked data based search in the field of renewable energy can be tried out here.

Links:

Thomas Thurner

European Data Forum’s Call for Contributions

Flyer2014_150_200The European Data Forum (EDF) is an annual meeting place for industry, research, policy makers, and community initiatives to discuss the challenges and opportunities of data in Europe, especially in the light of recent developments such as Open Data, Linked Data and Big Data.

EDF 2014 will be held in Athens, Greece on March 19-20, 2014. The program will consist of a mixture of presentations, panels and networking sessions by industry, academics, policy makers, and community initiatives. Topics will cover a wide spectrum of research and technology development, applications, and socio-economic aspects of the data value chain.

Call for Contributions

EDF 2014 is seeking inspiring presentations addressing the topics listed below:

  • Innovative research and technology for Open Data, Linked Data and Big Data
  • Applications
  • Socio-economic and policy issues
  • Data visions for the future

Proposals for presentations should be submitted as a single PDF file at https://www.easychair.org/conferences/?conf=edf2014 or sent via e-mail to edf2014@data-forum.eu. Proposals will be reviewed by the Organizing Committee of EDF 2014 according to their relevance to the scope and purpose of the event.

Important Dates

  • Submission of proposals:  December 10 2013, 22.00pm CET
  • Notification of acceptance or rejection: early January 2014
  • Full EDF2014 program available: end of January 2014

Download the Call for Contributions in pdf format.

Andreas Blumauer

The LOD cloud is dead, long live the trusted LOD cloud

The ongoing debate around the question whether ‘there is money in linked data or not’ has now been formulated more poignantly by Prateek Jain (one of the authors of the original article) recently: He is asking, ‘why linked open data hasn’t been used that much so far besides for research projects?‘.

I believe there are two reasons (amongst others) for the low uptake of LOD in non-academic settings which haven’t been discussed in detail until today:

1. The LOD cloud covers mainly ‘general knowledge‘ in contrast to ‘domain knowledge

Since most organizations live on their internal knowledge which they combine intelligently with very specific (and most often publicly available) knowledge (and data), they would benefit from LOD only if certain domains were covered. A frequently quoted ‘best practice’ for LOD is that portion of data sets which is available at Bio2RDF. This part of the LOD cloud has been used again and again by the life sciences industry due to its specific information and its highly active maintainers.

We need more ‘micro LOD clouds’ like this.

Another example for such is the one which represents the German Library Linked Open Data Cloud (thanks to Adrian Pohl for this pointer!) or the Clean Energy Linked Open Data Cloud:

reegle-lod-cloud

I believe that the first generation of LOD cloud has done a great job. It has visualised the general principles of linked data and was able to communicate the idea behind. It even helped – at least in the very first versions of it – to identify possibly interesting data sets. And most of all: it showed how fast the cloud was growing and attracted a lot of attention.

But now it’s time to clean up:

A first step should be to make a clear distinction between the section of the LOD cloud which is open and which is not. Datasets without licenses should be marked explicitly, because those are the ones which are most problematic for commercial use, not the ones which are not open.

A second improvement could be made by making some quality criteria clearly visible. I believe that the most important one is about maintenance and authorship: Who takes responsibility for the quality and trustworthiness of the data? Who exactly is the maintainer?

This brings me to the second and most important reason for the low uptake of LOD in commercial applications:

2. Most datasets of the LOD cloud are maintained by a single person or by nobody at all (at least as stated on datahub.io)

Would you integrate a web service which is provided by a single, maybe private person into a (core-)application of your company? Wouldn’t you prefer to work with data and services provided by a legal entity which has high reputation at least in its own knowledge domain? We all know: data has very little value if it’s not maintained in a professional manner. An example for a ‘good practice’ is the integrated authority file provided by German National Library. I think this is a trustworthy source, isn’t it? And we can expect that it will be maintained in the future.

It’s not the data only which is linked in a LOD cloud, most of all it’s the people and organizations ‘behind the datasets’ that will be linked and will co-operate and communicate based on their datasets. They will create on top of their joint data infrastructure efficient collaboration platforms, like the one in the area of clean energy – the ‘Trusted Clean Energy LOD Cloud‘:

reegle.info trusted links

REEEP and its reegle-LD platform has become a central hub in the clean energy community. Not only data-wise but also as an important cooperation partner in a network of NGOs and other types of stakeholders which promote clean energy globally.

Linked Data has become the basis for more effective communication in that sector.

To sum up: To publish LOD which is interesting for the usage beyond research projects, datasets should be specific and trustworthy (another example is the German labor law thesaurus by Wolters Kluwer). I am not saying that datasets like DBpedia are waivable. They serve as important hubs in the LOD cloud, but for non-academic projects based on LOD we need an additional layer of linked open datasets, the Trusted LOD cloud.

 

Andreas Blumauer

There’s Money in Linked Data

I believe that the ongoing debate whether there ‘is money in linked (open) data or not’ is a bit misleading. ‘Linked (open) data’ is not only the data itself. It’s much more, even more than yet another technology stack. Linked data is most of all a set of principles how to organize information in agile organizations that are embedded in fast moving and dynamic environments. And from this perspective there is a huge amount of money in it – but let me refine that a bit later.

networkMan

Crying out loud in 2013 that ‘there is no money in linked data’ is an important step towards the right direction because it points out that data publishers should be more precise with data licensing. Although quite flexible licensing models would already exist – it’s the people (and probably other legal entities) who forget to publish their data together with statements about the ‘openness’ of it. As a result, the data remains closed for commercial users. This hasn’t been properly noticed in the early days of the linked open data cloud since commercial users haven’t been around at all (in contrast to academic institutions which considered the LOD cloud to be a wonderful playground). It’s the same thing with linked data as a technology and linked data as a set of standards: the standards and the technology stack are mature now (just think about Virtuoso’s brilliant SPARQL performance, for example), but most people from IT still wouldn’t have things like URIs, RDF and SPARQL off the top of their head when they seek solutions for powerful data integration methodologies.

Why is that?

I believe that so far ‘linked data’ has always been perceived by people from outside the linked data core-community only as a new way to organize data on the web, thus technologies are still not mature for enterprises.

But the truth is, that linked data has at least a threefold nature. Linked data is

  1. a method to organize information in general, not only on the web but also in enterprises
  2. a set of standards which is flexible and expressive enough to link data across boundaries (organizational, political, philosophical), cultures and languages
  3. a way of using IT and information in a quite intuitive way, very close to the patterns like human beings tend to create realities, thus comprehensible also for non-techies.

I think that technologists have made a brilliant job so far with creating the linked data technology stack, its underlying standards, triple-stores and quad-stores, reasoners etc., and for specialists it’s absolutely clear why this kind of technologies will outperform traditional databases, BI-tools, search engines etc. by far.

But: the crucial point now is that enterprises have to adapt linked data technologies inside their corporate boundaries (and not only for SEO purposes or the like). The key question is not whether there is enough LOD out there for app-makers or not. High-quality LOD will be produced very quickly as soon as there are commercial consumers like large enterprises. I am not talking about use cases for linked data in the fields of data publishing or SEO.

The main driver for the further Linked Data development will be enterprises which embrace LD technologies for their internal information management.

It’s true that there are already some large companies (like Daimler - meet them at this year’s I-SEMANTICS in Graz!) dealing with that question but to be honest: there is not the same hype around ‘linked data’ as we can see with ‘big data’. IBM, Microsoft & Co. are not that interested in linked data of course because it is a platform by itself and doesn’t foresee any kind of lucrative lock-in effects. Internet companies like Google and Facebook make use of linked data quite hesitantly. Although Facebook’s Graph Search or Google’s Knowledge Graph contain large portions of this kind of technology, Google would never say ‘oh, we are a semantic web company now, we make heavy use of linked data, and of course we will also contribute to the LOD cloud.’

Why is that? Simply spoken, because through the glasses of Google, Facebook & Co. the internet is a huge machine which produces data for them. Not the other way around.

But shouldn’t the enterprise customers themselves be interested in a cost-effective way of information management? They are, but as stated before, they haven’t perceived linked data as such, although it clearly is.

To develop technologies, we need critical questions, and of course the most critical ones always come from the inside of a community or movement. But time has come to spread the good news for the ‘outside’.

  • Yes, databases which rely on linked data standards have become mature and enough performing for many query types so that they outperform even ‘traditional’ relational databases
  • Yes, also issues which are critical for enterprise usage like privacy and security have been solved by most linked data technology vendors
  • Yes, there is a critical mass of available LOD sources (for example UK Ordnance Survey) and also of high-quality thesauri and ontologies (for example Wolter Kluwer’s working law thesaurus) to be reused in corporate settings
  • Yes, there is a volume of developers and consultants on the labor market (in the U.S. as well as in the E.U.) which is big enough to being able to execute large linked data projects
  • Yes, there are tons of business cases that can benefit from linked data. Linked data and semantic web technologies should be considered as core technologies for any information architecture, at least in larger corporations
  • Yes, SPARQL Query Language is not only a second SQL but comes with some brilliant features like transitive queries which help to save a lot of time when developing applications like business intelligence reporting and analysis
  • Yes, Linked Data has the potential to become the basis for a large variety of tools which help decision-makers (not only in enterprises but also in politics) to become true ‘digerati’ instead of being degraded to masters of the ‘bullshit bingo’.

Yes, this list can be further extended and it is a core element for the further expansion of the LOD cloud. It’s the enterprises that will drive the next level of maturity of the linked data landscape. Because at the end of the day it’s only them who will pay or have already paid the bill for open (government) data.