Andreas Blumauer

Why SKOS thesauri matter – the next generation of semantic technologies

As a matter of fact still a lot of “semantic technologies” are around which do nothing else than pure statistical analysis of text. Sure, this is better than simple full text search but there are still quite a lot of opportunities to improve search, especially when it comes to more sophisticated applications like “similarity search”, the search for similar documents to enable cross-reading or recommendation systems.

Providers of first generation semantic technologies calculate rather basic “semantic networks” by co-occurency analysis which results sometimes in  disappointing results. Bearing in mind that Google just bought a company (“Google buys Metaweb“) which has been working on one of the largest knowledge bases in the world, we could assume that some of the last miles towards a semantic search engine can be achieved by applying thesauri or other structured knowledge bases.

A demo application was recently developed by PoolParty team where one can find out how thesauri will improve search results on top of second generation semantic technologies. With PoolParty SKOS based controlled vocabularies can be managed and also can be enriched with linked data. PoolParty Tag & Content Recommender analyzes virtually any text or website to recommend corresponding tags, concepts from (in this case) STW (Standard Thesaurus für Wirtschaft), DBpedia and respective articles from Wikipedia.

STW which was developed by the German National Library of Economics (ZBW) provides vocabulary on any economic subject: about 6,000 standardized subject headings and about 18,000 entry terms to support individual keywords.

This background knowledge is used in this demo app to improve the search for similar documents dramatically:

Similarity between two documents can be calculated not only on a key-phrase basis but also on a rather conceptual basis. Even if two documents do not have one single word or phrase in common they can be identified as “similar documents”.

This can be achieved because thousands of important relations between economic subjects are represented in the domain specific thesaurus. Thus, in this special case best results are achieved with documents from economics (for instance from Econstor) but of course for other recommender systems thesauri from other domains can be used instead of STW.

Nevertheless, also this approach can be improved and this development is underway: SKOS thesauri enriched with Linked Data do an even better job. This kind of third generation semantic technologies are currently developed by LASSO project and LOD2 project, two innovative projects in the area of linked data and the semantic web.

Andreas Blumauer

What if the biggest web company bought one of the central semantic web players?

Well, exactly this happened yesterday: Google bought Metaweb – provider of Freebase. Freebase is an important hub in the linked data cloud providing 12 million entities with uniform resource identifiers most of them linked to other semantic web datasets like DBpedia or New York Times. For example: Google´s page on Freebase offers a rich source for machine-readable facts around this company.

What does this mean to the Semantic Web Community which has  been working on a smarter web in the last decade?
Well, a lot… First of all, it´s good to hear that Google will continue to develop Freebase as a free and open database to everyone, saying “… we would be delighted if other web companies use and contribute to the data.”

Until yesterday still a lot of companies were not fully convinced if the Semantic Web will play a central role in the further development of the Internet. Now the game has changed. The entity-driven approach to develop web applications has just started now:

We will keep on reporting and discussing how Google will influence the development of the Semantic Web – and if I had a wish for free: Please add RDF(a) to the Freebase widgets!

Tassilo Pellegrini

Linking Open Data to Thesaurus Management

The Vienna-based company punkt. netServices is just about to release a demo version of their PoolParty service, a SKOS-based thesaurus management tool with linked data capabilities. I had the chance to pre-read a white paper and test their service. Here is a brief overview. You can also try a demo.

Purpose

Poolparty was conceived to facilitate various applications like

  • Semantic search engines
  • Recommender systems (similarity search)
  • Corporate bookmarking
  • Annotation- & tag recommender systems
  • Autocomplete services and facetted browsing.

These use cases can be either achieved by using PoolParty stand-alone or by integrating it with existing Enterprise Search Engines and Document Management Systems or Enterprise Wikis.

Thesaurus Management

PoolParty is aiming to be easy to use for people without a strong Semantic Web background or special technical skills. The GUI is entirely web-based and utilizes AJAX so the user can e.g. quickly merge two concepts via drag & drop. An overview over the thesaurus can be gained with a tree or a graph view on the concepts.

poolparty-blueskin

PoolParty also helps to semi-automatically add concepts to a thesaurus as it can be used to analyse documents (e.g. web pages or PDF files) relevant to a thesaurus’ domain in order to glean candidate terms. This is done by the key-phrase extractor of KEA. The extracted terms can be selected by the user, thereby becoming “free concepts” which later can be integrated into the thesaurus, turning them into “approved concepts”.

Documents can be searched in various ways – either by keyword search in the full text, by searching for their tags or by semantic search and similarity search. The latter takes not only a concept’s preferred label into account, but also its synonyms and the labels of its related concepts are considered in the search. The user might manually remove query terms used in semantic search. Boost values for the various relations considered in semantic search may also be adjusted. In the same way the recommendation mechanism for document similarity calculation works.

PoolParty by default also publishes a Semantic Wiki version of its thesauri, which provides an alternative way to browse and edit concepts. Through this feature anyone can get read access to a thesaurus, and optionally also edit, add or delete labels of concepts. Search and autocomplete functions are available here as well. The Wiki’s XHTML source is also enriched with RDFa, thereby exposing all RDF metadata associated with a concept to be picked up by RDF search engines and crawlers. (See two examples: Cocktail thesaurusStandard Thesaurus for Economics)

PoolParty also supports the import of thesauri in SKOS (including several consistency checks) or Zthes format. Those functionalities can also be consumed as stand-alone web services via PoolParty SKOS Services. Additionaly, lists of concepts and their labels can also be imported via CSV files.

Linked (Open) Data

PoolParty not only publishes its thesauri as Linked Open Data (in addition to a SPARQL endpoint), but it also consumes LOD in order to expand thesauri with information from LOD sources.

Concepts in the thesaurus can be linked to e.g. DBpedia  via a service like Georgi Kobilarov‘s DBpedia lookup service, which takes the label of a concept and returns possible matching candidates. The system suggests relevant resources from DBpedia and the user can select the one that matches the concept from his thesaurus, thereby creating a skos:exactMatch relation between the concept URI in PoolParty and the DBpedia URI. The same approach can be used to link to other SKOS thesauri available as Linked Data.

poolparty-lod

Other triples can also be retrieved from the target data source, e.g. the DBpedia abstract can become a skos:definition and geographical coordinates can be imported and be used to display the location of a concept on the map, where appropriate. The DBpedia category information may also be used to retrieve additional concepts of that category as siblings of the concept in focus, in order to populate the thesaurus.

PoolParty is capable of importing a SKOS thesaurus from a Linked Data server, and may also receive updates to thesauri imported this way. This feature has been implemented in the course of the KiWi  project funded by the European Commission. KiWi also contains SKOS thesauri and exposes them as LOD. Both systems can read a thesaurus via the other’s LOD interfaces and may write it to their own store. This is facilitated by special Linked Data URIs that return e.g. all the top-concepts of a thesaurus, with pointers to the URIs of their narrower concepts, which allow other systems to retrieve a complete thesaurus through iterative dereferencing of concept URIs.

Additionally KiWi and PoolParty publish lists of concepts created, modified, merged or deleted within user specified time-frames. With this information the systems can learn about updates to one of their thesauri in an external system. They then can compare the versions of concepts in both stores and may write according updates to their own store.

This means each system decides autonomously which data it accepts and there is no risk of a system pushing data that might lead to inconsistencies into an external store. Data transfer and communication are achieved using REST/HTTP, no other protocols or middleware are necessary. Also no rights management for each external systems is needed, which otherwise would have to be configured separately for each source.

Technology

The software is written in Java and utilizes the SAIL API, so it can be used with various triple stores. The thesaurus management itself (viewing, creating and editing SKOS concepts and their relationships) can be done in an AJAX Frontend based on Yahoo User Interface (YUI). Editing of labels can alternatively be done in a Wiki style HTML frontend. For key-phrase extraction from documents PoolParty uses a modified version of the KEA 5 API, which is extended for the use of controlled vocabularies stored in a SAIL Repository (this module is available under GNU GPL). The analysed documents can be stored and indexed in Lucene/Solr or any other (enterprise) search system along with extracted and semantically related concepts.

Reblog this post [with Zemanta]
Thomas Thurner

1000-and-one pulldowns

Personalisation interface
Image by wocrig via Flickr

Luckily, times have come, where semantic search techniques have found their way to enhance knowledge providing theme portals. Nearly once a week a new knowledge portal with built-in semantic search pops up. They deal with environmental issues, health care, economy etc. These sites are good examples how the vision of a knowledge web is fostered by semantic technologies. Such focused approaches are great showcases for “a” semantic web (even if they are not based on “the” RDF semantic web) in the next few months besides general knowledge portals like Wolfram Alpha.

But the potential of these semantic theme portals is often reduced essentially by their bad usability. You get lost in categories and flags – you get puzzled by pulldowns, mouseovers and embedded hierachies – it’s sometimes a mess out off 1001 functions. You need to understand the underpinning semantic concept to get oriented within these applications – and this is not the goal of the exercise. Search has to be easy.

To show the potential of semantic technologies, we need good examples, which offer good usability. This is a call to everyone to provide such examples.

See my favorites:

  • NextBio, a platform that enables life science researchers to search, discover, and share knowledge locked within public and proprietary data
  • reegle, the Search Engine for Renewable Energy and Energy Efficiency
  • CultureSampo, a Finnish cultural heritage platform for institutional organizations as well as private citizens
Reblog this post [with Zemanta]