Andreas Blumauer

Has Google hi-jacked the Semantic Web?

Just recently Google has launched the ‘Knowledge Graph‘ (GKG) which “understands real-world entities and their relationships to one another: things, not strings.” Has Google hi-jacked the idea of the ‘Semantic Web’ or at least its vocabulary?

Sean Golliher has compared the most central concepts of the SemWeb community to the wording of Google in his blog post, for instance: Google doesn´t talk about ‘Linked data’ or ‘URIs’ but rather about ‘things and their relationships’. We don´t know if Google uses standards like RDF but obviously a lot of concepts and ideas developed by the SemWeb community in recent years were implemented in GKG. Some people complain that Google should clearly state that this is an implementation of the ‘Semantic Web’ (which was not invented by Google), others say that most concepts like ‘taxonomies’ have been around for hundreds of years anyway.

I believe that both sides have now a great chance to work together: Whether Google’s goal, to “build the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do”, can be reached or not is a matter of the intelligence of the employees. A lot of potential can be found within the semantic web community: If Google gives credit where it is due, semantic web people will be a bit more inspired to support an eco-system built around GKG – and it won´t last long until an ‘Open Knowledge Graph’ will fit together with Google´s revenue model.

Andreas Blumauer

Introducing SKOSsy – generate thesauri on the fly!

Imagine you could generate any thesaurus you would like for nearly any knowledge domain you can think of with quite a good quality! Sounds impossible? Reminds you of all the promises made by text mining software which generates “semantic nets” from scratch?

Let me introduce you to SKOSsy. I will explain what this web service can do for you:

SKOSsy generates SKOS based thesauri in German or in English for a domain you are interested in. Not any domain but nearly any: SKOSsy extracts data from DBpedia, so it can cover anything which is in DBpedia. Thus, SKOSsy works well whenever a first seed thesaurus should be generated for a certain organisation or project. If you load the automatically generated thesaurus into an editor like PoolParty Thesaurus Manager (PPT) you can start to enrich the knowledge model by additional concepts, relations and links to other LOD sources. But you don´t have to start in the open countryside with your thesaurus project.

Let me give you an example: Imagine you are working for a company which is an international plant builder and you would like to index several thousands of documents the “semantic way”. You have to walk through the following steps:

  1. Identify proper categories in Wikipedia/DBpedia which describe best what your business or your domain is all about. Those categories should contain pages / resources which are related to the documents you would like to index. For example: http://dbpedia.org/resource/Category:Metalworking or http://dbpedia.org/resource/Category:Industrial_automation
  2. After you have selected proper categories SKOSsy will traverse DBpedia for you and collect all resources, their hierarchical and non-hierarchical relations, alternative labels, definitions and other properties and put them together as a valid SKOS thesaurus; this step will last a couple of minutes. (Find the resulting vocabulary here)
  3. Load the resulting thesaurus into PPT, explore it, improve it and enrich it with additional facts.
  4. After you´re done you can generate a tailor-made text extractor by using PoolParty Extractor (PPX) which is the second component of PoolParty product family
  5. With PPX and its extraction model especially curated for your special use case you can extract named entities from your documents automatically and index your documents in a meaningful way.
  6. After a few seconds your semantic search engine is ready to be used. PoolParty Semantic Search (PPS) which is the third PoolParty component will offer some nice facilities like categorized auto-complete, faceted search, content recommendation (similarity search) and smart search suggestions to ease your life as a knowledge worker.

We have constantly discussed the application of thesauri and other knowledge models to improve search over the last years. Many people understood straight away why thesaurus based search is most often much better than search algorithms purely based on statistics. Of course the big contra always was, “the costs are too high to establish a “good-enough” thesaurus or even a “high-quality” one”.

With SKOSsy in place those kinds of arguments become weaker and weaker. To sum up,

  • SKOSsy makes heavy use of Linked Data sources, especially DBpedia
  • SKOSsy can generate SKOS thesauri for virtually any domain within a few minutes
  • Such thesauri can be improved, curated and extended to one´s individual needs but they serve usually as “good-enough” knowledge models for any semantic search application you like
  • SKOSsy based semantic search usually outperform search algorithms based on statistics since they contain high-quality information about relations, labels and disambiguation
  • SKOSsy works perfectly together with PoolParty product family

If you are interested in the results produced by SKOSsy, just send us a short note about your domain or your project and we will send you an invitation as beta-tester or prepare a demo for you.

Enhanced by Zemanta
Andreas Blumauer

“Thesaurus based search engines will become main stream in the near future”

The results of the survey titled “Do controlled vocabularies matter?” which was conducted by Semantic Web Company from May until June 2011 are public now. Over 150 participants from 27 countries draw a picture of the current and future usage behaviour in the realm of controlled vocabularies.

Here are three of the most interesting outcomes of this questionnaire – the whole report can be found and downloaded on issuu:

Do you think enterprises and other organizations can significantly benefit from using Linked Data?

The answer is a clear YES. A subsequent question also reveals that all kind of organisation sizes have about the same opinion concerning linked data. Only few people think that linked data is a “niche thing”. In general it can be said, that over 90% of the participants think that most or at least some organisations can benefit from using linked data.

Do you think that search engines which utilize thesauri to improve results will become main-stream

The results of this question are amazing: Two thirds of the participants think that thesaurus based search is already or will become main-stream in the near future. Scepticism towards this development seems to be low – at least it can be stated, that a clear majority thinks that thesaurus based search engines will become main stream in the near future.

 

How important is the usage of standards like SKOS for controlled vocabularies?

The results speak for themselves. The majority of the participants are convinced that standards like SKOS are important for their daily work. In August 2009 W3C announced the new SKOS standard – now, nearly two years after, it looks like this standard has well arrived. 48.7% stated that standards like SKOS are very important and 29.1% voted for “relevant”.

 

As an overall result of the survey it can be stated: Semantic Web community has done a great job to convince the controlled vocabulary people to benefit from SKOS and linked data – on the other side only 3-5% are aware of SPARQL as a valuable resource to build standard APIs around controlled vocabularies to lower costs when implementing such knowledge organization systems.

Many thanks to all participants of this survey!

Andreas Blumauer

Florian Bauer: I like to view “linked data” as a “single worldwide API”

Florian BauerFlorian Bauer is REEEP’s Operations and IT Director, responsible for the overall operational management of the organisation, the product management of reegle (the search engine for renewable energy and energy efficiency) and the management of the IT landscape of REEEP.

PoolParty Team had the chance to talk with Florian about reegle – information gateway on clean energy.

Could you please give us a brief overview over reegle – what are the targets you are pursuing with this platform?

The main aim of the reegle information gateway (http://www.reegle.info) is to provide a one-stop gateway to comprehensive, high-quality and up-to-date information on clean energy. By making this information accessible to stakeholders in the field around the world, and by presenting it in a user-friendly and intuitive format, reegle directly helps to facilitate the transition to low-carbon energy.

The website provides information on renewable energy, energy efficiency and climate change and their various sub-sectors at a global level, and some reegle services actually combine raw data sets from several different sources, put these datasets into context and thus provide enriched information.

reegle is an offshoot of the Renewable Energy & Energy Efficiency Partnership (REEEP), a non-profit, specialist change agent aiming to catalyze the market for renewable energy and energy efficiency, with a primary focus on emerging markets and developing countries.

The new reegle data portal (data.reegle.info), launched in 2011, has established reegle as a publisher and consumer of Linked Open Data in the energy sector. It provides key clean energy datasets free for re-use using Linked Open Data W3C standards.

reegle consists of two components: one is the semantic search engine (http://www.reegle.info/), the other is the linked data portal (http://data.reegle.info/) – What are your target groups, and which typical problems of the clean energy domain can you solve with these services?

For reegle.info, our target groups are primarily project developers, financiers and government policy-makers. These users can access high-quality information on clean energy-related issues with the set of tools we provide: a special web search, a catalogue of more than 1700 key stakeholders, a map view for geographical browsing, a clean energy glossary, and an energy country profiles function.

The energy country profiles are typical of what we’re trying to achieve. Here, we take information from many different providers and combine it all to present one comprehensive information dossier on renewable energy and energy efficiency in that particular country. This means that in one location you have the country’s most important energy-related information ranging from key statistics, and current regulations to key players in the energy field in both public and private sectors.

For our data portal, the target group is a more technical one: primarily IT developers and open data specialists who want to create new mash-ups and integrate data from reegle into other websites. One of the first using these reegle data sets is the OpenEI.org website, another key portal in the energy field.

Open data is not the same as linked open data. Why did you choose to build your services around W3C´s linked data paradigm and/or standards like RDF?

Tim Berners-Lee once mentioned that he likes to compare the progressive ways of offering data with the “stars system” used to rate hotels. You get:

* for making data public (in any format)
** for machine-readable formats (structured data)
*** if the data is offered in a non-proprietary format
**** if you use URIs to identify things, so people can point to your datasets
***** for linking to other people’s data to provide context

So, as you can imagine, our goal is for reegle to be firmly in the 5-star category, and to establish reegle as an avant-garde tool in energy data.
I also like to view “linked data” as a “single worldwide API”. If the old web was like a huge book, the new semantic web is like a huge database, and SPARQL is the way to ask for information – by sending a query through the SPARQL Endpoint. RDF is the language that offers all possibilities to describe a given dataset with all of the necessary information, including any links to other datasets. Therefore RDF data and SPARQL endpoints provide a powerful tool to find and filter datasets and are crucial, base parts of the semantic web’s architectural layers. On reegle the SPARQL endpoint and the description of the structure of our RDF files is online on our clean energy open data portal.

You also decided to build a SKOS based domain thesaurus for clean energy which now plays an important role to improve the search experience at reegle.
Which experiences have you gained so far from this effort? Which obstacles did you have to overcome?

The SKOS-based renewable energy thesaurus can be seen as the “heart” of reegle as it provides the basis for a lot of related services in reegle, including the refinement suggestions for search results, the auto-completion options and the glossary links between defined terms and their synonyms and related terms.

We decided to use SKOS because we think it is the best language for building a formal and controlled vocabulary for thesauri in a semantic web context, without adding too much complexity. Although it is a simple language, you really still need IT experts to use it to build a thesaurus – domain experts with additional IT skills (hard to find!).

So in our case, we decided to use a scalable and easy-to-use thesaurus server called “PoolParty”. Using this system drastically reduced the complexity, and allowed us to concentrate on the actual building of the thesaurus with our domain experts, and to spend less time on transferring the knowledge into data sets.

What are your future plans with reegle?

Currently we’re working on restructuring the site to better highlight our new added-value services such as the clean energy country profiles. We are also planning to further develop our thesaurus to include climate-compatible development terms and we’ll soon release a wordpress plug-in to insert this thesaurus into clean energy blogs. One of the most exciting projects we are actually working on is the development of “dossier pages”, where we will provide relevant information to several topics mashed up on one page using semantic web technologies. This is part of the EU funded SCMS (“semantic content management system”) project.