Andreas Blumauer

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.

Tassilo Pellegrini

I-SEMANTICS 2011: Best Paper Award & Triplification Challenge Winners

This year the I-SEMANTICS conference gave away prices for the best scientific paper and the most promising triplifications.

The best paper award went to Pablo N. Mendes, Max Jakob, Andrés García-Silva and Christian Bizer for their contribution DBpedia Spotlight: Shedding Light on the Web of Documents.

Abstract: The paper impressively shows how Linked Open Data can be utilized  as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, the authors developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. They compare their approach with the state of the art in disambiguation, and evaluate their results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of the system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.

For the 4th time I-SEMANTICS hosted the Triplification Challenge, an event aiming at stimulating the availability of large quantities of RDF data and showcasing practical applications built on top of them. The Challenge consisted of an unspecific “open data track” and a dedicated “open government data track” for which one winner was selected. The prize money of 1000 Euro each was sponsored by Wolters Kluwer Germany.

The “open data track” award went to Daniel Garijo, Boris Villazón and Oscar Corcho for their contribution A Provenance-Aware Linked Data Application for Trip Management and Organization.

Abstract: The authors present El Viajero, an application for exploiting, managing and organizing Linked Data in the domain of news and blogs about travelling. El Viajero makes use of several heterogeneous datasets to help users to plan future trips, and relies on the Open Provenance Model for modeling the provenance information of the resources.

The “open government data track” award went to John Erickson, Yongmei Shi, Li Ding, Eric Rozell, Jin Zheng and Jim Hendler for their contribution TWC International Open Government Dataset Catalog.

Abstract: The TWC International Open Government Dataset Catalog (IOGDC) integrates a diverse selection of more than 70 government dataset catalogs from around the world. IOGDC demonstrates a practical dataset catalog metadata model for integrating diverse dataset catalogs collected from the real world and linking those catalogs into Linked Data Cloud. IOGDC’s faceted browsing and search interface provides a scalable and reconfigurable solution for finding and browsing open government datasets which also offers a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDC highlight the potential for useful Linked Data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide.

All papers are available in the ACM Digital Library.

We thank all participants for their contributions and wish the winners all the best for their future work!

 

Thomas Schandl

Which kind of controlled vocabularies matter?

Looking at intermediate results of the Controlled Vocabularies Survey an interesting finding concerns the question which types of knowledge models are currently best fit for actual use in applications.

So far 143 people whose organization already make use of controlled vocabularies answered the question “Which kind of controlled vocabulary do you use or plan to use in your applications?”.
The results so far show that lightweight models like taxonomies and thesauri are somewhat preferred over ontologies:

Taxonomies are the favorite, as 73.6% of participants use or plan to use them, followed by thesauri (62%) and ontologies (61.2%), while simple glossaries lag considerably behind with a usage of 31.4%.

This survey will close in about a week, so please take this chance to make your opinions on this topic count! You can find the questions here, it will take 5-10 minutes to answer them.

All participants will gain access to a report with the results within the following month. The most interesting results will be made public on this blog.

Pascal Hitzler

Semantic Web and Emerging Trends in Scholarly Publishing

In my capacity as one of the Editors-in-chief of the Semantic Web journal (the other one is Krzysztof Janowicz; the journal is published by IOS Press), I was recently invited to talk about the journal at Allen Press’ Seminar Emerging Trends in Scholarly Publishing.  This seminar is an annual event which draws decision makers from the scholarly publishing industry to hear about and discuss recent developments and hot topics related to their profession. This year’s event had a session on “Semantic Enrichment”, and one on “Rethinking the Structure of Peer Review.” All presentations, including videos, are available from the Allen Press website.

The invited speaker of the “Semantic Enrichment” session was Pam Harley, Vice President, Product & Market Development of Semedica, a division of Silverchair.  Pam gave a high-level account of the possibilities and added value which comes with Semantic Enrichment, in a way suitable for the non-technical audience. I personally benefited particularly from the large variety of reasons for adopting Semantic Technologies in publishing which she presented and discussed in her talk (see also her slides).

My presentation (see also the slides) about the Semantic Web journal was part of the “Rethinking the Structure of Peer Review” session, and was focused on the open and transparent review process which we have adopted for the journal. After the presentation, throughout the event, I received ample feedback and remarks which in particular commended us for setting up a realistic improvement of the review process while avoiding radical changes which are likely to meet too much resistance from researchers. I certainly agree with this assessment. The presentation also contains a bit of information on how the journal is doing (in short: it’s doing great).

The seminar was a very enjoyable experience. In particular, it was enlightening to learn about publisher’s perspectives on scientific publishing, reviewing processes, and emerging revenue models. It was also nice to see that Semantic Web as a technology has a natural place in these discussions and is seeing more and more adoption in practice.

If you’re curious to learn more, have a look at the videos of the presentations.

[Author: Pascal Hitzler]