Andreas Blumauer

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.

Thomas Thurner

Vienna Semantic Web Meetup – the next season

Started mid 2009, Vienna Semantic Web Meetup (VSWM) goes now in it’s third year. Hosted by various partners, from media to culture and from corporate to academic, this regular gathering now counts over 200 members. As it is a good tradition at VSWM, people from abroad are visiting by, giving input and new insights. Also the next season of VSWM will bring this mixture of international connection and informal meeting in putting two upcoming topics onto the agenda.

Digital Identity on the Semantic Web
Thursday, April 7, 2011

While recent developments in ICT make it easier for companies and consumers to reach each other, they can also scatter your personal information more widely, making life easier for criminals. On the other hand public institutions and government agencies are collecting personal data too. So personal data is processed without the consensus (or even the knowledge) of the respective citizen. As we know, leaks in this field may unleash sensible personal data as well. The misuse of personal data can be restricted – this is a challenge to both, the technological and the juridical domain. This meetup takes a look on how Semantic Web Technologies can take over its responsibility in this emerging field.

  • Christof Tschohl (BIM)
    Ludwig Boltzmann Institute for Human Rights
  • Mischa Tuffield (Garlik)
    A Standards-based, Open and Privacy-aware Social Web (W3C)

>> read more, and register for free

Portals, Apps and Visualizations for Open Government Data
Wednesday, June 15, 2011

Picking up Keith Andrews suggestion, this is a MeetUp focusing on tools, services and projects dealing with Visualization, Apps-creation and Portals/Catalogs for Open [Government] Data. As this MeetUp is on the eve of Austrians first Open Government Data – Conference (OGD2011) we expect to meet experts ans enthusiasts from Austria and abroad.

  • Keith Andrews (IICM)
    Institute for Information Processing and Computer Supported New Media at Graz University of Technology
  • Andreas Blumauer (SWC)
    Storing, searching, serving Open Government Data – getting an overview on the growing market for open data solutions

>> read more, and register for free



Thomas Schandl

Drupal and the Semantic Web – Interview with Stéphane Corlosquet

Stéphane Corlosquet has been the main driving force in incorporating Semantic Web capabilities into Drupal. In the recent release of Drupal 7, Semantic Web technologies became part of the core of this popular CMS, which is used to power at least 1% of all the world’s web sites.

Drupal is the leading CMS when it comes to implementing Semantic Web standards. What are the reasons for this, what makes Drupal such a good fit for Semantic Web technologies?

Historically, Drupal is known to be web standard compliant. It supported the RDF-based aggregation format known as RSS 1.0 as early as in 2001, which was later upgraded to RSS 2.0. The Drupal community prides itself in valid HTML code, not only for the code generated by Drupal, but also by taking the extra step of automatically fixing faulty HTML entered by its users. Drupal has been using XHTML since its version 4.0 in 2002. The next logical step beyond XHTML was to add a layer of semantics with the RDFa standard, a W3C recommendation published in 2008.

There are definitely many reasons that contributed to the addition of RDFa into Drupal 7. The first comes from the Drupal project lead, Dries Buytaert, who is passionate about the web and open source. Secondly, the growing Drupal community is very web savvy and includes many experts from different backgrounds in accessilibity, CSS, HTML, security etc. As a result, every release of Drupal includes many latest standards. The community meets twice a year at conferences (DrupalCons), thes events play a great role in hashing out what technologies or designs will be incorporated into the next version of Drupal. Because of the flexibility of its internal architecture, Drupal is able to keep up with the latest web standards. Content in Drupal is very structured and provides site administrators with a user interface to build the site structure they want, using entity types, content types, fields and taxonomies for categorization. When it comes to other CMSs, Joomla!’s community appears to be more fragmented with a core software that is not as extensible as Drupal and WordPress is more of a blogging platform, so turning it into a full blown CMS can be challenging. Both WordPress and Joomla! are in fact adapting the concept of Drupal’s Content Construction Kit (CCK) to their software but they have not yet reached the same level of maturity as Drupal.

A common objection to the adoption of Semantic Web technologies is that the learning curve is steep and that it is too complicated for many web developers to get into it. How can Drupal 7 change that? Which features accessible for the average web site operator will it offer?

Semantic Web technologies don’t have to be complicated when applied to simple use cases! We purposely chose only of a subset of semantic web technologies to integrate into the core of Drupal, keeping the learning curve for the Drupal developers and users as low as possible. The main technology is RDFa which includes the notions of vocabularies (a schema, or collection of attributes) as well as Compact URIs (CURIEs) which make the authoring of RDFa easier. In fact, some web developers might have come across these notions before when working with Dublin Core in the meta tags as such dc:title or dc:date.

Which benefits will web site owners get when they switch to a semantics enabled Drupal 7?

Google and Bing increasingly rely on machine-readable structured data from the websites that they crawl. The design of Drupal 7 embeds semantic meta data that makes machine-to-machine (M2M) search native for a Drupal 7 website. RDFa can add value by giving search engines more details such as the latitude and longitude of a venue for display on a map; or providing the ISO date format for localization and proper display in the search results for different countries.

What are your hopes regarding the development of other applications that either provide or consume data from D7 sites? Which improvements of standards, best practices or (lightweight) ontologies in the Semantic Web community would you like to see?

Services like Sig.ma are already able to collect semantic data from different sources and display it in new ways in the form of mash-ups. Eventually, these services that consume semantic data will not be just Drupal specific, as more platforms jump on the semantic web band wagon. What I hope to see as improvements or best practices in the future are more well-maintained vocabularies. Many of the existing vocabularies are over engineered, some fail to de-reference properly. Their is also some work to be done in order to improve the tooling made available to web developers as well as introducing the simple concepts of Linked Data to web developers via easy to read documentation.

Thank you for this interview, Stéphane!

Tassilo Pellegrini

LOD2 Kick Off Meeting in Leipzig

From September 6 – 8, 2010 we kicked off the LOD2 project in Leipzig / Germany. LOD2 is funded by the European Commission within the 7th Framework Programme (Grant Agreement No. 257943) consisting of 10 partners from 7 countries. Its main aim is to integrate and syndicate linked data with large-scale, existing applications and showcase the benefits in three application scenarios: 1) Media & Publishing, 2) Enterprise Data Management and 3) Open Government Data. The resulting tools, methods and data sets have the potential to change the Web as we know it today. (You can download the project flyer here.)

The first day was dedicated to the general introduction of the project partners which are Universität Leipzig (Germany), Centrum Wiskunde & Informatica (Netherlands), National University of Ireland in Galway (Ireland), Freie Universität Berlin (Germany), OpenLink Software (United Kingdom), Semantic Web Company (Austria), TenForce (Belgium), Exalead (France), Wolters Kluwer Deutschland (Germany) and Open Knowledge Foundation (United Kingdom). Below you see a picture of the kick off team.

During the morning of the second day a first introduction to the technical components took place. The picture below shows an abstraction of the LOD2 high level architecture.

Orri Erling and Hugh Williams from OpenLink introduced Virtuoso, which will be used as one of the storage technologies in the LOD2 stack. The second knowledge store technology will be MonetDB introduced by Peter Boncz from CWI. Both systems will also be used as a kind of benchmark laboratory for hosting and querying linked data.

Christian Bizer from FU Berlin talked about Silk and D2R. In combination they will be used to discover relationship and similarities between entities within different linked data sources – generally called identity resolution.

Giovanni Tummarello from DERI introduced Sindice and Sig.ma under the aspect of how to update, validate and reuse data that is available on the web and support the production of professional, collaboratively governed linked data especially for enterprise use. Beside that an important aspect will be how to handle the high amounts of generated data. So according to Giovanni scaling the infrastructure and the use of appropriate hardware will be central in bringing the Sindice index into enterprise stacks i.e. as an approach for lightweight data consolidation purposes.

Norman Heino from AKSW University of Leipzig introduced OntoWiki and Semantic Pingback. Ontowiki will be used at the interface layer for producing, annotating, browsing and querying linked data and presenting it to the enduser in various GUIs. Semantic Pingback’s aim is to interlink the Web 2.0 with the Semantic Web by backwards compatible RPCs (remote procedure calls). It detects new typed or untyped external links, manages the GET and POST commands and it takes care of server autodiscovery.

Andreas Blumauer from Semantic Web Company demonstrated PoolParty as a smart editor for metadata in enterprise stacks. Like Ontowiki PoolParty also addresses the interface level of LOD2 especially when it comes to generate, edit and link metadata to documents primarily based on SKOS. PoolParty deliberatelly uses Thesauri as a mapping layer to discover similarities of documents, generate tag recommendations for their annotation and publish used vocabularies as Linked Data.

In the afternoon we continued with individual breakout sessions to discuss work package interdependencies and start profiling the use cases and requirements eingineering in more detail.

The third day started with an introduction by Stefano Bertolo – the responsible scientific project officer from the EC side for the LOD2 project – who pointed out that the LOD2 project is an important one for the European Web of Data and the EC among others specially is interested in the Open Government Data use case of LOD2.

After this introduction talks of the 3 Use Cases were presented by A) Jonathan Gray (OKFN) about the Open Gov Data use case followd by B) Amar-Djalil MEZAOUR (Exalead) speaking about the Linked Business Data use case and C) Christian Dirschl (Wolters Kluwer) having a talk about the LOD in the publishing & media industry use case.

Central to the success of LOD2 will be a smart handling of all the integration issues which will come up in the course of the project. Here Tenforce, an integration specialist from Belgium, will have the lead. CEO Bastiaan Deblieck gave a detailed outlook on the methodologies  and he presented a nice and comprehensive overview how the integration issues will be approached from a SCRUM perspective.

After a presentation about LOD2 project dissemination, training and community building activities by Martin Kaltenböck (Semantic Web Company) there were serveral discussions going on until the successful kick off meeting was closed by project lead Sören Auer (Universität Leipzig) at 04.00pm of 08 September 2010.

Updated news information can be accessed on the LOD2 project website as well as on the LOD2project twitter stream (and on twitter using #lod2)…

Stay tuned!