Andreas Blumauer

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.

Helmut Nagy

The ESA vocabulary site – Making Publishing and Reusing Vocabularies Easier

Reviewing the interview we made with Les Kneebone (project manager of the vocabulary projects at Education Services Australia) in November 2010 we can see that ESA has been one of the early adopters of SKOS as a standard for thesaurus development. Les said then: “We had already identified SKOS as an important standard for ScOT so it was natural to select PoolParty as our new thesaurus management tool”. Around a year later ESA´s vocabulary site went online with PoolParty as its basis.

We asked Les to comment on his statement from last year and he confirmed that SKOS continues to be central to the ESA vocabulary business model and that it has also been important for ESA that PoolParty has been flexible enough to support continued publication of non-RDF formats, especially IMS VDEX.

In the course of this project it became more and more obvious that SKOS cannot only be used as yet another format for publishing thesauri but rather as a unified model to build thesauri in general. This approach made possible several improvements to the vocabulary development model and the maintenance process of ESA. Since all data is stored as RDF in a triple store, and SKOS and RDF are flexible formats supporting interoperability and interchangeability of data, many manual transformations that had to be done before are not needed anymore and all other systems using the vocabularies are dynamically fed by PoolParty offering the data in its needed formats (see image below).

Changes in ESA’s vocabulary development model

Les states that while some manual processes still exist to support legacy systems, PoolParty ensures the integrity and richness of ESA data. Support and customizations for legacy systems can be achieved in the confidence that the linked-data capabilities are centrally managed and stored in the PoolParty triple store.

From the publishing perspective, the previous vocabulary publishing site has been replaced by the PoolParty Linked Data Frontend (LD-Frontend) that has been customized especially for this project to offer more flexibility in the display and the layout of the data. Similar to the frontend for the Austrian Geological Survey mentioned in a previous blog post , the LD-Frontend has been adapted to the ESA styleguide and the display of the data in the HTML view of the frontend has been adapted to be more user-friendly (see screenshot below).

From ESA’s perspective Les commented here that for the vocabulary manager, edits to the frontend styles and templates are intuitive and can be tested in staging environments. But he also stated that for publishing support is important, and that SWC was very responsive.

Example ESA linked data frontend

Of course we asked Les to give a preview of the next steps for ESA. He stated that they include language translation projects so that its vocabularies, especially Schools Online Thesaurus (ScOT), can be accessed by wider markets and by students of other languages. He also stated that PoolParty handles multi-lingual thesauri very well.

We here at SWC are glad to see PoolParty used in more and more applications and usage scenarios. We are looking forward to the next steps that will be done in this project and also to see how the data offered by the ESA vocabulary site is used in other applications.

Thanks to Les Kneebone from ESA for his contribution to his blog post.

Andreas Blumauer

WordPress plugin to make use of linked data

PoolParty Team has recently published an improved version of their WordPress plugin which enables linked data enrichments of blogs. Therefore a SKOS based vocabulary has to be uploaded or retrieved from a SPARQL-endpoint. Users and developers benefit from

  • automatic annotation of all blog entries displayed as tooltips
  • a comfortable search facility with auto-complete over all concepts from the linked thesaurus including semantic search over the whole blog
  • an integrated thesaurus browser, plus
  • a corresponding linked data frontend including RDF/XML serialization of the underlying thesaurus + SPARQL endpoint

All details about the new version 2.2.3 can be read here.

Enhanced by Zemanta
Andreas Blumauer

“Thesaurus based search engines will become main stream in the near future”

The results of the survey titled “Do controlled vocabularies matter?” which was conducted by Semantic Web Company from May until June 2011 are public now. Over 150 participants from 27 countries draw a picture of the current and future usage behaviour in the realm of controlled vocabularies.

Here are three of the most interesting outcomes of this questionnaire – the whole report can be found and downloaded on issuu:

Do you think enterprises and other organizations can significantly benefit from using Linked Data?

The answer is a clear YES. A subsequent question also reveals that all kind of organisation sizes have about the same opinion concerning linked data. Only few people think that linked data is a “niche thing”. In general it can be said, that over 90% of the participants think that most or at least some organisations can benefit from using linked data.

Do you think that search engines which utilize thesauri to improve results will become main-stream

The results of this question are amazing: Two thirds of the participants think that thesaurus based search is already or will become main-stream in the near future. Scepticism towards this development seems to be low – at least it can be stated, that a clear majority thinks that thesaurus based search engines will become main stream in the near future.

 

How important is the usage of standards like SKOS for controlled vocabularies?

The results speak for themselves. The majority of the participants are convinced that standards like SKOS are important for their daily work. In August 2009 W3C announced the new SKOS standard – now, nearly two years after, it looks like this standard has well arrived. 48.7% stated that standards like SKOS are very important and 29.1% voted for “relevant”.

 

As an overall result of the survey it can be stated: Semantic Web community has done a great job to convince the controlled vocabulary people to benefit from SKOS and linked data – on the other side only 3-5% are aware of SPARQL as a valuable resource to build standard APIs around controlled vocabularies to lower costs when implementing such knowledge organization systems.

Many thanks to all participants of this survey!