Andreas Blumauer

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.

Andreas Blumauer

WordPress plugin to make use of linked data

PoolParty Team has recently published an improved version of their WordPress plugin which enables linked data enrichments of blogs. Therefore a SKOS based vocabulary has to be uploaded or retrieved from a SPARQL-endpoint. Users and developers benefit from

  • automatic annotation of all blog entries displayed as tooltips
  • a comfortable search facility with auto-complete over all concepts from the linked thesaurus including semantic search over the whole blog
  • an integrated thesaurus browser, plus
  • a corresponding linked data frontend including RDF/XML serialization of the underlying thesaurus + SPARQL endpoint

All details about the new version 2.2.3 can be read here.

Enhanced by Zemanta
Andreas Blumauer

Introducing SKOSsy – generate thesauri on the fly!

Imagine you could generate any thesaurus you would like for nearly any knowledge domain you can think of with quite a good quality! Sounds impossible? Reminds you of all the promises made by text mining software which generates “semantic nets” from scratch?

Let me introduce you to SKOSsy. I will explain what this web service can do for you:

SKOSsy generates SKOS based thesauri in German or in English for a domain you are interested in. Not any domain but nearly any: SKOSsy extracts data from DBpedia, so it can cover anything which is in DBpedia. Thus, SKOSsy works well whenever a first seed thesaurus should be generated for a certain organisation or project. If you load the automatically generated thesaurus into an editor like PoolParty Thesaurus Manager (PPT) you can start to enrich the knowledge model by additional concepts, relations and links to other LOD sources. But you don´t have to start in the open countryside with your thesaurus project.

Let me give you an example: Imagine you are working for a company which is an international plant builder and you would like to index several thousands of documents the “semantic way”. You have to walk through the following steps:

  1. Identify proper categories in Wikipedia/DBpedia which describe best what your business or your domain is all about. Those categories should contain pages / resources which are related to the documents you would like to index. For example: http://dbpedia.org/resource/Category:Metalworking or http://dbpedia.org/resource/Category:Industrial_automation
  2. After you have selected proper categories SKOSsy will traverse DBpedia for you and collect all resources, their hierarchical and non-hierarchical relations, alternative labels, definitions and other properties and put them together as a valid SKOS thesaurus; this step will last a couple of minutes. (Find the resulting vocabulary here)
  3. Load the resulting thesaurus into PPT, explore it, improve it and enrich it with additional facts.
  4. After you´re done you can generate a tailor-made text extractor by using PoolParty Extractor (PPX) which is the second component of PoolParty product family
  5. With PPX and its extraction model especially curated for your special use case you can extract named entities from your documents automatically and index your documents in a meaningful way.
  6. After a few seconds your semantic search engine is ready to be used. PoolParty Semantic Search (PPS) which is the third PoolParty component will offer some nice facilities like categorized auto-complete, faceted search, content recommendation (similarity search) and smart search suggestions to ease your life as a knowledge worker.

We have constantly discussed the application of thesauri and other knowledge models to improve search over the last years. Many people understood straight away why thesaurus based search is most often much better than search algorithms purely based on statistics. Of course the big contra always was, “the costs are too high to establish a “good-enough” thesaurus or even a “high-quality” one”.

With SKOSsy in place those kinds of arguments become weaker and weaker. To sum up,

  • SKOSsy makes heavy use of Linked Data sources, especially DBpedia
  • SKOSsy can generate SKOS thesauri for virtually any domain within a few minutes
  • Such thesauri can be improved, curated and extended to one´s individual needs but they serve usually as “good-enough” knowledge models for any semantic search application you like
  • SKOSsy based semantic search usually outperform search algorithms based on statistics since they contain high-quality information about relations, labels and disambiguation
  • SKOSsy works perfectly together with PoolParty product family

If you are interested in the results produced by SKOSsy, just send us a short note about your domain or your project and we will send you an invitation as beta-tester or prepare a demo for you.

Enhanced by Zemanta
Thomas Schandl

PoolParty 3.0 and its all new Linked Data framework

The new major release of PoolParty boasts with new Linked Data capabilities that further unlock the potential that the Semantic Web can bring to improve your metadata management, to enhance your data with external knowledge and to ease data integration efforts within your organization and with your partners.

In PoolParty 3.0 we created a Linked Data interlinking editor, making it easier than ever to add your own lookup and interlinking services (even for non-RDF sources) and made the Linked Data publishing front-end fully customizable in design, layout and regards to which parts of your content will be displayed.

But let’s start at the beginning:

Step 1 – Hook into the Linked Data Cloud!

In the era of the rapidly growing Linked Data Cloud your knowledge models don’t need to stay isolated from the outside world anymore. Simply use PoolParty’s new and improved lookup service to find matching resources from the Linked Open Data Cloud (e.g. from DBpedia).

Imagine having different data models that all refer to the same product categories and world regions. Once you have them represented in PoolParty you can use its lookup service to find matching resources from the Linked Data Cloud. In this way you will get globally used identifiers for your product categories and regions, usually in the form of a URI like http://dbpedia.org/resource/Berlin. This eases your internal data integration efforts, and it can aid the data exchange with partners or customers and enables hassle-free distributed management of knowledge models.

Image 1: Lookup of concept ‘Austria’ and selection of properties and values to be imported

 

With PoolParty 3.0 we increased the number of included lookup services: DBpedia, Geonames, Wordnet, Umbel, Yago, Freebase, Sindice, dmoz and LCSH – BBC Wildlife, Enis and Gemet are available on request.

Step 2 – Pull in Semantic Data!

There is a vast amount of Linked Data out there just waiting to be leveraged for thesaurus creation and extension. To meet that end we had a close look at our interlinking module and decided to enhance it a way that it becomes more of a Linked Data editor.

Once you have a base thesaurus in PoolParty and hooked a couple of your concepts into the cloud as described above, you can proceed to pull in the good stuff that comes with the Linked Data resources you have found.

Image 2: Imported Linked Data for concept ‘London’

 

As you can see in the image above, you can extend your local thesaurus with labels, definitions and all kinds of other information like e.g. in the case of countries their population, GDP, spoken languages, famous people born there, newspaper articles related to the political situation, and so on.

Now PoolParty 3.0 takes this approach a couple of steps further. You can not only specify which of your local concepts corresponds to which Linked Data resource and grab all semantic information that comes with this resource, but now you are able to selectively pick out the data items you are interested in and even transform the predicates they originally came with. Just switch them to whatever custom properties you created or want to re-use from any ontology (see an example in Image 1).

In this way you can easily enrich your own knowledge models with external information – which in turn can be utilized for better content recommendation, easier data integration and improved search services.

Step 3 – Publish your Linked Data in Style

Previous PoolParty versions already offered the possibility to instantly publish your thesauri, taxonomies or vocabularies and display their concepts as HTML while additionally providing machine-readable RDF versions for them. This means that anyone using PoolParty intuitive GUI can become a W3C standards compliant Linked Data publisher without having to know anything about Semantic Web technicalities.
Of course you don’t need to publish all your valuable models, just choose the parts that safely can be shared with the public and keep everything else behind your firewall, available only to you and trusted partners!

In this new release of PoolParty the design of all pages on the Linked Data front-end is now under your full control. You can use your own style sheets and create views on your data with velocity templates. It is even possible to develop project- and thesaurus-specific templates and layouts, so they can have an individual look and display different predicates and their values.

Take a look at PoolParty´s standard linked data frontend!

The following images show a PoolParty default Linked Data page and a custom-made Linked Data page of a PoolParty concept that has some DBpedia info imported.

Image 3: PoolParty default Linked Data page

PoolParty Linked Data page of ScOT thesaurus courtesy of Educational Services Australia
Image 4: Custom Linked Data page of ScOT thesaurus (courtesy of Educational Services Australia)

 

Step 4 – Unlock new Linked Data Sources

With PoolParty 3.0 you are in no way limited to DBpedia, Freebase, Geonames and the other lookup services that PoolParty provides out of the box: you can add your own non-Semantic Web data sources to the mix, thereby enabling you to boldly go where no Linked Data tool has gone before.

Maybe you have a product thesaurus and want to specify which products are related to patents that can be found with Google Patents?
Or you want to interlink concepts from a company taxonomy with related articles from the Guardian’s search service or any other newspaper that provides a search API?

All those sources are not available as RDF, so how can you re-use them easily as data sources for Linked Data style interlinking? For such cases PoolParty introduces the Unified Lookup API, which makes it easy to turn almost any third party Web API into a source for interlinking your concepts with third party resources as described above.

This makes it possible to interlink your concepts with many kinds of data out there, be it New York Times articles, UN data, synonym services, abbreviations, press releases, juridical information – or any web API important for your knowledge domain.

That being said, if you have suggestions for additional lookup services that you think are interesting, let us know!

To gain a first hand impression of the new PoolParty just apply for a demo account!