Andreas Blumauer

State-of-the-art Text Mining: PoolParty Extractor 2.1.1 released

PoolParty Extractor (PPX) is part of the PoolParty product family and builds the basis for state-of-the art text mining applications.

The idea behind PPX is to underpin automatic text mining algorithms with domain-specific knowledge from thesauri and linked data sources. This is the precondition to extract meaning from unstructured information more precisely and with higher performance. PoolParty Extractor supports the following application scenarios:

  • automatic document categorisation
  • named entity extraction based on concepts from thesauri or other knowledge models
  • text analysis to improve semantic indexing
  • automatic transformation of unstructured text to an RDF based linked data source
  • linking and enrichment of text with structured data from databases or XML-documents
  • extended indexing by using inflected forms of words and by splitting of compound words
  • generation and continuous improvement of thesauri by text corpus analysis

PoolParty Extractor can be integrated smoothly with third-party systems like CMS, DMS, communication platforms, wikis etc. PPX is fully based on Java and provides an HTTP API. Integrations with Sharepoint, Confluence, WordPress and others exist, please provide us your use case!

The latest release 2.1.1 of PPX further extends the capabilities to extract meaning from text with high precision and high performance:

  • use of tf-idf (term frequency inverse document frequency)
    • Creation of a textcorpus for tf-idf
    • Use tf-idf calculation during extraction
    • Corpus / thesaurus alignment
      • show missing concepts
      • show not used concepts
  • Use regular expressions to match specific patterns in texts
  • Use parts of the thesaurus as dynamic components for regular expressions
  • Calculate inflected forms (at the moment for German)
    • Word forms are added to the extraction model and used during extraction
    • List of inflected forms can be imported to thesaurus
  • Split compound words (at the moment for German)

PPX can be tested online as a web service, please send us a short note describing your interest and we will provide further details.

Andreas Blumauer

PoolParty Thesaurus Manager 3.1 with auto-population feature was presented at SemTechBiz 2012 in San Francisco

A new PoolParty Thesaurus Manager (PPT) release was presented at this year´s Semantic Technology & Business Conference in San Francisco: Version 3.1.0 is a major release offering lots of great new funcitionalities and improvements including auto-population of thesauri and linked data knowledge models.

The main new features are:

  • Autopopulation of Thesauri from DBpedia
    The Skossy functionality has been integrated into PPT. You can assign DBpedia categories to concepts and then autopopulate your thesaurus based on data from DBpedia.

  • Linked Data Based Synonym and Translation Service
    You can add labels (pref, alt, hidden) to the concepts of your thesaurus based on suggestions for synonyms and translations provided by data from DBpedia.
  • ADMS Description for Projects
    Metadata for PoolParty projects can now be published according to the Asset Description Metadata Schema (ADMS) developed by the joinup project of the European Union.

  • Windows Theme
    A new theme has been added based on the Windows GUI guidelines.

Andreas Koller from Semantic Web Company: “SemTechBiz 2012 was a great success for us, we had a lot of talks with people from various industries at our booth. Demonstrating how building knowledge models on top of linked data sources can improve text mining for example, attracted wide interest. We enjoyed the whole conference, the location and the support from the organization team.”

To get an overview over all changes made in Release 3.1.0 take a look at the Release Notes.

Andreas Blumauer

Re-vamped PoolParty Knowledge Discoverer has been released

PoolParty team has released a brandnew version of its knowledge discoverer to showcase the power of knowledge models in combination with linked data and text mining.

First of all: PoolParty Knowledge Discoverer is more about collecting context information about documents which deal with domain-specific ‘things’ like persons, places, companies etc. than a search engine in a ‘classical’ sense.

PoolParty Knowledge Discoverer

Don´t expect to find a pizzeria in your neighbourhood with this kind of tool. If you want to build a similar tool like this, take a look at the PoolParty product family.

How does it work?

Provide some text either by

  • typing your topic or
  • by retrieving text from a URL or
  • by entering a text directly into the editor

PoolParty will analyse your text.

Now you will get smart recommendations and context information:

  • related contents from Wikipeda
  • categories related to the text
  • images related to the text
  • tags relevant for the text

For example: If you want to get a quick overview over an interesting article of ‘The Guardian’ about open data, just click on the bookmarklet which can be installed to use the Knowledge Discoverer instantly, and you will be redirected to the following page.

The tool is a blueprint for many use cases in different sectors, here are some examples:

  • find relations between open positions and applicants in your recruiting database
  • find those pieces of your technical documentation which are related to a concrete description of a customer´s problem
  • save time when analysing new markets by collecting and linking information about your target market from different databases

Interested? Wanna see how this could work in established platforms like Confluence? Come to Atlassian Summit or SemTechBiz (both to be held in San Francisco) next week and visit us at the PoolParty booth!

Andreas Blumauer

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.