Andreas Blumauer

“Thesaurus based search engines will become main stream in the near future”

The results of the survey titled “Do controlled vocabularies matter?” which was conducted by Semantic Web Company from May until June 2011 are public now. Over 150 participants from 27 countries draw a picture of the current and future usage behaviour in the realm of controlled vocabularies.

Here are three of the most interesting outcomes of this questionnaire – the whole report can be found and downloaded on issuu:

Do you think enterprises and other organizations can significantly benefit from using Linked Data?

The answer is a clear YES. A subsequent question also reveals that all kind of organisation sizes have about the same opinion concerning linked data. Only few people think that linked data is a “niche thing”. In general it can be said, that over 90% of the participants think that most or at least some organisations can benefit from using linked data.

Do you think that search engines which utilize thesauri to improve results will become main-stream

The results of this question are amazing: Two thirds of the participants think that thesaurus based search is already or will become main-stream in the near future. Scepticism towards this development seems to be low – at least it can be stated, that a clear majority thinks that thesaurus based search engines will become main stream in the near future.

 

How important is the usage of standards like SKOS for controlled vocabularies?

The results speak for themselves. The majority of the participants are convinced that standards like SKOS are important for their daily work. In August 2009 W3C announced the new SKOS standard – now, nearly two years after, it looks like this standard has well arrived. 48.7% stated that standards like SKOS are very important and 29.1% voted for “relevant”.

 

As an overall result of the survey it can be stated: Semantic Web community has done a great job to convince the controlled vocabulary people to benefit from SKOS and linked data – on the other side only 3-5% are aware of SPARQL as a valuable resource to build standard APIs around controlled vocabularies to lower costs when implementing such knowledge organization systems.

Many thanks to all participants of this survey!

Thomas Schandl

Which kind of controlled vocabularies matter?

Looking at intermediate results of the Controlled Vocabularies Survey an interesting finding concerns the question which types of knowledge models are currently best fit for actual use in applications.

So far 143 people whose organization already make use of controlled vocabularies answered the question “Which kind of controlled vocabulary do you use or plan to use in your applications?”.
The results so far show that lightweight models like taxonomies and thesauri are somewhat preferred over ontologies:

Taxonomies are the favorite, as 73.6% of participants use or plan to use them, followed by thesauri (62%) and ontologies (61.2%), while simple glossaries lag considerably behind with a usage of 31.4%.

This survey will close in about a week, so please take this chance to make your opinions on this topic count! You can find the questions here, it will take 5-10 minutes to answer them.

All participants will gain access to a report with the results within the following month. The most interesting results will be made public on this blog.

Andreas Blumauer

Controlled vocabularies: “Data integration is king”

Just recently a survey about “Controlled vocabularies” and their significance for enterprise information management has started. Until today 143 participants have responded and completed the survey at least partially. To give a first example what was found out, I would like to take a closer at the question: What are the main application areas of controlled vocabularies from your perspective?

A bit surprising is the intermediate result, that it´s not “Semantic Search” or “Support of multilingual applications” which was considered to be the most important application. Instead of this it turned out that “Data Integration” is king:



The bar graph shows the weighed value of each application candidate (1.0 would be a 100% acceptance that this is an important application area of controlled vocabularies). Regarding the top candidate “data integration”

  • 57,4% said “very important”
  • 29,8% “relevant”
  • 7,4% “somewhat relevant”
  • 2,1% “not relevant”
  • 3,2% “Don´t know”

If you don´t think this should be the final result, please help to get a better overview of what´s going on in the controlled vocabulary community. The survey is open until May 18th, 2011 – all participants will gain access to a report with the results within the following month. Most interesting results will be made public on this blog.

Jana Herwig

Combining Closed and Open Data Classification Mechanisms in an Extended Thesaurus

Rolf SintIn the next session, Rolf Sint gave us insights into his approach to the combination of closed and open data classification mechanisms, which is informed by his findings in his master’s thesis. The probably most widely used retrieval method for digital content is full-text search; Google and Yahoo’s indexing methods, for instance, rely on full-text search. To be able to use this method, words must be contained within the content, leading to obvious problems with synonyms, ambiguities or the different lexical inventory of different languages. Advantages are that full-text search is easy to use, and that no maintenance is required as this responsibility rests with the content providers.

On the other end of the spectrum, within open data classification mechanisms, we have social tagging. Tagging (in general) means that a user asigns labels to content items. The advantage here is that content is immediately classified; as such, tagging is an easy way to provide metadata for content, in particular as the user does not to have think about (arbitrary, system-dictated) structures. However, this leads to problems if singulars and plurals are used simultaneously, if synonyms are used, spelling mistakes occur etc etc. With tags, the exact same spelling has to be used if items are to be assigned to the same group. But if done collectively (and that is what social tagging is about), the wisdom of crowds can improve the signal to noise ratio significantly – see the miracle of the tag cloud.

What Rolf proposed in his thesis was to combine the two approaches. In his design, he used an extended thesaurus as an instrument to achieve vocabulary control – we’re looking at an extended thesaurus here, because it’s not simply built around a taxonomy, but expanded by tags that were assigned by users and integrated using a vocabulary management tool.
Extended Theasurus

This extended thesaurus can be applied in multiple ways. Continue reading