The Semantic Puzzle

Jana Herwig

Combining Closed and Open Data Classification Mechanisms in an Extended Thesaurus

Rolf SintIn the next session, Rolf Sint gave us insights into his approach to the combination of closed and open data classification mechanisms, which is informed by his findings in his master’s thesis. The probably most widely used retrieval method for digital content is full-text search; GoogleGoogle Inc. is a multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program. The company was ... and YahooYahoo! Inc. is an American public corporation headquartered in Sunnyvale, California,, that provides Internet services worldwide. The company is perhaps best known for its web portal, search engine, Yahoo! Directory, Yahoo! Mail, Yahoo! News, advertising, online mapping, video sharing, and ...’s indexing methods, for instance, rely on full-text search. To be able to use this method, words must be contained within the content, leading to obvious problems with synonyms, ambiguities or the different lexical inventory of different languages. Advantages are that full-text search is easy to use, and that no maintenance is required as this responsibility rests with the content providers.

On the other end of the spectrum, within open data classification mechanisms, we have social tagging. Tagging (in general) means that a user asigns labels to content items. The advantage here is that content is immediately classified; as such, tagging is an easy way to provide metadata for content, in particular as the user does not to have think about (arbitrary, system-dictated) structures. However, this leads to problems if singulars and plurals are used simultaneously, if synonyms are used, spelling mistakes occur etc etc. With tags, the exact same spelling has to be used if items are to be assigned to the same group. But if done collectively (and that is what social tagging is about), the wisdom of crowds can improve the signal to noise ratio significantly – see the miracle of the tag cloud.

What Rolf proposed in his thesis was to combine the two approaches. In his design, he used an extended thesaurusA thesaurus is a book that lists words grouped together according to similarity of meaning, in contrast to a dictionary, which contains definitions and pronunciations. The largest thesaurus in the world is the Historical Thesaurus of the Oxford English Dictionary, which contains more than ... as an instrument to achieve vocabulary control – we’re looking at an extended thesaurus here, because it’s not simply built around a taxonomyTaxonomy is the practice and science of classification. The word finds its roots in the Greek τάξις, taxis (meaning 'order' or 'arrangement') and νόμος, nomos (meaning 'law' or 'science'). Taxonomy uses taxonomic units, known as taxa. In addition, the word is also used as a count noun: ..., but expanded by tags that were assigned by users and integrated using a vocabulary management tool.
Extended Theasurus

This extended thesaurus can be applied in multiple ways. During a tag event, for instance, the user can be assisted by questions like “Did you mean…” if a term is ambiguous:

Tag Assistant

Search can be improved, too: If a user makes a search query, related terms can be suggested, drawing on the thesaurus. E.g., the term ‘jaguar’ would call up similar terms, allowing the user to specify the query and clarify that he (or she) is looking for a predatory animal (i.e. not the car).

Screen Related Query

In the long term, using an extended thesaurus as a light-weight ontology, an ontology is a formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. It is used to reason about the entities within that domain, and may be used to describe the domain. In theory, an ontology is a "formal, explicit ... can reduce the amount of work needed to maintain a vocabulary. What’s special in Rolf’s proposal is that the controlled vocabularyControlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of ... also contains the terminology of the community. The user is thus able to navigate within the communal information space and, as a result, problems with homonyms, synonyms and different languages would be reduced.

A paper in which Rolf and two of his colleagues explain this approach in more detail is currently being prepared for publication: Güntner, G., Sint, R., Westenthaler, R. (2008): “Ein Ansatz zur Unterstützung traditioneller Klassifikation durch Social Tagging”. Tagungsband des ExpertInnenworkshops “Social Tagging in der Wissensorganisation – Perspektiven und Potenziale”, 2008 (im Druck). Further details about the publication can be obtained from Rolf.

Zemanta Pixie

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>