Thomas Schandl

Linked data based thesaurus management in collaborative settings

The creation and management of controlled vocabularies in companies often takes place in a distributed manner. Different departments in different branch offices often rather create their own vocabularies, than have one large central knowledge model, where everyone contributes.

How to model divergent views on one concept?

Such a central model is not only much harder to manage, but there is also the general problem that differerent departments like marketing, quality assurance, R&D, etc. will have divergent views on the model and its concepts. These different perspectives on one and the same concept are hard to unify in a single model.

Think of a company that sells mobile phones and wants to create a model of its line of products. It wants to utilize this model in the context of its online shop as well as in the context of its user support forum. While the structure of the model (i.e. the relationships between the products) might be very similar or the same in both contexts, there will be differences in which properties of the products are actually relevant in the respective contexts.

In the model of the marketing department there might be a concept for a “Phantastax StamiMaxx” cell phone with a definiton “The StamiMaxx has a powerful battery and is great for professionals who travel a lot”. They might relate it to manufacturer “ACME Corporation” and to several concepts representing different features like “Android OS”, “Multi-touch touchscreen”, etc.
The very same phone has different properties that are interesting from the Quality Assurance departement’s perspective. They might call it by a more specific name like “Phantastax i3000 StamiMaxx S”, have a different definition for it like “3G cell phone implementing the new WTF3000 protocol, …” and relate it to concepts representing known problems and their solutions.

Now they face the task to integrate these different models, as it is not desirable to use a bunch of isolated models within one company.

Support of collaborative work on distributed models

To support this kind of collaborative work on distributed knowledge models, we would like to link the concepts of the models, just as is we link documents in the World Wide Web. Fortunately the Simple Knowledge Organisation System (SKOS) offers mapping properties that can be used to define relationships between concepts from different knowledge models.

E.g. when we want to say that concept “Phantastax StamiMaxx” in the product line thesaurus refers to the same real world entity as concept “Phantastax i3000 StamiMaxx S” in the Quality Assurance thesaurus, then we can use skos:exactMatch to express that. If we want to express that the concepts are merly similar, skos:closeMatch could be used.

The other SKOS mapping properties express a hierarchical (narrowMatch, broadMatch) or an associative (relatedMatch) mapping relation between concepts from different concept schemes. With those we can say that my Samsung Galaxy concept has a skos:broadMatch “Smartphone” in the product line vocabulary and a skos:relatedMatch “ACME Corporation” in a controlled vocabulary about Tech companies.

Modularisation of knowledge models

In this way SKOS thesaurus management systems like PoolParty make it possible to modularise knowledge models, represent concepts in their different contexts and consequently enable collaborative work on those models: The marketing guy can work on his model with the concept properties focused on sales without disrupting the work of the quality assurance expert on her own thesaurus. Later one or both of them can create the skos:exactMatch link between the concepts that are the same, like seen in the “Exact Matching Concepts” box in screenshot of PoolParty below.

Enrich your knowledge: Get connected with the LOD Cloud

Going a step further the models could be connected to external knowledge, e.g. a source from the Linked Open Data (LOD) Cloud. Once we establish links to LOD hubs like DBpedia, we can import additional information for their concepts or use it to establish whether similar concepts from different models really refer to the same real world resource.

Tassilo Pellegrini

Linking Open Data to Thesaurus Management

The Vienna-based company punkt. netServices is just about to release a demo version of their PoolParty service, a SKOS-based thesaurus management tool with linked data capabilities. I had the chance to pre-read a white paper and test their service. Here is a brief overview. You can also try a demo.

Purpose

Poolparty was conceived to facilitate various applications like

  • Semantic search engines
  • Recommender systems (similarity search)
  • Corporate bookmarking
  • Annotation- & tag recommender systems
  • Autocomplete services and facetted browsing.

These use cases can be either achieved by using PoolParty stand-alone or by integrating it with existing Enterprise Search Engines and Document Management Systems or Enterprise Wikis.

Thesaurus Management

PoolParty is aiming to be easy to use for people without a strong Semantic Web background or special technical skills. The GUI is entirely web-based and utilizes AJAX so the user can e.g. quickly merge two concepts via drag & drop. An overview over the thesaurus can be gained with a tree or a graph view on the concepts.

poolparty-blueskin

PoolParty also helps to semi-automatically add concepts to a thesaurus as it can be used to analyse documents (e.g. web pages or PDF files) relevant to a thesaurus’ domain in order to glean candidate terms. This is done by the key-phrase extractor of KEA. The extracted terms can be selected by the user, thereby becoming “free concepts” which later can be integrated into the thesaurus, turning them into “approved concepts”.

Documents can be searched in various ways – either by keyword search in the full text, by searching for their tags or by semantic search and similarity search. The latter takes not only a concept’s preferred label into account, but also its synonyms and the labels of its related concepts are considered in the search. The user might manually remove query terms used in semantic search. Boost values for the various relations considered in semantic search may also be adjusted. In the same way the recommendation mechanism for document similarity calculation works.

PoolParty by default also publishes a Semantic Wiki version of its thesauri, which provides an alternative way to browse and edit concepts. Through this feature anyone can get read access to a thesaurus, and optionally also edit, add or delete labels of concepts. Search and autocomplete functions are available here as well. The Wiki’s XHTML source is also enriched with RDFa, thereby exposing all RDF metadata associated with a concept to be picked up by RDF search engines and crawlers. (See two examples: Cocktail thesaurusStandard Thesaurus for Economics)

PoolParty also supports the import of thesauri in SKOS (including several consistency checks) or Zthes format. Those functionalities can also be consumed as stand-alone web services via PoolParty SKOS Services. Additionaly, lists of concepts and their labels can also be imported via CSV files.

Linked (Open) Data

PoolParty not only publishes its thesauri as Linked Open Data (in addition to a SPARQL endpoint), but it also consumes LOD in order to expand thesauri with information from LOD sources.

Concepts in the thesaurus can be linked to e.g. DBpedia  via a service like Georgi Kobilarov‘s DBpedia lookup service, which takes the label of a concept and returns possible matching candidates. The system suggests relevant resources from DBpedia and the user can select the one that matches the concept from his thesaurus, thereby creating a skos:exactMatch relation between the concept URI in PoolParty and the DBpedia URI. The same approach can be used to link to other SKOS thesauri available as Linked Data.

poolparty-lod

Other triples can also be retrieved from the target data source, e.g. the DBpedia abstract can become a skos:definition and geographical coordinates can be imported and be used to display the location of a concept on the map, where appropriate. The DBpedia category information may also be used to retrieve additional concepts of that category as siblings of the concept in focus, in order to populate the thesaurus.

PoolParty is capable of importing a SKOS thesaurus from a Linked Data server, and may also receive updates to thesauri imported this way. This feature has been implemented in the course of the KiWi  project funded by the European Commission. KiWi also contains SKOS thesauri and exposes them as LOD. Both systems can read a thesaurus via the other’s LOD interfaces and may write it to their own store. This is facilitated by special Linked Data URIs that return e.g. all the top-concepts of a thesaurus, with pointers to the URIs of their narrower concepts, which allow other systems to retrieve a complete thesaurus through iterative dereferencing of concept URIs.

Additionally KiWi and PoolParty publish lists of concepts created, modified, merged or deleted within user specified time-frames. With this information the systems can learn about updates to one of their thesauri in an external system. They then can compare the versions of concepts in both stores and may write according updates to their own store.

This means each system decides autonomously which data it accepts and there is no risk of a system pushing data that might lead to inconsistencies into an external store. Data transfer and communication are achieved using REST/HTTP, no other protocols or middleware are necessary. Also no rights management for each external systems is needed, which otherwise would have to be configured separately for each source.

Technology

The software is written in Java and utilizes the SAIL API, so it can be used with various triple stores. The thesaurus management itself (viewing, creating and editing SKOS concepts and their relationships) can be done in an AJAX Frontend based on Yahoo User Interface (YUI). Editing of labels can alternatively be done in a Wiki style HTML frontend. For key-phrase extraction from documents PoolParty uses a modified version of the KEA 5 API, which is extended for the use of controlled vocabularies stored in a SAIL Repository (this module is available under GNU GPL). The analysed documents can be stored and indexed in Lucene/Solr or any other (enterprise) search system along with extracted and semantically related concepts.

Reblog this post [with Zemanta]
Andreas Blumauer

Metaweb´s Jamie Taylor: “Freebase provides a large and user extensible vocabulary for RDF/RDFa”

Jamie Taylor, Metaweb

Jamie Taylor, Metaweb

Andreas Blumauer from Semantic Web Company (SWC) talked with Jamie Taylor, Minister of Information at Metaweb Technologies Inc. about Freebase & Linked Data and Google´s announcement to use RDFa.

SWC: At ISWC 2008 Freebase became “officially” part of the LOD Cloud. What exactly has changed since that time?

Jamie: Since Freebase is a community writable semantic database, the addition of the RDF interface allows anyone to publish data into the LOD cloud. LOD Applications can access any Freebase Topic through the RDF interface by constructing a URI from the Freebase identifier.  But perhaps more importantly, because entities in Freebase can be annotated with multiple identifiers, Freebase Topics can be retrieved by constructed URIs using the identifiers used by other systems and data sets.
For instance, the movie Blade Runner can be referred to as http://rdf.freebase.com/ns/en.blade_runner, but it can also be referenced as http://rdf.freebase.com/ns/authority.netflix.movie.70053131 using the Netflix identifier, http://rdf.freebase.com/ns/authority.imdb.title.tt0083658 using the IMDB identifier, or as http://rdf.freebase.com/ns/wikipedia.en.Dangerous_Days using a Wikipedia wikiword (which in this case is a Wikipedia redirect to the wikiword Blade_Runner).
Freebase also provides a user maintained mapping of how these identifiers can be used to address resources in other LOD systems. The sameas.freebase.com schema can tell an LOD user that the Freebase Blade Runner Topic can also be found in DBpedia using Wikipedia identifiers or how musical artists can be found at the BBC using Musicbrainz identifiers.  In fact, the Freebase RDF interface uses the sameas.freebase.com schema to create the owl:sameAs links in the RDF output allowing the user community to expand the interconnections between Freebase and the LOD Cloud.
Linked Data providers are also using the strong identifiers in Freebase to identify entities such as companies and locations in their own data sets.  When they find an entity that is not represented in Freebase, they simply add the entity to Freebase and use the newly minted Freebase identifier.  This permits anyone using their data to understand how their entities relates to any of the more than 5 million things interconnected within Freebase.

The RDF interface can also be used to reference the Freebase type system, giving LOD data set providers vocabularies across a wide range of subject areas.  And because anyone can expand Freebase’s data model, data providers can use our schema development tools to build and extend these vocabularies to suite their needs.
Freebase was not designed for ephemeral or fast changing data, like weather conditions or stock ticks.  But this type of information is well suited for publication as Linked Data.  Freebase entities representing a location or company can be annotated with references to LOD services that provide these types of volatile data.  Similarly, Linked Data provides a great way to disseminate very fined grained information that might be associated with a scientific study or financial report.  Linked Data provides a seemless transition from Freebase, where a user (or application) can run a query with constraints that run across a wide range of types to find entities of interest along with the LOD services that provide access to temporal or high resolution data not available in Freebase.
We recently demonstrated MQL Extensions which allows the Metaweb Query Language to use data from other systems as a part of the query constraint and result set.  While MQL Extensions are user extensible and work with a wide array of systems,  this capability makes the connection between Freebase and the LOD Cloud even more transparent.
For example, because US companies that are registered with the SEC are annotated CIK code in Freebase and the sameas.freebase.com schema indicates that the CIK annotation can be used to create a URI that is dereferencable at rdfabout.com, it is possible to write a MQL query that asks who is on the board of financial services companies that trade on NASDAQ and are  headquartered in California (and using another MQL Extension, you can ask for their stock price as well!)

SWC: Many organisations are very interested in Linking Open Data now but they are still not sure if they can benefit from publishing data on the web – what´s your experience so far?

Jamie: Linked Open Data provides a simple, standard way for organizations to distribute structured data.  For most organizations, providing access to data is another important outlet to announce the availability of higher value services.  For organizations involved in building or selling physical goods, the bits representing what they provide are not the goods themselves, but a way of attracting potential customers.  Making catalogs and specification sheets available in electronic form, so other applications can connect buyers to their physical goods is simply an effective marketing system.  Even for firms involved in electronic services, providing access to open structured data is generally a lead-in to value added services.  For instance, if I ran a service collecting hard-to-find information about manufacturing relationships between medium sized businesses, I would publish open company profiles covering things like market size, industry, location for the medium-sized businesses I tracked, so potential users the premium data would know I had the coverage they were looking for.

SWC: Just recently Google has announced to use RDFa to enhance their search results. What do you think?

Jamie: We are excited about Google’s announcement. Yahoo’s use of RDFa for Search Monkey and Google’s announcement gives RDFa users tangible benefits. The Search Monkey team was very quick to realize that because users can create data models in Freebase, and because the elements of those models all have strong RDF identifiers, Freebase provides a large and user extensible vocabulary for RDF/RDFa (see the list of vocabularies). When a user wants to create a Search Monkey application that works with their film review site, they need not invent a new vocabulary (that will probably be used only once),  they can use the Freebase Film Domain vocabulary which supports over 63,000 instances in Freebase alone.
Similarly, with over 5 Million well described Topics in Freebase and over 14,000,000 Named Objects (Topics, images, musical tracks and documents) when a user wants to unambiguously identify a subject or object in RDF/RDFa, Freebase has an extremely large collection of identifiers to draw from.  These cover people, places, companies, movies, music, books and wide variety of other subjects.  If Freebase doesn’t have the entity the user is looking for, they can of course add it themselves and make use of the identifier immediately. I think this is why Google used some Freebase identifiers in their examples. We hope that with Yahoo and Google’s support for RDFa the web will become a strongly annotated source of data which can support a wide range of user applications.

SWC: Thank you, Jamie!

Reblog this post [with Zemanta]
Andreas Blumauer

Linked Data in Enterprises – some ideas for business models

Today in the morning, I wrote a short blog philosophizing about linked data and the value for enterprises. I asked a couple of questions and in its core I was wondering: “Which services and keyplayers will drive the web of data in the next few months?”

In the meantime I had the pleasure to listen to Talis´ Semantic Web Gang Podcast (January 2009 with Tom Tague from Calais) and some answers came into my mind:

  1. Some service providers will provide the highest accuracy regarding the links or tags (and the “things behind them) they provide for a given ressource or document (like Open Calais does). Tom Tague mentioned in the podcast quite often how important disambiguation is to provide the highest quality.
  2. Some will provide end-points to a given “thing” like a company, a person etc. in addition to free ones like DBpedia, but they always will try to refer to established URIs like the ones in DBpedia or Open Calais URIs, e.g. IBM´s URI @ Calais). Those companies will provide more facts, for example about a person, as those which are available now for free. They will build on the LOD infrastructure and will live in symbiosis with group number 3. They will control to whom additional facts will be given to but they will build exactly on the same interoperable framework as the “Linking Open Data” community does.
  3. Some companies will build applications on top of the linked data infrastructure. They have two kinds of knowledge: Who has the best end-points to a complex “thing” which consists of a couple of other atomic things (which necessarily exist in the web of data)? Who is interested in such a mashup?

My prediction: One possible business model will be pretty much the same as iTunes is built upon at the moment: You can listen to a song for free – but only a couple of seconds , if you want more, you pay 99 cents.

If you want to know a little bit about Werner Faymann (who is Austria´s prime minister) you go to an application which makes use from DBpedia (or the like) starting at http://dbpedia.org/page/Werner_Faymann.

If you pay 99 cents (or a bit more…) you get even more facts about Mr. Faymann, nicely mash-uped with other facts from the LOD cloud and together with special content from some other linked data sources, produced with relatively low costs due the high interoperability the Semantic Web provides – thanks to W3C and the whole community.