Thomas Schandl

Linked data based thesaurus management in collaborative settings

The creation and management of controlled vocabularies in companies often takes place in a distributed manner. Different departments in different branch offices often rather create their own vocabularies, than have one large central knowledge model, where everyone contributes.

How to model divergent views on one concept?

Such a central model is not only much harder to manage, but there is also the general problem that differerent departments like marketing, quality assurance, R&D, etc. will have divergent views on the model and its concepts. These different perspectives on one and the same concept are hard to unify in a single model.

Think of a company that sells mobile phones and wants to create a model of its line of products. It wants to utilize this model in the context of its online shop as well as in the context of its user support forum. While the structure of the model (i.e. the relationships between the products) might be very similar or the same in both contexts, there will be differences in which properties of the products are actually relevant in the respective contexts.

In the model of the marketing department there might be a concept for a “Phantastax StamiMaxx” cell phone with a definiton “The StamiMaxx has a powerful battery and is great for professionals who travel a lot”. They might relate it to manufacturer “ACME Corporation” and to several concepts representing different features like “Android OS”, “Multi-touch touchscreen”, etc.
The very same phone has different properties that are interesting from the Quality Assurance departement’s perspective. They might call it by a more specific name like “Phantastax i3000 StamiMaxx S”, have a different definition for it like “3G cell phone implementing the new WTF3000 protocol, …” and relate it to concepts representing known problems and their solutions.

Now they face the task to integrate these different models, as it is not desirable to use a bunch of isolated models within one company.

Support of collaborative work on distributed models

To support this kind of collaborative work on distributed knowledge models, we would like to link the concepts of the models, just as is we link documents in the World Wide Web. Fortunately the Simple Knowledge Organisation System (SKOS) offers mapping properties that can be used to define relationships between concepts from different knowledge models.

E.g. when we want to say that concept “Phantastax StamiMaxx” in the product line thesaurus refers to the same real world entity as concept “Phantastax i3000 StamiMaxx S” in the Quality Assurance thesaurus, then we can use skos:exactMatch to express that. If we want to express that the concepts are merly similar, skos:closeMatch could be used.

The other SKOS mapping properties express a hierarchical (narrowMatch, broadMatch) or an associative (relatedMatch) mapping relation between concepts from different concept schemes. With those we can say that my Samsung Galaxy concept has a skos:broadMatch “Smartphone” in the product line vocabulary and a skos:relatedMatch “ACME Corporation” in a controlled vocabulary about Tech companies.

Modularisation of knowledge models

In this way SKOS thesaurus management systems like PoolParty make it possible to modularise knowledge models, represent concepts in their different contexts and consequently enable collaborative work on those models: The marketing guy can work on his model with the concept properties focused on sales without disrupting the work of the quality assurance expert on her own thesaurus. Later one or both of them can create the skos:exactMatch link between the concepts that are the same, like seen in the “Exact Matching Concepts” box in screenshot of PoolParty below.

Enrich your knowledge: Get connected with the LOD Cloud

Going a step further the models could be connected to external knowledge, e.g. a source from the Linked Open Data (LOD) Cloud. Once we establish links to LOD hubs like DBpedia, we can import additional information for their concepts or use it to establish whether similar concepts from different models really refer to the same real world resource.

Thomas Schandl

Transforming spreadsheets into SKOS with Google Refine

Looking for high quality enterprise vocabularies we recently turned our attention to the Global Industry Classification Standard (GICS), which is an industry taxonomy designed to categorize any private company. It was developed by Morgan Stanley Capital International and Standard & Poor’s and is mainly used by the global financial community to aid in the investment research process.

It is available for download as .xls spreadsheet files in several languages. Of course it would be much better to have this valuable taxonomy in a standard and machine-readable format. The Simple Knowledge Organization System SKOS is a perfect fit for a taxonomy like GICS. But how to turn a spreadsheet into SKOS with minimal manual effort?

I chose to try Google Refine for this task, as recently a promising RDF extension had been released by DERI‘s Fadi Maali and Richard Cyganiak.

Google Refine is “a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases”. Previously it was known as Freebase Gridworks which is now further developed by Google since its acquisition of Metaweb.

Refine

Google Refine UI

Refine is a very useful tool to filter and consequently transform rows, colums and cells according to customizable patterns.

After applying all necessary transformations to the spreadsheet one can edit the “RDF Skeleton”, where the columns can be mapped to literals, RDF properties and RDF classes (which can be imported from their namespaces).

RDF Sekeleton

Editing the RDF Sekeleton

Once you got your valid SKOS model ready you can export it in RDF/XML or Turtle format. Then you may want to load it into an ontology editor like Protégé or a thesaurus management tool like PoolParty in order to build upon it or connect it to other knowledge models. With PoolParty the GICS taxonomy can also be utilized to tag and categorize documents, provide semantic search and facetted navigation and it can be published as Linked Data without further effort.

GICS in PoolParty screenshot

GICS loaded in PoolParty

Working with Refine and its RDF extension was easy and fun. It’s even possible to isolate and save the transformation steps done with Refine, so one can re-apply them on similar structured spreadsheets. This came in very handy as GICS is published in nine languages and as many separate, identically structured spreadsheets.

Thomas Schandl

Drupal and the Semantic Web – Interview with Stéphane Corlosquet

Stéphane Corlosquet has been the main driving force in incorporating Semantic Web capabilities into Drupal. In the recent release of Drupal 7, Semantic Web technologies became part of the core of this popular CMS, which is used to power at least 1% of all the world’s web sites.

Drupal is the leading CMS when it comes to implementing Semantic Web standards. What are the reasons for this, what makes Drupal such a good fit for Semantic Web technologies?

Historically, Drupal is known to be web standard compliant. It supported the RDF-based aggregation format known as RSS 1.0 as early as in 2001, which was later upgraded to RSS 2.0. The Drupal community prides itself in valid HTML code, not only for the code generated by Drupal, but also by taking the extra step of automatically fixing faulty HTML entered by its users. Drupal has been using XHTML since its version 4.0 in 2002. The next logical step beyond XHTML was to add a layer of semantics with the RDFa standard, a W3C recommendation published in 2008.

There are definitely many reasons that contributed to the addition of RDFa into Drupal 7. The first comes from the Drupal project lead, Dries Buytaert, who is passionate about the web and open source. Secondly, the growing Drupal community is very web savvy and includes many experts from different backgrounds in accessilibity, CSS, HTML, security etc. As a result, every release of Drupal includes many latest standards. The community meets twice a year at conferences (DrupalCons), thes events play a great role in hashing out what technologies or designs will be incorporated into the next version of Drupal. Because of the flexibility of its internal architecture, Drupal is able to keep up with the latest web standards. Content in Drupal is very structured and provides site administrators with a user interface to build the site structure they want, using entity types, content types, fields and taxonomies for categorization. When it comes to other CMSs, Joomla!’s community appears to be more fragmented with a core software that is not as extensible as Drupal and WordPress is more of a blogging platform, so turning it into a full blown CMS can be challenging. Both WordPress and Joomla! are in fact adapting the concept of Drupal’s Content Construction Kit (CCK) to their software but they have not yet reached the same level of maturity as Drupal.

A common objection to the adoption of Semantic Web technologies is that the learning curve is steep and that it is too complicated for many web developers to get into it. How can Drupal 7 change that? Which features accessible for the average web site operator will it offer?

Semantic Web technologies don’t have to be complicated when applied to simple use cases! We purposely chose only of a subset of semantic web technologies to integrate into the core of Drupal, keeping the learning curve for the Drupal developers and users as low as possible. The main technology is RDFa which includes the notions of vocabularies (a schema, or collection of attributes) as well as Compact URIs (CURIEs) which make the authoring of RDFa easier. In fact, some web developers might have come across these notions before when working with Dublin Core in the meta tags as such dc:title or dc:date.

Which benefits will web site owners get when they switch to a semantics enabled Drupal 7?

Google and Bing increasingly rely on machine-readable structured data from the websites that they crawl. The design of Drupal 7 embeds semantic meta data that makes machine-to-machine (M2M) search native for a Drupal 7 website. RDFa can add value by giving search engines more details such as the latitude and longitude of a venue for display on a map; or providing the ISO date format for localization and proper display in the search results for different countries.

What are your hopes regarding the development of other applications that either provide or consume data from D7 sites? Which improvements of standards, best practices or (lightweight) ontologies in the Semantic Web community would you like to see?

Services like Sig.ma are already able to collect semantic data from different sources and display it in new ways in the form of mash-ups. Eventually, these services that consume semantic data will not be just Drupal specific, as more platforms jump on the semantic web band wagon. What I hope to see as improvements or best practices in the future are more well-maintained vocabularies. Many of the existing vocabularies are over engineered, some fail to de-reference properly. Their is also some work to be done in order to improve the tooling made available to web developers as well as introducing the simple concepts of Linked Data to web developers via easy to read documentation.

Thank you for this interview, Stéphane!

Thomas Schandl

Report of Linked Data Camp Vienna

Earlier this month the first ever Linked Data Camp took place in Vienna at the Quartier für Digitale Kunst. This two day event attracted about 35 people to discuss and to jointly work on novel applications for the Web of Data.

The first day started off with a keynote by Richard Cyganiak form DERI Galway’s Linked Data Research Center. He talked about the technical challenges that have to be overcome to allow for more Linked Data applications over heterogenous RDF data. These challenges revolve around discovery of and access to Linked Data, identifier and schema reconciliation, data fusion, quality assessment, aggregation, analytics and mining.
As Richard pointed out, the good news is “that linked data makes it possible that different people do the different steps, e.g., the publisher can help doing the identifier reconciliation by publishing sameAs links, and 3rd parties can help with access by providing a single SPARQL store over multiple related but independent datasets.” Check out the transcript
or slides for Richard’s talk.

Linked Data Camp Vienna Working Groups

After this keynote participants presented their topics of interest in Lightning Talks and working groups formed, some of their outcomes can be found online:
One group worked on the topic of “Dataset Dynamics”. As data in Linked Data sets change, clients having some dependency on data need to be notified about these changes. You can read about their proposed solutions here.
Another group had a go at “Expert search and profiling on the Semantic Web”, their discussions are summarized in this blog post.
Andreas Langegger demonstrated XLWrap, which is a versatile RDF wrapper for spreadsheets. A lot of feature request from participants came up (see here), so he and others worked on this handy application.

On day 2 Leigh Dodds from Talis talked about “Rights Statements on the Web of Data” (slides and transcript). Leigh raised awareness for the issue that the majority of LOD sources do not have licensing information associated with their data. This of course conflicts with the proposed openness of Linked “Open” Data, as it is doubtful whether these sources can be used for commercial puropses.

The organizers from the universities of Linz and Vienna, Joanneum Research, Gnowsis, DERI Galway, STI Innsbruck and the Semantic Web Company would like to thank all participants for making the camp a success! As with VoCamps anyone can organize a Linked Data Camp, so we hope for more camps in 2010!