Thomas Schandl

Drupal and the Semantic Web – Interview with Stéphane Corlosquet

Stéphane Corlosquet has been the main driving force in incorporating Semantic Web capabilities into Drupal. In the recent release of Drupal 7, Semantic Web technologies became part of the core of this popular CMS, which is used to power at least 1% of all the world’s web sites.

Drupal is the leading CMS when it comes to implementing Semantic Web standards. What are the reasons for this, what makes Drupal such a good fit for Semantic Web technologies?

Historically, Drupal is known to be web standard compliant. It supported the RDF-based aggregation format known as RSS 1.0 as early as in 2001, which was later upgraded to RSS 2.0. The Drupal community prides itself in valid HTML code, not only for the code generated by Drupal, but also by taking the extra step of automatically fixing faulty HTML entered by its users. Drupal has been using XHTML since its version 4.0 in 2002. The next logical step beyond XHTML was to add a layer of semantics with the RDFa standard, a W3C recommendation published in 2008.

There are definitely many reasons that contributed to the addition of RDFa into Drupal 7. The first comes from the Drupal project lead, Dries Buytaert, who is passionate about the web and open source. Secondly, the growing Drupal community is very web savvy and includes many experts from different backgrounds in accessilibity, CSS, HTML, security etc. As a result, every release of Drupal includes many latest standards. The community meets twice a year at conferences (DrupalCons), thes events play a great role in hashing out what technologies or designs will be incorporated into the next version of Drupal. Because of the flexibility of its internal architecture, Drupal is able to keep up with the latest web standards. Content in Drupal is very structured and provides site administrators with a user interface to build the site structure they want, using entity types, content types, fields and taxonomies for categorization. When it comes to other CMSs, Joomla!’s community appears to be more fragmented with a core software that is not as extensible as Drupal and WordPress is more of a blogging platform, so turning it into a full blown CMS can be challenging. Both WordPress and Joomla! are in fact adapting the concept of Drupal’s Content Construction Kit (CCK) to their software but they have not yet reached the same level of maturity as Drupal.

A common objection to the adoption of Semantic Web technologies is that the learning curve is steep and that it is too complicated for many web developers to get into it. How can Drupal 7 change that? Which features accessible for the average web site operator will it offer?

Semantic Web technologies don’t have to be complicated when applied to simple use cases! We purposely chose only of a subset of semantic web technologies to integrate into the core of Drupal, keeping the learning curve for the Drupal developers and users as low as possible. The main technology is RDFa which includes the notions of vocabularies (a schema, or collection of attributes) as well as Compact URIs (CURIEs) which make the authoring of RDFa easier. In fact, some web developers might have come across these notions before when working with Dublin Core in the meta tags as such dc:title or dc:date.

Which benefits will web site owners get when they switch to a semantics enabled Drupal 7?

Google and Bing increasingly rely on machine-readable structured data from the websites that they crawl. The design of Drupal 7 embeds semantic meta data that makes machine-to-machine (M2M) search native for a Drupal 7 website. RDFa can add value by giving search engines more details such as the latitude and longitude of a venue for display on a map; or providing the ISO date format for localization and proper display in the search results for different countries.

What are your hopes regarding the development of other applications that either provide or consume data from D7 sites? Which improvements of standards, best practices or (lightweight) ontologies in the Semantic Web community would you like to see?

Services like Sig.ma are already able to collect semantic data from different sources and display it in new ways in the form of mash-ups. Eventually, these services that consume semantic data will not be just Drupal specific, as more platforms jump on the semantic web band wagon. What I hope to see as improvements or best practices in the future are more well-maintained vocabularies. Many of the existing vocabularies are over engineered, some fail to de-reference properly. Their is also some work to be done in order to improve the tooling made available to web developers as well as introducing the simple concepts of Linked Data to web developers via easy to read documentation.

Thank you for this interview, Stéphane!

Tassilo Pellegrini

LOD2 Kick Off Meeting in Leipzig

From September 6 – 8, 2010 we kicked off the LOD2 project in Leipzig / Germany. LOD2 is funded by the European Commission within the 7th Framework Programme (Grant Agreement No. 257943) consisting of 10 partners from 7 countries. Its main aim is to integrate and syndicate linked data with large-scale, existing applications and showcase the benefits in three application scenarios: 1) Media & Publishing, 2) Enterprise Data Management and 3) Open Government Data. The resulting tools, methods and data sets have the potential to change the Web as we know it today. (You can download the project flyer here.)

The first day was dedicated to the general introduction of the project partners which are Universität Leipzig (Germany), Centrum Wiskunde & Informatica (Netherlands), National University of Ireland in Galway (Ireland), Freie Universität Berlin (Germany), OpenLink Software (United Kingdom), Semantic Web Company (Austria), TenForce (Belgium), Exalead (France), Wolters Kluwer Deutschland (Germany) and Open Knowledge Foundation (United Kingdom). Below you see a picture of the kick off team.

During the morning of the second day a first introduction to the technical components took place. The picture below shows an abstraction of the LOD2 high level architecture.

Orri Erling and Hugh Williams from OpenLink introduced Virtuoso, which will be used as one of the storage technologies in the LOD2 stack. The second knowledge store technology will be MonetDB introduced by Peter Boncz from CWI. Both systems will also be used as a kind of benchmark laboratory for hosting and querying linked data.

Christian Bizer from FU Berlin talked about Silk and D2R. In combination they will be used to discover relationship and similarities between entities within different linked data sources – generally called identity resolution.

Giovanni Tummarello from DERI introduced Sindice and Sig.ma under the aspect of how to update, validate and reuse data that is available on the web and support the production of professional, collaboratively governed linked data especially for enterprise use. Beside that an important aspect will be how to handle the high amounts of generated data. So according to Giovanni scaling the infrastructure and the use of appropriate hardware will be central in bringing the Sindice index into enterprise stacks i.e. as an approach for lightweight data consolidation purposes.

Norman Heino from AKSW University of Leipzig introduced OntoWiki and Semantic Pingback. Ontowiki will be used at the interface layer for producing, annotating, browsing and querying linked data and presenting it to the enduser in various GUIs. Semantic Pingback’s aim is to interlink the Web 2.0 with the Semantic Web by backwards compatible RPCs (remote procedure calls). It detects new typed or untyped external links, manages the GET and POST commands and it takes care of server autodiscovery.

Andreas Blumauer from Semantic Web Company demonstrated PoolParty as a smart editor for metadata in enterprise stacks. Like Ontowiki PoolParty also addresses the interface level of LOD2 especially when it comes to generate, edit and link metadata to documents primarily based on SKOS. PoolParty deliberatelly uses Thesauri as a mapping layer to discover similarities of documents, generate tag recommendations for their annotation and publish used vocabularies as Linked Data.

In the afternoon we continued with individual breakout sessions to discuss work package interdependencies and start profiling the use cases and requirements eingineering in more detail.

The third day started with an introduction by Stefano Bertolo – the responsible scientific project officer from the EC side for the LOD2 project – who pointed out that the LOD2 project is an important one for the European Web of Data and the EC among others specially is interested in the Open Government Data use case of LOD2.

After this introduction talks of the 3 Use Cases were presented by A) Jonathan Gray (OKFN) about the Open Gov Data use case followd by B) Amar-Djalil MEZAOUR (Exalead) speaking about the Linked Business Data use case and C) Christian Dirschl (Wolters Kluwer) having a talk about the LOD in the publishing & media industry use case.

Central to the success of LOD2 will be a smart handling of all the integration issues which will come up in the course of the project. Here Tenforce, an integration specialist from Belgium, will have the lead. CEO Bastiaan Deblieck gave a detailed outlook on the methodologies  and he presented a nice and comprehensive overview how the integration issues will be approached from a SCRUM perspective.

After a presentation about LOD2 project dissemination, training and community building activities by Martin Kaltenböck (Semantic Web Company) there were serveral discussions going on until the successful kick off meeting was closed by project lead Sören Auer (Universität Leipzig) at 04.00pm of 08 September 2010.

Updated news information can be accessed on the LOD2 project website as well as on the LOD2project twitter stream (and on twitter using #lod2)…

Stay tuned!

Andreas Blumauer

Interview with David Huynh: “The user interface design must inform the back-end design”

Linked Data is evolving fast. A huge amount of RDF data is available and ready for exciting new applications. Unfortunately, the bottleneck is still the availability of Semantic Web user front-ends which demonstrate the power of linked data. To a certain degree BBC Music beta is the first commercial platform which makes heavy use of linked data. With Parallax David Huynh has shown that one of the most interesting semantic web applications can be built around browse and search applications which offer tools for doing complex search queries.

Andreas Blumauer from Semantic Web Company (SWC) talked with David Huynh, “Interaction Scientist” at Metaweb, the company which developed Freebase, an “open, shared database of the world’s knowledge”.

SWC: David, you have been working for MIT´s Simile Project and now for Metaweb Technologies – two “building blocks” of the Semantic Web. Could you tell us a bit about your ongoing work at Metaweb?

David: My official title at Metaweb is “Interaction Scientist,” and so my main focus is coming up with novel interaction designs for Metaweb’s platform and products, and prototyping them to some extent to evaluate their effectiveness. Parallax was one such prototype that has gathered much excitement within Metaweb and the Semantic Web community at large. And the Freebase query editor 2.0 shows my interaction designs at the other end of the spectrum – targeting developers rather than just end-users.
I’ve also learned that data-centric user interfaces and interaction designs can only be as good as the data allows them to. So I am also dedicating some of my time toward analyzing the data we have and improving its quality so that I can design even better interactions.

Freebase Query Editor 2.0 from David Huynh on Vimeo.

SWC: With Parallax you have introduced a new way to search and explore data: Could you explain the “set-based browsing paradigm”?

David: In the browsing paradigm of the original Web, while looking at a web page, you can only click on one hyperlink to get to one other web page. But in a lot of cases, the hyperlinks on that web page can be grouped into different groups based on what they mean to the human reader: these are the links that lead to reviews, these are the links that lead to authors, these are the links that lead to vendors, etc.
Now if the computer actually knows what these links mean, then you can tell it to follow several of those links that mean the same thing: follow all the links that lead to authors. Think of it as powered browsing: the computer does the work of following several similar browsing paths at the same time – going from a set of things (web pages or data entries) to a similarly related set of things – and making all of that information available for your perusal in one shot. It is a paradigm shift compared to how we browse the Web today. And it’s only possible when the computer is capable of telling which link is similar to which other link. And that capability, in turn, will be made possible by the Data Web.
(See this unpublished paper which goes into depth about this concept)

SWC: Linked Data is evolving fast. A huge amount of RDF data is available and ready for exciting new applications. Unfortunately, one bottleneck is still the availability of Semantic Web user front-ends which demonstrate the power of linked data. Do you think, that the Semantic Web is rather a server-technology than an end-user experience?

David: I have never thought of the Semantic Web as either a server technology or an end-user experience. I only care about usefulness, and then a matching amount of usability to make that usefulness accessible to people, especially those without Computer Science expertise.
I find that it’s so much easier to explain to people and get them excited about “immediate, personal, local benefits” of a particular technology than about “long-term, communal, global benefits” of a vision. For most people, the former must be experienced and felt often before the latter can appear vaguely appealing enough to call for actions. I’m lazy – I don’t like to spend efforts convincing people of visions; I only want entice people into using the tools that I have created.
So if Parallax is considered a success, it is so not just because of its technologies and research contributions, but also because the accompanying screencast explained it in a way that people who cared nothing about the Semantic Web could understand why Parallax would be useful to them. This was achieved by pointing out limitations of existing web technologies as already experienced and understood by a lot of web users, and then illustrating concretely a possible solution enabled by data web technologies.
Perhaps I could venture further and say that the dichotomy of server technologies and end-user experience is what’s holding back Semantic Web user interface efforts. For those who don’t have expertise in design, it is a comfort to think that once the back-end technologies are solid, then it’s just a matter of putting on some polishes, a.k.a. user interfaces from their point of view, to make the whole package appealing. This approach is wrong. The user interface design must inform the back-end design. Otherwise, the user interface will almost always reflect the internal system model, and that’s usually very dissonant with how users think and behave. Recall all the Semantic Web interfaces you have seen that force users to think in terms of triples or of raw URIs. Those were made by starting from the data model, not from user needs.

SWC: Quite often I hear people saying: Where is the Semantic Web? – I still can´t “see” it! How could the linking open data community make use of such user interfaces like Exhibit, Piggy Bank or Parallax? Is the set-based browsing paradigm a universal way to browse linked data or just one possible way?

David: My research prototypes embody a number of UI ideas that are quite transferable to other platforms. Most of my code is open source, too. This, by the way, is rarer than it should be: research prototypes often fall apart as soon as, or even sooner than, the relevant research papers get presented at conferences, and research code rots rather than gets offered free for reuse. This is sad, because reusable data needs reusable code to proliferate even more widely, but there is no reward system for making research code reusable, or for keeping research prototypes running. So perhaps people can’t “see” the Semantic Web because research prototypes are not presented in appealing and comprehensible ways, and they break down and disappear too quickly.
Regarding the set-based browsing paradigm, it is most certainly not the only way to browse linked data. It is just the first good one that came to my mind, around 2005. But it’s not until 2008 that I actually got around to implement it for real. One of the factors so important in its feasibility is the quality of data in Freebase, compared to other data sources that I had access to. Even the simple fact that a lot of Freebase topics have images makes Parallax look a lot more interesting and useful. People like to see pictures rather than raw URIs. And the diversity of types of data helps illustrate the browsing paradigm of Parallax – that ability to shift focus from one set of things to another set of things, even across very seemingly unrelated domains of information, such as from politicians to their celebrity friends in the movie industry.
So, perhaps one of the main challenges in adopting Parallax ideas on any arbitrary RDF data set is curating the data sufficiently for the purpose of presenting it. In fact, if you don’t know how some data is to be presented and used, there’s no way for you to determine if that data is of sufficient quality. User needs and interface designs drive back-end implementation and data curation, not the other way around. It’s a simple idea, really, but it can be hard to adopt if one is fixated on data alone.

SWC: Do you plan new versions of Parallax? When will it become part of Freebase or of even more Linked Data Sources?

David: I’ve done a few further experiments with the ideas in Parallax, but they are not ready for public use, yet. Freebase data makes my job much easier by allowing me to focus mostly on interaction designs rather than mostly on data quality, or rather, fighting the lack of data quality, for the purpose of presenting it. So I’ll start with Freebase data and we’ll see where it takes me.

SWC: What else are you working on at the moment?

David: As mentioned briefly earlier, reusable data needs reusable code to proliferate widely. That gives you a hint at an effort that I’m involved with.

SWC: Many thanks, David!

About David François Huynh

Reblog this post [with Zemanta]
Christoph Wieser

Tim Berners-Lee: “We need data on the Web to work better together”

Today, the 18th WWW conference started in Madrid, Spain. In his opening talk, Tim Berners-Lee outlined the status quo of the current Web and focused on areas for ongoing research.

tbl_klein

According to Tim Berners-Lee the Web is still static and consists mostly of archived HTML and PDF documents. There is still a need for a read/write Web and the standards are still not used to a sufficient extend. Changes in the Web are the ‘move to mobile’ and the climb up of ‘advertizing to being a science’.

Beside the still existing challenges of the current Web, additional ones arrived. Web Applications as well as Open Social Networking and Open Linked Data count to the area of current interest.

Web Applications are supposed to become new computing platforms and need a serious clean trust system. In the future Web Applications could offer a decentralized modular installation like a webized Debian.

Open Social Networking has become a great application in the Web. Currently it suffers from the ‘Social Silo Problem’. Users have often accounts in several platforms like Facebook or MySpace. The platforms, however, are separated from each other like in a field of silos. The challenge of the Semantic Web Community is now to interconnect the silos via RDF, OWL, HTTP, and SPARQL. A further requirement of Tim Berners-Lee are to focus on a Secure Web id.

Open Linked Data attracted the attention of Tim Berners-Lee most of all. Being one of the chairs of the co-located workshop ‘Linked Data on the Web’ he stressed that “we need data on the Web to work better together” in government, enterprise, and science. Open Linked Data could be a wizard for users of existing relational database systems. As query language he proposed a federated/delegated SPARQL.

Finally, Tim Berners-Lee described the role of researchers in those challenges. Researchers should ‘build a platform for others that follow’. Thereby, one should not assume what people will use the platform for.

(Report by Christoph Wieser / Salzburg Research)