Thomas Schandl

Short Semantic MediaWiki Tutorial (with link to sandbox)

On the occasion of the recent publication of our book, Social Semantic Web, we have created an accompanying wiki for you to explore the contents of the book and obtain information about its authors. Staying true to the motto “Eat your own dog food”, the Semantic Web Company has used a semantic wiki for that purpose.

We opted for Semantic MediaWiki (SMW) and the extensions Semantic Forms and Semantic Drilldown. In this blog post we’ll take a look at the handy features you get with these. This short tutorial is based on my SMW demonstration at the Web of Data Practitioners’ Days in Vienna two weeks ago.

As the book is in German, the wiki is set up in German, too, but that shouldn’t be a problem for understanding the demonstrated features. For the following examples, we have created a mirror of our productive wiki, so don’t hesitate to edit and play with this mirror wiki (we might refresh it occasionally, so don’t write any data into the wiki that you don’t also have stored elsewhere). This tutorial is going to take you through the following SMW features:

  • Automatically created lists
  • Faceted search
  • Semantic queries
  • Entering data via forms
  • RDF export

So let’s see what these features hold for us.

  • Automatically created lists

A common problem in wikis like Wikipedia is the (amount of) effort it requires to create and maintain various lists like the list of the EU’s largest cities. It’s an equally laborious and error-prone activity to keep such lists up to date; as a result, there are a lot of useful Wikipedia lists we can imagine that don’t exist at all, like a list of the world’s largest corporations with a CEO younger than 35.

In SMW it is easy to create all kinds of lists with queries. This page for the book’s table of contents is an example. View its source to see the inline queries used to generate the page (click to enlarge or view the source on the wiki):

Semantic Media Wiki Query

As a result, the list is generated afresh any time the table of contents page is called up. If the data on an article’s page has changed, it will also be updated in that list – while in regular MediaWikis one has to manually update the data in both places (the article, and the list), which, apart from the extra work, also makes errors and inconsistencies much more likely.

  • Faceted search

Take at look at the list of articles page… Continue reading

Jana Herwig

Multimedia in the Web of Data – Annotating and Interlinking Photos, Music, Multimedia [WOD-PD]

The Web of Data Practitioners Days concluded with the session on Multimedia in the Web of Data, the first part of which was led by Ansgar Scherp (University of Koblenz-Landau, Germany).

Multimedia content, as Ansgar pointed out, is hardly annotated, badly organized, and hardly ever looked at again – just think of the 300 something pics you might take on an average week-end getaway, and which you never touch again. Annotating multimedia content requires a lot of work and dedication – but most of the time, these pictures eventually dissappear in the “digital shoe box” that is your photo management software.

The most obvious remedy is to annotate content as early as possible, ideally when creating the content, ideally already on your portable camera (formerly known as: mobile phone:) Ansgar suggested to provide incentives for people to encourage picture annotation – professionals could for instance receive a higher financial reward if the deliver already annotated pictures. And of course there are ‘Games with a purpose’ such as Google Image Labeler, where players tag images in pairs, with and against each other, and are rewarded with the entertainment factor of the game.

The slide below shows what has happened (or will happen) to the process of creating photo books in the digital age and the age of mashups:

Ansgar Scherp's slides

After all, this is the age of the social semantic web, so why not try and (re-)use the content, structure and contexts that other users have already created on the web? Content augmentation, for the scope that Ansgar is concerned with, consists in the reuse of content and structures (e.g. from sources such as Flickr and Wikipedia, Geonames) made possible through the definition of rules, e.g.:

  • If there are two or less pictures on a page*
  • then automatically augment the page with additional photos using location information.

* Page here means a page in the album you are currently working on – you probably took a picture of yourself and your friend in Paris, and even though you went to the Centre Pompidou, you forgot to actually take a pic of the building itself – well, let the web be your library!

So the goal is clear: develop a procedure for applying automatic content augmentation in the creation of good photo books.

But what makes a ‘good’ photo book anyway? Here are some of the results of a structural analysis of real, human-created photobooks conducted at CeWe Color:

  • % of photos with faces: 36%
  • Number of album pages: 16.96
  • Photos per page: 6.69
  • Text fields per page: 1.45
  • % of pages with text: 87%

There are many rules that can be established from the structural analysis, which can be applied in turn in the creation of photoboooks, e.g. rules like this one,

  • If the text located in the upper third of a page
  • if the font size is equal or larger that 16 points
  • if the number of words is less than 10
  • if there is no caption on the page that has a bigger font size
  • then this page is the title

Ansgar recommended xSmart, which he described as a “context-driven authoring tool for page-based multimedia presentations.”

Ansgar’s presentation was followed by two more: one by Yves Raimond on Interlinking Music on the Web of Data, and one on Interlinking Multimedia – in spite of better intentions, I did not manage to cover these two in detail, but at least I gathered the links to relevant resources from all three sessions… Continue reading

Jana Herwig

Session 4: Using the Web of Data [WOD-PD]

This morning’s first session was dedicated to Using the Web of Data, or, as Alan Dix put it: “In the end, it’s not about data – it’s about use!” Alan and Richard Cyganiak were the keynoters for this session.

Alan Dix is a Professor at the Computing Department of Lancaster University, and author (with Janet Finlay, Gregory Abowd, and Russel Beale) of Human-Computer Interaction.

To start with, Alan pointed to the two sides of achieving the web of data: Firstly generating the web of data (a billion triples, as mighty as this may sound, is actually tiny, says Alan) and then, secondly, accessing the web of data.

Alan Dix giving a talk

With regard to generating the Web of Data, Alan distinguished between top down and bottom up approaches, counting to the former the creation of the web of data from legacy sources (i.e. where you take existing data and semantically lift them, e.g. from structured data) or web scraping such as DBpedia‘s extraction of data from Wikipedia.

N.B.: This notion of ‘top-down’ does not imply a hierarchical relationship, but rather means that there is already a plan for what is going to be put on the web of data (e.g. ‘all semi-structured information on Wikipedia’ or ‘dataset XY from project Z’). The bottom-up idea here implies that data is added as the result of an action, or interaction, as the user/s go, e.g. relationships are created as the user expands his or her social network. For instance on Amazon, user interaction is used to generate semantics: People do not tell Amazon what they like, they simply buy it.

Having relationships of course does not imply yet that these relationships are part of the Semantic Web. Or, as Alan put it, “why should I be RDFizing my online presence if none of my friends are?”

Please take a look at the PDF of the Alan’s slides (2,4 MB) – what I cannot reproduce here is a chart he developed, which was very useful for describing current scenarios on the web and which posed a twofold question:

Does a website/platform have the web of data implemented? YES/NO
Is the web of data on ta website/platform apparent to the user? YES/NO

The possible combinations (YES/YES, YES/NO, NO/YES, NO/NO) provide a good heuristic tool for describing what is currently available, with and without the Semantic Web. Take, for instance, the shiny interface of Talis’ Project Cenote: Cenote’s vision is to “make library data visible in many contexts, inside and outside of the library, making the data much more accessible and visible to a wider audience – benefiting current and potential users of library services wherever they are.” On Cenote, the user doesn’t see that it’s got the Web of Dat in it – it is actually implemented, but not in a way that is apparent to the user.

On the other end of the spectrum, you have a platform like Facebook: Alan referred to Facebook as “the user’s own web of data”, i.e. web of relationships: The user is aware of these relationships (they actually shape his interaction and communication with the site), and the (numerous!) apps on Facebook continually add relationships, but, regrettably, insulated from one another and not using RDF (and don’t you try to take data out of Facebook!).

Two examples of public data that Alan cited and that grow as people/institutions add data do them are Freebase (the “open database of the world’s information” – see previous posts on this blog about Freebase) and Swivel. Swivel allows people, institutions, anyone to upload and explore data, also featuring official data sources such as (links go to their Swivel pages): New York Federal Reserve Bank, UNESCO Institute for Statistics, DukeResearch or EUROSTAT. According to Alan, there is already more data on Swivel now than in the whole Linked Data cloud.

Alan also mentioned the Social Graph API – o yesterday evening Luca Hammer (one of the web 2.0 people who had joined the Open Hacking Session) introduced me to the WordPress Plugin “Meet your commenters” – Meet you commenters uses Social Graph to find social relations on the web, and adds these data to the commenter profiles it creates in WordPress.

Two Christmas crackersImage via WikipediaOn a different note: I took sometime today to explore Alan’s homepage and found the cute Christmas Cracker’s application which was first developed in 1999 and which is now also available on Facebook. As trivial as it may sound at first – sending virtual Christmas Crackers (with more than 5000 possible combinations!) is a good showcase for developing Human Interaction Scenarios, and a number of papers have been written about the application. Here is the casestudy which Alan recommends to begin with: Designing experience – virtual Christmas Crackers.

The abstract and a list of links to all websites and demos Alan discussed can be found here. Full reference: A. Dix and R. Cyganiak (2008). Using the Web of Data. Keynote at WOD-PD 2008 | Web of Data Practitioners Days, Vienna, Austria – Oct 22-23, 2008. http://www.hcibook.com/alan/papers/WOD-PD-2008/

Even if you have not met Richard Cyganiak in person, you have certainly come across one of his creations: The Linked Data Cloud. Richard is a research assistant at DERI Galway. In his demo, he gave us the opportunity to gain hands on experience, introducing a tool he dubbed Snorql, which is basically an easier to use version of a SPARQL-endpoint, as it already has the required prefixes ‘pre-installed’:

Using the Snorql interface, we could explore the dataset we had created collaboratively during Keith Alexander and Yves Raimond’s session. Writing SPARQL queries manually can be a challenge, but is next to impossible if you (like me) don’t know the syntax. But today we could just copy and paste all the queries from a website Richard had put up prior to his session – thanks a lot for the excellent preparation and demonstration!

Richard also showed a couple of RDF browsers in action, e.g. the Tabulator Plugin (“a Firefox extension which allows Firefox to handle data as well as documents”), or the Marbles Linked Data browser which is running right on beckr.org/marbles; enter, for instance http://api.talis.com/stores/wod-pd-sandbox/items/People/JanaHerwig (learn more about Marbles here).

Thank you, Alan and Richard – the combination of talk and demo was indeed a perfect intro towards using the Web of Data.

Reblog this post [with Zemanta]
Jana Herwig

Bringing (Legacy) Data to the Web [WOD-PD]

The third session at WOD-PD was dedicated to “Bringing (Legacy) Data on the Web“, and led by Sören Auer (University of Leipzig, Germany) and Orri Erling (OpenLink Software) .

Sören Auer giving a talkSören Auer described the difference between the Web 1.0, 2.0 and 3.0 as follows: On the Web 1.0, you had many websites that provided unstructured, mainly textual content. On the Web 2.0, you have a few large websites that are specialised on specific content types. And, finally, on the Web 3.0, there are many websites which contain, and are able to semantically syndicate, arbitrarily structured content.

So why would we need another web? What you cannot do with the current web is finding answers to seemingly complex, yet in reality pretty mundane question such as: Where in Leipzig do I find an apartment that is close to bilingual, German-French child care facilities? Are there any ERP service providers which have offices in Vienna and Berlin? Who are the researchers in South-East Asia currently working on database related topics?

Sören further discussed three of the present means of bringing relation data to the web: Triplify (a web application plugin that exposes data from relational databases in RDF), D2RQ (a declarative language to describe mappings between relational database schemata and OWL/RDFS ontologies, developed at Free University Berlin), and Virtuoso Universal Server (a middleware and database engine hybrid delivering for instance data integration for SQL, RDF, XML, Web Services). With respect to Triplify, Sören – who is Triplify’s founder and main developer at AKSW Uni Leipzig – showed and discussed the configuration for WordPress 2.1., which can be found here (click here for more configurations, e.g. for Joomla, OpenConf and Drupal). The next aim for Triplify is to become an integral part in enduser web app distibutions.

And important question raised by Sören was: How do next generation search engines know that something has changed on the web of data? He suggested three approaches:

  1. Always try to crawl everything (this may sound silly – but that’s actually what is happening on the current web)
  2. Ping a central update notification service – e.g. PingTheSemanticWeb.com – which works as a showcase, but will probably not scale if the data web gets really deployed.
  3. Each linked data endpoint publishes an update log – e.g. with Triplify, as a special folder inside the Triplify namespace, e.g. http://example.com/Triplify/update

Also discussed by Sören and worth checking out is Reuters’ Semantic proxy – the demo went live in late September.

Orri Erling, as the lead developer of the Virtuoso Team, addressed the issue of mapping relational databases to RDF with OpenLink Virtuoso. In his talk, he addressed the pros and cons of RDF data warehouse:

Pros

  • Even query performance across all data
  • Possibility of forward-chaining inference
  • Some SPARQL features may be better supported, e.g. Unspecified predicates

Cons

  • Keeping data up-to-date
  • Complex set up, needs dedicated servers: you don’t build them on a whim

Orri Erling giving a talkWhat Virtuoso delivers is mapping of SPARQL to SQL against any existing schema (whether stored in Virtuoso or elsewhere); a physical quad-store (quad as in quadruple; not as in quad-bike :) ; and Federated/local Relational Data Base Management Systems (RDBMS).

A more detailed discussion of the requirements for Relational-to-RDF Mapping is available on Orri’s blog, where he discusses it in the light of his own experience. A power point presentation of a previous talk he gave to the W3C RDB2RDF Incubator Group can be downloaded here: Mapping Relational Databases to RDF with OpenLink Virtuoso (PPT, 115KB). His summary of the group discussions around the same topic, Requirements for Relational to RDF Mapping, can be found here.

Orri also showed the Virtuoso billion triples demo which, according to the corresponding blogpost, “is being worked on at the time of submission and may be shown online by appointment.” The demo was a submission to the Billion Triples Challenge.

Reblog this post [with Zemanta]